Assignment 2

Binary Classification with Imbalanced Data using H2O

Due October 4th

This assignment involves modeling imbalanced tweet sentiment datasets, similar to the dataset used in Assignment 1, except that there are many more negative tweets than positive tweets in these datasets. To understand how to evaluate binary classification models, especially with imbalanced data, you will vary the classification threshold in order to balance False Positive Rate (FPR, Type I Error Rate) and False Negative Rate (FNR, Type II Error Rate). Next, Random Undersampling will be used to alleviate class imbalance, by randomly removing majority (Negative) instances from the training dataset. This will be tested on several datasets with different imbalance levels, and with different post-sampling ratios.


Update the course folder on your VM by pulling the latest code from the git respository. This will download a zip file with the datasets for this assignment in the data/assignment2data folder. There is no other starter code for this assignment, as everything will be performed from new H2O Flow notebooks. H2O will automatically parse the .arff files, so you will not need to make any modifications when parsing the files. The files have slightly different attributes, so you will need to parse each file individually.

Submit the notebook(s) that you use to perform the experiments, along with a report as detailed on the course home page. There is no specific requirement for how you layout your notebooks, or how many notebooks you use, so feel free to do what is best to accomplish the assignment. The notebooks are turned in to verify that you are the one who performed the experiments.

Part 1: Balancing Misclassification Rates

In many classification scenarios, the positive class is much more valuable than the negative class. Consequently, False Negative (Type II) errors are more serious than False Positive (Type I) errors. Therefore, overall model accuracy is not an appropriate metric, especially when the dataset is imbalanced. The objective in this part is to determine the classification threshold that results in balanced misclassification rates, with False Negative Rate (Type II error rate) as low as possible.

Build a Distributed Random Forest model with H2O, using the sentiment_25_75.arff file with the following parameters:

  • nfolds: 5
  • response_column: sentiment_class
  • ntrees: 5
  • max_depth: 20
  • seed: 0 (important: the random seed needs to be set for grading purposes)
  • fold_assignment: Random

Look at the “ROC Curve – Cross Validation Metrics” section. Click various points on the ROC curve to see the performance metrics for that particular threshold value, or select the threshold from the dropdown below the graph. Observe the trends of FNR and FPR as you select different threshold values. Select the threshold that gives max accuracy by selecting “max accuracy” from the Criterion dropdown. Include these metrics in your report. What is the model doing in this scenario, and why is accuracy not a good metric?

H2O reports the performance metrics and confusion matrices for each threshold value in the “Output – cross_validation_metrics – Metrics for Thresholds (Binomial metrics as a function of classification thresholds)” section. Copy these values (along with the header row of the table) into Excel, and plot the threshold vs. FNR and FPR to visualize where the optimal threshold is. Refer to the course references to see examples of these graphs. Select the threshold that balances FNR and FPR, with FNR as low as possible. Include your graph(s) and the performance metrics for the optimal threshold. What happens when the threshold decreases/increases?

Part 2: Random Undersampling

To perform Random Undersampling (RUS) with H2O, the balance_classes, class_sampling_factors, and max_after_balance_size parameters must be used.
Note: the class_sampling_factors and max_after_balance_size parameter only show up in the Expert section after the balance_classes parameter is checked.

Build Random Forest models for each dataset in the data/assignment2data folder, using the same RF parameters from Part 1. In addition, build models with both 50/50 and 35/65 (35% negative, 65% positive) post-sampling class ratios. The following table outlines all models and parameter combinations for this experiment:

Dataset Desired RUS ratio class_sampling_factors max_after_balance_size
25/75 N/A N/A N/A
25/75 50/50 0.333,1 0.5
25/75 35/65 0.619,1 0.929
15/85 N/A N/A N/A
15/85 50/50 0.176,1 0.3
15/85 35/65 0.328,1 0.557
05/95 N/A N/A N/A
05/95 50/50 0.0526,1 0.1
05/95 35/65 0.0977,1 0.1857

After you build the model, you can verify that they have been properly sampled by viewing the confusion matrix for the training model. To evaluate and compare the models, report the cross validation AUC score for each combination.
Note: DO NOT use the training metrics to evaluate the models. Make sure you are reading from the cross validation metrics when reporting model performance.

First, compare the performance of the sampled (RUS) models versus the original (no sampling) model for each dataset. Does sampling improve classifier performance? Which ratio performs better?

Next, compare all 9 combinations. Which combination has the best performance? Provide some commentary and explain your observations.