Assignment 4

Parameter Tuning and Model Evaluation with Spark

Due November 15th (Part 1 due November 1st)

This assignment will be performed on a real Hadoop cluster, and consists of tuning parameters for text modeling with Random Forest. There is a trade-off when changing model parameters to improve performance, as it often increases the time needed to train a model. This assignment explores different parameter options with regard to model building time and performance (AUC).

Part 1 of the assignment is due on November 1st, to make sure that all students can access the cluster and runs jobs without issue. Parts 2-3, along with the full report, is due on November 15th.

Setup

This assignment is performed on a Hadoop cluster provisioned for this course under the FAU KoKo high-performance computing platform. The cluster has 4 nodes, with 128GB of RAM and 20 CPU cores each. Note that KoKo contains many computing nodes, but the Hadoop cluster is logically separated from the non-Hadoop cluster. KoKo is just used for logging into the cluster. As a student in this class, you should have a login for KoKo. If you have any issues logging into the cluster, you can contact Eric Borenstein (eric@fau.edu), who is the HPC administrator. Please only contact Eric if you have trouble logging into the cluster; any assignment-related questions should still be routed through Piazza.

You will use the Hadoop cluster by logging into KoKo with X2Go, a remote desktop client used for the cluster. Follow these instructions provided by the FAU HPC group for logging into KoKo: https://hpc.fau.edu/x2go/. The desktop that you see is the head node of KoKo, which is not a Hadoop NameNode or DataNode. Hadoop is accessed by viewing web pages that are hosted on the Hadoop cluster, and by loading the Hadoop modules from the terminal.

Viewing Hadoop Web Pages

You can view the Hadoop management web pages (the same pages that were explored in Assignment 1) with the following URLs. Note that these URLs will only work from inside X2Go when you are logged into KoKo.

YARN Applications: http://hda002:8088/

HDFS NameNode information: http://hda001:50070

Accessing Hadoop from the Terminal

KoKo uses the module tool to load and run different software modules for each user. To access the Hadoop cluster from the terminal, you need to load the hdacademic module:

module load hdacademic
module list

This will allow you to perform all of the familiar Hadoop operations (hadoop fs -ls, for example). Note that you have to load the module each time you open a terminal window.

Running PySpark Scripts

Each part of the assignment has an associated script located in the python directory of the course repository (Don’t forget to clone the git repository on the cluster!). To run these scripts in batch mode on the cluster, we use spark-submit. with several options. To run a job, use the spark-job.sh script:

cd ~/fau-bigdatacourse/python
./spark-job.sh <SCRIPT TO RUN>

You will notice inside the spark-job.sh script, there are several options set regarding the number of nodes to use and how much memory to use in each node. This results in the Spark job using all 4 nodes on the cluster, with 7GB of RAM for each node (28GB total for the job). The assignment4.py file contains functions that must be referenced by all workers, so it is sent as part of the job. DO NOT change any of these parameters, as they are required for running Spark on YARN, and to allow sharing of the cluster. It is very important that you only run one job at a time, as there are many students using the cluster and no one person can hog all the resources.

NOTE: If you cancel/close the terminal window, or your KoKo session gets terminated, the Spark job will still continue running. Check the YARN Application page before submitted a new job.

When you run the scripts, you will notice that the output from the script is not printed to the screen (only the default Spark output). To view the job’s output, you need to find it on the YARN Applications page (http://hda002:8088/), click on the application ID, then “logs” at the bottom right, and then “stdout”. You can then refresh that page to see updated output. Status messages will be printed here to show that the job is successfully running.

If you have trouble cloning the git repository, you might need to specify your username when cloning:

git clone https://<username>@bitbucket.org/rikturr/fau-bigdatacourse

Part 1: Create Datasets

To reduce computational complexity of Parts 2-3, we will generate several datasets from the original tweet data (similar to Assignment 3). In this assignment, we are using the full dataset of 1.6 million tweets (800,000 positive, 800,000 negative). This part involves pre-processing the raw data into four vectorized data files, with 500, 1,000, 5,000, and 10,000 features each. These data files are generated from the raw text using the RDD functions created in Assignment 3, and are saved in LibSVM format, a data format for storing datasets with sparse numerical features.

Run the following command from inside the terminal (on KoKo) to download the raw data:

wget https://www.dropbox.com/s/knejjohvt2tgz9q/assignment4_data.tsv

You will then need to put the file in your folder on HDFS (note that you cannot access other user’s files). The WebHDFS is not working on the cluster at the moment, so you will have to use the command line to double-check that the file is on HDFS.

The datasets are generated using the assignment4-part1.py script. Before running it, edit the file and update the sourceFile variable to point to the data file you put on HDFS. Then use the spark-job.sh script, as specified in the Setup section, to run the script. This will generate four files and save them into your HDFS directory under a folder named a4data.

To show that the script successfully generated the datasets, run the following:

hadoop fs -du /user/<your-username>/a4data

The output of this command is all that is required to be submitted on November 1st. This shows that you successfully logged into the cluster and ran a Spark job. There is no need for a write-up, just the output of this command.

Part 2: Number of Trees

The goal of this part is to select the best number of trees to use in the Random Forest model, based on AUC and running time. Increasing the number of trees to use in the model linearly increases running time, as the model has to train more trees. The best value of this parameter is when there is an increase in AUC from the previous (lower) value, but not as large an increase in running time. For example, consider the following three (fictional) runs:

nTrees AUC Time
10 0.6 10 seconds
20 0.7 25 seconds
30 0.72 125 seconds

The best nTrees to use in this scenario would be 20, as the model building time is still relatively low compared to using 30 trees, and the AUC value is comparable.

For this part, we will test to determine the best nTrees to use with the 1,000 feature dataset. That value will then be used with all the models built for Part3. As with Part 1, use the spark-job.sh script to run the assignment4-part2.py job. As you will see in the schemes  array, we are testing 7 different values for nTrees: 10, 30, 50, 100, 150, 200, and 250. When you run the job, the YARN stdout output for the job will only show the progress of the job, not the actual AUC values.

After each run, the job saves an HDFS directory called a4part2-results in your /user folder. When the job is complete, you can download this file from HDFS and use it to interpret your results. Note that since the results are stored in an RDD, the order of the results is not maintained. Due to this, a header row could not be included with the file, but the file is organized as follows:

  • Dataset (Number of Features)
  • Number of Trees
  • Fold # (for CV)
  • AUC
  • Running Time (in seconds)

Make sure to average each run’s AUC and add together the running time across all 5 folds before analyzing the results. In the report, explain your results and justification for choosing a particular value for the best number of trees.

This job will take about 1.5 hours to run. To reduce resource hogging, all jobs running for longer than 3 hours are automatically killed. If your job is killed because it passed the time limit, there is something wrong.

Part 3: Maximum Depth and Feature Vector Size

In this part, we will examine the performance of Random Forest (in terms of both AUC and running time), when the model receives datasets with different numbers of features, as well as when the maximum depth of each tree is increased.

As with Parts 1 and 2, use the spark-job.sh script to run the assignment4-part3.py job. As you will see in the schemes  array, we are testing the 4 different datasets created in part 1: 500, 1,000, 5,000 and 10,000, and 4 different values for maxDepth: 4, 8, 12, 16. When you run the job, the YARN stdout output for the job will only show the progress of the job, not the actual AUC values. You will need to fill in the value for nTrees based on your results from part 2.

After each run, the job saves an HDFS directory called a4part3-results in your /user folder. When the job is complete, you can download this file from HDFS and use it to interpret your results. Note that since the results are stored in an RDD, the order of the results is not maintained. Due to this, a header row could not be included with the file, but the file is organized as follows:

  • Dataset (Number of Features)
  • Number of Trees
  • Maximum Depth
  • Fold # (for CV)
  • AUC
  • Running Time (in seconds)

Your report should contain detailed analysis of both parts 2 and 3. The analysis should include, but does not need to be limited to:

  •  How does the performance of the model change when the number of trees is increased (in terms of AUC and running time)?
  •  How does the performance of the model change when the maximum depth is increased (in terms of AUC and running time)?
  •  How does the performance of the model change when the number of features is increased (in terms of AUC and running time)?
  • Examine the relationship between the number of features and maximum depth. Provide some commentary for the results.

Your analysis should show an understanding of how the Random Forest learner works, as well as the nature of our classification task (Tweet sentiment with binary word vectors). These factors should be considered when analyzing the results.

This job will take about 5.5 hours to run. To reduce resource hogging, all jobs running for longer than 7 hours are automatically killed. If your job is killed because it passed the time limit, there is something wrong.

References

For more details about Spark’s implementation of Random Forest, as well as the meaning of various parameters: