Assignment 1

Course Environment Setup and Basic Analytics Tasks

Due September 20th

In this assignment, you will set up your course environment and make sure that every element works, as well as perform basic tasks with Hadoop to get an understanding of how it works. You will also work on building and analyzing a Random Forest classification model.

To get started, download and setup the course virtual environment according to the these instructions.

Part 1 – Hadoop/HDFS

Launch HDFS and YARN. Verify that HDFS and YARN are running by executing the following commands. Include the console output in your report.

hdfs dfsadmin -report
yarn node -list

Explore management tools

  • Launch Firefox and look through the “Namenode Information”, “Browse HDFS”, and “Resource Manager” bookmarks
  • Include a screenshot of the “Datanode Information” page to verify that HDFS is running
  • Include a screenshot of the “All Applications” page to verify that YARN is running

Perform the following in the terminal and capture the outputs to include in your report.

  • Create an HDFS directory in the /user/hadoop/ folder called data
  • Put the tweet_sentiment_2000.csv and tweet_sentiment_2000_wordvec.csv data files from the course repository on the local filesystem onto HDFS in the /user/hadoop/data/ folder
  • What is the size of the files on HDFS? Verify that this is the same as your local files. Print a sample of a file (a file on HDFS) onto the console using the -tail command (Make sure it is an HDFS file, not a local file!).

The two text files are a sample of the Sentiment140 dataset created by Stanford. They include tweets labeled by sentiment (either positive or negative). The tweet_sentiment_2000.csv file contains the tweets in their raw text form, along with the sentiment label. The tweet_sentiment_2000_wordvec.csv file contains a binary word vector for each instance, using the top 1,000 words used throughout the 2,000 instances. There are 1,000 features, corresponding to individual words (“T” – meaning the word is present, or “F” – meaning the word is not present), along with the sentiment class label. Normally, these word vectors are encoding with a 0/1, but using letters makes it easier to load into H2O for the data mining portion of this assignment. There will be more about text data and generating features from text in Assignment 3.

Part 2 – MapReduce

Make sure that you have cloned the course git repository in your home directory. Navigate to the fau-bigdatacourse/python/mapreduce_wordcount directory to run the MapReduce wordcount problem

Use the run.sh script to run a word count on the tweet data file (tweet_sentiment_2000.csv). Modify the run.sh file and fill in the arguments for the -input and -output with the proper HDFS paths. NOTE: If you want to run the program more than once, you will need to remove or rename the output directory from HDFS. Use the unix time command to time how long it takes to execute the program:

time ./run.sh
  • What does the console output of the MapReduce program look like?
  • The last 3 lines of output described how long it took to execute the program. The line starting with “real” indicates the actual wall clock time that the program took to execute. How long did the word count program take?
  • Provide a sample of the word count results (the last few lines of the part-00000 file on HDFS) in your report.
  • Look through the results of the word count program. How could the word count results be made more accurate?

Part 3 – Spark

Navigate to your home folder (/home/hadoop). Verify that Spark is running by launching the pyspark console. Enter sc in the console to show that the SparkContext is loaded.

Run the following lines in the console to verify that Spark is working properly. Capture the console output and include in your report

data = sc.textFile("hdfs://localhost:9000/user/hadoop/data/tweet_sentiment_2000.csv")
data.count()

Verify that Jupyter Python notebook is running. Launch Spark with Jupyter notebook functionality using the spark_notebook.sh script.

  • Open the fau-bigdatacourse/notebooks/wordcount.ipynb file and execute each cell in the notebook
  • The wordcount output will be on HDFS in /user/hadoop/spark_wordcount
  • Download the notebook using “File -> Download As -> IPython Notebook (.ipynb)” and submit that as part of your homework zip file.
  • The output of the final block will display the total wall clock execution time of the wordcount. How long does it take? Compare this to the execution time of the MapReduce program. What is the cause of the difference in execution time?

Part 4 – H2O

Launch H2O.

  • Once you open H2O click “Import Files” and type hdfs://localhost:9000/user/hadoop/data/ in the search box. Load the tweet_sentiment_2000_wordvec.csv file by clicking on the plus sign next to the file and click import, then parse these files.
  • Make sure the “First row contains column names” radio box is selected under the “Column Headers” section, and then click Parse at the bottom of the cell. Once that is done click View.
  • Click “Build Model” and select “Distributed RF”, change the following parameters, then click “Build Model”:
    • nfolds: 2
    • response_column: sentiment (last in the list)
    • ntrees: 3
    • max_depth: 10
  • Look at the “ROC curve – Cross Validation Metrics” section, what is the AUC value for this model?
  • Change the ntrees and max_depth parameters until you get a higher AUC value. What values did you select, and why do you think they improve the classification performance?
  • Save your H2O Flow by editing the notebook name at the top of the screen, then click the save button. Click “Flow -> Download this Flow” from the menu bar to save the notebook. The notebook will also be in the /user/hadoop/h2oflows/notebook directory on HDFS. Attach this notebook as part of your homework zip file.