Assignment 3

Extracting Features from Text and Modeling High-Dimensional Data using Spark

Due October 25th

This assignment involves modeling tweet sentiment data, but we will use Spark to generate datasets from the raw text. The data for this assignment is balanced, so we will examine the effects of different feature extraction and feature selection methods.

Setup

Update the course folder on your VM by pulling the latest code from the git repository. This will download the dataset and the .ipynb notebooks for this assignment in the data/assignment3_data.tsv and notebooks folder, respectively. Most of the instructions for each part will be inside the notebooks in the form of markdown cells. Be careful to read all the content in the notebooks, as some sections will require you to fill in code, and other sections will require you to include analysis in your PDF report.

Refer to the course setup page for instructions on how to launch Spark inside a Jupyter notebook.

After you run the experiments, submit the notebooks (.ipynb) along with a report as detailed on the course home page. If you are comfortable with Python and Markdown, you can write all your analysis from inside the notebooks and just submit those. If you do that, download the notebooks as HTML files and submit those in addition to the .ipynb files.

Part 1: Extracting Features & Feature Selection

Open the assignment3-part1.ipynb notebook and follow the instructions. This notebook outlines the steps necessary to extract a binary word vector from the tweet data, as well as how to perform classification using Naive Bayes with and without ChiSquared feature selection.

The first cell in the notebook tries to import a couple of packages that will be needed for the assignment. These are not installed in your VM, so you will need to follow the instructions in the notebook to install the packages from the terminal.

The notebook will require you to fill in code in several code blocks. You need to write code whenever you see “<FILL IN>”. Each section that requires code has an associated problem number, like “1b”. You can refer to specific parts of the assignment in this way when asking questions on Piazza. Additionally, after working through the notebook, you should answer the following and include it in your report:

Is the AUC from the model with feature selection greater than the AUC for the model without feature selection? Please explain if these results prove that feature selection improves or does not improve Naive Bayes. If not, explain why.

NOTE: It may help to restart the kernel between Part 1 and 2. This will release the resources used by the Part 1 notebook, which should make the Part 2 notebook run faster. Click “Kernel -> Restart” from inside a Jupyter notebok to do this. Otherwise shut down Jupyter and restart it.

Part 2: Experimentation

Open the assignment3-part2.ipynb notebook and follow the instructions. This notebook will run an experiment to see the effects of different classifiers, word vector sizes, and number of features to select for feature selection. This part will require detailed analysis of your experimental results:

Include very detailed analysis of these results in your report. Your analysis should cover, but does not have to be limited to, the following questions:

  • What is the effect of changing the size of the vocabulary/word vector?
  • What is the effect of changing the classifier?
  • What is the effect of Chi Squared feature selection for various word vector sizes?
  • How do the results change for different combinations of these parameters?

Include the results.csv file in your homework submission zip file.