Setup Course Virtual Enviroment

Assignments 1-3 will utilize a virtual machine (VM) specifically made for this course. Hadoop, Spark, and H2O are installed and configured for you on the VM. Assignment 4 will be performed on a real Hadoop cluster, but most of what you do in this VM will be applicable to the cluster. The versions of each software in the VM are:

  • Hadoop: 2.6.0
  • Spark: 1.4.1
  • H2O: 3.0.1.7

Download and Install VM

Download VirtualBox AND the VirtualBox Extension Pack. Download the proper package for your OS under:

  • VirtualBox platform packages.
  • VirtualBox 5.0.2 Oracle VM VirtualBox Extension Pack

Download the VM image here (about 4.5GB download): https://drive.google.com/open?id=0Bxl7nF3lofTJcUZrWFUyT05YRFE

Open VirtualBox and click “File -> Import Appliance” and open the .ova file that you downloaded. Once the appliance is imported, check the VM settings and make sure that at least 2GB of RAM is allocated to the machine. You can increase the amount of RAM if your machine can handle it, the assignments might run a bit faster. You can also set up a shared folder with your local machine so you can pass files between your machine and the VM. Once the machine is started, you can also set up a shared clipboard from the “Devices -> Shared Clipboard” menu option from VirtualBox. This will enable copy/paste between your machine and the VM.

The login credentials are:

  • user: hadoop
  • password: password

The root password is also “password”, but you shoudn’t need to use the root account for any of the assignments.

When you close the machine, you can choose to “Save the machine state” so you can pick up where you left off next time you launch it.

Run Software

All the software for the assignments are installed in the /home/hadoop/ folder.

Hadoop

For all the assignments, you will need to have Hadoop running. To start Hadoop (HDFS and YARN), run the following from the terminal:

start-dfs.sh
start-yarn.sh

To stop Hadoop, run:

stop-dfs.sh
stop-yarn.sh

You will want to stop Hadoop if you are shutting down the VM completely, otherwise you can leave Hadoop running all the time. There are various bookmarks in Firefox that you can use to view the status of Hadoop.

Spark

There are many ways to run Spark. If you take a look at the spark/bin folder, the following will launch consoles for various programming languages:

  • spark-shell: Scala
  • pyspark: Python
  • sparkR: R
  • spark-sql: SQL

In this class, we will only be using pyspark, but primarily within a Jupyter notebook. To launch pyspark with Jupyter, navigate to the /home/hadoop folder, and run the spark_notebook script:

cd /home/hadoop
./spark_notebook.sh

The Jupyter home page should launch automatically in Firefox, but if it does not there is a bookmark for it. Create a new notebook by pressing “New -> Python 2” from the Jupyter home page. Click “Help -> User Interface Tour” to become familiar with the notebook interface if you have not used it before.

H2O

To launch H2o, navigate to the  /home/hadoop/h2o folder and run the H2O jar file in Hadoop. If you have allocated more RAM to your VM, you can increase the -mapperXmx option:

cd /home/hadoop/h2o
hadoop jar h2odriver.jar -nodes 1 -mapperXmx 1g -output testH2o

NOTE: If you kill the H2O process and want to start it again, you have to remove the testH2o directory from HDFS or provide a different output directory name.

Once you see the line Blocking until the H2O cluster shuts down... , that means H2O is running. Open Firefox and click the “H2O Flow” bookmark to open H2O.

Since the VM is limited on resources, do not run Spark and H2O at the same time.

Code

All sample code, as well as starter code for the homework assignments is in this git repository: https://bitbucket.org/rikturr/fau-bigdatacourse/.

You will need to sign up for a BitBucket account with your FAU email address. You should have received an email invitation to join the repository. Even if you have an existing BitBucket account, you will need to create one associated with your FAU email. Feel free to fork the repository if you would like to keep your assignments within version control. You will not be able to commit to the fau-bigdatacourse repo or branch from it.

To get the code on your VM, navigate to your home folder, then clone the repository:

cd /home/hadoop
git clone https://bitbucket.org/rikturr/fau-bigdatacourse

You will then be prompted to log into your BitBucket account. This will create a folder called fau-bigdatacourse in the home folder, which will contain the starter code for all the assignments. You may need to change the permissions for the folder to run certain scripts:

chmod -R 777 /home/hadoop/fau-bigdatacourse

The repository may get updated for future assignments, in which case you will need to do a pull to update the files:

cd /home/hadoop
git pull