Welcome to the first week of the Programming for Big Data Module.

This week we will look at why we needed newer technologies, such as Hadoop, in the age of Big Data. Who uses it and why. An important consideration is the scale of the data. Hadoop is for massive levels of data. Much bigger than you could imagine. Anything smaller can be easily handled using traditional methods. The Hadoop eco-system will be reviewed before we move into the main components and start looking at HDFS, before moving onto the lab works.

FAQ : Check out the questions and suggestions from previous students.


Click here to download the notes.


Managing your VM
Make sure you regularly clear down the temp files and files in  bin
The VM has a small disk size and careful management of the available space is necessary.

If you need to change the size of the disk for the VM, you can follow the instructions here. Make sure you follow them very carefully.

IMPORTANT: Do not update any of the software on the VM. Do not update the OS, or version of Java, etc.  No Updates.

Lab Exercises (complete all exercises before next class)

Exercise 0 – You should have already completed these Tasks

Install VirtualBox software.

Download the pre-build Virtual Machine (VM).  I will show you how to install and use this VM during the First Week class.

This is an 8Gb download. Additional Storage space will be required. You need a minimum of 4GB RAM on your laptop to run the VM. Ideally 8GB RAM is needed and you can allocate more memory to the VM.

Alternatively, you can create your own VM and install Hadoop yourself.

Docker: If you like working with Docker, try out the pre-built Docker images on the Docker Hub Store.

IMPORTANT: There are lots of options for having your own Hadoop environment. You can use one from AWS, GCP, DataBricks, etc. The VM provided here is just one option.

Exercise 1 – Setup the Hadoop VM

Exercise 1 – Notes

Exercise 2 – Explore the Hadoop environment

Exercise 2 – Notes

Exercise 3 – Java Environment Setup – needed for Map-Reduce next week

Exercise 3 – Notes

Additional Reading

Google white paper : Google File System => HDFS
Google white paper : MapReduce

Video-What is Hadoop & Map-Reduce

Hadoop Website
Hadoop Documentation
Hadoop APIs
HDFS Commands Cheat Sheet
Hadoop in Action
Relational databases are far from dead — just ask Facebook
Microsoft CEO Satya Nadella reveals which product he wishes the company had developed first