Welcome to the first week of the Programming for Big Data Module.
This week we will look at why we needed newer technologies, such as Hadoop, in the age of Big Data. Who uses it and why. An important consideration is the scale of the data. Hadoop is for massive levels of data. Much bigger than you could imagine. Anything smaller can be easily handled using traditional methods. The Hadoop eco-system will be reviewed before we move into the main components and start looking at HDFS, before moving onto the lab works.
Managing your VM
Make sure you regularly clear down the temp files and files in bin
The VM has a small disk size and careful management of the available space is necessary.
If you need to change the size of the disk for the VM, you can follow the instructions here. Make sure you follow them very carefully.
IMPORTANT: Do not update any of the software on the VM. Do not update the OS, or version of Java, etc. No Updates.
Lab Exercises (complete all exercises before next class)
Exercise 0 – You should have already completed these Tasks
This is an 8Gb download. Additional Storage space will be required. You need a minimum of 4GB RAM on your laptop to run the VM. Ideally 8GB RAM is needed and you can allocate more memory to the VM.
Alternatively, you can create your own VM and install Hadoop yourself.
Docker: If you like working with Docker, try out the pre-built Docker images on the Docker Hub Store.
IMPORTANT: There are lots of options for having your own Hadoop environment. You can use one from AWS, GCP, DataBricks, etc. The VM provided here is just one option.
Exercise 1 – Setup the Hadoop VM
Exercise 2 – Explore the Hadoop environment
Exercise 3 – Java Environment Setup – needed for Map-Reduce next week
HDFS Commands Cheat Sheet
Hadoop in Action
Relational databases are far from dead — just ask Facebook
Microsoft CEO Satya Nadella reveals which product he wishes the company had developed first