Introduction to Spark

This week we will commence working on Spark. Spark has become one of the most popular languages for processing and analyzing big data. It builds upon many of the aspects from Hadoop. We will look at some of these aspects, and some of the components of the Spark architecture.

We will work through a live demo of using PySpark. We will use the Python interface for Spark for most of our examples during this part of the module. Other languages include Scala, R and Java.

You will need a Spark environment. You can create your own by installing Spark on your only laptop/PC. You don’t need Hadoop to run Spark. Spark was originally developed to run on Hadoop but has migrated to being able to run on a variety of different distributed computing frameworks. An alternative to installing Spark on your own laptop/PC, you can use a free cloud account on Databricks website. The team own created Spark are being the Databricks company and website.


Click here to download the Notes for this week.



will be available soon!

Lab Exercises

Download the Spark Lab-1 Notes & Exercises.

Files needed for Lab Exercise

Additional Materials & Reading

Apache Spark
Scala Cheat Sheet
PySpark – Complete Guide on DataFrame Operations in PySpark
PySpark – Dataframe Quickstart

10-minute Spark Tutorials