Overview of this Lecture / week

This week we will look at clustering or data clustering. This is a machine learning technique that tries to identify related groups of records or cases. What makes me more similar or less similar to others. This is an un-directed or unsupervised machine learning method.

Clustering can be used to identify similar records. It can also be used to help with missing data treatment.

Sometime the terms Segmentation and Clustering are used interchangeable. This is incorrect. They are different concepts.

Clustering will tell us case records that are similar, but it will not tell us what this grouping or similarity means. it will be your job as the data scientist to investigate the outputs for clustering and to determine the meaning of each cluster. This is not an easy task and requires lots of domain and business knowledge. You need to apply these to understand the outputs.


Click here to download the notes.


Videos of Notes

Planning my Summer Vacation using Clustering

Lab Exercises

You have 2 main tasks for the lab work this week. You can work through the Clustering lab exercises and/or use the lab time to work on your assignment.

I would suggest you use the lab time to start/progress your assignment. Then come back during the week and complete the Clustering lab exercises

Click here to download the Clustering Lab Exercises.


Additional Reading Materials

Tan Book Chapter

ACM Review Paper on Data Clustering