Overview of this Lecture / week

This week we will look at some of the typical data preparations steps that you will need to perform. It would be great if data was in a clean state. Sadly this is never true. You are always going to have inconsistencies in the data. You will also have attributes/variables/columns/features that contain null values. What does a null value really mean? For machine learning we never want to have null values, but what can we do about it. Well we have a number of ways of working out what a possible value should be. Other things we have to do is integrate data from different sources, do some dimensionality reduction, etc, etc.

There are lots and lots of things you can do to prepare and format the data. You will cover many different techniques in other modules. No one approach is correct but with practice you will learn what works best for your data and scenario. We will cover the main ones in this module.


Click here to download Week 3 notes.

L3 - DM Data Prep

Videos of Notes

Related Videos

Lab Exercises

This weeks lab involves loading data into your SAS Enterprise Miner workspace, and using the features in the tool to explore the data. Remember all Data Science tools and languages just gives you more data. It is your job as the data scientist to put meaning to it, by taking your domain and business knowledge and applying it to the statistical outputs from exploring the data.

Task 1 (you should have completed this last week. Skip to Task 2 if already completed)

Create a SAS EM Project for your Lab work. Call it ‘My Lab Work’

Create and Open a SAS EM Project

Task 2 – Access the SAS data sets

How to access the SAS Data Sets for your Lab work

Task 3 – Accessing and Analysing your data

Lab 1 – Accessing and Analysing your Data – start at  Page 18

Refer back to Task 2 for the location of all data sets for the SAS exercises.

Task 4 – Optional – Load your own data into SAS Enterprise Miner

You can load your own data into SAS Enterprise Miner. The following two guides will show you how to do this.

How to load your own data sets into SAS OnDemand
SAS Guide for loading your own data using SAS Studio

Task 5 – Optional – Using Python or R to Analyse and Prepare Data for Data Mining

In this exercise task you are going to take a data set, analyse it and prepare it for data mining. The main tasks include:

      • Access and download the data set
      • Load the data set into a data frame
      • Perform some descriptive analytics on the data set
      • Perform some data transformations, creating new features, re-coding categorical variables, normalization, one-hot-coding, etc
      • Check of imbalance in the data set and create balanced data set using a variety of methods
        • You may end up having multiple versions of the original data set, based on the different methods used
      • Create a training and test data sets
      • Verify the training and test data sets to ensure they have similar data distributions.

Go to Kaggle and fine a data set to use. Check out their list of data sets. Then process these data sets using the sets outlined above. You can use the following examples from my Python for ML notes.

What to prepare for next week

Make sure you complete all the steps in the lab document before next week.

The future lab exercises for SAS Enterprise Miner are dependent on you completing the SAS tasks for this week.

Next week will involve some exercises using  R & Python. Make sure you have installed R and Python,  and are familiar with the environment, installing packages and writing some basic code.

Additional Reading Materials

Han & Kamber Book Chapter – Data Processing

3 Epic Data Quality Blunders

One-page Survival Guide to Data Science with R

Reducing Dimensionality from Dimensionality Reduction Techniques

Some interesting and inaccurate Correlations

Data science is different now

Data Visualization 101 – Common Charts and When to Use Them

Seven Techniques for Data Dimensionality Reduction