Overview of this Lecture / week

Part of data preparation is exploring and investigating the data. You can discover a lot of information about what is happening in your data by using a variety of different analytical and visualisation techniques to look at the data in different ways.

This week we will have a look at Text Mining and Forecasting. These are very common approaches for analysing data and can be used as part of the data preparation stage (of CRISP-DM) for creating and enriching the data set.

Notes

Click here to download notes for Week 4 – Text Mining.

L12 - Text Mining

Click here to download notes for Week 4 – Forecasting.

L10 - Forecasting

Videos of Notes

Text Mining video

Forecasting video

Other Videos

Lab Exercises

Task 1 – Load a data set into SAS Enterprise Miner and explore the data

Find a data set from a data set repository. For example A Plethora of Data Set Repositories
There are lots and lots of these.
Find one of these repositories, find a data set, download it, load it into SAS EM and Aanalyse the data (same as last week)

How to load your own data sets into SAS OnDemand
SAS Guide for loading your own data using SAS Studio

Task 2 – Text Mining & Word Cloud in R

Demo R script for building a word cloud

Text Analysis 101 : Document Classification

Or check out my blog post on Creating a WordCloud using Python

Task 3 – Find a website or some PDF documents you want to explore and perform text mining on these.

Use the code (and if needed expand it) to analyse 3 or 4 webpages from a company
or (maybe do both if you can)
Use this code (and if needed expand it) to analyse and/or compare some news stories from newspaper websites

Work together individually or in pairs.
Discuss the usefulness of WordClouds and how they can give you interesting insights on the topics covered on those websites.
Does the pattern of words match what you would expect ?

What is the purpose of performing text mining on these?

What do you want to discover?

How can you use this information?

After performing text mining, explain what you discovered, what changes you would recommend, what impact these change might have (technical and non technical), etc ?

What if you could add a Time Series element to the WordCloud. For example, for a particular news topics (general elections) how the coverage changes during the weeks when the parties and candidates are campaigning?  A series of WordClouds might be able to show who the themes changes and evolve throughout the campaign. This will be similar for different Marketing strategies.

A company can see if certain words or phrases are appearing on their websites and related materials. By making changes to these, what impact will it have?

Task 4 – Forecasting

Go to my notes pages on Forecasting using Python.

Follow and complete all the steps.

Task 5 – Select a different data set for Forecasting

Select a suitable data set from one of these sources. Use the sample code as the basis for perform Forecasting on this data

Follow the steps from my Python examples to see if you can Forecast values using the same approaches and algorithms.

You may need to make some changes to the code based on the data set used.

What to prepare for next week

Make sure you are keeping up with all the lab exercises.

We will be back using SAS Enterprise Miner next week.

Additional Reading Materials

Text Analysis 101 : Document Classification

WTF is TF-IDF? – KDnuggets

TF-IDF Tutorial 1 – using Python

TF-IDF Tutorial 2 – using Python

Sentiment Analysis in R Tutorial – Kaggle

Machine Learning basics for a Newbie

Sentiment Analysis using R

Comprehensive set of English Stop Words lists

Introduction to Time Series Analysis and Forecasting

Forecasting s-curves is hard

Try the Sktime library in Python