1. Setup

The first thing that you need to do is to install four R packages these are “tm”, “wordcloud” , “Curl” and “XML”. The first two of these packages are needed for the main part of the Text processing and generating the word cloud. The last two of these packages are needed by the function “htmlToText”. You can download the htmlToText function on github.

HINT: If you get an error message when using this ‘htmlToText’ function, try commenting out the last line. If you copy and paste this file, the last 3 to 4 words of the last line might appear as a new line at the end of the file. Just comment this line out.

Open R and run the following

> install.packages (c ( "tm", "wordcloud", "RCurl", "XML", "SnowballC")) # install 'tm'' package

> library (tm)
> library (wordcloud)
> library (SnowballC)

> # load htmlToText
> source("/Users/brendan.tierney/htmltotext.R")

Change "/Users/brendan.tierney/htmltotext.R" to where you have saved the file.

Warning: You might get an error message when installing the ‘tm’ package. The message will say something about another package called ‘slam’. If this occurs then you will need to download and install the ‘slam’ package manually. There are a couple of ways of doing this and a quick google will show you these.

If you continue to get error messages relating to the ‘slam’ package then use an earlier version of the package and keep trying earlier versions until the install works.

This is an example of the types of issues with using the R language where the packages and dependent package get out of sync with each other and the version of R you are using. It is recommended that you do not use the latest or current version of R. Always use the previously related version or the one before that.

2. Read in the Web Pages you want to analyse (Repeat with some of webpages from your company)

Read the webpages into some local variables. Make sure that the webpages contain mainly HTML.
In the following example I read in four webpages from the Oracle website.

HINT: The following webpages may not work. Website are constantly changing the format and location of their content. Find alternative websites that have content based in their html pages.

> data1 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.html")
> data2 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html")
> data3 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-technologies/overview/index.html")
> data4 <- htmlToText("http://www.oracle.com/technetwork/database/database-technologies/r/r-enterprise/overview/index.html")

You will need to combine each of these webpages into one for processing in later steps.

> data <- c(data1, data2)
> data <- c(data, data3)
> data <- c(data, data4)

3. Convert into a Corpus and perfom Data Cleaning & Transformations

To convert our web documents into a Corpus.

> txt_corpus <- Corpus (VectorSource (data)) # create a corpus

We can use the summary function to get some of the details of the Corpus. We can see that we have 4 documents in the corpus.

> summary(txt_corpus)

A corpus with 4 text documents

The metadata consists of 2 tag-value pairs and a data frame

Available tags are:

create_date creator

Available variables in the data frame are:

MetaID

Remove the White Space in these documents

> tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space

Remove the Punctuations from the documents

> tm_map <- tm_map (tm_map, removePunctuation) # remove punctuations

Remove Numbers from the documents

> tm_map <- tm_map (tm_map, removeNumbers) # to remove numbers

Remove the typical list of Stop Words

> tm_map <- tm_map (tm_map, removeWords, stopwords("english")) # to remove stop words(like ‘as’ ‘the’ etc….)

Apply stemming to the documents

If needed you can also apply stemming on your data. I decided to not perform this as it seemed to trunc some of the words in the word cloud.

> # tm_map <- tm_map (tm_map, stemDocument)

If you do want to perform stemming then just remove the # symbol.

Remove any addition words (would could add other words to this list)

> tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))

If you want to have a look at the output of each of the above commands you can use the inspect function.

> inspect(tm_map)

4. Convert into a Text Document Matrix and Sort

> Matrix <- TermDocumentMatrix(tm_map) # terms in rows

> matrix_c <- as.matrix (Matrix)

> freq <- sort (rowSums (matrix_c)) # frequency count of the data

> freq #view the words and their frequencies

5. Generate the Word Cloud

> tmdata <- data.frame (words=names(freq), freq)

> wordcloud (tmdata$words, tmdata$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, "Dark2"))

and the World Clould will look something like the following. Everything you generate the Word Cloud you will get a slightly different layout of the words.
OAA Word Cloud

WordCloud

6. Examine the WordCloud

Have a look at the WordCloud and then examine the original webpages.

What insights does the WordCloud give about the webpages?

What stands out from the WordCloud?

What is missing from the WordCloud?

What conclusions and recommendations can you report to the company?

7. Exercises
Exercise 1: Use the code (and if needed expand it) to analyse 3 or 4 webpages from a company

Exercise 2: Use this code (and if needed expand it) to analyse and/or compare some news stories from newpaper websites

Work together individually or in pairs.
Discuss the usefulness of WordClouds and how they can give you interesting insights on the topics covered on those websites.
Does the pattern of words match what you would expect ?