In this section I will show the steps with building a Cluster model using k-Means and how to inspect the various elements of the clusters.

I’ll also include some elements of visualizing clusters using Google Maps.

The Data Set: The data set used in this example is based on a famous Scotish Whiskies data set. This website contains the original data set, but you need to be careful with using this data set as the LAT and LONG coordinates are not in the normal international format. The following data set contains two additional variables that contain the internationally recognized coordinates and these are recognized by Google Maps and all other mapping tools.

Let’s get started and read in the data set

import pandas as pd

whisky_file = "whiskies_b.txt"
whisky = pd.read_csv(whisky_file, header=0)

Now explore the data a bit more

(86, 19)
Index(['RowID', 'Distillery', 'Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', 'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral', 'Postcode', 'Latitude', 'Longitude', 'lat', 'long'], dtype='object')

# calculates measures of central tendency

List subsets of records and variables
Here we only display the first 10 records and variables 1-14. (numbering starts at Zero)
ROWID variable is exclude and some of the location variables

whisky.iloc[0:10, 1:14]

Data Preparation

Let us define the subset of variable to use for clustering

X = whisky.iloc[0:whisky.shape[0], 2:14]


Do the clustering – But how many Clusters – Use the elbow method to determine the optional number of clusters to use

# Using the elbow method to find the optimal number of clusters
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

wcss = []
for i in range(1,11):
kmeans = KMeans(n_clusters=i,init='k-means++',max_iter=300,n_init=10,random_state=0)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')

Let’s use 4 for the number of cluster build the cluster model based on having 4 clusters. Then fit the model using the data set defined above.

# Applying k-means to the cars dataset
kmeans_model = KMeans(n_clusters=4,init='k-means++',max_iter=300,n_init=10,random_state=0) 
y_kmeans = kmeans_model.fit_predict(X)

List the clusters identified for each records this only lists the assigned clusters

What are the centers of the clusters. ?
What are the mid-points of the clusters ?

Now get the predicted clusters for each records and attach it to the dataframe

y = kmeans_model.predict(X)

y_df = pd.DataFrame(y)
cluster_results = whisky
cluster_results["CLUSTER_NUM"] = y_df

How many records are in each cluster?

0    25
1    19
2    36
3     6
dtype: int64

I can not plot these clusters on Google Maps