Association Rule mining is one of the unsupervised machine learning methods. It has a number of commonly used alternative names such as Market Basket Analysis, Frequent Item Sets, etc. There is one primary algorithm used for this and is called Apriori algorithm. As with the other machine learning examples on this website, the theory behind this algorithm and associated metrics will not be covered. This page will work through an example of how you can perform Association Rule Mining using Python.
1. Python Packages.
There are a number of Python Packages available that support various elements of Association Rule Mining. The most commonly used/cited package is called MLxtend. To install this package run,
pip3 install mlxtend
Note: It is worth noting that the level of implementation, features and graphing of the frequent item sets and apriori algorithms in Python is not at the same level of maturity when compared with the R language. But I’m sure this is going to change over time.
2. The Data Set
The data set used in the following examples is available on the UCI Machine Learning Repository , and contains Online Retails Sales data set.
The following code accesses this data set and downloads it into a pandas dataframe. This data set have approx 500K records, so it might take a few seconds or a minute to download, depending on your internet connection speed.
import pandas as pd df = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx') df.head()
This data set contains sales transactions from a number of different countries (37 countries). Let’s see how many we have for each country.
country_sales = df.groupby(['Country'])['InvoiceNo'].count().reset_index(name ='NumRecords') country_sales
Now order the data set in descending Number of records order.
country_sales_order = country_sales.sort_values(['NumRecords'],ascending=False) country_sales_order
I see Ireland (EIRE) had 4th largest number of transactions, and as I’m from Ireland, let’s use those transactions for the Association Rules analysis.
3. Preparing the Data Set
First we need to create our data subset containing the transactions for EIRE (Ireland).
Ireland_sales = df[df['Country'] =="EIRE"].reset_index() Ireland_sales.head(5)
There is a little cleanup, we need to do. Some of the descriptions have spaces that need to be removed.
Ireland_sales['Description'] = Ireland_sales['Description'].str.strip()
After the cleanup, we need to consolidate the items into 1 transaction per row with each product 1 hot encoded. basket = (df[df['Country'] =="EIRE"] .groupby(['InvoiceNo', 'Description'])['Quantity'] .sum().unstack().reset_index().fillna(0) .set_index('InvoiceNo')) basket
Here is a subset of the ‘basket’ dataframe. It takes each distinct product value from the Description column and creates a new column for each value. Then for each record it will either contain the original Quantity value or a Zero to indicate it wasn’t used.
The new data frame contains 2027 columns.
#list the columns. There are 2027 columns basket.columns
Index([' 4 PURPLE FLOCK DINNER CANDLES', ' 50'S CHRISTMAS GIFT BAG LARGE', ' DOLLY GIRL BEAKER', ' NINE DRAWER OFFICE TIDY', ' OVAL WALL MIRROR DIAMANTE ', ' RED SPOT GIFT BAG LARGE', ' SPACEBOY BABY GIFT SET', ' TRELLIS COAT RACK', '10 COLOUR SPACEBOY PEN', '12 COLOURED PARTY BALLOONS', ... 'ZINC FOLKART SLEIGH BELLS', 'ZINC HEART FLOWER T-LIGHT HOLDER', 'ZINC HERB GARDEN CONTAINER', 'ZINC METAL HEART DECORATION', 'ZINC SWEETHEART WIRE LETTER RACK', 'ZINC T-LIGHT HOLDER STAR LARGE', 'ZINC T-LIGHT HOLDER STARS SMALL', 'ZINC WILLIE WINKIE CANDLE STICK', 'ZINC WIRE KITCHEN ORGANISER', 'ZINC WIRE SWEETHEART LETTER TRAY'], dtype='object', name='Description', length=2027)
Now we need to recode the values to indicate presence with a 1 or 0.
#Recode the numbers in the cells to indicate if item was bought (1) or not (0). The actual number is not important def encode_units(x): if x <= 0: return 0 if x >= 1: return 1 basket_sets = basket.applymap(encode_units) basket_sets
The data is now ready for input to the Association Rule algorithms.
4. Create the Frequent Item Sets and Association Rules
The first thing we need to do is to load the MLxtend package and the particular functions we are going to use.
from mlxtend.frequent_patterns import apriori from mlxtend.frequent_patterns import association_rules
Now use the Apriori algorithm to extract out the frequent Itemsets. A very important parameter to set for this is the Min Support percentage. Check your machine learning notes/textbook for details of this. Setting this too high can result in fewer itemsets, setting it too low can generate too many item sets. Some work may be necessary to determine the optimal number.
frequent_itemsets = apriori(basket_sets, min_support=0.05, use_colnames=True) frequent_itemsets
We can now generate the rules from the frequent item sets.
#Generate the rules with their corresponding support, confidence and lift rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1) rules.head()
5. Filtering the Data Set and Rules
After generating the association rules, you will want to query the data by filtering based on certain criteria, for example based on the support and confidence values.
#filter the dataframe to select a subset rules[ (rules['lift'] >= 8) & (rules['confidence'] >= 0.8) ]
At time of writing this page there a limited visualizations available for Association Rules in Python compared to R.
import matplotlib.pyplot as plt support=rules.as_matrix(columns=['support']) confidence=rules.as_matrix(columns=['confidence']) plt.figure(figsize=(10,8)) plt.scatter(support, confidence, alpha=0.5, marker="*") plt.xlabel('support') plt.ylabel('confidence') plt.show()
Repeat the above steps for the following countries. Some additional data cleaning might be necessary.
- United Kingdom
Compare the Rules to see if there is any overlap or commonalty of sales transactions between these countries.