In this section we will look at creating classification models using a number of different algorithms. In most cases the data preparation needed is the same. The data set used for this example is the Portuguese Bank Marketing Data Set.

For Data Mining/Machine Learning you cannot decide what algorithm will give your the best results. You need to prove what algorithm gives the best results. This is a concept known as ‘No Free Lunch’, whereby you need to generate and test each algorithm to see which one gives the best model for your particular data set.

1. Load and Inspect the Data

import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
get_ipython().magic('matplotlib inline')

bank_file = "/.../bank-additional-full.csv"

# import dataset
df = pd.read_csv(bank_file, sep=';',)

# get basic details of df (num records, num features)
(41188, 21)
df.describe() # basic descriptive statistics

Calculate the distributions for each of the Target variables. We have an imbalanced data set.

df['y'].value_counts() # dataset is imbalanced with majority of class label as "no".

no     36548
yes     4640
df['y'].value_counts()/len(df)   #calculate percentages

no     0.887346
yes    0.112654

2. Data Clean-up and Reformatting
The Target variable has text for the values. In Python the algorithms expect the target values to be numerical.

df['y'] = df['y'].map({'no':0, 'yes':1}) # binary encoding of class label

The ‘duration’ variable is highly correlated to the target variable. We are also told this in the description of the data set, and this variable should be removed from the data set.

df = df.drop('duration', axis=1)

Now we need to perform One-Hot Coding. This takes all the categorical variables and creates additional variables for each value. Then adds a 1 to the new variable that corresponds to the value for each instances.

df_new = pd.get_dummies(df)

3. Prepare the Training and Test Data Sets

After the data is prepared, we can now divide the data into the Training and Test data sets.

import random

# use Stratified sampling to divide the data
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=10, test_size = 0.2, random_state=18)
for train_index, test_index in split.split(df_new, df_new['y']):
   train_set = df_new.loc[train_index]
   test_set = df_new.loc[test_index]

Check the sizes of the Training and Test data sets.

0    29238
1     3712
0    0.887344
1    0.112656
0    7310
1     928
0    0.887351
1    0.112649

We can imbalanced Training data set. Here will will look to up sample the ‘1’ to have approx. the same number of instances as the other target value.

from sklearn.utils import resample
# Separate majority and minority classes
train_negative = train_set[train_set['y']==0]
train_positive = train_set[train_set['y']==1]

# Upsample minority class
train_positive_upsample = resample(train_positive,
                                   replace=True, # sample with replacement
                                   n_samples=29238, # to match majority class
                                   random_state=18) # reproducible results

# Combine majority class with upsampled minority class
train_upsample = pd.concat([train_negative, train_positive_upsample])

# Display new class counts
1    29238
0    29238

Final data preparation step to to divide the data into input variables and target variable.

# Final data preparating step is to
## create separate dataframes for Input features (X) and for Target feature (Y)
X_train = train_upsample.drop('y', axis=1)
X_test = test_set.drop('y', axis=1)
Y_train = train_upsample['y']
Y_test = test_set['y']

4. Algorithm Setup

As we want to test a number of algorithms we can setup a list of these.

from sklearn import model_selection
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

log_cols=["Classifier", "Accuracy"]
log = pd.DataFrame(columns=log_cols)

classifiers = [
   ('KNN', KNeighborsClassifier(3)),
   ('NB', GaussianNB()),
   ('DT', DecisionTreeClassifier()),
   ('RF', RandomForestClassifier()),
   ('LR', LogisticRegression())]

Others can be added to this, eg. SVM, Neural Network, etc

5. Create Models & Generate Confusion Matrix & ROC Chart

Create a Loop to process each algorithm in the list, calculate the accuracy, confusion matrix and values for creating a ROC chart.

from sklearn.metrics import roc_curve, auc


for name, model in classifiers:
   model.fit(X_train, Y_train)
   name = model.__class__.__name__

   train_predictions = model.predict(X_test)
   acc = accuracy_score(Y_test, train_predictions)
   print (name, "Accuracy=", acc)

   log_entry = pd.DataFrame([[name, acc*100]], columns=log_cols)
   log = log.append(log_entry)

   Y_predict_prob = model.predict_proba(X_test)[:,1]

   fpr, tpr, thresholds = roc_curve(Y_test, Y_predict_prob)
   roc_auc = auc(fpr, tpr)

   print(pd.crosstab(pd.Series(Y_test), pd.Series(train_predictions), rownames=['Actual'], colnames=['Predicted'], margins=True))

   plt.plot(fpr, tpr, label='%s ROC (area = %0.2f)' % (name, roc_auc))

plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc=0, fontsize='small')
KNeighborsClassifier Accuracy= 0.8131828113619811
Predicted     0    1   All
0          1286  285  1571
1            56    7    63
All        1342  292  1634

GaussianNB Accuracy= 0.8329691672736101
Predicted     0    1   All
0          1288  283  1571
1            60    3    63
All        1348  286  1634

DecisionTreeClassifier Accuracy= 0.840616654527798
Predicted     0    1   All
0          1355  216  1571
1            56    7    63
All        1411  223  1634

RandomForestClassifier Accuracy= 0.8800679776644816
Predicted     0    1   All
0          1445  126  1571
1            61    2    63
All        1506  128  1634

LogisticRegression Accuracy= 0.8345472201990775
Predicted     0    1   All
0          1259  312  1571
1            60    3    63
All        1319  315  1634

import seaborn as sns

sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")

plt.xlabel('Accuracy %')
plt.title('Classifier Accuracy')

6. Inspect Model Properties

You can also inspect the model properties. The following is an example for Random Forest. You will need to modify the code to work with the above examples.

# Get numerical feature importances
import matplotlib.pyplot as plt
%matplotlib inline

importances = list(rf.feature_importances_)

# Set the style

# list of x locations for plotting
x_values = list(range(len(importances)))

# Make a bar chart
plt.bar(x_values, importances, orientation = 'vertical')

# Tick labels for x axis
plt.xticks(x_values, feature_list, rotation='vertical')

# Axis labels and title
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');