We have already seen an example of using Logistic Regression in the section covering classification.

With Logistics Regression looks to model numeric data.

Logistic regression is regressing data to a line (i.e. finding an average of sorts) so you can fit data to a particular equation and make predictions for your data. This type of regression is a good choice when modeling binary variables, which happen frequently in real life (e.g. work or don’t work, marry or don’t marry, buy a house or rent…). The logistic regression model is popular, in part, because it gives probabilities between 0 and 1. Let’s say you were modeling a risk of credit default: values closer to 0 indicate a tiny risk, while values closer to 1 mean a very high risk.

On this page I will give another example, using the Titanic Data Set

Go get the data set from the Kaggle website. Download the train and test CSV files.

Let’s get started by reading in the data training data set located in the ‘train.csv’ file.

import pandas as pd

titanic = pd.read_csv('/Users/brendan.tierney/Dropbox/titanic/train.csv')
titanic.head()

We have the following variables in the data set

# Survived – Survival (0 = No; 1 = Yes)
# Pclass – Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
# Name – Name
# Sex – Sex
# Age – Age
# SibSp – Number of Siblings/Spouses Aboard
# Parch – Number of Parents/Children Aboard
# Ticket – Ticket Number
# Fare – Passenger Fare (British pound)
# Cabin – Cabin
# Embarked – Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Now perform a quick correlation analysis to get some early insights into the data.

titanic.corr()

Check the number of records/cases for each value in the target variable ‘Survived’.

#check the number of passengars for each Target variable values
import seaborn as sb
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = 10, 8

sb.countplot(x='Survived',data=titanic, palette='hls')

Now check to see what variables have missing data.

titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

The Cabin variable is a lot of values, over 77% of records have a missing value. We can drop that variable from the data set.

But are there any other variables that can be dropped:
– PassengerId – can be dropped as it is just a sequence number with no meaning
– Ticket – can be dropped for a similar reason
– Name – can be dropped as the name is not important
– Cabin – can be dropped as it contains a lot of missing data

titanic_data = titanic.drop(['PassengerId','Name','Ticket','Cabin'], 1)
titanic_data.head()

But what about AGE variable. It has approx. 20% missing values. It would be great if we could keep this variable, but can we impute or calculate what the Age might be for those missing a value.

We need to explore the data and see how Age is related to some of the other variables.

Let’s us look at how Age is related to the Class variable

sb.boxplot(x='Pclass', y='Age', data=titanic_data, palette='hls')

We could say that the younger a passenger is, the more likely it is for them to be in 3rd class. The older a passenger is, the more likely it is for them to be in 1st class. etc. So there is a loose relationship between these variables.

We need to write a function to perform this calculation, for example, the average age of 1st class passengers is about 37, 2nd class passengers is 29, and 3rd class passengers is 24.

def missing_age(cols):
    Age = cols[0]
    Pclass = cols[1]

    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age

#Now apply this function to the data set and the re-check
titanic_data['Age'] = titanic_data[['Age', 'Pclass']].apply(missing_age, axis=1)
titanic_data.isnull().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    2
dtype: int64

There are 2 null values in the EMBARKED variable. We can drop those 2 records without loosing too much important information from our data set.

titanic_data.dropna(inplace=True)
titanic_data.isnull().sum()

#Display the first 10 records to remind us of the data
titanic_data.head(10)

What about Pclass and Fare. You could say these are related to each other. I’ll leave that for you to explore or we can use a correlation map to examine their relationship.

sb.heatmap(titanic_data.corr(), annot=True, cmap='RdBu')

They aren’t really correlated.

Now we need to process the categorical variables Sex and Embarked. We need to convert these into numerical values. The easiest way is to use one-hot-coding to do this and the get_dummies function.

titanic_data2 = pd.get_dummies(titanic_data)
titanic_data2.head(10)

New variables are created for each of the categorical variables (Sex_female, Sex_male, Embarked_C, etc)

When you examine the new Sex_female and Sex_male variables, there is very little between them. Each is just the opposite encoding of the other. This means we can use just one of those variables (and drop the other one).

titanic_data2.drop(['Sex_male'],axis=1,inplace=True)
titanic_data2.head()

Create the Train and Test data sets.

#create the training and test dataframes 2/3 to 1/3 split
from sklearn.model_selection import train_test_split

X = titanic_data2.drop('Survived', axis = 1)
Y = titanic_data2['Survived']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(595, 9)
(294, 9)
(595,)
(294,)

We can now create or Fit the logistic regresssion algorithm to the data

from sklearn.linear_model import LogisticRegression

LogReg = LogisticRegression()
LogReg.fit(X_train, Y_train)

> LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Use the model to score the test data set and evaluate the results.

Y_pred = LogReg.predict(X_test)

from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(Y_test, Y_pred)
confusion_matrix

array([[174,  19],
       [ 31,  70]])

The results from the confusion matrix are telling us that 137 and 69 are the number of correct predictions. 34 and 27 are the number of incorrect predictions.

Now let’s plot the confustion matrix. The following code/function is taken from the SciKit-Learn documentation website

def plot_confusion_matrix(cm,
                          target_names,
                          title='Confusion matrix',
                          cmap=None,
                          normalize=True):
    """
    given a sklearn confusion matrix (cm), make a nice plot

    Arguments
    ---------
    cm: confusion matrix from sklearn.metrics.confusion_matrix

    target_names: given classification classes such as [0, 1, 2]
                  the class names, for example: ['high', 'medium', 'low']

    title: the text to display at the top of the matrix

    cmap: the gradient of the values displayed from matplotlib.pyplot.cm
          see http://matplotlib.org/examples/color/colormaps_reference.html
          plt.get_cmap('jet') or plt.cm.Blues

    normalize: If False, plot the raw numbers
               If True, plot the proportions

    Usage
    -----
     plot_confusion_matrix(cm = cm, # confusion matrix created by
                                    # sklearn.metrics.confusion_matrix
    normalize = True, # show proportions
    target_names = y_labels_vals, # list of names of the classes
    title = best_estimator_name) # title of graph

    Citiation
    ---------
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

    """
    import matplotlib.pyplot as plt
    import numpy as np
    import itertools

    accuracy = np.trace(cm) / float(np.sum(cm))
    misclass = 1 - accuracy

    if cmap is None:
    cmap = plt.get_cmap('Blues')

    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names, rotation=45)
        plt.yticks(tick_marks, target_names)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.4f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
    plt.show()

plot_confusion_matrix(cm = confusion_matrix, 
                      normalize = False,
                      target_names = ['0', '1'],
                      title = "Confusion Matrix")

It correctly predicted that 174 Did Not survive. For 19 people that did not survive it predicted they DID survive

Similarly, It correctly predicted 70 people who DID survive, but incorrectly predicted 31 survivors as NOT surviving

Now print the additional confusion matrix measures

from sklearn.metrics import classification_report

print(classification_report(Y_test, Y_pred))