We have already seen an example of using Logistic Regression in the section covering classification.

With Logistics Regression looks to model numeric data.

Logistic regression is regressing data to a line (i.e. finding an average of sorts) so you can fit data to a particular equation and make predictions for your data. This type of regression is a good choice when modeling binary variables, which happen frequently in real life (e.g. work or don’t work, marry or don’t marry, buy a house or rent…). The logistic regression model is popular, in part, because it gives probabilities between 0 and 1. Let’s say you were modeling a risk of credit default: values closer to 0 indicate a tiny risk, while values closer to 1 mean a very high risk.

On this page I will give another example, using the Titanic Data Set

Go get the data set from the Kaggle website. Download the train and test CSV files.

Let’s get started by reading in the data training data set located in the ‘train.csv’ file.

import pandas as pd titanic = pd.read_csv('/Users/brendan.tierney/Dropbox/titanic/train.csv') titanic.head()

We have the following variables in the data set

# Survived – Survival (0 = No; 1 = Yes)

# Pclass – Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)

# Name – Name

# Sex – Sex

# Age – Age

# SibSp – Number of Siblings/Spouses Aboard

# Parch – Number of Parents/Children Aboard

# Ticket – Ticket Number

# Fare – Passenger Fare (British pound)

# Cabin – Cabin

# Embarked – Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Now perform a quick correlation analysis to get some early insights into the data.

titanic.corr()

Check the number of records/cases for each value in the target variable ‘Survived’.

#check the number of passengars for each Target variable values import seaborn as sb import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = 10, 8 sb.countplot(x='Survived',data=titanic, palette='hls')

Now check to see what variables have missing data.

titanic.isnull().sum() PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64

The Cabin variable is a lot of values, over 77% of records have a missing value. We can drop that variable from the data set.

But are there any other variables that can be dropped:

– PassengerId – can be dropped as it is just a sequence number with no meaning

– Ticket – can be dropped for a similar reason

– Name – can be dropped as the name is not important

– Cabin – can be dropped as it contains a lot of missing data

titanic_data = titanic.drop(['PassengerId','Name','Ticket','Cabin'], 1) titanic_data.head()

But what about AGE variable. It has approx. 20% missing values. It would be great if we could keep this variable, but can we impute or calculate what the Age might be for those missing a value.

We need to explore the data and see how Age is related to some of the other variables.

Let’s us look at how Age is related to the Class variable

sb.boxplot(x='Pclass', y='Age', data=titanic_data, palette='hls')

We could say that the younger a passenger is, the more likely it is for them to be in 3rd class. The older a passenger is, the more likely it is for them to be in 1st class. etc. So there is a loose relationship between these variables.

We need to write a function to perform this calculation, for example, the average age of 1st class passengers is about 37, 2nd class passengers is 29, and 3rd class passengers is 24.

def missing_age(cols): Age = cols[0] Pclass = cols[1] if pd.isnull(Age): if Pclass == 1: return 37 elif Pclass == 2: return 29 else: return 24 else: return Age #Now apply this function to the data set and the re-check titanic_data['Age'] = titanic_data[['Age', 'Pclass']].apply(missing_age, axis=1) titanic_data.isnull().sum() Survived 0 Pclass 0 Sex 0 Age 0 SibSp 0 Parch 0 Fare 0 Embarked 2 dtype: int64

There are 2 null values in the EMBARKED variable. We can drop those 2 records without loosing too much important information from our data set.

titanic_data.dropna(inplace=True) titanic_data.isnull().sum() #Display the first 10 records to remind us of the data titanic_data.head(10)

What about Pclass and Fare. You could say these are related to each other. I’ll leave that for you to explore or we can use a correlation map to examine their relationship.

sb.heatmap(titanic_data.corr(), annot=True, cmap='RdBu')

They aren’t really correlated.

Now we need to process the categorical variables Sex and Embarked. We need to convert these into numerical values. The easiest way is to use one-hot-coding to do this and the get_dummies function.

titanic_data2 = pd.get_dummies(titanic_data) titanic_data2.head(10)

New variables are created for each of the categorical variables (Sex_female, Sex_male, Embarked_C, etc)

When you examine the new Sex_female and Sex_male variables, there is very little between them. Each is just the opposite encoding of the other. This means we can use just one of those variables (and drop the other one).

titanic_data2.drop(['Sex_male'],axis=1,inplace=True) titanic_data2.head()

Create the Train and Test data sets.

#create the training and test dataframes 2/3 to 1/3 split from sklearn.model_selection import train_test_split X = titanic_data2.drop('Survived', axis = 1) Y = titanic_data2['Survived'] X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5) print(X_train.shape) print(X_test.shape) print(Y_train.shape) print(Y_test.shape) (595, 9) (294, 9) (595,) (294,)

We can now create or Fit the logistic regresssion algorithm to the data

from sklearn.linear_model import LogisticRegression LogReg = LogisticRegression() LogReg.fit(X_train, Y_train) > LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

Use the model to score the test data set and evaluate the results.

Y_pred = LogReg.predict(X_test) from sklearn.metrics import confusion_matrix confusion_matrix = confusion_matrix(Y_test, Y_pred) confusion_matrix

array([[174, 19], [ 31, 70]])

The results from the confusion matrix are telling us that 137 and 69 are the number of correct predictions. 34 and 27 are the number of incorrect predictions.

Now let’s plot the confustion matrix. The following code/function is taken from the SciKit-Learn documentation website

def plot_confusion_matrix(cm, target_names, title='Confusion matrix', cmap=None, normalize=True): """ given a sklearn confusion matrix (cm), make a nice plot Arguments --------- cm: confusion matrix from sklearn.metrics.confusion_matrix target_names: given classification classes such as [0, 1, 2] the class names, for example: ['high', 'medium', 'low'] title: the text to display at the top of the matrix cmap: the gradient of the values displayed from matplotlib.pyplot.cm see http://matplotlib.org/examples/color/colormaps_reference.html plt.get_cmap('jet') or plt.cm.Blues normalize: If False, plot the raw numbers If True, plot the proportions Usage ----- plot_confusion_matrix(cm = cm, # confusion matrix created by # sklearn.metrics.confusion_matrix normalize = True, # show proportions target_names = y_labels_vals, # list of names of the classes title = best_estimator_name) # title of graph Citiation --------- http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html """ import matplotlib.pyplot as plt import numpy as np import itertools accuracy = np.trace(cm) / float(np.sum(cm)) misclass = 1 - accuracy if cmap is None: cmap = plt.get_cmap('Blues') plt.figure(figsize=(8, 6)) plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() if target_names is not None: tick_marks = np.arange(len(target_names)) plt.xticks(tick_marks, target_names, rotation=45) plt.yticks(tick_marks, target_names) if normalize: cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] thresh = cm.max() / 1.5 if normalize else cm.max() / 2 for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): if normalize: plt.text(j, i, "{:0.4f}".format(cm[i, j]), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") else: plt.text(j, i, "{:,}".format(cm[i, j]), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass)) plt.show() plot_confusion_matrix(cm = confusion_matrix, normalize = False, target_names = ['0', '1'], title = "Confusion Matrix")

It correctly predicted that 174 Did Not survive. For 19 people that did not survive it predicted they DID survive

Similarly, It correctly predicted 70 people who DID survive, but incorrectly predicted 31 survivors as NOT surviving

Now print the additional confusion matrix measures

from sklearn.metrics import classification_report print(classification_report(Y_test, Y_pred))