L14-1-Linear_Regression

In the first of the Regression example we are going to perform a typical Linear Regression.

The data set is a commonly used data set to illustrate Linear Regression. It is the Boston Housing data set. This data set used to be available on the UCI repository, but is no longer available there. It is available on Kaggle. But it is also available as a data set in SciKit-Learn library in Python.

# The examples shown here demonstrate Linear Regression.
# - The data set is the Boston Housing data set. This is one of the typical data sets used for Linear Regression
# - The data set can be loaded directly from the Scikit-Learn library

from sklearn.datasets import load_boston
boston = load_boston()

The data set, when loaded into your Python environment, will be as a dictionary, consisting of the following components

  • keys()
  • data
  • target
  • DESCR
  • feature_names

Let’s look at each of these. They return key-value pairs.

boston.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR'])
boston.data
array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]])
boston.target
array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
       15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
       13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
       21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
       35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,
       19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,
...
boston.DESCR
"Boston House Prices dataset\n===========================\n\nNotes\n------\nData Set Characteristics:  \n\n    :Number of Instances: 506 \n\n    :Number of Attributes: 13 numeric/categorical predictive\n    \n    :Median Value (attribute 14) is usually the target\n\n    :Attribute Information (in order):\n        - CRIM     per capita crime rate by town\n        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n        - INDUS    proportion of non-retail business acres per town\n        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n        - NOX      nitric oxides concentration (parts per 10 million)\n        - RM       average number of rooms per dwelling\n        - AGE      proportion of owner-occupied units built prior to 1940\n        - DIS      weighted distances to five Boston employment centres\n        - RAD      index of accessibility to radial highways\n        
...
boston.feature_names
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

The dataset has the following attributes see https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names

  1. CRIM per capita crime rate by town
  2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS proportion of non-retail business acres per town
  4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  5. NOX nitric oxides concentration (parts per 10 million)
  6. RM average number of rooms per dwelling
  7. AGE proportion of owner-occupied units built prior to 1940
  8. DIS weighted distances to five Boston employment centres
  9. RAD index of accessibility to radial highways
  10. TAX full-value property-tax rate per 10,000
  11. PTRATIOpupilteacherratiobytownPTRATIO pupil-teacher ratio by town
  12. B 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT % lower status of the population
  14. MEDV Median value of owner-occupied homes in $1000’s

Having the data in a dictionary format can be a bit difficult to work with. The easiest way to work with the data is to convert it into a pandas.

#convert the data set into a panda dataframe
import pandas as pd

df = pd.DataFrame(boston.data, columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'])

#the target variable is defined separetely in the dictionary object.
#need to add that variable to the data frame
df['MEDV'] = boston.target

df.head(10)
Now do some correlation analysis of the attributes/variables/features
df.corr()

This is a very useful table and allows us to quickly see what variables are more or less correlated to each other. But as the number of variables increases so does the complexity of picking out these correlations. We could very easily miss some.

An alternative way to view these correlations is to add some color to them.

import matplotlib.pyplot as plt
import seaborn as sns

#alternatively, create a heat map for the correlations
plt.figure(figsize=(12,9))
sns.heatmap(df.corr(), annot=True, cmap='RdBu')
plt.show()

#Strongest positive correlations are displayed in blue while the strongest negative correlations are displayed in red.

We can also use the Scatter Plot feature with pandas to plot comparisons between variables.

#Now let's take a look at some scatter plots/ bar charts using Pandas built in scatter matrix function
pd.plotting.scatter_matrix(df[['RM','LSTAT','MEDV']], figsize=(12,12), s=75)

#you can see ‘RM’ (rooms) and ‘LSTAT’(lower status of the population) have a clear linear relationship 
# with the median value of homes. 
#The scatter matrix also includes bar charts which are great for getting a sense of the distribution of the data.

##Notice that there is a straight line of anomalous data points where the median values max out at “50”($500,000)

You can do lots more exploring of the data.

After exploring the data, and getting a better understanding if it and the various relationships, you are now ready to move onto the Linear Regression model. With Regression, the target is numeric as opposed to a binary or multi-class target variable in your typical classification problems.

#prepare data for input to model.
# Need a training and test data set
from sklearn.model_selection import train_test_split

#split the dataframe into 'input' attributes and target attribute
X = df.drop('MEDV', axis = 1)
Y = df['MEDV']

#create the training and test dataframes 2/3 to 1/3 split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(339, 13)
(167, 13)
(339,)
(167,)

Now the data is prepared we can run the Linear Regression algorithm.

#create or fit the Linear Regression models to the data
from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm.fit(X_train, Y_train)

Now take the LM model and apply it to he Test data set and make the predictions and plot the Actual vs Predicted values.

#take the LM model and score the test data
Y_pred = lm.predict(X_test)

#plot the actual values against the predicted values
plt.figure(figsize=(12,9))
plt.scatter(Y_test, Y_pred)
plt.xlabel("Prices (Acutal)")
plt.ylabel("Predicted prices: (Predicted) ")
plt.title("Prices vs Predicted prices: Actual(X) vs Predicted(Y)")

As you can see there is a bit of a spread of values. This would indicate that the model has a high error rate. We can display some of these for the model.

# The coefficients
print('Coefficients: \n', lm.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% sklearn.metrics.mean_squared_error(Y_test, Y_pred))
# Explained variance score: 1 is perfect prediction
print('R squared score: %.2f' % sklearn.metrics.r2_score(Y_test, Y_pred))
print("Model intercept:", lm.intercept_)
Coefficients: 
 [-1.07236741e-01  6.78482210e-02  4.14870781e-02  1.98584442e+00
 -2.07245202e+01  2.99104758e+00 -2.76241212e-03 -1.70484078e+00
  3.42025266e-01 -1.22042386e-02 -9.29929587e-01  1.05398067e-02
 -5.52673433e-01]
Mean squared error: 21.18
R squared score: 0.77
Model intercept: 43.0458714954825

The next step would be to repeat the above steps removing iteratively removing some of the variables to see what effect it would have on the model and the error rate. The aim is to find the combination of variables that has the smallest error rate.

I’ll leave that task for you to explore.