In the first of the Regression example we are going to perform a typical Linear Regression.
The data set is a commonly used data set to illustrate Linear Regression. It is the Boston Housing data set. This data set used to be available on the UCI repository, but is no longer available there. It is available on Kaggle. But it is also available as a data set in SciKit-Learn library in Python.
# The examples shown here demonstrate Linear Regression. # - The data set is the Boston Housing data set. This is one of the typical data sets used for Linear Regression # - The data set can be loaded directly from the Scikit-Learn library from sklearn.datasets import load_boston boston = load_boston()
The data set, when loaded into your Python environment, will be as a dictionary, consisting of the following components
Let’s look at each of these. They return key-value pairs.
boston.keys() dict_keys(['data', 'target', 'feature_names', 'DESCR'])
boston.data array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02, 4.9800e+00], [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02, 9.1400e+00], [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02, 4.0300e+00], ..., [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02, 5.6400e+00], [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02, 6.4800e+00], [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02, 7.8800e+00]])
boston.target array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. , 18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6, 15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2, 13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7, 21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9, 35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5, 19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. , ...
boston.DESCR "Boston House Prices dataset\n===========================\n\nNotes\n------\nData Set Characteristics: \n\n :Number of Instances: 506 \n\n :Number of Attributes: 13 numeric/categorical predictive\n \n :Median Value (attribute 14) is usually the target\n\n :Attribute Information (in order):\n - CRIM per capita crime rate by town\n - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n - INDUS proportion of non-retail business acres per town\n - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n - NOX nitric oxides concentration (parts per 10 million)\n - RM average number of rooms per dwelling\n - AGE proportion of owner-occupied units built prior to 1940\n - DIS weighted distances to five Boston employment centres\n - RAD index of accessibility to radial highways\n ...
boston.feature_names array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
The dataset has the following attributes see https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per 10,000
- PTRATIOpupil−teacherratiobytownPTRATIO pupil-teacher ratio by town
- B 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000’s
Having the data in a dictionary format can be a bit difficult to work with. The easiest way to work with the data is to convert it into a pandas.
#convert the data set into a panda dataframe import pandas as pd df = pd.DataFrame(boston.data, columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']) #the target variable is defined separetely in the dictionary object. #need to add that variable to the data frame df['MEDV'] = boston.target df.head(10)
This is a very useful table and allows us to quickly see what variables are more or less correlated to each other. But as the number of variables increases so does the complexity of picking out these correlations. We could very easily miss some.
An alternative way to view these correlations is to add some color to them.
import matplotlib.pyplot as plt import seaborn as sns #alternatively, create a heat map for the correlations plt.figure(figsize=(12,9)) sns.heatmap(df.corr(), annot=True, cmap='RdBu') plt.show() #Strongest positive correlations are displayed in blue while the strongest negative correlations are displayed in red.
We can also use the Scatter Plot feature with pandas to plot comparisons between variables.
#Now let's take a look at some scatter plots/ bar charts using Pandas built in scatter matrix function pd.plotting.scatter_matrix(df[['RM','LSTAT','MEDV']], figsize=(12,12), s=75) #you can see ‘RM’ (rooms) and ‘LSTAT’(lower status of the population) have a clear linear relationship # with the median value of homes. #The scatter matrix also includes bar charts which are great for getting a sense of the distribution of the data. ##Notice that there is a straight line of anomalous data points where the median values max out at “50”($500,000)
You can do lots more exploring of the data.
After exploring the data, and getting a better understanding if it and the various relationships, you are now ready to move onto the Linear Regression model. With Regression, the target is numeric as opposed to a binary or multi-class target variable in your typical classification problems.
#prepare data for input to model. # Need a training and test data set from sklearn.model_selection import train_test_split #split the dataframe into 'input' attributes and target attribute X = df.drop('MEDV', axis = 1) Y = df['MEDV'] #create the training and test dataframes 2/3 to 1/3 split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5) print(X_train.shape) print(X_test.shape) print(Y_train.shape) print(Y_test.shape)
(339, 13) (167, 13) (339,) (167,)
Now the data is prepared we can run the Linear Regression algorithm.
#create or fit the Linear Regression models to the data from sklearn.linear_model import LinearRegression lm = LinearRegression() lm.fit(X_train, Y_train)
Now take the LM model and apply it to he Test data set and make the predictions and plot the Actual vs Predicted values.
#take the LM model and score the test data Y_pred = lm.predict(X_test) #plot the actual values against the predicted values plt.figure(figsize=(12,9)) plt.scatter(Y_test, Y_pred) plt.xlabel("Prices (Acutal)") plt.ylabel("Predicted prices: (Predicted) ") plt.title("Prices vs Predicted prices: Actual(X) vs Predicted(Y)")
As you can see there is a bit of a spread of values. This would indicate that the model has a high error rate. We can display some of these for the model.
# The coefficients print('Coefficients: \n', lm.coef_) # The mean squared error print("Mean squared error: %.2f" % sklearn.metrics.mean_squared_error(Y_test, Y_pred)) # Explained variance score: 1 is perfect prediction print('R squared score: %.2f' % sklearn.metrics.r2_score(Y_test, Y_pred)) print("Model intercept:", lm.intercept_)
Coefficients: [-1.07236741e-01 6.78482210e-02 4.14870781e-02 1.98584442e+00 -2.07245202e+01 2.99104758e+00 -2.76241212e-03 -1.70484078e+00 3.42025266e-01 -1.22042386e-02 -9.29929587e-01 1.05398067e-02 -5.52673433e-01] Mean squared error: 21.18 R squared score: 0.77 Model intercept: 43.0458714954825
The next step would be to repeat the above steps removing iteratively removing some of the variables to see what effect it would have on the model and the error rate. The aim is to find the combination of variables that has the smallest error rate.
I’ll leave that task for you to explore.