L13-9 – k-Fold Cross Validation and Repeated k-Fold Cross Validation

When it comes to evaluation the performance of a machine learning model there are a number of different approaches. Plus there are as many different view points on what is the best or better evaluation metric to use.

One of the common approaches is to use k-Fold cross validation. This divides the data in to ‘k‘ non-overlapping parts (or Folds). One of these part/Folds is used for hold out testing and the remaining part/Folds (k-1) are used to train and create a model. This model is then used to applied or fitted to the hold-out ‘k‘ part/Fold. This process is repeated across all the ‘k‘ parts/Folds until all the data has been used. The results from applying or fitting the model are aggregated and the mean performance is report.

Traditionally, ‘k‘ is set to 10 and will be the default value in most/all languages, libraries, packages and application. This number can be changed to anything you want. Most reports indicated a value of between 5 and 10, as these seem to indicate results that don’t suffer from bias or variance.

Let’s take a look at an example of using k-Fold Cross Validation using Scikit-Learning library. First step is to prepare the data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

bank_file = "/.../4-Datasets/bank-additional-full.csv"

# import dataset
df = pd.read_csv(bank_file, sep=';',)

# get basic details of df (num records, num features)
df.shape

print('Percentage per target class ')
df['y'].value_counts()/len(df) #calculate percentages

#Data Clean up
df = df.drop('duration', axis=1) #this is highly correlated to target variable
df_new = pd.get_dummies(df) #simple and easy approach for categorical variables
df_new.describe()
df['y'] = df['y'].map({'no':0, 'yes':1}) # binary encoding of class label

#split data set into input variables and target variables
## create separate dataframes for Input features (X) and for Target feature (Y)
X_train = df_new.drop('y', axis=1)
Y_train = df_new['y']

Now we can perform k-fold cross valuation.

#load scikit-learn k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

#setup for k-Fold Cross Validation
cv = KFold(n_splits=10, shuffle=True, random_state=1)
#n_splits = number of k-folds
#shuffle = shuffles data set prior to split
#radnom_state = seed for (pseydo)random number generator
#define model
model = LogisticRegression()
#create model, perform cross validation and evaluate model
scores = cross_val_score(model, X_train, Y_train, scoring='accuracy', cv=cv, n_jobs=-1)
#performance result
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

 

We can see from the above example the model is evaluated across 10 folds, giving the accuracy score for each of these. The mean of these 10 accuracy scores is calculated along with the standard deviation, which in this example is very small. You may have slightly different results and this will vary from data set to data set.

The results from k-fold can be nosy, as in each time the code is run a slightly different result may be achieved. This is due to having differing splits of the data set into the k-folds. The model accuracy can vary between each execution and it can be difficult to determine which iteration of the model should be used.

One way to address this possible noise is to estimate the model accurary/performance based on running k-fold a number of times and calculating the performance across all the repeats. This approach is called Repeated k-Fold Cross-Validation. Yes there is a computation cost for performing this approach, and it therefore suited to datasets of smaller scale. In most scenarios having data sets up to 1M records/cases is possible, and depending on the hardware and memory, it can scale to many times that and still be relatively quick to run.

[a small data set for one person could be another persons Big Data set!]

How many repeats should be performed? It kind of depends on how noisy the data is, but in a similar way of having ten as a default value for k, the number of repeats default is ten. Although the typical default is ten, but can be adjusted to say 5, but some testing/experimentation is needed to determine a suitable value.

Building upon the k-fold example code given previously, the following shows can example of using the Repeated k-Fold Cross Validation.

#Repeated k-Fold Cross Validation
#load the necessary libraries
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

#using the same data set created for k-Fold => X_train, Y_train

#Setup and configure settings for Repeated k-Fold CV (k-folds=10, repeats=10)
rcv = RepeatedKFold(n_splits=10, n_repeats=10, random_state=1)

#define model
model = LogisticRegression()

#create model, perform Repeated CV and evaluate model
scores = cross_val_score(model, X_train, Y_train, scoring='accuracy', cv=rcv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))