K-Fold divides the data into k-parts also known as folds. The estimator/algorithm is tested on one fold and trained on the remaining k-1 folds. It tends to be more accurate as the estimator is run multiple times on different splits.
Cross-validation can be done with the cross_val_score()
helper function on the estimator, dataset and split technique(k-fold).
– cross_val_score()
returns scores of the estimator for each fold
This recipe includes the following topics:
- Load data/file from github
- Split columns into the usual feature columns(X) and target column(Y)
- Set k-fold count to 10
- Set seed to reproduce the same random data each time
- Split data using KFold() class
- Instantiate a classification model (LogisticRegression)
- Call cross_val_score() to run cross validation
- Calculate mean and standard deviation from scores returned by cross_val_score()
# import modules
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)
# convert into numpy array for scikit-learn
pimaArr = pimaDf.values
# Let's split columns into the usual feature columns(X) and target column(Y)
# Y represents the target 'class' column whose value is either '0' or '1'
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]
# set k-fold count
folds = 10
# set seed to reproduce the same random data each time
seed = 7
# split data using KFold
kfold = KFold(n_splits=folds, random_state=seed)
# instantiate a classification model
model = LogisticRegression()
# call cross_val_score() to run cross validation
resultArr = cross_val_score(model, X, Y, cv=kfold)
# calculate mean of scores for all folds
meanAccuracy = resultArr.mean() * 100
# calculate standard deviation of scores for all folds
stdAccuracy = resultArr.std() * 100
# display accuracy
print("Mean accuracy: %.3f, Standard deviation: %.3f" % (meanAccuracy, stdAccuracy))
Mean accuracy: 76.951, Standard deviation: 4.841