Voting ensemble method builds multiple models (generally of different types) and uses simple statistics(eg: mean) to combine predictions. VotingClassifier is used to wrap your models and combine the predictions from different sub-models.
In this example, we will use three different classification algorithms/models:
1. Logistic Regression
2. Classification and Regression Trees(CART)
3. Support Vector Machines
This recipe includes the following topics:
- Load classification problem dataset (Pima Indians) from github
- Split columns into the usual feature columns(X) and target column(Y)
- Split data using KFold() with k-fold count: 10, seed:7
- Instantiate the Logistic Regression model
- Instantiate the CART model
- Instantiate the Support Vector Machines model
- Instantiate the voting ensemble method: VotingClassifier with the above three models
- Call cross_val_score() to run cross validation
- Calculate mean estimated accuracy from scores returned by cross_val_score()
# import modules
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)
# convert into numpy array for scikit-learn
pimaArr = pimaDf.values
# Let's split columns into the usual feature columns(X) and target column(Y)
# Y represents the target 'class' column whose value is either '0' or '1'
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]
# set k-fold count
folds = 10
# set seed to reproduce the same random data each time
seed = 7
# split data using KFold
kfold = KFold(n_splits=folds, random_state=seed)
# instantiate the LogisticRegression model
lr = LogisticRegression()
# instantiate the DecisionTreeClassifier model
cart = DecisionTreeClassifier()
# instantiate the LogisticRegression model
svm = SVC()
# Create sub models
estimators = []
estimators.append(('lr', lr))
estimators.append(('cart', cart))
estimators.append(('svm', svm))
# instantiate the voting ensemble method: VotingClassifier
ensemble = VotingClassifier(estimators)
# call cross_val_score() to run cross validation
resultArr = cross_val_score(ensemble, X, Y, cv=kfold)
# calculate mean of scores for all folds
meanAccuracy = resultArr.mean()
# display mean estimated accuracy
print("Mean estimated accuracy: %.5f" % meanAccuracy)
Mean estimated accuracy: 0.72905