Bagging ensemble method builds multiple models (generally of the same type) from different samples(with replacement) of the training dataset. Then the predictions from all the sub-models are averaged across.
Extra Trees are another modification of bagging where random trees are constructed from the training set. The model is constructed using ExtraTreesClassifier class.
This recipe includes the following topics:
- Load classification problem dataset (Pima Indians) from github
- Split columns into the usual feature columns(X) and target column(Y)
- Split data using KFold() with k-fold count: 10, seed:7
- Instantiate the bagging ensemble method: ExtraTreesClassifier with num_trees:100, and max_features:7
- Call cross_val_score() to run cross validation
- Calculate mean estimated accuracy from scores returned by cross_val_score()
# import modules
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier
# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)
# convert into numpy array for scikit-learn
pimaArr = pimaDf.values
# Let's split columns into the usual feature columns(X) and target column(Y)
# Y represents the target 'class' column whose value is either '0' or '1'
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]
# set k-fold count
folds = 10
# set seed to reproduce the same random data each time
seed = 7
# split data using KFold
kfold = KFold(n_splits=folds, random_state=seed)
# set total number of trees
num_trees = 100
# set random selection of features
max_features = 7
# instantiate the bagging ensemble method: ExtraTreesClassifier
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
# call cross_val_score() to run cross validation
resultArr = cross_val_score(model, X, Y, cv=kfold)
# calculate mean of scores for all folds
meanAccuracy = resultArr.mean()
# display mean estimated accuracy
print("Mean estimated accuracy: %.5f" % meanAccuracy)
Mean estimated accuracy: 0.76676