Decision trees like Random Forest can be used to estimate the importance of features. The higher, the more important the feature.
This recipe includes the following topics:
- Initialize RandomForestClassifier class
- Call fit() to build a forest of trees from the training set (X, y)
- Display the feature importances
# 4. Feature Importance
# import modules
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)
# convert into numpy array
pimaArr = pimaDf.values
# Let's split our data into the usual train(X) and test/target(Y) set
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]
# initialize PCA class
# call fit() to build a forest of trees from the training set (X, y)
rfc = RandomForestClassifier().fit(X,Y)
# display rfe attributes
print("Feature importances: %s" % rfc.feature_importances_)
print('-'*60)
# The scores suggest at the importance of plas, mass, and pedi
Feature importances: [0.10203687 0.25106337 0.08872303 0.06846597 0.07482446 0.15623041
0.13915677 0.11949911]