SelectKBest class can be used to select features that have the strongest relationship to the target(univariate) column according to the k highest scores.
This recipe includes the following topics:
- Initialize SelectKBest class with best features selection set to (k=4)
- Use chi-squared stats as score function
- Call fit() to run score function and get the appropriate features
- Display scores of features
- Call transform() to reduce X to the selected features
################################
## Feature Selection ##
################################
# 1. Univariate Selection
# import modules
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)
# convert into numpy array
pimaArr = pimaDf.values
# Let's split our data into the usual train(X) and test/target(Y) set
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]
# initialize SelectKBest class
# 1. select chi2 as score function
# 2. set output of best feature select to 4
# 3. call fit() to run score function and get the appropriate features
uniSelector = SelectKBest(score_func=chi2, k=4).fit(X, Y)
# display scores of features
print(uniSelector.scores_)
print('-'*60)
# call transform to reduce X to the selected features/columns
selectedDf = uniSelector.transform(X)
# print first 3 rows of output with only the best 4 features/columns
print(selectedDf[:3,])
[ 111.51969064 1411.88704064 17.60537322 53.10803984 2175.56527292
127.66934333 5.39268155 181.30368904]
------------------------------------------------------------
[[148. 0. 33.6 50. ]
[ 85. 0. 26.6 31. ]
[183. 0. 23.3 32. ]]