Feature selection: Univariate Selection

SelectKBest class can be used to select features that have the strongest relationship to the target(univariate) column according to the k highest scores.

This recipe includes the following topics:

  • Initialize SelectKBest class with best features selection set to (k=4)
  • Use chi-squared stats as score function
  • Call fit() to run score function and get the appropriate features
  • Display scores of features
  • Call transform() to reduce X to the selected features

##      Feature Selection     ##

# 1. Univariate Selection
# import modules
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# read data file from github
# dataframe: pimaDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)

# convert into numpy array
pimaArr = pimaDf.values

# Let's split our data into the usual train(X) and test/target(Y) set
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]

# initialize SelectKBest class
# 1. select chi2 as score function
# 2. set output of best feature select to 4
# 3. call fit() to run score function and get the appropriate features
uniSelector = SelectKBest(score_func=chi2, k=4).fit(X, Y)

# display scores of features

# call transform to reduce X to the selected features/columns
selectedDf = uniSelector.transform(X)

# print first 3 rows of output with only the best 4 features/columns
[ 111.51969064 1411.88704064   17.60537322   53.10803984 2175.56527292
  127.66934333    5.39268155  181.30368904]
[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]]

