Preprocessing data: Rescaling using scikit-learn

Rescaling will transform data to all have the same scale.
Transformed data will lie between a given minimum and maximum value, often between zero and one.

This recipe includes the following topics:

  • Rescale using MinMaxScaler class
  • Call fit() to compute the min and max value to be used for later scaling
  • Call transform() on the input data

# import modules
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# read data file from github
# dataframe: pimaDf
gitFileURL = ''
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
pimaDf = pd.read_csv(gitFileURL, names = cols)

# convert into numpy array
pimaArr = pimaDf.values

# Though we won't be using test set in this example
# Let's split our data into the usual train(X) and test(Y) set
X = pimaArr[:, 0:8]
Y = pimaArr[:, 8]

# 1. initialize MinMaxScaler class to limit output range between 0 and 1
# 2. call fit() function to compute the min and max value
scaler = MinMaxScaler(feature_range=(0,1)).fit(X)

# rescale input data using transform()
rescaledX = scaler.transform(X)

# limit precision to 3 decimal points for printing

# print first 3 rows of input data

# print first 3 rows of output data
[[  6.    148.     72.     35.      0.     33.6     0.627  50.   ]
 [  1.     85.     66.     29.      0.     26.6     0.351  31.   ]
 [  8.    183.     64.      0.      0.     23.3     0.672  32.   ]]
[[0.353 0.744 0.59  0.354 0.    0.501 0.234 0.483]
 [0.059 0.427 0.541 0.293 0.    0.396 0.117 0.167]
 [0.471 0.92  0.525 0.    0.    0.347 0.254 0.183]]

