Correlation refers to the relationship between two attributes.
– The correlation coefficient ranges from −1(full negative correlation) to 1(full positive)
– A value of 0 implies that there is no linear correlation between the columns
Note: It is recommended to remove highly correlated columns in some machine learning algorithms
This recipe includes the following topics:
- Use the standard Pearson’s Correlation Coefficient
- Compute pairwise correlation of columns
# import module
import pandas as pd
fileGitURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
# define column names
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# load file as a Pandas DataFrame
pimaDf = pd.read_csv(fileGitURL, names=cols)
# set options
pd.set_option('precision', 3)
pd.set_option('display.width', 100)
# calculate correlation between columns
correlation = pimaDf.corr(method='pearson')
print(correlation)
preg plas pres skin test mass pedi age class
preg 1.000 0.129 0.141 -0.082 -0.074 0.018 -0.034 0.544 0.222
plas 0.129 1.000 0.153 0.057 0.331 0.221 0.137 0.264 0.467
pres 0.141 0.153 1.000 0.207 0.089 0.282 0.041 0.240 0.065
skin -0.082 0.057 0.207 1.000 0.437 0.393 0.184 -0.114 0.075
test -0.074 0.331 0.089 0.437 1.000 0.198 0.185 -0.042 0.131
mass 0.018 0.221 0.282 0.393 0.198 1.000 0.141 0.036 0.293
pedi -0.034 0.137 0.041 0.184 0.185 0.141 1.000 0.034 0.174
age 0.544 0.264 0.240 -0.114 -0.042 0.036 0.034 1.000 0.238
class 0.222 0.467 0.065 0.075 0.131 0.293 0.174 0.238 1.000