Descriptive statistics summarize the central tendency, dispersion, and shape of a dataset’s distribution.
This recipe includes the following topics:
- Load csv using Pandas
- Display shape of data
- Display data types for each attribute
- Set display options
- Generate descriptive statistics
# import module
import pandas as pd
fileGitURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/pima-indians-diabetes.data.csv'
# define column names
cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# load file as a Pandas DataFrame
pimaDf = pd.read_csv(fileGitURL, names=cols)
# get shape (row, columns size)
shape = pimaDf.shape
# get data types for each attribute
types = pimaDf.dtypes
# set options
pd.set_option('precision', 3)
pd.set_option('display.width', 100)
# generate descriptive statistics
stats = pimaDf.describe()
# display results
print(shape)
print(types)
print(stats)
(768, 9)
preg int64
plas int64
pres int64
skin int64
test int64
mass float64
pedi float64
age int64
class int64
dtype: object
preg plas pres skin test mass pedi age class
count 768.000 768.000 768.000 768.000 768.000 768.000 768.000 768.000 768.000
mean 3.845 120.895 69.105 20.536 79.799 31.993 0.472 33.241 0.349
std 3.370 31.973 19.356 15.952 115.244 7.884 0.331 11.760 0.477
min 0.000 0.000 0.000 0.000 0.000 0.000 0.078 21.000 0.000
25% 1.000 99.000 62.000 0.000 0.000 27.300 0.244 24.000 0.000
50% 3.000 117.000 72.000 23.000 30.500 32.000 0.372 29.000 0.000
75% 6.000 140.250 80.000 32.000 127.250 36.600 0.626 41.000 1.000
max 17.000 199.000 122.000 99.000 846.000 67.100 2.420 81.000 1.000