Least Absolute Shrinkage and Selection Operator (or LASSO for short) is an extension of linear regression. It reduces the model complexity by modifying the loss function as the sum absolute value of the coefficient values(also called the L1-norm).
Coefficients are basically the weights assigned to the features, based on their importance.
Example: In a linear regression equation: (y = ax + b), ‘a’ is a coefficient.
Basic loss function will measure the absolute difference between our prediction and the actual value.
Note: LASSO Regression is a linear machine learning(ML) algorithm which is simpler and faster than non-linear algorithms.
This recipe includes the following topics:
- Load regression problem Boston house price dataset from github
- Split columns into the usual feature columns(X) and target/prediction column(Y)
- Split data using KFold() class with kFold:10, seed:7
- Instantiate a regression model (Lasso)
- Set scoring parameter to ‘neg_mean_squared_error’
- Call cross_val_score() to run cross validation
- Calculate Mean Squared Error from scores returned by cross_val_score()
Caveat: cross_val_score() reports scores in ascending order (largest score is best). But MSE is naturally descending scores (the smallest score is best). Thus we need to use ‘neg_mean_squared_error’ to invert the sorting. This also results in the score to be negative even though the value can never be negative.
# import modules
import pandas as pd
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# read data file from github
# dataframe: houseDf
gitFileURL = 'https://raw.githubusercontent.com/andrewgurung/data-repository/master/housing.csv'
cols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
houseDf = pd.read_csv(gitFileURL, delim_whitespace=True, names = cols)
# convert into numpy array for scikit-learn
houseArr = houseDf.values
# Let's split columns into the usual feature columns(X) and target column(Y)
# Y represents the target 'MEDV' column
# MEDV: median value of owner-occupied homes in $1000s
X = houseArr[:, 0:13]
Y = houseArr[:, 13]
# set k-fold count
folds = 10
# set seed to reproduce the same random data each time
seed = 7
# split data using KFold
kfold = KFold(n_splits=folds, random_state=seed)
# instantiate a regression model
model = Lasso()
# set scoring parameter to 'neg_mean_squared_error'
scoring = 'neg_mean_squared_error'
# call cross_val_score() to run cross validation
resultArr = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
# calculate mean of scores for all folds
mse = resultArr.mean()
# display Mean Squared Error
# descending score(smallest score is best) is denoted by negative even though the value is positive
print("Mean Squared Error: %.3f" % mse)
Mean Squared Error: -34.464