Quick Run Through for Ridge Regression from Scratch
Posted on Tue 31 January 2017 in Projects
from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge
from sklearn.preprocessing import normalize
import numpy as np
from copy import copy
import plotly.plotly as py
from plotly.graph_objs import *
import plotly.graph_objs as go
This tutorial will include setting up the Ridge Regression and going through the optimization to determine the best weights using gradient descent. I plan to introduce this topic to my numerical class as we go through the linear regression chapter and letting them use the Ridge Regression for handwritting recognition and image recognition as a fun exercise for them to enjoy.
Data¶
The data used for this tutorial is from the UCI Machine Learning Repository. The dataset is for median value of homes in Boston that includes 13 features. The goal is to predict the value of each home given the 13 features. The features in the dataset are as follows:
- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centres
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per \$10,000
- PTRATIO: pupil-teacher ratio by town
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT: % lower status of the population
- MEDV: Median value of owner-occupied homes in \$1000's
X = load_boston().data
y = load_boston().target
load_boston().feature_names
X = normalize(X, axis = 0)
The Ridge Regression is a linear regression with a l-2 penalty. The l-2 penalty allows us to deal with overfitting issues and penalize large values of $\beta$ in our model. First, we will set up a cost/energy function that describes the "cost" to us, how much error our model has with the parameters $\alpha$ and $\beta$.
The goal is to minimize this function by determining the best weights, $\alpha$, $\beta$. Since this energy function is convex, gradient descent will work just fine for this model.
The gradient descent is simply an iterative method that updates each parameter so that we take a step opposite of the direction of the gradient. This allows us to take a step "down hill" so we reach the minimum. First, we will find the gradient with respect to the parameters $\alpha$ and $\beta_j$
Now that we have the gradients, we simply update each parameter: $$ \alpha \rightarrow \alpha - \eta \frac{\partial E}{\partial \alpha} \\ \beta_j \rightarrow \beta_j - \eta \frac{\partial E}{\partial \beta_j} $$ where $\eta$ is the learning rate of the descent
def E(X,y, alpha, beta, lam):
return np.sum(pow(y - np.repeat(alpha, len(X)) - np.dot(X,beta), 2)) + lam * np.linalg.norm(beta)**2
def ridge(X, y, lam):
alpha = 0
beta = np.zeros(X.shape[1])
learning_rate = 0.001
max_iter = 10000
E_new = []
for i in range(max_iter):
E_old = E(X, y, alpha, beta, lam)
h = (np.repeat(alpha, len(X)) + np.dot(X, beta))
loss = y - h
cost = pow(h - y, 2)
grad_alpha = -2.0 * np.sum(loss)
alpha = alpha - learning_rate * grad_alpha
grad_beta = -2 * np.dot(X.T, loss) + 2 * lam * beta
beta = beta - learning_rate * grad_beta
E_new.append(E(X, y, alpha, beta, lam))
if abs(E_old - E_new[i]) < 1e-5:
print 'iteration: ', i
break
return alpha, beta, E_new
alpha, beta, E = ridge(X,y,1.0)
alpha
beta
np.sum((y - np.repeat(alpha, len(X)) - np.dot(X, beta))**2)
trace = go.Scatter(
x = range(len(E)),
y = E
)
layout = go.Layout(
title = 'Energy Function Over Iterations',
xaxis = dict(title = 'iteration'),
yaxis = dict(title = 'Energy Function')
)
data = [trace]
figure = Figure(data = data, layout = layout)
py.iplot(figure)
From the plot above, we can see that the energy function decreases over the number of times we iterate through gradient descent, which is exactly what we want when we are minimizing a function.
Now, to make sure we did the gradient descent and found the parameters correctly, I will use scikit-learn's Ridge Regression model to compare our result.
Sklearn Ridge Model¶
ridge_model = Ridge(alpha = 1.0)
ridge_model.fit(X,y)
beta_r = ridge_model.coef_
alpha_r = ridge_model.intercept_
ridge_model.n_iter_
alpha_r
beta_r
sum((y - ridge_model.predict(X))**2)
Conclusion¶
We get the same results from our code and scikit-learn's model. This is a good indicator that we are on the right track and found the correct parameters.