Quick Run Through for Ridge Regression from Scratch

Posted on Tue 31 January 2017 in Projects

In [1]:

from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge
from sklearn.preprocessing import normalize
import numpy as np
from copy import copy

import plotly.plotly as py
from plotly.graph_objs import *
import plotly.graph_objs as go

This tutorial will include setting up the Ridge Regression and going through the optimization to determine the best weights using gradient descent. I plan to introduce this topic to my numerical class as we go through the linear regression chapter and letting them use the Ridge Regression for handwritting recognition and image recognition as a fun exercise for them to enjoy.

Data¶

The data used for this tutorial is from the UCI Machine Learning Repository. The dataset is for median value of homes in Boston that includes 13 features. The goal is to predict the value of each home given the 13 features. The features in the dataset are as follows:

CRIM: per capita crime rate by town
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: nitric oxides concentration (parts per 10 million)
RM: average number of rooms per dwelling
AGE: proportion of owner-occupied units built prior to 1940
DIS: weighted distances to five Boston employment centres
RAD: index of accessibility to radial highways
TAX: full-value property-tax rate per \$10,000
PTRATIO: pupil-teacher ratio by town
B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT: % lower status of the population
MEDV: Median value of owner-occupied homes in \$1000's

In [2]:

X = load_boston().data
y = load_boston().target

load_boston().feature_names

X = normalize(X, axis = 0)

The Ridge Regression is a linear regression with a l-2 penalty. The l-2 penalty allows us to deal with overfitting issues and penalize large values of $\beta$ in our model. First, we will set up a cost/energy function that describes the "cost" to us, how much error our model has with the parameters $\alpha$ and $\beta$.

$$ \displaystyle E(x) = \sum_{i = 1}^{n} (y_i - \alpha - <\beta, x_i>)^2 + \lambda ||\beta||^2 $$

The goal is to minimize this function by determining the best weights, $\alpha$, $\beta$. Since this energy function is convex, gradient descent will work just fine for this model.

The gradient descent is simply an iterative method that updates each parameter so that we take a step opposite of the direction of the gradient. This allows us to take a step "down hill" so we reach the minimum. First, we will find the gradient with respect to the parameters $\alpha$ and $\beta_j$

$$ \begin{align} \frac{\partial E}{\partial \alpha} &= -2\sum_{i = 1}^{n} (y_i - \alpha - <\beta, x_i>) \\ \frac{\partial E}{\partial \beta_j} &= -2 \sum_{i = 1}^{n} (y_i - \alpha - <\beta, x_i>)x_{i}^j + 2\lambda \beta_j \\ &= -2 X^{T}(\vec{y} - \vec{\alpha} - X \vec{\beta}) + 2 \lambda \vec{\beta} \end{align} $$

Now that we have the gradients, we simply update each parameter: $$ \alpha \rightarrow \alpha - \eta \frac{\partial E}{\partial \alpha} \\ \beta_j \rightarrow \beta_j - \eta \frac{\partial E}{\partial \beta_j} $$ where $\eta$ is the learning rate of the descent

In [3]:

def E(X,y, alpha, beta, lam):
    return np.sum(pow(y - np.repeat(alpha, len(X)) - np.dot(X,beta), 2)) + lam * np.linalg.norm(beta)**2

def ridge(X, y, lam):
    alpha = 0
    beta = np.zeros(X.shape[1])
    learning_rate = 0.001
    max_iter = 10000
    E_new = []
    
    for i in range(max_iter):
        E_old = E(X, y, alpha, beta, lam)
        
        h = (np.repeat(alpha, len(X)) + np.dot(X, beta))
        loss = y - h
        cost = pow(h - y, 2)
        
        grad_alpha = -2.0 * np.sum(loss)
        alpha = alpha - learning_rate * grad_alpha
        
        grad_beta = -2 * np.dot(X.T, loss) + 2 * lam * beta
        beta = beta - learning_rate * grad_beta
        
        
        E_new.append(E(X, y, alpha, beta, lam))
        if abs(E_old - E_new[i]) < 1e-5:
            print 'iteration: ', i
            break
    return alpha, beta, E_new

In [4]:

alpha, beta, E = ridge(X,y,1.0)

iteration:  3009

In [5]:

alpha

Out[5]:

26.087474894848473

In [6]:

beta

Out[6]:

array([-23.67539041,  23.35184911, -21.72476754,  17.63173486,
        -6.75024633,  11.7497408 ,  -9.88680766,  -0.24042206,
       -14.83054516, -14.39131326,  -7.63006426,   8.24639461, -44.82654007])

In [7]:

np.sum((y - np.repeat(alpha, len(X)) - np.dot(X, beta))**2)

Out[7]:

27416.928725946458

In [8]:

trace = go.Scatter(
    x = range(len(E)),
    y = E
)
layout = go.Layout(
    title = 'Energy Function Over Iterations',
    xaxis = dict(title = 'iteration'),
    yaxis = dict(title = 'Energy Function')
)
data = [trace]
figure = Figure(data = data, layout = layout)
py.iplot(figure)

Out[8]:

From the plot above, we can see that the energy function decreases over the number of times we iterate through gradient descent, which is exactly what we want when we are minimizing a function.

Now, to make sure we did the gradient descent and found the parameters correctly, I will use scikit-learn's Ridge Regression model to compare our result.

Sklearn Ridge Model¶

In [9]:

ridge_model = Ridge(alpha = 1.0)
ridge_model.fit(X,y)

beta_r =  ridge_model.coef_

alpha_r = ridge_model.intercept_

ridge_model.n_iter_

In [10]:

alpha_r

Out[10]:

26.089811478652788

In [11]:

beta_r

Out[11]:

array([-23.67274775,  23.35236303, -21.72914621,  17.62952319,
        -6.75423947,  11.76105965,  -9.88461594,  -0.26512071,
       -14.82209149, -14.3965544 ,  -7.64299595,   8.24769453, -44.85749936])

In [12]:

sum((y - ridge_model.predict(X))**2)

Out[12]:

27413.730425497597

Conclusion¶

We get the same results from our code and scikit-learn's model. This is a good indicator that we are on the right track and found the correct parameters.