Quick Run Through for Ridge Regression from Scratch

Posted on Tue 31 January 2017 in Projects

In [1]:
from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge
from sklearn.preprocessing import normalize
import numpy as np
from copy import copy

import plotly.plotly as py
from plotly.graph_objs import *
import plotly.graph_objs as go

This tutorial will include setting up the Ridge Regression and going through the optimization to determine the best weights using gradient descent. I plan to introduce this topic to my numerical class as we go through the linear regression chapter and letting them use the Ridge Regression for handwritting recognition and image recognition as a fun exercise for them to enjoy.

Data

The data used for this tutorial is from the UCI Machine Learning Repository. The dataset is for median value of homes in Boston that includes 13 features. The goal is to predict the value of each home given the 13 features. The features in the dataset are as follows:

  1. CRIM: per capita crime rate by town
  2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS: proportion of non-retail business acres per town
  4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  5. NOX: nitric oxides concentration (parts per 10 million)
  6. RM: average number of rooms per dwelling
  7. AGE: proportion of owner-occupied units built prior to 1940
  8. DIS: weighted distances to five Boston employment centres
  9. RAD: index of accessibility to radial highways
  10. TAX: full-value property-tax rate per \$10,000
  11. PTRATIO: pupil-teacher ratio by town
  12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT: % lower status of the population
  14. MEDV: Median value of owner-occupied homes in \$1000's
In [2]:
X = load_boston().data
y = load_boston().target

load_boston().feature_names

X = normalize(X, axis = 0)

The Ridge Regression is a linear regression with a l-2 penalty. The l-2 penalty allows us to deal with overfitting issues and penalize large values of $\beta$ in our model. First, we will set up a cost/energy function that describes the "cost" to us, how much error our model has with the parameters $\alpha$ and $\beta$.

$$ \displaystyle E(x) = \sum_{i = 1}^{n} (y_i - \alpha - <\beta, x_i>)^2 + \lambda ||\beta||^2 $$

The goal is to minimize this function by determining the best weights, $\alpha$, $\beta$. Since this energy function is convex, gradient descent will work just fine for this model.

The gradient descent is simply an iterative method that updates each parameter so that we take a step opposite of the direction of the gradient. This allows us to take a step "down hill" so we reach the minimum. First, we will find the gradient with respect to the parameters $\alpha$ and $\beta_j$

$$ \begin{align} \frac{\partial E}{\partial \alpha} &= -2\sum_{i = 1}^{n} (y_i - \alpha - <\beta, x_i>) \\ \frac{\partial E}{\partial \beta_j} &= -2 \sum_{i = 1}^{n} (y_i - \alpha - <\beta, x_i>)x_{i}^j + 2\lambda \beta_j \\ &= -2 X^{T}(\vec{y} - \vec{\alpha} - X \vec{\beta}) + 2 \lambda \vec{\beta} \end{align} $$

Now that we have the gradients, we simply update each parameter: $$ \alpha \rightarrow \alpha - \eta \frac{\partial E}{\partial \alpha} \\ \beta_j \rightarrow \beta_j - \eta \frac{\partial E}{\partial \beta_j} $$ where $\eta$ is the learning rate of the descent

In [3]:
def E(X,y, alpha, beta, lam):
    return np.sum(pow(y - np.repeat(alpha, len(X)) - np.dot(X,beta), 2)) + lam * np.linalg.norm(beta)**2

def ridge(X, y, lam):
    alpha = 0
    beta = np.zeros(X.shape[1])
    learning_rate = 0.001
    max_iter = 10000
    E_new = []
    
    for i in range(max_iter):
        E_old = E(X, y, alpha, beta, lam)
        
        h = (np.repeat(alpha, len(X)) + np.dot(X, beta))
        loss = y - h
        cost = pow(h - y, 2)
        
        grad_alpha = -2.0 * np.sum(loss)
        alpha = alpha - learning_rate * grad_alpha
        
        grad_beta = -2 * np.dot(X.T, loss) + 2 * lam * beta
        beta = beta - learning_rate * grad_beta
        
        
        E_new.append(E(X, y, alpha, beta, lam))
        if abs(E_old - E_new[i]) < 1e-5:
            print 'iteration: ', i
            break
    return alpha, beta, E_new
In [4]:
alpha, beta, E = ridge(X,y,1.0)
iteration:  3009
In [5]:
alpha
Out[5]:
26.087474894848473
In [6]:
beta
Out[6]:
array([-23.67539041,  23.35184911, -21.72476754,  17.63173486,
        -6.75024633,  11.7497408 ,  -9.88680766,  -0.24042206,
       -14.83054516, -14.39131326,  -7.63006426,   8.24639461, -44.82654007])
In [7]:
np.sum((y - np.repeat(alpha, len(X)) - np.dot(X, beta))**2)
Out[7]:
27416.928725946458
In [8]:
trace = go.Scatter(
    x = range(len(E)),
    y = E
)
layout = go.Layout(
    title = 'Energy Function Over Iterations',
    xaxis = dict(title = 'iteration'),
    yaxis = dict(title = 'Energy Function')
)
data = [trace]
figure = Figure(data = data, layout = layout)
py.iplot(figure)
Out[8]:

From the plot above, we can see that the energy function decreases over the number of times we iterate through gradient descent, which is exactly what we want when we are minimizing a function.

Now, to make sure we did the gradient descent and found the parameters correctly, I will use scikit-learn's Ridge Regression model to compare our result.

Sklearn Ridge Model

In [9]:
ridge_model = Ridge(alpha = 1.0)
ridge_model.fit(X,y)

beta_r =  ridge_model.coef_

alpha_r = ridge_model.intercept_

ridge_model.n_iter_
In [10]:
alpha_r
Out[10]:
26.089811478652788
In [11]:
beta_r
Out[11]:
array([-23.67274775,  23.35236303, -21.72914621,  17.62952319,
        -6.75423947,  11.76105965,  -9.88461594,  -0.26512071,
       -14.82209149, -14.3965544 ,  -7.64299595,   8.24769453, -44.85749936])
In [12]:
sum((y - ridge_model.predict(X))**2)
Out[12]:
27413.730425497597

Conclusion

We get the same results from our code and scikit-learn's model. This is a good indicator that we are on the right track and found the correct parameters.