Collaborative Filtering + Matrix Factorization Recommendation Systems
Posted on Wed 11 January 2017 in Projects
Motivation¶
It has been a year or two since I have done any analysis involving recommendation systems so I wanted to shake off the rust and dive into building a recommendation system for movies. At the end of the day, it will let me practice and since I am on break and have some free time, it doesn't hurt to have a working recommendation system so I have a good list of movies to watch.
Additionally, since my girlfriend and I never know what to watch after we finish our Netflix and Hulu queues, I can use this model to aid with our problem.
import graphlab as gl
import pandas as pd
import numpy as np
Dataset¶
The dataset was downloaded from MovieLens via http://grouplens.org/datasets/movielens/. It is the most recent dataset published on the site, updated in October 2016. There are 24,404,096 total ratings by 259,137 different users in this dataset for 40,110 movies.
I will be using Graphlab Create to build the models for this analysis. Graphlab Create is a machine learning tool created by Turi, formerly known as Dato.
# ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')
ratings = gl.SFrame.read_csv('ratings.csv')
# movies = gl.SFrame.read_csv('movies.csv')
ratings.head(4)
movies.head(4)
print ratings.shape, movies.shape
len(ratings['userId'].unique())
ratings['genres'] = str(movies['genres'])
Collaborative Filtering¶
I will use two algorithms under the category of collaborative filtering algorithms, Cosine Similarity and Non-negative Matrix Factorization. Let's take a quick look at the two algorithms before we build the models.
The Cosine Similarity simply takes in the matrix of users and their ratings and calculates the vector orientations of either user to user, or item to item, measuring similarity. Let's take an example user-movie rating matrix:
movie 1 | movie 2 | movie 3 | movie 4 | |
---|---|---|---|---|
user 1 | 3 | 5 | 1 | |
user 2 | 1 | 2 | ||
user 3 | 5 | 4 | 1 |
Cosine Similarity:
The cosine similarity measures the is defined as:
$$\displaystyle \cos(\pmb x, \pmb y) = \frac {\pmb x \cdot \pmb y}{||\pmb x|| \cdot ||\pmb y||}, \quad ||\pmb x|| = \sqrt{\sum_{i}^{n} x_i^2}$$The larger $\displaystyle \cos(\pmb x, \pmb y)$ is, the closer the vectors are, implying that the two vectors are more similar. The less $\displaystyle \cos(\pmb x, \pmb y)$ is, the less similar the vectors are. To measure similarity between users, we take the values corresponding to their rows and calculate their cosine similarity value. To measure similarity between movies, we do the same but take the values corresponding to the movie's column. In either case, we fill in the missing values with a 0.
Let's say we want to measure how similar user 1 is to user 2 compared to user 1 to user 3.
User 1 - User 2:
Using the formula above, we will let $\pmb x$ = (3, 5, 0 , 1) and $\pmb y$ = (0, 1, 2, 0), then:
$\displaystyle \cos (\pmb x, \pmb y) = \frac{3 \times 0 + 5 \times 1 + 0 \times 2 + 1 \times 0}{\sqrt{3^2 + 5^2 + 0^2 + 1^2} + \sqrt{0^2 + 1^2 + 2^2 + 0^2}}$ = 0.6133
User 1 - User 3:
Let $\pmb x$ = (3, 5, 0 , 1) and $\pmb y$ = (0, 5, 4, 1), then:
$\displaystyle \cos (\pmb x, \pmb y) = \frac{3 \times 0 + 5 \times 5 + 0 \times 4 + 1 \times 1}{\sqrt{3^2 + 5^2 + 0^2 + 1^2} + \sqrt{0^2 + 5^2 + 4^2 + 1^2}}$ = 2.097
From the calculations, we can see that User 1 is more similar to User 3 than User 2 by comparing the similarity values. We can also observe this from the table/matrix. User 1 and User 3 gave exactly the same rating for movie 2 and movie 4, while User 1 and User 2 gave opposite ratings for movie 2. We can use this result to recommend User 1 a movie or multiple movies that User 2 rated highly.
Above is a user based recommender. We can go though the same process to do a item based recommender: find items/movies that are closest in similarity and recommend users movies that are most similar to the ones they rated highly.
Non-Negative Matrix Factorization¶
Cosider the same matrix/table:
movie 1 | movie 2 | movie 3 | movie 4 | |
---|---|---|---|---|
user 1 | 3 | 5 | 1 | |
user 2 | 1 | 2 | ||
user 3 | 5 | 4 | 1 |
Let A be the matrix of values from the table
$ A = \begin{bmatrix} 3 & 5 & & 1 \\ & 1 & 2 & \\ & 5 & 4 & 1 \end{bmatrix} $
The fundamental idea of this process is as its name states, it takes the data matrix and factors it into a product of two matrices that best approximates the data matrix A:
$$ A = \begin{bmatrix} 3 & 5 & & 1 \\ & 1 & 2 & \\ & 5 & 4 & 1 \end{bmatrix} = \begin{bmatrix} & & \\ & & \\ & & \end{bmatrix} \begin{bmatrix} & & & & \\ & & & & \end{bmatrix} $$The goal here is if we find the two matrices that best approximates A, then we can use the values from the product as approximations for the missing values in A. Therefore, we can use these approximations to recommend or not recommend movies to users. The model Graphlab uses is fundamentally this process described above. However, Graphlab adds in bias and weight terms to account for either user bias or item bias.
We'll get started now. First I will split the data into a train and test set then I will add my movies and ratings along with my girlfriend's movies and ratings so we can get recommendations from our models.
# Create train and test set
train, test = gl.recommender.util.random_split_by_user(ratings, user_id = 'userId', item_id = 'movieId')
cos_rec = gl.item_similarity_recommender.create(train, user_id='userId',
item_id='movieId', target='rating',
similarity_type='cosine', verbose = False)
my_id = ratings['userId'][-1] + 1
def search_title(movie_title):
x = np.array([1 if movie_title in movie else 0 for movie in movies['title']])
index = np.where(x == 1)[0].tolist()
print movies['title'][index]
search_title('Avengers')
my_movies = ['Hitch (2005)', 'Iron Man (2008)', 'Everybody\'s Fine (2009)', 'Horrible Bosses 2 (2014)',
'Other Woman, The (2014)', 'Think Like a Man Too (2014)',
'Seeking a Friend for the End of the World (2012)', 'Iron Man 2 (2010)', 'Iron Man 3 (2013)',
'Avengers, The (2012)', 'Avengers: Age of Ultron (2015)']
my_ratings = [5., 5, 2, 3, 2, 3, 3, 5, 4, 5, 5]
my_genres = list(movies[movies.title.isin(my_movies)].genres)
my_matrix = pd.DataFrame({'userId':[my_id] * len(my_ratings),
'rating':my_ratings,
'movieId':list(movies[movies.title.isin(my_movies)].movieId),
'genres': my_genres,
'timestamp':[1484126312] * len(my_ratings)})
my_matrix = gl.SFrame(my_matrix)
myrec_movies = cos_rec.recommend(users = [my_id],
new_observation_data = my_matrix, k = 30, diversity = 1)
my_index = list(myrec_movies['movieId'])
print 'List of Recommended Movies from Cosine Similarity:'
list(myrec_movies.join(gl.SFrame(movies), on = 'movieId', how = 'left').sort('rank')['title'])
Matrix Factorization¶
nmf_model = gl.factorization_recommender.create(train, user_id = 'userId', item_id = 'movieId',
target = 'rating', verbose = False,
side_data_factorization = True
)
myrec_movies = nmf_model.recommend(users = [my_id],
new_observation_data = my_matrix, k = 30)
my_index = list(myrec_movies['movieId'])
print 'List of Recommended Movies from Matrix Factorization:'
list(myrec_movies.join(gl.SFrame(movies), on = 'movieId', how = 'left').sort('rank')['title'])
Girlfriend's Recommendations¶
search_title('27 Dresses')
gf_movies = ['Hitch (2005)', 'Chef (2014)', 'Juno (2007)', 'How to Lose a Guy in 10 Days (2003)',
'Devil Wears Prada, The (2006)', 'Proposal, The (2009)',
'Flight (2012)', 'Avengers, The (2012)', 'Friends with Kids (2011)',
'Legally Blonde (2001)', 'Princess Diaries, The (2001)', 'Finding Nemo (2003)', 'Elf (2003)',
'Along Came Polly (2004)', 'Notebook, The (2004)', 'Ratatouille (2007)', 'Knocked Up (2007)',
'Iron Man (2008)', 'Up (2009)', 'Zombieland (2009)', 'Avatar (2009)', 'Thor (2011)',
'Horrible Bosses (2011)', 'Captain America: The Winter Soldier (2014)', 'Mean Girls (2004)', 'WALL·E (2008)',
'Leap Year (2010)','27 Dresses (2008)']
gf_ratings = [5., 4, 4, 4, 4, 5, 2, 5, 3, 4, 4, 5, 5, 3, 4, 3, 4, 4, 4, 1, 4, 3, 3, 4, 5, 4, 4, 4]
gf_genres = list(movies[movies.title.isin(gf_movies)].genres)
gf_id = my_id + 1
gf_matrix = pd.DataFrame({'userId':[gf_id] * len(gf_ratings),
'rating':gf_ratings,
'movieId':list(movies[movies.title.isin(gf_movies)].movieId),
'genres': gf_genres,
'timestamp':[1484126312] * len(gf_ratings)})
gf_matrix = gl.SFrame(gf_matrix)
gfrec_movies = cos_rec.recommend(users = [gf_id], new_observation_data = gf_matrix, k = 30)
gf_index = list(gfrec_movies['movieId'])
print 'List of Recommended Movies from Cosine Similarity:'
list(gfrec_movies.join(gl.SFrame(movies), on = 'movieId', how = 'left').sort('rank')['title'])
gfrec_movies = nmf_model.recommend(users = [gf_id], new_observation_data = gf_matrix, k = 30)
gf_index = list(gfrec_movies['movieId'])
print 'List of Recommended Movies from Matrix Factorization:'
list(gfrec_movies.join(gl.SFrame(movies), on = 'movieId', how = 'left').sort('rank')['title'])
gfrec_movies = nmf_model.recommend(users = [gf_id + 2000], new_observation_data = gf_matrix, k = 30)
gf_index = list(gfrec_movies['movieId'])
print 'List of Recommended Movies from Matrix Factorization:'
list(gfrec_movies.join(gl.SFrame(movies), on = 'movieId', how = 'left').sort('rank')['title'])
There seems to be a slight problem with the matrix factorization model. It recommended almost the exact same movies for my girlfriend and me. I also ran the recommendation again for a user that is not in the dataset that have no ratings whatsoever and the recommendations are almost the same. Looking into their github, it seems that the recommendation only works if the user was originally in the training set. They state that the recommendation for new users with observations works best with item similarities such as the cosine similarity we used. My theory is the recommendation defaults to most popular movies which is the case if we input a new user not in the dataset with no data. Thus, the movies that were recommended my girlfriend and I recieved were probably the most popular movies in the set.
I am still curious of which movies are recommended to us from the matrix model so I wil train another factorization model using a dataset that has our movies and ratings appended to the original dataset.
Train new matrix model¶
new_ratings = ratings.append(gf_matrix)
new_ratings = new_ratings.append(my_matrix)
train2, test2 = gl.recommender.util.random_split_by_user(new_ratings, user_id = 'userId', item_id = 'movieId')
nmf_model2 = gl.factorization_recommender.create(train2, user_id = 'userId', item_id = 'movieId',
target = 'rating', verbose = False,
)
myrec_movies = nmf_model2.recommend(users = [my_id], k = 30)
my_index = list(myrec_movies['movieId'])
list(myrec_movies.join(gl.SFrame(movies), on = 'movieId', how = 'left').sort('rank')['title'])
gfrec_movies = nmf_model2.recommend(users = [gf_id], k = 30)
gf_index = list(gfrec_movies['movieId'])
list(gfrec_movies.join(gl.SFrame(movies), on = 'movieId', how = 'left').sort('rank')['title'])
These recommendations are much better. We can tell that they are different and the model is not just suggesting the most popular movies as the prior model did.
Compare Models¶
There are two metrics that are commonly used to evaluate a model, the Root Mean Squared Error (RMSE) and the Precision-Recall.
The root mean squared error value determines the average error for a typical prediction. The cosine similarity recommender has a RMSE of 3.5. Using this recommender, the typical prediction rating would be off by 3.5 stars. This is not a good RMSE for prediction purposes.
The factorization model on the other hand, has a RMSE of 0.86 (first factorization model) and 0.83 (second factorization model). This is much better than the cosine similarity as we are only 0.86 stars in error for the typical rating prediction.
If we are tasked to use RMSE as the evaluation metric, then the choice is as obvious as possible; the matrix factorization model is the model to use. However, RMSE is not the only metric that can be used. We can also use precision and recall as evaluation metrics.
Precision and recall in terms of this problem can be thought of:
- Precision:
- If our recommender suggested 5 movies and the user liked 3 of them, then the precision is 0.6.
- Recall:
- If the user likes 4 movies, and our recommender recommended 2 of those movies, then the recall is 0.5.
The tables below show the precision and recall for each model for each cutoff. The cutoff value simply just indicates the number of movies we recommend the user, i.e. if the cutoff is 2, then the recommender recommends 2 movies to the user.
If we look closer at the precision and recall tables, we can see how poorly the factorization model does in terms of precision and recall. The factorization model doesn't even reach the double digits in percentage for either precision or recall. On the other hand, the cosine similarity recommender's precision and recall are much better. For a cutoff of 10 movies, the cosine similarity recommender has a precision of 17.3% and recall of 13.9%.
print 'Cosine Similarity'
print 'RMSE:', cos_rec.evaluate_rmse(test, target = 'rating')['rmse_overall']
print cos_rec.evaluate_precision_recall(test)['precision_recall_overall']
print 'Factorization (First Model):'
print 'RMSE:', nmf_model.evaluate_rmse(test, target = 'rating')['rmse_overall']
print nmf_model.evaluate_precision_recall(test)['precision_recall_overall']
print 'Factorization (Second Model):'
print 'RMSE:', nmf_model2.evaluate_rmse(test2, target = 'rating')['rmse_overall']
print nmf_model2.evaluate_precision_recall(test2)['precision_recall_overall']
# Save Models
cos_rec.save('cosine_model')
nmf_model.save('matrix_fac_model')
nmf_model2.save('matrix_fac_us_model')
Which Model to Choose?¶
Choosing a particular model here is not obvious. While the factorization model dominates the cosine similarity recommender in terms of RMSE, the roles are flipped when we consider precision and recall. My take on this question is that it depends on what question and what problem you're trying to solve. Let's take Netflix for example, if I am just scrolling through a list of movies and wonder what my predicted rating would be for those movies, then the factorization model is the one to use. If all I want is a list of movies listed for me as recommended movies when I sign in, then the cosine simiarlity model is the one to use. Although, we can also use a combination of both methods to highlight the strengths of each model.
At the end of the day, I now have a recommender to help my girlfriend and I decide what movie to watch next. To make sure we will both enjoy our next movie, I will find the intersection of our two recommended lists. Something I am actually suprised Netflix and Hulu do not implement as a feature for couples or even families especially with the new addition to their profiles: predict ratings for a movies viewed by multiple people such a couple, or family.