15.2 Relational Learning

15.2.2 Learning Hidden Properties: Collaborative Filtering

Consider the problem of, given some tuples of a relation, predicting whether other tuples are true. Here we provide an algorithm that does this, even if there are no other relations defined.

In a recommender system users are given personalized recommendations of items they may like, and may not be aware of. One technique for recommender systems is predict the rating of a user on an item from the ratings of similar users or similar items, by what is called collaborative filtering.

One of the tasks for recommender systems is the top-n task, where n is a positive integer, say 10, which is to present the user n items they have not rated, and to judge the recommendation by how much the user likes one of these items. One way to select the items is to estimate the rating of each item for a user and to present the n items with the highest predicted rating. An alternative is to take the diversity of the recommendations into account; it may be better to recommend very different items than to recommend similar items.

Example 15.15.

MovieLens (https://movielens.org/) is a movie recommendation system that acquires movie ratings from users. The rating is from 1 to 5 stars, where 5 stars is better. A data set for such a problem is of the form of Figure 15.3,

UserItemRatingTimestamp196242388125094918630238917177422237718788871162445128806069232534655891628467

Figure 15.3: Part of the MovieLens data set

where each user is given a unique number, each item (movie) is given a unique number, and the timestamp is the Unix standard of seconds since 1970-01-01 UTC. These data can be used in a recommendation system to make predictions of other movies a user might like.

Consider the problem of predicting the rating for an item by a user. This prediction can be used to give personalized recommendations: recommend the items with the highest predicted ratings for a user on items they have not rated. The recommendations for each user may be different when they have provided different ratings. In the example above, the items were movies, but they could also be consumer goods, restaurants, holidays, or other items.

Netflix Prize There was a considerable amount of research on collaborative filtering with the Netflix Prize to award $1,000,000 to the team that could improve the prediction accuracy of Netflix’s proprietary system, measured in terms of sum-of-squares by 10%. Each rating gives a user, a movie, the rating from 1 to 5 stars, and a date and time the rating was made. The data set consisted of approximately 100 million ratings from 480,189 anonymized users on 17,770 movies that was collected over a 7 year period. The prize was won in 2009 by a team that averaged over a collection of hundreds of predictors, some of which were quite sophisticated. After 3 years of research the winning team beat out another team by just 20 minutes to win the prize. They both had solutions which had essentially the same error, which was just under the threshold to win. Interestingly, an average of the two solutions was better than either alone. The algorithm presented here is the basic algorithm that gave the most improvement. The Netflix data set is no longer available because of privacy concerns. Although users were only identified by a number, there was enough information, if combined with other information, to potentially identify some of the users.

We develop an algorithm that does not take properties of the items or the users into account. It simply suggests items based on what other people have liked. It could potentially do better if it takes properties of the items and the users into account.

Suppose E is a data set of u,i,r triples, where u,i,r means user u gave item i a rating of r. (We are ignoring the timestamp here.) Let r^(u,i) be the predicted rating of user u on item i. The aim is to optimize the sum-of-squares error:

u,i,rE(r^(u,i)-r)2.

which penalizes large errors much more than small errors. As with most machine learning, we want to optimize for the test examples, not for the training examples.

Here we give a sequence of more sophisticated predictions:

Make a single prediction

In the simplest case, we could predict the same rating for all users and items: r^(u,i)=μ, where μ is the mean rating. Recall that if we are predicting the same for every user, predicting the mean minimizes the sum-of-squares error.

Add user and item biases

Some users might give, on average, higher ratings than other users, and some movies may have higher ratings than other movies. We can take this into account using

r^(u,i)=μ+ib[i]+ub[u]

where item i has an item-bias parameter ib[i] and user u has a user-bias parameter ub[u]. The parameters ib[i] and ub[u] are chosen to minimize the sum-of-squares error. If there are n users and m items, there are n+m parameters to tune (assuming μ is fixed). Finding the best parameters is an optimization problem that can be done with a method like gradient descent, as was used for the linear learner.

One might think that ib[i] should be directly related to the average rating for item i and ub[u] should be directly related to the average rating for user u. However, it is possible that ub[u]<0 even if all of the ratings for user u were above the mean μ. This can occur if user u only rated very popular movies and rated them lower than other people.

Optimizing the ib[i] and ub[u] parameters can help get better estimates of the ratings, but it does not help in personalizing the recommendations, because the movies are still ordered the same for every user.

Add a hidden property

We could hypothesize that there is an underlying property of users and movies that makes more accurate predictions. For example, we could have the age of users, and the age-appropriateness of movies. Instead of using observed properties, we can invent a hidden property and tune it to fit the data.

The hidden property has a value ip[i] for item i and a value up[u] for user u. The product of these is used to predict the rating of that item for that user:

r^(u,i)=μ+ib[i]+ub[u]+ip[i]*up[u]

If ip[i] and up[u] are both positive or both negative, the property will increase the prediction. If one ip[i] or up[u] is negative and the other is positive, the property will decrease the prediction.

Figure 15.4: Movie ratings by users as a function of a single property
Example 15.16.

Figure 15.4 shows a plot of the ratings as a function of a single property. This was for a subset of the MovieLens data set, where we selected 20 movies that had the most ratings, then we selected 20 users who had the most ratings for these movies. It was then trained for 1000 iterations of gradient descent, with a single property.

On the x-axis are the users, ordered by their value, up[u], on the property. On the y-axis are the movies, ordered by their value, ip[i] on the property. Each rating is then plotted against the user and the movie, so that triple u,i,r is depicted by plotting r at the (x,y) position (up[u],ip[i]). Thus each vertical column of numbers corresponds to a user, and each horizontal row of numbers is a movie. The columns overlap if two users have very similar values on the property. The rows overlap for movies that have very similar values on the property. The users and the movies where the property values are close to zero are not affected by this property, as the prediction uses the product of values of the properties.

What we would expect is that generally high ratings are in the top-right and the bottom-left, as these are the ratings that are positive in the product, and low ratings in the top-left and bottom-right, as their product is negative. Note that what is high and low is relative to the user and movie biases; what is high for one movie may be different from what is high for another movie.

Add k hidden properties

There might not be just one property that makes users like movies, but there may be many such properties. Here we introduce k such properties. There is a value ip[i,p] for every item i and property p{1,,k}, and a value up[u,p] for every user u and property p. The contributions of the properties are added. This gives the prediction:

r^(u,i)=μ+ib[i]+ub[u]+pip[i,p]*up[u,p]

This is often called a matrix factorization method as the summation corresponds to matrix multiplication.

Regularize

To avoid overfitting, a regularization term can be added to prevent the parameters from growing too much and overfitting the given data. Here an L2 regularizer for each of the parameters is added to the optimization. The goal is to choose the parameters to

minimize (u,i,rE(r^(u,i)-r)2)
+λ(i(ib[i]2+pip[i,p]2)+u(ub[u]2+pup[u,p]2)) (15.1)

where λ is a regularization parameter, which can be tuned by cross validation.

We can optimize the ib, ub, ip, and up parameters using stochastic gradient descent. The algorithm is shown in Figure 15.5. Note that ip[i,p], up[u,p] need to be initialized randomly (and not to the same value) to force each property to be different.

1:procedure Collaborative_filter_learner(E,η,λ)
2:      Inputs
3:            E: set of user,item,rating triples
4:            η: gradient descent step size
5:            λ: regularization parameter       
6:      Output
7:            function to predict rating for a user,item pair
8:      μ:= average rating
9:      assign ip[i,p], up[u,p] randomly
10:      assign ib[i],ub[u] arbitrarily
11:      define r^(u,i)=μ+ib[i]+ub[u]+pip[i,p]*up[u,p]
12:      repeat
13:             Update parameters from training data
14:            for each u,i,rE do
15:                 error:=r^(u,i)-r
16:                 ib[i]:=ib[i]-η*error
17:                 ub[u]:=ub[u]-η*error
18:                 for each property p do
19:                       ip[i,p]:=ip[i,p]-η*error*up[u,p]
20:                       up[u,p]:=up[u,p]-η*error*ip[i,p]                              
21:             Regularize the parameters
22:            for each item i do
23:                 ib[i]:=ib[i]-η*λ*ib[i]
24:                 for each property p do
25:                       ip[i,p]:=ip[i,p]-η*λ*ip[i,p]                              
26:            for each user u do
27:                 ub[u]:=ub[u]-η*λ*ub[u]
28:                 for each property p do
29:                       up[u,p]:=up[u,p]-η*λ*up[u,p]                              
30:      until termination
31:      return r^
Figure 15.5: Gradient descent for collaborative filtering

This algorithm can be evaluated by how well it predicts future ratings. We can train it with the data up to a certain time, and test it on future data.

The algorithm can be improved in various ways including

  • taking into account observed attributes of items and users. This is important for the cold-start problem, which is how to make recommendations about new items or for new users

  • taking the timestamp into account, as users’ preferences may change and items may come in and out of fashion.