Matrix Factorization Methods for Recommender Systems633561/FULLTEXT01.pdf · Matrix Factorization Methods for Recommender Systems ... with matrix factorization methods, for recommender

Matrix Factorization Methods forRecommender Systems

Shameem Ahamed Puthiya Parambath

June 20, 2013

Master’s Thesis in Computing Science, 15 creditsUnder the supervision of:

Prof. Dr. Patrik Eklund, Umeå University, Sweden

Examined by:Dr. Jerry Eriksson, Umeå University, Sweden

Umeå University

Department of Computing Science

SE-901 87 UMEÅSWEDEN

[June] 2013

Abstract

This thesis is a comprehensive study of matrix factorization methods usedin recommender systems. We study and analyze the existing models, specificallyprobabilistic models used in conjunction with matrix factorization methods, forrecommender systems from a machine learning perspective. We implement twodifferent methods suggested in scientific literature and conduct experiments onthe prediction accuracy of the models on the Yahoo! Movies rating dataset.

Contents

Abstract i

List of Figures v

List of Tables v

1 Introduction 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Recommender Systems 52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Recommendation System Models . . . . . . . . . . . . . . . . . . . . . . . 82.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Bayesian Matrix Factorization 153.1 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Bayesian Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Experiments 234.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Related Work and Conclusion 295.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6 Acknowledgements 31

Bibliography 33

iv

List of Figures

4.1 Probabilistic Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Bayesian Probabilistic Matrix Factorization . . . . . . . . . . . . . . . . . . . 26

List of Tables

2.1 User-Item Rating Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.1 Dataset Field Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Chapter 1

Introduction

1.1 Introduction

The enormous growth of data has resulted in the foundation of new research areaswithin computer science field. Recommender systems, a completely automated sys-tem which analyze user preferences and predict user behavior, is one among them.The research interest in this area is still very high mainly due to the practical signifi-cance of the problem. As stated by Gediminas and Alexander [1] "recommender sys-tem helps to deal with information overload and provide personalized recommenda-tions, content and service". Most of the online companies have already incorporatedrecommender systems with their services. Examples of such systems include productrecommendation by Amazon, product advertisements shown by Google based on thesearch history, movie recommendations by Netflix, Yahoo! Movies and MovieLens.

Recommender Systems comes under the broad research field called collabo-rative filtering systems. In scientific literature the terms collaborative filtering andrecommender systems are used interchangeably. A collaborative filtering system con-sists of one task, filter user data according to the user preferences, whereas a recom-mender system consists of two tasks. First, predicting the ratings and the second,ranking the predictions. According to Resnick and Varian [2], a recommender systemdiffers from collaborative filtering systems in two aspects. First, the recommendersmay not explicitly collaborate with end users. Second, recommendation suggest par-ticularly interesting items in addition to filtering out noise.

Matrix factorization methods have recently received greater exposure, mainlyas an unsupervised learning method for latent variable decomposition. It has suc-cessfully applied in spectral data analysis and text mining [3]. Most of the matrixfactorization models are based on the linear factor model. In a linear factor model,

rating matrix is modeled as the product of a user coefficient matrix and an item factormatrix. During the training step, a low rank approximation matrix is fitted under thegiven loss function. One of the most commonly used loss function is sum-squarederror. The sum-squared error optimized low rank approximation can be found usingSingular Value Decomposition (SVD) or QR factorization.

The SVD and QR factorization has been successfully employed in informationretrieval systems [4]. Latent Semantic Indexing (LSI) [5] is a latent model based onSingular Value Decomposition to find hidden semantic content in a given text cor-pora. A probabilistic approach to the LSI model called Probabilistic Latent SemanticIndexing (PLSI) was suggested by Hoffman [6, 7] which is the basis for more complexlatent generative models including Latent Dirichlet Allocation (LDA) [8, 9].

Even though SVD and QR factorization methods are successfully applied tosolve practical problems, it suffers from severe overfitting when applies to sparse ma-trices. Most real-world datasets are very sparse i.e. density ratio can be very low. Thedensity ratio is one of the measures of the sparseness of a data matrix. It is defined as,density ratio δ = |R|

|U|∗|I| , where |R| is the number of the known entries in the ratingmatrix or number of observed ratings, |U| is the total number of users and |I| isthe total number of items. The density ratio of real-world datasets can be less than0.25%. When SVD and QR factorization are applied on sparse data, error functionhas to modify to consider only the observed ratings by setting non-observed entriesto zero. This minor modification results in a non-convex optimization problem.

Factor based models have been used extensively in the recommender systemsresearch, as it provides a simple framework for modeling user preference. The fun-damental idea behind factor models is that the user preference can be described asa latent factor. In linear factor models, the preference ratings are decomposed intotwo matrices one representing the user preferences and the other representing itemfactors. As it turns out, matrix factorization methods provide one of the simplest andmost effective approaches to recommender systems [10, 11]. In this thesis we explorethe probabilistic matrix factorization methods used for recommender systems.

We begin with the general discussion of recommender system from a machinelearning perspective. We briefly explain two of the commonly used models for rec-ommendation systems. Since we will not be able to cover the mathematical theorybehind all of the methods, an in-depth study is not presented in this thesis. Butproper references is provided for enthusiastic reader. The formulation of the problemand recommendation system approach based on clustering, dimentionality reduction

2

are briefly given in Chapter 2.

The probabilistic models for factor analysis used in recommender systems arediscussed in Chapter 3. This chapter is organized in a step-wise manner. We will startwith a simple probabilistic model for matrix factorization and develop the model toa proper Bayesian model as we go along. We will cover some of the mathematicaltheory behind Bayesian methods here.

In Chapter 4, we discuss the data set used for the implementation. We elaboratethe experimental results we obtained with two different implementations. Discussionand analysis of the results will be given in this Chapter. We conclude with a briefdescription of related works in Chapter 5.

Chapter 2

Recommender Systems

2.1 Introduction

In this Chapter, we describe different models used to build recommender systems.However, we first define the formal problem statement associated with the recom-mender system. A brief discussion of the limitations of different approaches is alsoprovided at the end of each model.

According to Resnick and Varian [2] "A recommender system assist and aug-ment the natural social process of word of mouth recommendation provided byfriends, recommendation letters, book reviews etc for different products and ser-vices". In the research community the term collaborative filtering is used as a syn-onym for recommendation system. But as Resnick and Varian pointed out [2] rec-ommender system need not be a collaborated one and in addition to filtering outinformation, it can generate interesting hidden factors. In this thesis, we use thegeneric term recommender systems and collaborative filtering is described as a spe-cific model of recommender system.

Recommender systems are very active research field in the data mining com-munity due to the large industrial potential. Every e-commerce company implementrecommendation systems for the products they trying to sell. The online retailerAmazon provides customer with "Customer Who Bought This Item Also Bought"recommendation for each product viewed. Yahoo! News provides two kinds of rec-ommendations with the labels "News For You" a user centric news recommendationfor each user and "Explore Related Contents" a news centric recommendation foreach viewed news item. Other major applications of recommender system is in rec-ommending movies, as provided by Netflix, Yahoo! Movies and MovieLens, andmusic, as in Yahoo! Music and Last.fm.

2.1.1 Formal Definition

According to Gediminas and Alexander [1], the recommendation problem is the prob-lem of estimating ratings for the items that have not been seen by a user in the contextof user-item settings. The rating estimation is based on the ratings given by the userto other items and on the meta information associated with the users and items.

The formal definition of recommendation problem is defined in the context ofuser-item context. The recommendation problem can be formulated as follows. Let Cbe the set of all users and let S be the set of all possible items, such as music, movies,books, electronic items etc. The set S of possible items and the set C of possible userscan be very large, millions in most cases. The recommendation system takes the setof users, set of items and the set of partial ratings for some users and some items,and outputs the items with top ratings for a selected user. Intuitively, this can bedecomposed into two sub-problems

• Finding the unknown ratings associated with users and items.

• Sorting the ratings to select the top k items

Second problem is a trivial problem of sorting a hash structure or dictionary,which can be solved using any efficient sorting algorithms given in any algorithmtext books, as in Johnsonbaugh and Schaefer [12]. The rating of an item given by auser can be expressed as a utility function. Let u be a utility function that measuresusefulness of item s to user c, i.e., u : C × S→ R, where R is a totally ordered set ofinteger or real numbers, as we consider only numeric values for ratings. A recom-mender system find the set s′ ∈ S for each user c ∈ C that maximizes the associatedutility function. Mathematically

∀c ∈ C, s′ = arg maxs∈S

u(c, s)

As mentioned earlier, in the context of recommender systems the utility functionis the rating which is either a positive integer valued or real valued number whichindicates how a particular user liked or disliked an item. One example of this util-ity function can be given here; the online entertainment database IMDB [13] allowsusers to rate movies, tv shows etc. Suppose a user Aybuke gave the movie "Pride and

6

Prejudice" the rating of 9 (out of 10) which clearly indicates that she liked the movievery well and would prefer similar movies.

The utility value u i.e. rating can be acquired in two ways, depending on theapplication. It can be specified explicitly by the user or computed automatically bythe application from the user profiles on the fly. Each user in the user space C isassociated with a profile. A profile is a representation of a user that contains variouscharacteristics such as age, gender, occupation, country etc. In addition, each userprofile is associated with a unique user id. Similarly each item is associated with cor-responding metadata. For example in the case of IMDB database mentioned above,the item metadata can be movie/show name, genre, director name, actor/actressname etc.

The user rating is represented as a matrix called user-item rating matrix. A sam-ple user-item rating matrix for the movie preferences of three users is given in theTable 2.1. Unknown values in the matrix is represented using the symbol ⊥. Matrixrepresentation is one of the most common approach as it provides an intuitive wayfor manipulation and prediction. As we mentioned earlier, the underlining problemof recommender system is to predict the ratings of each user for each item given thelimited rating values available. This means the u utility needs to be extrapolated tothe whole user-item space given by C× S. The above statement gives an idea aboutthe sparsity of this matrix. In most of the real cases user-item matrix sparsity rate ismore than 99% i.e. less than 1% of of the entries in the user-matrix entries are filled.

User MovieLooper Memento Atonement Pride & Prejudice

Aybuke 8 ⊥ 7 ⊥Sham ⊥ 7 8 9Shani ⊥ ⊥ ⊥ 6

Table 2.1: User-Item Rating Matrix

The new ratings of the not-yet-rated items can be estimated in many differentways using techniques from machine learning, data mining, approximation theoryand artificial intelligence.

2.2 Recommendation System Models

In this section we provide a brief survey of the the methods suggested in scientificliterature for extrapolating the user-rating matrix. Most commonly accepted andimplemented methods are classified into two groups, which are

• Content Based Recommendation

• Collaborative Recommendation

2.2.1 Content Based Recommendation

In content based recommendation system extrapolating of the ratings is carried outbased on the past rating values and user profile characteristics. In formal settings,the utility u(c, s) of an item s for a user c is estimated based on the utilities u(c, si)

assigned by user c to items si ∈ S that are "similar" to item s [1]. The notion of "simi-larity" in above can be quantitatively derived based on the metadata of the items. Forexample in case of movie ratings since the director of the movies Memento and Pride& Prejudice are same, a person who likes one movie ’may’ like other one also.

Most of the content based recommendation techniques root back to the profilebased information retrieval methods. In profile based approaches each user is associ-ated with a user profile which contains information about the tastes and preferencesof the users. This information is elicited from user explicitly through questionnaireswhile signing up for the service or learned implicitly from the transactional behav-ior. Similar way each item is associated with an item profile which is gathered fromthe item descriptions or item features. The item profiles are usually described us-ing keywords,(k), which represents a set of most important words associated with theitem. The importance of a keyword ki in an item dj is determined using some weight-ing measure wij.

The weighting measure wij can be defined in many different ways. One ofthe most common measure used in information retrieval community is the t f ∗ id f(term frequency*inverse document frequency) measure [14]. The t f ∗ id f represents theimportance of a term (keyword) in the set. It is the product of two measures, first onecalled term frequency which represents the number of times a term appears for thecorresponding item i.e. frequency of the term in the item description and the secondone called inverse document frequency which represents the relevancy of the term in theset of whole items. The id f is defined as the ratio of the number of items containingthe term to the total number of items

8

t f ∗ id f (s,k) = t f (s,k) ∗ log(|S|

d f (k))

Here t f (s,k) is the term frequency of the term k in item s, |S| is the numberof items in the set and d f (k) is the number of items in the set in which the term kappears. The above equation is a general equation that can be found in any standardtext book of information retrieval. Normalization is carried out on t f ∗ id f measureto offset for different factors. Normalization reduces the advantages of well describeditems over shorter ones. There exists different normalization techniques and motives,a brief overview and comparison can be found in [15, 16]. A detailed description oft f ∗ id f can be found in [9]. Other applications and methods for calculating t f ∗ id fcan be found in [17].

As stated above, content based systems recommend items similar to the one pre-ferred by a user. A vector of keyword weights for the past preference of a user is con-structed by keyword analysis techniques mentioned earlier. This vector is matchedwith the item vectors of unrated items to compute the utility score u(c, s) and mostsimilar items are chosen as recommendations. The notion of the similarity is mea-sured based on association coefficients and correlation coefficients used in clusteranalysis and information retrieval [18]. Association coefficients can be used withqualitative features and are in widespread use. Most commonly used measure isthe cosine coefficient. Cosine coefficient between a user profile vector a and an itemprofile vector b is calculated as given below

Cos(ta, tb) =ta.tb|ta||tb|

Here ta and tb are tf*idf vectors representing the user vector a and item vectorb respectively. One of the important point to be noted is that cosine similarity mea-sure is one of the many variants of inner product similarity measure, other variantsinclude inner product measure, pseudo-cosine measure and dice measure. All ofthe variants of inner product similarity measure comes under association coefficients.Cosine measure is the Euclidean length normalized version of inner product similar-ity. A detailed study of inner product based similarity measures with advantages anddisadvantages is given in [19].

Correlation coefficient is based on the Pearson product-moment correlation co-efficient used widely in statistics. Standard formula for calculating the correlationcoefficient is

Pa,b =Cov(a,b)

σaσb, where

σ represents the standard deviation between the data set a & b and Cov(a,b) repre-sents the covariance between a and b

In case of profile vectors, we can define correlation coefficient based on the tf*idfexplained earlier. Numeric value of the correlation coefficient can vary between -1and 1, different from other similarity measures,. 1 and -1 indicates identical profilesand 0 indicates distinct profiles [20]. A more rigorous mathematical treatment of thenotion of similarity can be found in Eklund et al. [21].

Limitations

Even though content based models are in widespread use, it has several limitations.According to [1], some of the major limitations are

• Limited Content Analysis

• Over-specialization

• New user problem

As the success of content based techniques is limited by the effectiveness ofkeywords or features associated with objects, it is of vital importance to acquire orgenerate explicit keywords for each object. The explicit acquiring of features for userprofile is an interactive task, requiring the users to select from a list of most commonlyused tags. This is discouraged from a user point of view as most of the customerswill not be interested in providing other than minimum required data. Most of theautomatic information extraction techniques work very well with text based objects.But outside text model, these techniques are not employed successfully. One exampleis the multimedia data including images, audio and video files.

Second major limitation is over-specialization. It is similar to the overfittingerror in the context of statistical modeling of the data. Since the content based sys-tems recommend items with high rating score against past preferred items, the useris limited to being recommended items similar to the ones preferred in the past. Incase of movie ratings context, a user who rated romantic comedies in the past will berecommended more romantic comedies and he will not be recommended any othergenres. This can be addressed by adding some randomness to the profile with theassumption that each profile contains some intrinsic error.

10

New user problem also known as cold-start problem is associated with all kindsof recommendation systems. A recommending system should be capable enough torecommend non-trivial recommendations to a user without sufficient prior recom-mendations in his profile.

2.2.2 Collaborative Recommendation

Collaborative recommendation systems try to predict the ratings of items based onthe ratings of other users instead of the similarity between new items and the pastpreferred items . Formally, utility u(c, s) of item s for user u is calculated based onthe utilities u(ci, s) of item s assigned by users ci ∈ C who are similar to user c. Anexample for ’peer’, a term used to indicate similar users, based recommendation isthe product recommendation by Amazon.

Based on the algorithms for collaborative recommendations, it can be groupedinto two general classes [1].

• memory based

• model based

In memory based systems, heuristic functions are run over the entire collectionof past ratings to derive new unknown ratings. The heuristic function calculates prop-erly normalized aggregate values for ratings i.e. the new rating rc,s for an item s anda user c is calculated by aggregating the ratings of the other users for the same items [1]. The aggregate function can be mean (average), median, mode or some othercomplex aggregate function including standard normalized value. Cosine similarityand Pearson correlation coefficient, explained in the content based recommendationsection, can also be used as aggregate function.

Model based systems are one of the most successful methods implemented bymany large online companies including Google and Yahoo!. The fundamental ideabehind model based system is to learn a model, commonly a statistical data model,from the given data (observed data) and predict values using the best-fit parametersof the model. Bayesian network models and cluster based models are the most im-portant kinds of models under this. A detailed study of cluster models can be foundin [9]. In Bayesian network based models, a Bayesian network is constructed fromthe observed data with proper prior distribution assumption of model parameters.Most of these methods is based on the principles of data mining and machine learn-ing. Our main theme of this thesis also comes under model based recommendation

systems. We will discuss more details about Bayesian learning methods in Chapter 3.

Limitations

Collaborative recommendation systems suffers almost similar type of limitations as incontent based recommendation systems. There are three types of limitations reportedin scientific literature [1], these are

• New user problem

• New item problem

• Sparsity

New user problem also known as cold-start problem is same as the one discussedin the context of content based recommendation systems. New item problem arisesmainly due to the fact that collaborative systems relies on similar user rating on oneitem for prediction. If one item is not rated by enough users, collaborative systemresults can be very much biased. As explained in the content based methods sparsityof the user-item rating matrix poses a big modeling problem in collaborative recom-mendation systems also. The sparsity of the data may lead to severe over-fitting ofthe data and results in very bad prediction accuracy.

2.2.3 Other Models

In addition to the above specified models, some of the scientific articles divides rec-ommendation models based on the mathematical models behind it. Even thoughmost of these models can be classified into either content based or collaborative mod-els, we briefly discuss these models for the sake of completion of the discussion.

Clustering Models

Clustering based models are used mainly as a pre-processing step to recommenda-tion. The users and items can be clustered to reduce the dimensions of the item spaceand user space. The users and items can be clustered to identify groups of users withsimilar preferences and items with similar features. The clustering methods leads tothe reduction in the dimensionality and collaborative recommendation methods canbe applied on the cluster of users or items to achieve better prediction ratings. Mostcommonly used clustering schemes are k-means and k-median. Other variations of

12

clustering, Fuzzy c-means clustering and soft k-means clustering can also be used forthis purpose. In these clustering methods, each item or user is a member of differentclusters with an associated grade of membership. The sum of the grade for a specificuser/item should sum to 1.

Dimentionality Reduction

Dimensionality reduction methods are another mathematical models used for recom-mender systems. Dimensionality reduction techniques are analogous to clusteringschemes, but the basic principle behind dimensionality reduction technique is theassumption that the data given by the user-matrix rating matrix is formed by a fixednumber of latent variables rather than a single latent variable. A special case of di-mensionality reduction is matrix factorization where a data matrix is decomposedinto product of two low rank matrices. Most commonly used matrix decomposi-tion methods are Singular Value Decomposition(SVD), QR Factorization, factor anal-ysis(FA) and principal components analysis (PCA). A detailed study of SVD and QRcan be found in [3, 5]. A tutorial on PCA can be found in [22]. More details aboutfactor analysis will be provided in Chapter 3.

2.3 Summary

We discussed two of the most popular models used to implement recommender sys-tems 1) content based models and 2) collaborative models. In content based modelsratings for unknown users are estimated based on the past ratings of the similaritems. In collaborative models, we estimate the ratings based on the probabilisticmodel fitted on the observed rating values. We conclude each model descriptionwith a discussion of the advantages and disadvantages of the model.

Chapter 3

Bayesian Matrix Factorization

In this Chapter, we introduce the recommendation model based on the factor analysis.We build a Bayesian matrix factorization model for recommender systems step bystep starting from a simple probabilistic matrix factorization model.

3.1 Factor Analysis

Factor analysis is a simple, constrained linear Gaussian latent variable model. Infactor analysis, the conditional distribution of the observed variable x given the latentvariable z is assumed to be a Gaussian distribution with a diagonal covariance matrix.The model is specified as given below

p(x|z) =N (x|Wz + µ,Ψ)

where x is the D dimensional observed variables, z is M dimensional latent variable,W is a D×M matrix, µ is the D dimensional mean vector of x and Ψ is the diagonalD × D covariance matrix. The basic assumption in factor analysis models are givenbelow

• Observed variables x1, x2, ..., xD are independent, in our context which indicatesthat user ratings for each items are independent of each other.

• Observed variables conditioned on latent variable follow Gaussian distribution

• Latent variable follow Gaussian distribution with zero mean and spherical (isotropic)covariance matrix

The matrix W captures the variance of the observed data points and Ψ capturesthe covariance between variables in the matrix W [23]. The factor analysis modelexplains the observed covariance structure of the data through W and Ψ. The matrix

W is called the factor loading matrix and Ψ is called the uniqueness matrix.

The corresponding generative view point associated with factor analysis can beexplained as, a sample of observed variable is obtained by first choosing a value forthe latent variable by sampling from the zero mean Gaussian distribution p(z)

p(z) =N (z|0, I)

where I is the identity matrix and then sampling the observed variables conditionedon this latent value. The D-dimensional observed variable X is defined as a lineartransformation of the M-dimensional latent variable z plus additive Gaussian ’noise’,so that

x = Wz + µ + ε

In the above, z is a M-dimensional Gaussian latent variable, and ε is a D-dimensionalzero mean Gaussian noise variable with covariance Ψ. The factor analysis methodscan be seen as a generalized version of probabilistic principal component analysis. Inprobabilistic principal component analysis, the covariance matrix of the distributionof observed variable is assumed to be spherical or isotropic (diagonal matrix with alldiagonal elements equal), whereas in factor analysis the covariance matrix is assumedto be a general diagonal matrix.

3.1.1 Simple Probabilistic Model

The probabilistic model called Probabilistic Matrix Factorization is the simple factoranalysis model we are discussing here. Consider a data set of N users and M items,an N ×M preference matrix R is given by the product of N × D user coefficient ma-trix U and a D×M factor matrix VT. Training such a model amounts to finding thebest rank-D approximation to the observed N × M target matrix R. Here D is theassumed size of the latent features.

Suppose we are given a set of M items and N users. R be the rating matrix,where the entry Rij represents the rating for item j by user i . Let U be a N×D latentuser latent matrix and V be M× D the movie feature loading matrix. We define theconditional distribution of the observed ratings as given below.

P(R|U,V,σ2) =N

∏i=i

M

∏j=1

[N (Rij|UiVj,σ2)]Iij (3.1)

16

whereN (x|µ,σ2) is the Gaussian distribution with mean µ and σ2 variance. TheIij is the indicator function which is equal to 1 if user i rated the movie j in the datasetR or else 0. We place a zero-mean spherical (isotropic) Gaussian prior distribution onuser and movie feature vectors.

P(U|σ2U) =

N

∏i=1N (Ui|0,σ2

U I) (3.2)

P(V|σ2V) =

M

∏i=1N (Vi|0,σ2

V I) (3.3)

The posterior distribution of the model can be found out as given below

P(U,V|R,σ2,σ2U I,σ2

V I) =P(U,V, R|σ2

U I,σ2V I,σ2)

P(R|σ2)

=P(R|U,V,σ2)P(U,V|σ2

U I,σ2V I)

P(R|σ2)

Since we assume that user and item vectors are independent,

P(U,V|σ2U I,σ2

V I) = P(U|σ2U I)P(V|σ2

V I)

this implies that

P(U,V|R,σ2,σ2U I,σ2

V I) =P(R|U,V,σ2)P(U|σ2

U I)P(V|σ2V I)

P(R|σ2)

The parameter estimation of the above model can be found using log maximumlikelihood estimation method, taking the log of the posterior distribution gives riseto,

ln(P(U,V|R,σ2,σ2U,σ2

V)) = −1

2σ2

N

∑i=1

M

∑j=1

Iij(Rij −UiVTj )

2 − 12σ2

U

N

∑i=1

UTi Ui −

12σ2

V

M

∑j=1

VTi Vj

− 12

((N

∑i=1

M

∑j=1

Iij

)ln(σ2) + NDln(σ2

U) + MDln(σ2V)

)+ C

(3.4)

In Eq 3.1, the constant C does not depends on the parameters, it also capturesthe normalization value P(R). Maximizing the equation given in 3.1 with hyperpa-rameters kept fixed is equivalent to minimizing the sum-of-squared-errors objectivefunction with quadratic regularization terms [11]

E =12

N

∑i=1

M

∑j=1

Iij(Rij −UTi Vj)

2 +λU

2

N

∑i=1||Ui||2Fro +

λV

2

M

∑i=1||Vi||2Fro (3.5)

In the above || . ||2Fro is the Frobenius norm and λU and λV are regularizationparameter. A local minima of the equation 3.2 can be found by running gradientdescent on U and V. The gradient descent equations for U and V is given below.

U = U − ε(−(R−UVT)V + λU

)(3.6)

V = V − ε(−(R−UVT)TU + λV

)(3.7)

where ε is the learning rate and λ is the regularization parameter (we use sym-metric regularization parameter for both U and V).

Rating prediction is carried out by multiplying the corresponding user vectorand item vector Ui and Vj respectively. Ideally, since we have a probabilistic model,the prediction is expressed in terms of the predictive distribution that gives the pre-dictive distribution over R which is Gaussian rather than simply a point estimate.Individual recommendation values for an item j by a user i can be found by lookingat Rij. Further details about the implementation, test run and results are given in theChapter Experiments.

3.1.2 Maximum a Posteriori Model

Maximum a posterior (MAP) estimation is a partial Bayesian approach. MAP estima-tion gives a point estimate rather than a predictive distribution. In MAP estimation,we introduce a prior distribution over the parameters of user vector Ui and item vec-tor Vj. We consider spherical Gaussian distributions for σU and σV with zero mean.We denote the set of parameters as ΘU and ΘV . Here we can find the point estimateof parameters and hyperparameters ( ΘU,ΘV )by maximizing the posterior given by

ln(P(U,V,σ2,ΘU,ΘV)) = ln(P(R|U,V,σ2)) + ln(P(U|ΘU)) + ln(P(V|ΘV))

+ ln(P(ΘU)) + ln(P(ΘV)) + C(3.8)

A local maxima can be found by running gradient descent on four variables,Ui,Vj,ΘU,ΘV .

18

3.2 Bayesian Model

In a fully Bayesian approach, we would integrate or sum over all the values of theparameters and hyperparameters. In most of the cases this integration is too complexto be evaluated exactly using analytical techniques. Here we use Gibbs sampling aMonte Carlo Markov Chain sampling method. Gibbs sampling allows us to samplefrom a distribution that asymptotically follows the conditional distribution of param-eters given data without having to explicitly calculate the integrals. In the first sectionwe will explain the Bayesian model for matrix factorization and in the second sectionwe discuss Gibbs sampling, the parameter estimation method employed in the im-plementation.

The likelihood of the observed data in the Bayesian settings is same as in thesimple probabilistic model given in Eq. 1. The prior distribution over the user anditem vectors are assumed to be Gaussian with non-zero mean and covariance. In thesimple probabilistic model, we assumed the vectors to be Gaussian with zero meanand spherical covariance matrix, but here we generalize the Gaussian prior as givenbelow.

P(U|σ2U) =

N

∏i=1N (Ui|µU,Λ−1)

P(V|σ2V) =

M

∏i=1N (Vi|µV ,Λ−1)

In the above model, we used the precision matrix (Λ−1) instead of covariancematrix as the prior probability for the hyperparameters of user and item vector is easyto represent using precision matrix. The precision matrix is defined as the inverse ofthe covariance matrix. Now for a fully Bayesian approach, we introduce Gaussian-Wishart prior distribution for the hyperparameters. The Gaussian-Wishart prior isthe conjugate prior for the joint distribution of mean and precision of the Gaussianlikelihood function.

P(ΘU|Θ0) =N (µU|µ0, (βΛU)−1)W(ΛU|W0,υ0) (3.9)

P(ΘV |Θ0) =N (µV |µ0, (βΛV)−1)W(ΛV |W0,υ0) (3.10)

In the above β is a constant, W(ΛV |W0,υ0 represents Wishart distribution withW0 as the D×D scale matrix, υ0 is called the degrees of freedom of Wishart distribu-tion. One thing to keep in mind is that the joint distribution is not the product of two

independent distributions. The Gaussian distribution of the mean is a linear functionof the covariance matrix.

W(Λ|W,υ0) = B|Λ|(υ0−D−1)/2exp(−1

2Tr(W−1Λ)

)(3.11)

The above equation represents the Wishart distribution. In the above equation,Tr represents the Trace of the matrix i.e. sum of the diagonal elements of the matrix,the matrix W represents the D× D scale matrix and B is the normalization constantthat is defined as

B(W,υ0) = |W|−υ02

(2Dυ0/2πD(D−1)/4

D

∏i=1

Γ(υ0 + 1− i

2

))−1

P(U,V|R,ΘU,ΘV) =∫ ∫

P(R|U,V)P(U,V|R,ΘU,ΘV)P(ΘU,ΘV |Θ0)dΘU dΘV

(3.12)The resulting predictive model is given by Equation 3.12. The exact solution for

this equation can not be found using analytical methods. In practice, these equationsare solved using approximate inference techniques. There are two types of approxi-mate inference methods generally used in this scientific literature.

• Variational Bayesian Methods.

• Sampling Methods

In Variational Bayesian Methods, the posterior distribution is approximated tothe product of constituent parameters with each parameters following different dis-tributions. In sampling methods, we generate samples of the distribution withoutcalculating the probability distribution. In this thesis, we used a Monte Carlo MarkovChain sampling algorithm called Gibbs sampling.

3.2.1 Gibbs Sampling

The Gibbs sampler is a technique for generating random variables from a marginalprobability distribution, indirectly without calculating the density. Consider a jointdistribution P(x1, x2, x3, ..., xk) from which we wish to generate samples of the jointdistribution. In each step of the Gibbs sampling algorithm, we replace one of thevariables with the value drawn from the conditional distribution of the variable con-ditioned on the remaining variable. Also, the variables are initialized to some valueswhich satisfy Markov chain property. Again, at each step of the algorithm, the statesfollow Markov property. The Gibbs algorithm is given in Algorithm 1.

20

Algorithm 1 Gibbs Sampling1: Initialize zi : i = 1,2, ...,k2: for τ = 1, .., T do3: for i = 1, ..,k do4: Sample zτ+1

i ∼ p(zi|zτ+11 ,zτ+1

2 , ..zτ+1i−1 ,zτ

i+1..,zτk )

5: end for6: end for

The conditional distribution specified in the Algorithm 1 can be found by thebasic principles of probability as,

p(zi|zτ+11 ,zτ+1

2 , ..zτ+1i−1 ,zτ

i+1..,zτk ) =

p(zτ+11 ,zτ+1

2 , ..zτ+1i−1 ,zτ

i ..,zτk )

p(zτ+11 ,zτ+1

2 , ..zτ+1i−1 ,zτ

i+1..,zτk )

The only difference between the numerator and the denominator is that the numera-tor is the joint distribution of all variables whereas denominator is the joint distribu-tion of all variables except zi. One full execution of the inner loop in the Algorithm1 will compute the joint distribution p(zτ+1

1 ,zτ+12 , ..,zτ+1

i ..,zτ+1k ). A more detailed ex-

planation of Gibbs sampling theory and examples can be found in [24, 25].

In our model, we sample the conditional distribution of user vector conditionedon all other variables and parameters and the same method is used for item vectors.Since we use conjugate priors for the parameters and hyperparameters, it will becomeeasy to find the conditional distribution. In our model, the conditional distributiontakes the form.

p(Ui|R,V,ΘU,α) =N (Ui|µ∗i , [Λ∗i ]−1) ∼

M

∏j=1

[N (Rij|UiVT

j ,α−1)]Iij

p(Ui|µU,ΛU),

where

Λ∗i = ΛU + αM

∑j=1

[VTj Vj]Iij

µ∗i = [Λ∗i ]−1

(α

M

∑j=1

[VjRij]Iij + ΛUµU

)

The conditional distribution of the user and item hyperparameters follow Gaussian-Wishart distribution asgiven below

p(µU,ΛU|U,Θ0) =N (µU|µ∗0 , (β∗0ΛU)−1)W(ΛU|W∗0 ,υ∗0),

where

[W∗0 ]−1 = W−1

0 + NS +β0N

β0 + N(µ0 − U)(µ0 − U)T,U =

1M

M

∑i=1

Ui

S =1M

M

∑i=1

UiUTi ,µ∗0 =

β0µ0 + NUβ0 + N

,β∗0 = β0 + N υ∗0 = υ0 + N

The update formulas for item vector Vi also follows same pattern. In the Gibbssampling procedure, we initialize the model parameters and first sample the hy-perparameters conditioned on the initial values of user and item vectors, and thensample the user and item vectors conditioned on the hyperparameters.

3.3 Summary

We covered the model based recommender system, the one based on factor analysis.We implemented two probabilistic model based algorithms, a simple probabilisticmatrix factorization model and Bayesian matrix factorization model.

22

Chapter 4

Experiments

4.1 Dataset

We used the movie ratings provided by Yahoo! Research which is free to use inacademic research and can be accessed through Yahoo! Sandbox [26]. According toYahoo!, the data was gathered on or before November 2003 with some modificationsand additions by Yahoo! Research through the movie ratings and recommendationportal Yahoo! Movies. All user ids in the dataset are anonymized to preserve theprivacy of the users. The fields in the dataset are delimited with tab ("\t") characters.The fields of the dataset is given in Table 4.1.

Field Description1 Anonymized User ID2 Movie ID3 User Rating4 Normalized User Rating

Table 4.1: Dataset Field Description

The dataset is divided into two sets, a training set and a test set. The trainingdata contains 7,642 users (|U|), 11,915 movies (|M|) and 211,231 ratings (|R|). The

average user rating, defined as RU = ∑U RU|U| , is 9.64 and and the average item rating,

defined as RM = ∑M RM|M| , is 9.32. The average number of ratings per user, defined as

RUavg =

|R||U| , is 27.64 and the average number of ratings per item, defined as RM

avg =|R||M| ,

is 17.73. In the training dataset, all the users have rated at least 10 items and allitems are rated by at least one user. The density ratio, defined as δ = |R|

|U|∗|I| , is 0.0023,meaning that only 0.23% of entries in the user-item matrix are filled. Please refer to

the discussion about the sparsity of the rating matrix in Chapter 2 and 3.

The ratings are divided in to two classes, normalized ratings are un-normalizedratings. Un-normalized ratings ranges from 1 to 13, where 1 denotes a rating of ’F’and 13 denotes ’A+’. We used the normalized ratings which ranges from 1 to 5, 5being the best. The training data contains a small sample of Yahoo! users’ ratings ofmovies. This test data was gathered chronologically after the training data. The datafields are same as in the training data, given in Table 4.1. The test data contains 2,309users, 2,380 items, and 10,136 ratings. There are no test users/movies that do notalso appear in the training data. The average user rating is 9.66 and the average itemrating is 9.54. The average number of ratings/user is 4.39 and the average number ofratings/item is 4.26. All users have rated at least one movie and all items have beenrated by at least one user.

4.1.1 Implementation

We used Numpy [27] and Scipy [28], scientific libraries for Python to implementProbabilistic Matrix Factorization and MATLAB [29] is used for the implementationof Bayesian Matrix Factorization as Gibbs sampler is easy to implement in MATLAB.The Graphs were created using the statistical computing language R [30]

4.2 Experiments

For the Probabilistic Matrix Factorization, we run gradient descent on the param-eter matrices U and V for two different ratings, normalized user ratings and un-normalized user ratings. The parameter matrices where initialized as per given inthe Equation 3.6 and 3.7. In the Equation λ is the regularization factor which is setto 100 and ε is the learning rate which is set to 2. We set the feature size to 50 inboth PMF and BPMF models. As we explained in the Chapter 2 and Chapter 3, thefeature space is the latent space which captures different characteristics of users anditems.Since the rating matrix was quite huge (7,642) with 211,231 observed entries,we run the code sequentially 1000 entries per iteration. On a 64GB RAM machine,the PMF algorithm took 5 hours to converge.

24

●●●

●

●

●

●

●

●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 20 40 60 80 100 120 140

0.3

0.4

0.5

0.6

0.7

Probabilistic Matrix Factorization

Number of Iterations

RM

SE

(R

oot M

ean

Squ

ared

Err

or)

Figure 4.1: RMSE for the Probabilistic Matrix Factorization model on the trainingdata with learning rate 3 and regularization parameter 100

In case of Bayesian Matrix Factorization setup, we initialized the values of υ0,the degree of freedom to D (number of factors), µ0 to zero and W to D × D iden-tity matrix. We initialized the starting user and item matrix as the ones obtainedfrom PMF model. We took 50 samples of the variables and it took 8 hours to finishthe parameter estimation on a 4GB machine. The RMSE with Gibbs sampling was0.27473624 on the test data.

●

●

●

●

●

●

●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 20 40 60 80 100

0.28

0.30

0.32

0.34

Bayesian PMF

Number of Iterations

RM

SE

(R

oot M

ean

Squ

ared

Err

or)

Figure 4.2: RMSE for the Bayesian Matrix Factorization model on the training data asa function of the number of samples generated

4.2.1 Results

The cost function of the gradient descent algorithm is plotted in Figure 4.1. Here thecost function is the Root Mean Squared Error (RMSE) of the observed ratings andcalculated ratings. RMSE on test data on the optimized U and V matrices for thenormalized ratings is 0.588912961491.

In case of Bayesian Matrix Factorization, the RMSE values for each generated

26

sample is plotted against the number of iterations in Figure 4.2. The RMSE for testdata was estimated as 0.27473624. As we can see the results of Gibbs sampling is farbetter than the one obtained by PMF models. The RMSE gain using Bayesian MatrixFactorization is double than of the PMF model.

4.3 Summary

We elaborated the experiments and dataset used for the experiments to measure thevalidity of the models we implemented. The result obtained using Bayesian MatrixFactorization methods is far superior to the result obtained using simple ProbabilisticMatrix Factorization methods. In both cases, the results obtained are superior to re-sults published by researchers on experiments conducted on similar dataset, Netflix.

Chapter 5

Related Work and Conclusion

5.1 Related Work

Recommender System is a very active research field. A number of work has alreadybeen mentioned in Chapter 2. Most of the online retail companies and travel com-panies do very active study in the field. Sihem, Cong et al describe a content basedrecommender system for travel recommendation in the paper [31]. A similar contentbased recommender systems is studied with algorithms to find the most similar itemsin the context of online shopping is described in [32].

Model based recommender systems are studied in very detail in the paper [33].This PhD thesis covers the recommender systems from a machine learning problemperspective. It discusses different algorithms used in recommender system, includingclustering, dimensionality reduction, Nave Bayes etc. One of the most advanced algo-rithm used in recommender systems based on topic modeling is explained in detailin [8].

A detailed probabilistic study of Latent Gaussian models and Factor analysiscan be found in the famous textbook of Bishop [23]. Some other parameter estimationtechnique for probabilistic model like Expectation Maximization (EM) algorithm canbe found in the same textbook. The textbook also covers k-means algorithm, Softk-means etc. A study of similarity measure in the fuzzy settings can be found in [21].

5.2 Conclusion

The results we obtained by applying both Probabilistic Matrix Factorization andBayesian Probabilistic Matrix Factorization are very satisfying. The results are verysuperior compared to results obtained by applying similar techniques on other dataset

like Netflix Movie Rating data [11, 10].

In the future, we would like to do more work in the statistical modeling of dataand machine learning algorithms for approximate inference. Initially we planed todo variational bayesian inference to approximate posterior distribution in Bayesiananalysis.

30

Chapter 6

Acknowledgements

It is always a pleasure to finish a research project on a high note. It gives extra plea-sure while looking back working with kind, humble and wise people.

First of all, I would like to thank the Almighty God for helping me to success-fully finish this project. His blessings guided me through some of the hard times inmy life and showed me the correct path. I would like to express my heartfelt grat-itude to my supervisor Prof. Dr. Patrik Eklund for allowing me to work with him,helping me to see it through to completion. He was very supportive, encouraging,patient during the whole life-cycle of the project and the final thesis preparation. Hetook the pain of reading through the initial drafts of my thesis during his busy sched-ule and I am greatly thankful to him for the suggestions regarding the structure andcontent of final paper.I thank my examiner Dr. Jerry Eriksson for reading through thefinal paper and making the necessary arrangements for the successful presentation.

Finally, I would like to thank my family, colleagues and friends, Ms. AybükeÖztürk, Mr. Sujith, Mr. Nishanth, Mr. Tony , Mr. Awad and Mr. Harishankar for allthe support given during the project implementation stage.

Bibliography

[1] Gediminas Adomavicius and Alexander Tuzhilin. Toward the next generation ofrecommender systems: A survey of the state-of-the-art and possible extensions.IEEE Trans. on Knowl. and Data Eng., 17(6):734–749, June 2005.

[2] Paul Resnick and Hal R. Varian. Recommender systems. Commun. ACM,40(3):56–58, March 1997.

[3] Michael W. Berry, Zlatko Drmac, Elizabeth, and R. Jessup. Matrices, vectorspaces, and information retrieval. SIAM Review, 41:335–362, 1999.

[4] Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, andRobert J. Plemmons. Algorithms and applications for approximate nonnegativematrix factorization. In Computational Statistics and Data Analysis, pages 155–173,2006.

[5] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer,and Richard Harshman. Indexing by latent semantic analysis. JOURNAL OFTHE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 41(6):391–407, 1990.

[6] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the22nd annual international ACM SIGIR conference on Research and development ininformation retrieval, SIGIR ’99, pages 50–57, New York, NY, USA, 1999. ACM.

[7] Kathryn B. Laskey and Henri Prade, editors. UAI ’99: Proceedings of the FifteenthConference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, July 30 - Au-gust 1, 1999. Morgan Kaufmann, 1999.

[8] David M Blei and J Lafferty. Topic models. Text mining: classification, clustering,and applications, 10:71, 2009.

[9] Shameem Ahamed Puthiya Parambath. Topic extraction and bundling of relatedscientific articles. Master’s thesis, Umeå University, Department of ComputingScience, 2012.

[10] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factor-ization using markov chain monte carlo. In Proceedings of the 25th internationalconference on Machine learning, ICML 08, pages 880–887, New York, NY, USA,2008. ACM.

[11] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. InNIPS, 2007.

[12] Richard Johnsonbaugh and Marcus Schaefer. Algorithms, volume 2. PearsonEducation, 2004.

[13] Internet movie database. http://www.imdb.com. Accessed: 2013-05-12.

[14] G. Salton and M.J. McGill. Introduction to modern information retrieval. McGraw-Hill computer science series. McGraw-Hill, 1983.

[15] Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length nor-malization. In SIGIR, pages 21–29, 1996.

[16] C. D. Manning, P. Raghavan, and H. SchÃijtze. Introduction to Information Re-trieval. Cambridge University Press, 2008.

[17] Aybüke Öztürk. Textual Summarization of Scientific Publications and Usage Patterns.PhD thesis, Umeå University, 2012.

[18] Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. Data clustering: Areview. ACM Comput. Surv., 31(3):264–323, 1999.

[19] William P. Jones and George W. Furnas. Pictures of relevance: a geometric anal-ysis of similarity measures. J. Am. Soc. Inf. Sci., 38(6):420–442, November 1987.

[20] A. Huang. Similarity measures for text document clustering. pages 49–56, 2008.

[21] Patrik Eklund, Maria A. Galán, Jesús Medina, Manuel Ojeda-Aciego, andAgustín Valverde. Similarities between powersets of terms. Fuzzy Sets and Sys-tems, 144(1):213–225, 2004.

[22] Jonathon Shlens. A tutorial on principal component analysis. In Systems Neuro-biology Laboratory, Salk Institute for Biological Studies, 2005.

34

[23] Christopher M Bishop et al. Pattern recognition and machine learning, volume 1.springer New York, 2006.

[24] Philip Resnik and Eric Hardisty. Gibbs sampling for the uninitiated. Technicalreport, DTIC Document, 2010.

[25] George Casella and Edward I George. Explaining the gibbs sampler. The Ameri-can Statistician, 46(3):167–174, 1992.

[26] Yahoo! research. http://webscope.sandbox.yahoo.com/. Accessed: 2013-05-12.

[27] Numpy, scientific computing tools for python. http://www.numpy.org/. Ac-cessed: 2013-05-12.

[28] Scipy, open source library of scientific tools. http://www.scipy.org/. Accessed:2013-05-12.

[29] Matlab, the language of technical computing. http://www.mathworks.se/products/matlab/. Accessed: 2013-05-12.

[30] The r project for statistical computing. http://www.r-project.org/. Accessed:2013-05-12.

[31] Munmun De Choudhury, Moran Feldman, Sihem Amer-Yahia, Nadav Golbandi,Ronny Lempel, and Cong Yu. Automatic construction of travel itineraries usingsocial breadcrumbs. In Proceedings of the 21st ACM conference on Hypertext andhypermedia, HT ’10, pages 35–44, New York, NY, USA, 2010. ACM.

[32] Senjuti Basu Roy, Sihem Amer-Yahia, Ashish Chawla, Gautam Das, and CongYu. Constructing and exploring composite items. In Proceedings of the 2010 ACMSIGMOD International Conference on Management of data, SIGMOD ’10, pages 843–854, New York, NY, USA, 2010. ACM.

[33] Benjamin Marlin. Collaborative filtering: A machine learning perspective. PhD thesis,University of Toronto, 2004.

[34] Ruth Gracia, Shameem A Puthiya Parambath, Aybuke Ozturk, and Sihem Amer-Yahia. Crowd sourcing literature review in sunflower. Technical report, Technicalreport, 2012.

[35] Google scholar. http://www.scholar.google.com. Accessed: 2013-05-12.

[36] Yahoo! movies. http://www.movies.yahoo.com. Accessed: 2013-05-12.

[37] Yahoo! music. http://www.music.yahoo.com. Accessed: 2013-05-12.

[38] Netflix inc. http://www.netflix.com. Accessed: 2013-05-12.

[39] Lindsay I Smith. A tutorial on principal components analysis. Cornell University,USA, 51:52, 2002.

36

Matrix Factorization Methods for Recommender Systems633561/FULLTEXT01.pdf · Matrix Factorization Methods for Recommender Systems ... with matrix factorization methods, for recommender

Documents