Rating Prediction with Topic Gradient Descent Method for Matrix Factorization in Recommendation

In many online review sites or social media, the users are encouraged to assign a numeric rating and write a textual review as feedback to each product that they have bought. Based on users’ history of feedbacks, recommender systems predict how they assesses the unpurchased products to further discover the ones that they may like and buy in future. A traditional approach to predict the unknown ratings is matrix factorization, while it uses only the history of ratings included in the feedbacks. In recent researches, its ignorance of textual reviews is pointed out to be the drawback that brings mediocre performance. In order to solve such issue, we propose a method of rating prediction which uses both the ratings and reviews, including a new first-order gradient method for matrix factorization, named Topic Gradient Descent (TGD). The proposed method firstly derives the latent topics from the reviews via Latent Dirichlet Allocation. Each of the topics is characterized by a probability distribution of words and is assigned to correspond to a latent factor. Secondly, to predict the ratings of the users, it uses matrix factorizaiton which is trained by the proposed TGD method. In the training process, the updating step of each latent factor is dynamically assigned depending on the stochastic proportion of its corresponding topic in the review. In evaluation, we both use YELP challenge dataset and per-category Amazon review datasets. The experimental results show that the proposed method certainly converges the squared error of the prediction, and improves the performance of traditional matrix factorization up to 12.23%. Keywords—Gradient descent; matrix factorization; Latent Dirichlet Allocation; information recommendation


I. INTRODUCTION
Nowadays, recommender systems play a significant role in online services and social networks to communicate with their users.In order to discover and provide the items (e.g. products or news, etc.) that users potentially are interested and buy in future, a consideration is to predict how they assess the unpurchased items based on their history of feedbacks.Such feedbacks are written by the users after their purchases, and each of them includes a numeric rating as the evaluation and a textual review.The most well-known approach to predict the ratings is Collaborative Filtering (CF) [1].It assumes that users sharing similar evaluation to their common items in the past, are also likely to have similar evaluation for a certain item in the future.Among all the CF algorithms, latent factor-based matrix factorization (MF) is the most famous [2]- [4].It characterizes both items and users by vectors of latent factors, which comprise computerized alternatives to the human created genres.The rating of a user to a specific item is modeled by the inner product of their factors.Using machine Fig. 1.A graph that characterizes the actual topics in reviews of two users to three products on Amazon.In reviews, each user mentioned many topics in which he is interested.For different products each user mentions different parts of interested topics in the reviews; for a specific product, different users focus on different associated topics.learning algorithm, the latent factors can be learnt based on the ratings in the history of feedbacks.
However, recent researches pointed out that the ignorance of the reviews is the major shortcoming of MF and brings it mediocre performance [5]- [7].Fig. 1 shows two users and their feedbacks to three products from Amazon.For product B (a music player), user A gives a low rating and points out the bad qualities of music's playing and camera; conversely, user B rates highly and his principal reason is that he is a fan of the player's maker.It represents the fact that although a user often has his own overall opinion (i.e.like or dislike) to obvious properties of product, he/she may focus only on a part of them in the evaluation.While the description of the properties are contained in the textual review, MF cannot realize such unequally treatment since the correspondence of them to latent factors are not defined.To bridge this gap, existing works [6], [7] model the latent topics of reviews with distributions of words, and transform them to latent factors.Unfortunately, the transformation is complicated and makes their methods time consuming in dealing with large-scale data.
In this paper, in order to solve the issues mentioned above, we propose a new method to predict the rating of user's unpurwww.ijacsa.thesai.orgchased item for recommendation.In order to model the latent topics in the reviews, we train a Latent Dirichlet Allocation (LDA) [8] independently.Each of the topics is assigned to a latent factor.Our idea is to present a new first-order gradient descent method, called Topic Gradient Descent (TGD), which binds the latent topics to latent factors via the training process of MF instead of the transformation.Since a more mentioned topic in a review is considered more importantly when the user rates, its proportion to all topics represent the degree of importance.When iteratively updating its corresponding latent factor in the training, its updating step is dynamically fixed based on such importance.In other words, the importance of topics points out the direction to update the latent factor vectors of users and items.With these trained latent factor vectors, we use the biased MF to predict the ratings.
In our evaluation, we conducted a series of experiments using 11 datasets, including YELP challenge dataset and percategory reviews from Amazon.It evaluates not only the entire performance in the problem of missing rating's prediction, but also the convergence of the squared error of rating prediction in the training.
The contribution of this paper is as follows: • It proposes a new gradient descent method named TGD, which provides a solution to introduce latent topics to MF, to release the unequally treatment of latent factors.The results of the experiment demonstrate that it certainly converges the objective function in MF's training.Furthermore, the speed of convergence depends on the dispersion of topics' proportions.
• TGD works well in the proposed method for rating prediction in recommendation.Comparing with simplex MF [2], the proposed system derives an improvement up to 9.03% in term of MAE, 12.23% in RMSE.
It even outperforms state-of-the-art model [7] for recommendation in most of the datasets.Additionally, the proposed method is also demonstrated to have higher accuracy than simplex MF in the prediction of high-scored ratings.
The remainder of this paper is organized as follows: Section II overviews related works of latent factor models.Section III describes the problem that we focus and briefly reviews MF.Section IV describes the detail of proposed method.Section V represents the method of evaluation and shows its results.Finally, Section VI concludes the paper with future work.

II. RELATED WORKS
In recent years, latent factor-based approaches are popular for their efficiencies in handling large scale datasource.In order to model users' ratings for further prediction, they approximate user-item rating matrix by singular value decomposition [2], [9].It decomposes the rating matrix into two orthogonal matrices, which represent the latent factors of users and items, respectively.In the training, the factors are often learnt via gradient descent method.For an unpurchased item of a given user, the rating prediction is made by calculating the inner product of their latent factors.Salakhutdinov et al. [9] proposed Probabilistic Matrix Factorization (PMF) and introduced Gaussian priors as hyperparameters to present latent factors.They noted that maximizing the log-posterior of the ratings over users' and items' latent factors is equivalent to minimizing the squared errors of the rating prediction.Hu et al. [10] applied latent factor model to the recommendation for implicit feedback datasets, which includes the indirectly reflect opinions, such as the purchase history of products and browsing history of the web pages.In the optimization of the model, they let the differentiation of user's and item's latent factors be zero, and recalculate the expression for them.Koren et al. [11] proposed a model for recommendation which integrates the latent factor model and traditional neighborhood model of CF.In their rating prediction, it directly sums up the predictions of these two models together.All these approaches ignore the reviews of users' feedbacks, which makes them scalability but mediocre accuracy in rating prediction.
To improve the performance of latent factor models, semantic analysis of textual summary of an item is introduced in many existing researches [5], [12], [13].In the case of reviews of feedbacks, a common idea is to take the reviews correlated with individual item as one summary.Wang et al. [5] proposed a model named Collaborative Topic Regression (CTR), which combines PMF and LDA for recommendation of scientific articles.The latent topics of a given article are derived from its title and abstract.Their distribution and a set of parameters are together to form the latent factors of the article.Purushotham et al. [14] pointed out that CTR has poor performance in the case of few feedbacks contained in the dataset, the so-called sparsity of data.To solve the problem, they integrated follower relationship among users as assistant information.The social network structure was transformed into social matrix, and approximated by an additional MF model.On the other hand, Wang et al. [15] proposed Collaborative Deep Learning (CDL), a latent factor model based on CTR to improve its performance.Instead of LDA, they use stacked denoising autoencoder to infer the distribution of latent topics for scientific article.Different with other CTR-based method, CDL directly use the probabilistic distribution of topics to be the latent factors of the given article.The idea of these CTR-based method is to join the distribution of latent topics into users' or items' latent factors, or to replace them.Since the topics are not derived from the individual review, they cannot perceive the unequally treatment of factors in the user's evaluation to a specific item.In contrast, we learn the topics for each review by LDA independently, and use it to be the direction in updating of latent factors in their learning.
Another consideration to combine latent factor model and topic model is to define a transformation between the topic distribution of reviews and latent factor vector of the corresponding item.McAuley et al. [6] proposed a latent factor based model named Hidden Factors of Topics (HFT) which integrates LDA and MF.The two models are combined with a softmax function of item's latent factors, which transforms them into topic distribution of correlated reviews.Based on HFT, Bao et al. [7] proposed Topic-MF, in which they replace LDA with Non-negative Matrix Factorization model.Not only the items' latent factors, the users' factors are also introduced into the transformation of softmax function.Therefore, the topic distribution represents no longer the topics in the reviews that correlate with a specific item, but a single review in a given feedback.Although it is demonstrated that the performance of HFT and TopicMF outperforms the traditional models such as MF, both of them suffer the drawback of the complicate transformation of latent factor and topic distribution.

A. Problem Definition
The problem we study is to accurately predict the ratings of unpurchased items for users based on their history of feedbacks.Each of the feedbacks includes a numerical rating in scale of [a, b] (e.g.ratings of one to five stars on Amazon) and a textual correlated review.Suppose in the feedbacks we have I users and J items overall.The rating made by user u i (i ∈ {1, . . ., I}) to item v j (j ∈ {1, . . ., J}) is denoted as r i,j .If r i,j exists, it must have a correlated review d i,j written by u i .Therefore, feedback is a 4-tuple < u i , v j , r i,j , d i,j >.Note that for a given user u i and his unpurchased item v j (j ∈ {1, . . ., J}), we only predict the unknown rating as ri,j without the unknown review d i,j .

B. Matrix Factorization for Recommendation
Biased matrix factorization [2] is an influential method to predict the missing ratings in recommendation.It maps users and items into a joint latent factor space with K dimensions which is arbitrary predefined.Accordingly, each user u i is associated with a vector U i ∈ R K , whose elements measure his/her extent of interest to such factors.On the other hand, vector V j ∈ R K is associated with a given item v j , and presents the positive or negative extent of those factors that v j possesses.The inner product of U i and V j represents the interaction of u i and v j , and approximates the corresponding rating r i,j : where µ is the average of ratings over all users and items, b i and b j denote the observed biases of user u i and item v j , respectively.Normally, a bias of a given user or item is calculated as the result of subtraction of µ from the average of correlated ratings.
The objective is to learn U i and V j by given training set including observed ratings, by minimizing the function of regularized squared error: where λ is the parameter to control the regularization to avoid over-fitting in learning, and • 2 denotes the L 2 norm.c i,j is the confidence parameter of rating r i,j , which indicates how much we trust it.A large c i,j should be assigned for some deliberate ratings, and a small c i,j for the ones that do not deserve seriously treatment such as advertisings and fakes.
A typical way to minimize the objective function ( 2) is to use gradient descent algorithm [2], [9].It calculates the gradients of U i and V j for every given rating r i,j as and updates them to the inverse direction of gradients iteratively.The updating step is often unique and controlled by a constant learning rate.Since a big learning rate causes divergence of the objective function and a small one may result in slow learning, it is crucial to find a proper learning rate [16].

IV. PROPOSED METHOD
In this section, we present our proposed method, whose structure is shown in Fig. 2. With a given set of history of feedbacks, the first task for us is to derive the topics from each review.As the pre-processing, we use LDA [8], which is a probabilistic generative latent topic model of a set of semantic documents called corpus.Its idea is that each latent topic is characterized by a distribution over words, and a document is a random mixture over such topics.We take each review in the feedback as a single document, and all reviews as the corpus D. Assume that there are K topics overall in D, which are shared by all documents.A topic is denoted by t k with k ∈ {1, . . ., K}.For a review d i,j ∈ D, its topic distribution is denoted by θ i,j , which is a K-dimensional stochastic vector.Therefore, each of the elements θ k i,j represents the proportion of corresponding topic t k having been mentioned in d i,j .Following the method presented in [17], we independently train the LDA model for D and infer θ i,j for each review d i,j by Gibbs Sampling.
Next, we use MF to model the ratings and further to predict the missing ones for users.The difficulty comes from the link of the topic distributions of reviews and latent factors without a complicated transformation between them.We propose a new first-order gradient descent method named Topic Gradient Descent (TGD), to correlate them through the training process of MF.Since the reviews provide an efficient tool for the users to explain their ratings, important topics are often mentioned much in the reviews.Therefore, the topic distribution θ i,j represents the importance of degree of topics in the evaluation of user u i to item v j , rather than his/her preference on v j .In other words, when θ k i,j = 0, t k is not worth to mention for u i and have no impact on the evaluation of v j .Assume that the number of latent factors is equal to the number of topics, and topic t k corresponds and interprets the elements U k i and V k j of U i and V j .The key idea is to use θ k i,j to affect the learning of U k i and V k j in the training process of MF.To be more specific, a given error of the rating prediction r i,j − ri,j is a linear combination of θ i,j , U i and V j .With the denotation of gradients g Ui and g Vj in (3), we write the updating equation for U i and V j as where γ is a pre-defined constant, and H i,j is a K×K diagonal matrix with θ i,j as the diagonal elements.H i,j is together with γ to be the learning rate, which assigns various updating steps for each latent factor.For the topics which have high importance and generate much error, their corresponding latent factors are updated with large steps.In contrast, factors of unimportant topics are updated with small steps in every epoch of training.When U i and V j are intialized with vectors of extremely small constant, such factors will remain the initial values and further have little impact on the rating prediction.
Although we have correlated latent factors with topics and realized their unequal treatment, an issue remained to be solved is that the convergence of the objective function (2) may be slow.Since the average of θ t i,j is 1/K, the average of updating step reduces to 1/K of the traditional gradient descent method 1 .Let s ∈ [1, +∞] be the timestamp that represents the epochs in training.Following the idea of previous effort [18], we introduce the timestamp into the learning rate.Instead of a constant, γ is re-defined with a function of the timestamp s: where α is an arbitrary predefined constant.γ is inverse to s so that it reduces following the growth of s.Therefore, U i and V j are updated with large steps at the beginning of training, and slightly adjusted to find the most proper values at last.
We present TGD method in Algorithm 1, where U s i and V s i denote the values of U i and V j in epoch s.Note that although the form of updating is similar to second-order Newton's method, we only use first-order information of U i and V j .
Let |D| denotes the number of reviews in corpus D. In each epoch, the time complexities for the calculation of gradients and update of U i and V j are O(|D| • K).Also assume that in the epoch T the objective function converges.Therefore, the time complexity of TGD remains O(T • |D| • K), the same as existing first-order method.
With the MF model trained by TGD, for a given user u i and an unpurchased item v j , we calculate the rating prediction ri,j following (1).

V. EVALUATION
In this section, we conduct the evaluation with three perspectives: 1) whether the proposed TGD method makes the objective function (2) rapidly converge; 2) how the parameters impact the performance of the proposed method; 3) how is the performance of proposed method comparing with MF and the state-of-the-art model for recommendation.

A. Datasets and Implementation
In evaluation, we use several datasets have been driven from YELP 2 and Amazon [19].They are filtered by the following constraints to have the feedbacks that: • g Vj end for t ⇐ t + 1 end while 1) their reviews have at least 10 words; 2) each of the users has at least 5 feedbacks; 3) each of the items concerns with at least 5 feedbacks.
Additionally, since in the following comparison with existing method [2], [7] large datasets make the experiments time consuming, we cut each of them by the publishing date of the feedbacks.For YELP challenge dataset, we only utilize the feedbacks from State of Arizona and Nevada for the sparsity of data.Discard of the stop words and stemming are also conducted for each review.With these processes, Table I shows their statistic including the number of users, items and feedbacks contained.The third and seventh columns show the average of rating and number of words in a review in the datasets, respectively.The sparsity of a dataset is calculated as #feedbacks /(# users × # items).
For each dataset, we randomly take 80% of its feedbacks as training set, and the rest as testing set to conduct the experiments.

B. Convergence of Topic Gradient Descent
For each of the training sets, we train the proposed method to observe the sum of squared error of rating prediction in each epoch.Considering the total number of the reviews in datasets, parameters K and λ are fixed to 20 and 0.01 respectively.The latent factors in U i and V j are initialized to be unique values of 0.001.As comparison, we also train MF by the method presented in [2], with its K and λ fixed with the same values as our proposed method.Different with the proposed method, factors in U i and V j are initialized by randomly generated values following the zero mean Gaussian distribution of N (0, λ 2 ).In order to guarantee the fairness, we set the confidence parameter c i,j to 1 if r i,j exists, and 0 otherwise for both the proposed method and MF.
For the their typical results, we show the results in the first 500 epochs for dataset Video Games, and 150 epochs for Movies and Videos in Fig. 3.The parameter α in ( 5) is fixed from 1.0 to 1.3, and the learning rate of MF is set to 0.03.For both of the datasets, MF reaches lower levels of squared error than the proposed method.For Video Games, the proposed method reduces the squared error more slowly than MF, which is opposite to the result of Movies and Videos.Especially in Movies and Videos, α = 1.3 is not a proper assignment since the squared error divergences early.Considering that for a given feedback, the updating steps of latent factors depend  For existing method, the learning rate is set to a general value of 0.03.
on the topic distribution of its review, we calculated and observed the standard deviation (SD) of topic distribution for each review.As a consequence, the average SD of Movies and Videos (0.067) is much higher than Video Games (0.029).It indicates that the speed of convergence depends on the dispersion of the topics' proportions in the reviews.

C. Impact of Parameters
Since the problem is to predict the ratings of the users to their unpurchased items, the performance of the proposed method is evaluated by observing the accuracy of predictions.For a given feedback from the testing set, we compare the rating prediction ri,j with its actual rating r i,j .As quantification, we use mean absolute error (MAE) and root mean square error (RMSE) which are calculated as follows: where N denotes the number of feedbacks in the testing set, and | • | denotes the absolute value.In general, RMSE is more sensitive than MAE for large error of prediction.The assignment of parameters and initialization follows the previous experiment in Section V-B.For each of the training sets, the proposed method is trained until the objective function converges.Fig. 4 shows the performance of the proposed method with α changed from 1.0 to 1.3.For Video Games RMSE is stable for all cases of α.On the other hand, RMSE of Movies and Videos is over 0.6 when α = 1.3, and reduces to roughly 0.3 when α ∈ {1.2, 1.1, 1.0}.Combining the results of the previous experiment, it is indicated that the divergence of www.ijacsa.thesai.orgobjective function in learning further affects the performance.
In other words, the performance of the proposed method is stable to small enough α.In order to avoid such affection, we fix α to 1.2 to conduct following experiments.
Fig. 5 shows the performance with K changed from 10 to 50.Recall that K denotes the number of overall topics, also the dimension of U i and V j .For Video Games, MAE and RMSE vary in parallel following K.When K = 20 the proposed method has the best performance and when K ≥ 40 the performance declines.In the case of Movies and Videos, although MAE is stable, a trough of RMSE is observed when K = 25.Therefore, in order to achieve the best performance, K should be fixed into a proper range which depends on the dataset.An assignment of too small or too large values makes the performance declines.

D. Performance in Recommendation
According to the previous experimental results, we set K to 20 and 40 to conduct a detailed evaluation to the performance in rating prediction.Except MF, we also implemented Top-icMF [7] which is an extension of HFT [6] as comparison.Following the setup of their experiments, we set λ = 1, c i,j = 1 if ∃r i,j and λ u = λ v = λ B = 0.001.Since the training of TopicMF is time consuming (3 to 5 minutes for a training set with a scale of 1,000 reviews for one epoch), we train it with 100 epochs and report its performance.
Table II summarizes the results of all datasets, with the best performance emphasized in boldface for each of the datasets.The last lines of the two tables present the average of MAE and RMSE.The improvement of the proposed method is calculated and presented in last four columns both for each dataset and average performance.When K = 20, the proposed method shows the best performance in term of RMSE on 10 datasets.Comparing with MF, the improvement of the proposed method is 3.77% in MAE, and 5.82% in RMSE in average.It indicates that the proposed method is effective in reducing the decisive failure of prediction.Especially on YELP, Movies and Videos, Video Games and Digital Music, the proposed method gains the improvement from 6.40% up to 9.03% in MAE.Also referring Table I, averages of words in one review of these four datasets are all more than 65.It represents that their reviews are written in more detail than other datasets.Therefore, the topics could be more clearly inferred to make the latent factors well trained in learning.On the other hand, the proposed method also outperforms TopicMF on 11 datasets, with the improvement of 4.87% in MAE and 3.99% in RMSE in average.When K = 40, the performance of the proposed method declines on most of the datasets except Digital Music and Sports and Outdoors.It represents that for such datasets, setting K to 40 makes the topics not clearly derived, further affects the performance.In term of RMSE, the improvement also reduces to -0.52% comparing with MF, and -3.40% with TopicMF in average.
Additionally, we underline the best performance among the approaches in both cases of K for each dataset.For example, the proposed method obtains the smallest RMSE for YELP dataset, which is underlined in the first table of K = 20.Overall, the proposed method obtains the best performance on 8 datasets in term of RMSE, 7 datasets in MAE.It is also observed that only two of them is in the case of K = 40.Therefore, proper assignment of K (20 for most of the datasets) guarantees the proposed method to gain better performance than two existing methods.
In practical application, if the predicted rating of an unpurchased item is high, such item may be a future recommendation to the given user.Therefore, we particularly evaluate the accuracy of predictions to the actual ratings with the highest score.Considering that both in YELP and Amazon a user evaluates an item up to 5 stars, we take the feedbacks in the testing set with 5 stars' ratings as objective ones.For such feedbacks, the prediction is successful if the predicted rating locates in [4.5, ∞).The precision is calculated as the proportion of successful predictions to the objective feedbacks.
Table III shows the precision of the proposed method and MF on each dataset with K = 20.For example, to 5 stars' ratings in YELP dataset, 55.5% of them are predicted in the range of [4.5, ∞) by MF, and 59.4% by the proposed method, respectively.Among all datasets the improvement of the proposed method is up to 6.977%.For Movies and Videos, since the precision of both MF and the proposed method is at a high level of more than 0.98, correspondingly the improvement is the slightest (0.334%).Also note that although the performance of the proposed method is worse than MF in Sports and Outdoors and Grocery and Gourmet Food (line 9 and line 10 in Table II), the precision is higher than MF.It demonstrates that the proposed method has higher accuracy in the prediction of such highest ratings than MF.

VI. CONCLUSION
In this paper, we propose a new method to predict ratings for recommendation, including a topic gradient descent method (TGD) for the MF model.From the given textual reviews, their topics are derived by Latent Dirichlet Allocation model.Using such topics, in the learning of the proposed method the latent factors of the users and items are iteratively updated by dynamically assigned updating steps.In the evaluation, we conduct a series of experiments utilizing 11 datasets, including www.ijacsa.thesai.orgYELP challenge dataset and per-category Amazon reviews.Firstly, the experimental results verified that the TGD certainly converges the squared error of the rating prediction.Secondly, it also shows that the proposed method outperforms MF in the recommendation.The accuracy of rating prediction improves up to 12.23% in term of RMSE, and 5.82% on average in all datasets.Comparing with TopicMF which is a state-ofthe-art recommendation model for recommendation, it also achieves a superiority of performance.Finally, the proposed method is demonstrated to have higher accuracy than MF in the prediction of high-scored ratings, which is considered as an ordinary scene of recommendation.
In the future, we intent to develop a mechanism to automatically search the proper assignment of parameters corresponding with the given dataset.On the other hand, we hope to evaluate the ability to describe the predicted ratings by the learnt latent factors and derived topics.Not only for MF, we also plan to apply the proposed TGD method to tensor factorization to extend it as an optimization to general latent factor based model.

Fig. 3 .
Fig.3.The squared error of rating prediction in the training of MF with using proposed TGD and existing method.α is fixed to 1.0, 1.1, 1.2 and 1.3.For existing method, the learning rate is set to a general value of 0.03.

Fig. 4 .
Fig. 4. The performance of Video Games and Movies and Videos in term of RMSE with α is fixed to 1.3, 1.2, 1.1 and 1.0.
and Videos RMSE@Movies and Videos MAE@Video Games RMSE@Video Games

Fig. 5 .
Fig. 5.The performance of Video Games and Movies and Videos with K fixed from 10 to 50.
Algorithm 1 Topic Gradient Descent Require: θ i,j for d i,j ∈ D Initialize U i and V j with vectors of unique value and α with constant, set s = 1.while The objective function (2) has not converged do Calculate γ ⇐ α • s −1/2 for d i,j ∈ D do Compute gradients g Ui and g Vj Apply update U s+1 i

TABLE I .
THE DESCRIPTION OF DATASETS USED IN EXPERIMENTS

TABLE II .
THE PERFORMANCE IN TERMS OF MAE AND RMSE OF MF, TOPICMF AND THE PROPOSED METHOD ON ALL DATASETS

TABLE III .
THE PRECISION OF THE PROPOSED METHOD AND MF IN PREDICTION OF 5 STAR'S RATING