Simulated Annealing with Levy Distribution for Fast Matrix Factorization-Based Collaborative Filtering

Matrix factorization is one of the best approaches for collaborative filtering, because of its high accuracy in presenting users and items latent factors. The main disadvantages of matrix factorization are its complexity, and being very hard to be parallelized, specially with very large matrices. In this paper, we introduce a new method for collaborative filtering based on Matrix Factorization by combining simulated annealing with levy distribution. By using this method, good solutions are achieved in acceptable time with low computations, compared to other methods like stochastic gradient descent, alternating least squares, and weighted non-negative matrix factorization.


INTRODUCTION
The objective of Recommender Systems is to recommend new products or items for users based on their history [1]. There are two major approaches to create a Recommender System. The first one is the Content Filtering or (Content Based ). This approach tries to create a profile for each user and item, and then tries to match these profiles [3]. The second approach is the Collaborative Filtering. It uses the rating history of users and items, and creates a large sparse matrix called Rating Matrix. This matrix usually contains ratings from 1 to 5. The 0's are for the incomplete ratings. The objective of the Collaborative Filtering is to predict these missing ratings. One of the most successful method for Collaborative Filtering is the Latent Factor Model [3]. This method tries to learn the latent features of each user and item in a fixed number of dimensions. Then represent each of them in a latent feature vector, that can be used to predict the incomplete ratings, or measure the similarity.
Matrix Factorization is one of the best techniques used for Latent Factor Model. The basic idea is to construct the low-dimensional matrices to approximate the original rating matrix [3] [7] [12] [13], where R M,N is the rating matrix, U M,K is the users matrix, I K,N is the items matrix. M and N are the number of users and items respectively, K is the number of latent feature that represent each user and item. Where K min(M, N ). Row m in matrix U represents user number m in the rating matrix, whereas column n in matrix I represents item number n in the rating matrix. So, in the rating matrix, the rating of user m for item n can be calculated by the dot product of arXiv:1708.02867v1 [cs.
LG] 9 Aug 2017 row m of matrix U by the column n of matrix I.
The rating matrix is very sparse, because it contains a few users ratings. The objective of this paper is to use the known ratings to construct the low-rank matrices, to predict the unknown or incomplete ratings.
One of the most common evaluation metrics for Collaborative Filtering is RMSE (root mean squared error). We calculate RMSE only for the known rating using the following equation: There is a lot of work done on Matrix Factorization and Collaborative Filtering. Here we discus three of the most popular methods.

Stochastic Gradient Descent (SGD)
SGD is one of the popular Matrix Factorization methods [3]. The idea is to minimize the following cost equation: where λ is a regularization term, µ is the overall average rating, b m and b n are the user and item bias respectively.
To minimize the squared-error equation (4), the algorithm iterates over all ratings in the training set. Then, it computes the associated prediction error in equation (5). Next, the error value is used to compute the gradient. The algorithm finally uses the gradient to update user bias, item bias, user matrix U , and item matrix I.
u m = u m + γ(e m,n · i n − λ · u m ) (8) where γ is the learning rate. The learning rate determines the moving speed towards the optimal solution. If γ is very large, we might skip the optimal solution. If it is too small, we may need too many iterations to reach the optimal solution. So using an appropriate γ is very important.

Alternating Least Squares (ALS)
ALS is very good for parallelization [11]. When we have large data, and need to distribute the computations over cluster of nodes. ALS objective is to minimize the following equation: (10) It is the same like equation (4), but without the bias terms. the basic idea can be summarized as follows: 1. Initialize U , and I matrices.
3. Fix U , solve for I by minimizing equation (10). 4. Repeat the previous two steps until converging or reaching the max iteration.
to solve the user U and item I matrices we use the following two equations respectively: where Eye is the Identity matrix.

Weighted Non-Negative Matrix Factorization (WNMF)
Here we present a special type of Matrix Factorization called Non-Negative Matrix Factorization (NMF). The only difference is the non-negativity constraint for the input matrix, and the low rank matrices as well. The problem can be formulated as an optimization problem: where V is the original matrix, W andH are the two factorized matrices. One of the most simplest methods for NMF is Multiplicative Update Rules [12]. It is a good compromise between speed and ease of implementation. So NMF objective equation (13) can be optimized using the following update rules: The original version of Multiplicative Update Rules will not fit in our problem. The two rules will not be able to differentiate between the true ratings, and the incomplete ratings. So we need to modify the original rules, to be able to learn from the true ratings, then predict the incomplete. In [10] It is similar to equations (4), (10), but without the bias b or regularization λ. Now we can optimize function (16) using the following two rules: where W M,N is a matrix which its elements are equal to 1 if the corresponding entry in R is known rating, and 0 otherwise. ( * ) denotes to the element wise multiplication.
In this paper we focus on efficiency more than effectiveness. We assume that we have a very large data, and limited time. So we need an acceptable solution in reasonable time. So we chose the Metaheuristic algorithms for this problem, because of its ability to scape from the local optimal, and reaching good solutions in reasonable time. We used Simulated Annealing algorithm, with Levy Flight as a random walk operator.
The rest of the paper is organized as follows: in section (2) we briefly describe the prerequisite topics that are needed before going through the proposed method. In section (3) we describe our proposed method. In section (4) we discuss the experimental results, effect of each parameter, and compare our proposed method against others. Finally section (5) concludes the paper.

Metaheuristic Optimization
There are two types of optimization algorithms, Deterministic and Stochastic. Deterministic algorithms usually focus on optimal solution, like Simplex method in linear programming, some of these algorithms use the derivative of the objective function, these algorithms are called gradient based algorithms.
In Stochastic Optimization we will talk about Metaheuristic Algorithms, we can divide Metaheuristic into two parts, META and HEURISTIC, META means "beyond" or "higher level", and HEURISTIC means "to find" or "to discover by trial and error". This type of algorithms depends on randomization and local search to find the optimal solution iteratively, whereas each iteration tries to improve the current solutions from previous iteration. Also Metaheuristic doesn't guarantee the optimal solution, but it gives good quality solutions in a reasonable time. Metaheuristic achieves its goal by making a good balance be-tween two major components, intensification and diversification. Intensification is to search for a better solution within the local area of the current solution. Diversification is to use the randomization to escape from the local optimum, and explore all the search space [4].
There are many types of Metaheuristic algorithms, like single solution, or population based, in this paper we use Single Solution.

Levy Distribution and Random Walk
We presented randomization techniques for exploring the search space (Diversification), local search for optimizing the current solution, and searching within the local area of it (Intensification). In (19) x t is the current solution state, s is a new step or random number drawn from a probability distribution, we add s to x to move it from state t to t + 1.
Levy Flights are a random walk that their steps are drawn from Levy Distribution. Mantegna algorithm is the best and easiest way to generate random numbers from Levy Distribution [4] [5], so the random walk can be achieved using the following equations: Where α is the step size.

Simulated Annealing
SA is one of the most popular metaheuristic algorithms. It simulates the annealing process for solids by cooling to reach the crystal state. Reaching the crystal state is like reaching the global optimum in optimization. It is a single solution algorithm. The basic idea is to perform a random walk, but with some probability called Transition Probability that may accept new solutions that do not improve the objective function, see equation (25). Accepting bad solutions with Transition Probability gives more exploration for the search space (Diversification). Transition Probability decreases gradually during the iterations to decrease the Diversification and increase the Intensification. This means that the algorithm will end up with accepting only better solutions [4] [6].
In (25) f is the difference between the two evaluation function values of the current solution and the new one. T is the current temperature which is decreased iteratively by the cooling rate. r is a random number. So the algorithm accepts bad solution if the Transition Probability p is greater than r.
One of the common cooling schedules is linear cooling schedule, in (26) T 0 is the initial temperature, β is the cooling rate, t is the pseudo time for iterations. The following pseudo code demonstrate the basic implementation of Simulated Annealing.

THE PROPOSED METHOD
In this section we introduce our new method for solving Matrix Factorization. In our method we use Simulated Annealing based on Levy Flight as a random walk, instead of Gaussian distribution. See section.(2.1). We called it SA-Levy. We choose Simulated Annealing because its low computations, as it is a single solution Metaheuristic Move randomly to a new location: x t+1 = x t + (random walk) 8: Accept the new solution if better 10: if not improved then Update the best x and f 15: t = t + 1 16: end while algorithm. So it will be very fast compared to other population based Metaheuristic algorithms. Also compared to the state of art methods like ALS, and WNMF Simulated Annealing is much faster, because it needs only one matrix multiplication per iteration. Regarding SGD, Simulated Annealing is easier to be parallelized.
In Simulated Annealing we just need to represent the solution, and implement the evaluation function to compare the current solution against others. We use RMSE as an evaluation function. Figure 1 shows the representation of the solution, we put the two matrices users and items in one matrix to simplify the solution and the calculation [8], the number of rows is equal the number of users M plus the number of items N , and the number of columns is equal the number of latent features K.

EXPERIMENTAL RESULTS
In this section we show the effect of the parameters on the RMSE results, we uses MovieLens 1M dataset [9] in our experiments, 80% of the dataset is used for training and 20% for testing. Tables 1 2 3 show how RMSE can be affected by the number of iteration, number of latent features, and step size. see equation (20). In table 1 we can see that good RMSE can be achieved by few number of iterations, so there is no need for many iteration to converge. In table 2 we can see that best RMSE can be achieved starting from 20 latent features. In table 3 we found that the best value for the step size is 0.01. We can say that the step size is the most important parameter in our method. It manages the balance between Intensification and Diversification, see section (2.1). Small values of step size give more Intensification, and and large values give more Diversification. Table 4 shows the difference between using Gaussian distribution and Levy distribution as a random walk. Levy distribution outperform Gaussian because of it's ability to escape from the local minimum [4] [5]. Table 5 shows that SA-Levy can be outperformed by other methods in terms of effectiveness. But SA-Levy can outperform all other methods in terms efficiency, because of its low computations, where it needs only one matrix multiplication in

Choice of Parameters
We conducted these experiments using Simanneal. It is a python module for simulated annealing optimization 1 , also the project source code can be found here 2 . Based on the MovieLens 1M dataset [9] we found that the best parameters are 10 Iteration, 20 latent features, 0.01 step size. For the temperature parameter we found that the best values for maximum and minimum temperature are 25000 and 2.5 respectively. To focus more