A Proposed Method to Solve Cold Start Problem using Fuzzy user-based Clustering

With the elevation of the online accessibility to almost everything, many logics, systems and algorithms have to be revised to match the pace of the trends among the socialized networks. One such system; recommendation system has become very important as far as the socialized networks are concerned . In such paced and vibrant environment of the online accessibility and availability to heavy and large amount of data uploaded to the internet such as, movies, books, research articles and much more. The method of recommendation where provides the socialized networks between the operators, at the same instance, it provides references for the users to asses other users that effects their socialized relation directly or indirectly. Collaborative filtering is the technique used for recommending the same taste of picks to that of the user, and it is accomplished by the user’s mutual collaboration, this technique is mostly used by the social networking sites. Nowadays this technique is not only popular but common for recommending the data to the user; meanwhile it also motivates the researchers to find the more effective system and algorithm so that the user’s satisfaction can be achieved by recommending them the data according to their search history. This paper suggests the CF (Collaborative Filtering) model that is based on the user’s truthful information applied by the FCM (Fuzzy C-means) clustering. This study proposes that the fuzzy truthful information of the user is to be combined with rating of the content by other users to produce a recommender system formula with a coupled coefficient with new parameters. To achieve the results the Data set of Movie Lens is included in the study which shows significant improvement in the recommendation subjected to the condition of cold start. Keywords—Recommender system; collaborating filtering; cold start problem; clustering; user based clustering


I. INTRODUCTION
Commercial e-business and e-commerce websites or social media networking websites can depicted as the instance of recommender system. These kind of online attractions use recommender systems to suggest the likewise articles and content to the viewers. It is used to filter the information available on the internet related to any particular content to suggest that content to the user. Seeking that a likewise content that has been rated or watched frequently by the similar type of users is to be recommended to other users with same taste. The historical interests of the user plays an important role in recommending the content based on the rating and preferences of other users. Recommendation systems are used in the variety of areas like in recommending videos on YouTube, shopping products on Amazon and similar apps on play stores. There are several approaches in use to recommend particular articles to the user such as collaborative filtering, content-based recommendation or personality based recommendation and knowledge based recommendation system. All these approaches are in use depending upon the necessity of the platform for which it is being used. To recommend the content to the user, different types of engines, components and elements are fed by the recommender systems. These components normally consists of data collectors, whose job is to collect the likewise data available on the internet based on the keyword used for the search or other web analytics tools. The second component is data processor which processes the collected data for the third component that is recommender model. Which then processed for the user interface and again finally to the last component that is processing of the data that is recommended by the recommender by the model. These all components take their input from the first clue is given on the platform from the user[1] [2]. Thus, such tools that recommends content to the user or group of users that supports them to explore the content they like available on the internet from their interests taken from their search history that has been rated or watched by the other users in that group to let them watch or go through the explicit set of content. Online item recommendation is achieved by the collaborative filtering technique. The past activities performed by the user such as conducting a purchase or selected by that user or other users, are given the numerical values called rating of the item, worked as the aid for the collaborative recommender model to filter out and suggest the content [3]. This technique has been used in many applications and platforms, showing very efficient results in certain environments. Collaborative Filtering recommender system at first step uses rating of the item by judging the similarities between the users rated that item. Thus, this recommender system worked on the assumption that users who liked this item previously will like the recommended content and if this particular user likes this content then the item recommended will also be liked by the users of same interest. Hence, in this system items recommended to the other users is based on the computation of similarities between users and rating of the content by other users. There are many statistics similarity measures can be implemented to determine the similarity of the rating and other statistics of the users. Pearson's correlation is one such statistics approach to determine the suggestions for the recommendation, for which the value can be somewhere between 1 and -1 to determine the extent to which two variables are linearly related to each other. Similarly, cosine similarity measures the similarity between two vectors and degree between them in the inner space of the product. These *Corresponding Author.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 2, 2020 530 | P a g e www.ijacsa.thesai.org both techniques are themselves interrelated to each other in the context of mean deduction, but dissimilar when it comes to the measure of coefficients [4].
The generation of the top recommendations the most popular problem i.e. cold start problem which is most popular and commonly faced challenge. Different methodologies by the researchers have been proposed to address the problem of cold start. The occurrence of this problem is faced much when there is not much data is available for particular item or for particular user. Basically, it is the problem in the information system to filter out the data available on the internet based on the data like rating of the item and interest of the user, and this data is not sufficient to infer any results for the recommendation system model. In this study it refers to the inadequate interactions between the users and the architecture of the recommended model and the quality of the recommendations are degraded significantly. In addition this study is focused on the rating of the user items to find in the nearest neighbors.
The mentioned problem then resolved by using the one of collaborative filtering technique. Collaborative filtering techniques as mentioned earlier is achieved on two bases that are Item based collaborative filtering and user based collaborative filtering, the techniques are used to recommend items from the action of other or same user respectively. These both techniques in practice are very effective but when it comes to the dynamic environment of the users or group of users, then these techniques are hard to become scalable subjected to the condition that items does not change too much. However, item based collaborative filtering approach is easy to compute offline and re-training is not required that is why KNN technique is the right thing to go for recommendation model. It also provides foundation to the development of recommendation system. However the KNN's learning method is lazy and also no-parametric, and in this technique the results are compiled for the new systems from the database that includes separated data points that are clustered. It is profound and easy to use technique because of its simplicity, despite this; it can overkill the performance of other complex models especially in forecasting the economics situations and its related demographics and data compression etc. In the recommendation system settings the KNN collaborative filtering algorithm is used to crawl through the whole space of user items to determine the neighbors of the user. In this regard, it is intricate and thorny to provide the recommendation in real-time due to its unpleasant approach towards consumption of time while swarming through the large number of users.

A. Literature Review
To overcome the issue of Cold User (cold start) in existing techniques e.g. CF many researches proposed their models, architecture and frameworks. Before conducting this study, around 20 studies has been gone through that identified and addressed the same problem while conducting their work. These studies discussed the problem and also proposed their solution to overcome cold start problem. In this study the solution produced by the researchers are reviewed in detail and sectioned according to their nature. The review of the solutions is carried out based on the categories like, cross domain and social network data system, implementation of association rules on the user profile, similarities in ratings and demographics, behavior and historical interest of the users, deep learning strategies, ICHM, CBCF, OWA and CART.
In 2015, another research addressed the cold start problem and generating recommendation for cross-selling items and products on large e-commerce websites like Amazon and eBay [5]. In this study it is mentioned that the cold start problem hinders the functioning of the recommender system when there is insufficient information about the user or inadequate ratings of the particular item. In this study, knowledge based links among the large e-commerce platforms and the sharing and transfer of knowledge about the user and the item among the domain can mitigate the cold start problem. Cross domain recommender system addressed the cold start problem by sharing and transferring information about the item or user from other similar platforms. This, solution has been implemented and now can be seen in many applications of renowned tourism organizations. Suggesting the tourism places from the picture uploaded and tagged by other tourists on the social media networking websites about the places which are not explored by the particular user is the effective application of the cross-domain recommender system [6] [7,8].
In this regard, researchers developed an application in which cross-domain knowledge sharing technique is used to suggest the travel destination based on the tourists travel interests and suggests likewise destinations based on the geotagged pictures uploaded on the social media with the prediction of the weather [8]. In this study it is mentioned that the problem arises when a new tourist is going on a trip. The recommendations for the new tourists from the system are then generated by the other tourist trip planned for the last location. While matching the interests for the particular user from the likewise social media profiles of other users some of the irrelevant and non-related generation of recommendation occurs which infers in incorrect results thus reducing the efficiency of the whole recommendation system.  Shapira, Rokach, and Freilikhman [9] in their research discussed the cold start problem and suggested a solution for this problem. It is suggested that recommendation system can collect data from the Facebook profile of the user and filter out the data related to the particular domain. If there is insufficient information about the interests of the user related to the domain then it may extract the information from the behavior of the user and the reactions posted on the friends profile by that user. This study suggested the cross-domain learning method that is k-nearest neighbor which is source a source aggregation method for collecting the information about the new user in the specific domain. The authors concluded that this aggregation method based k-NN shows significant improvement in the recommended results if the users reacted to the content is not much spaced.
Thus, infers that the cross-domain model functionality is based on the data set density. For resolving the cold start problem the data set density is the significant variable.
Sobhanam concluded in their research that the cold start issue can be addressed by the implementation of association and clustering [10]. The association rules are implemented on the profile of the new user to generate a new enriched profile, also the implementation of the frequency pattern taxonomy based profile aided in the generation of these profiles of the user. This technique can generate the top-N recommendations for the new user.
Park addressed the same issue but in different domain [11]. This is relatively old study but addressed the cold start as it is based on the filterbots algorithm. This study discussed that the naïve filterbot algorithm is used by which the system can be infused with bots or hypothetical user. This algorithm can collect the information about the rating of the item (here item can also be a user). This model extracts the average rating for the item or user that is based on the attributes or association of the specific user or item. It also calculated the similarities between users using demographics of the users and determining its average rate. The infused bots and pseudo users in the user matrix as another user or actual rating of the item are treated by the system by applying algorithms of collaborative filtering for the generation of recommendations. Item based and user based algorithms are used to predict the interests of the user and to determine interest similarities among the users Pearson correlation is used which is given as: (r a,ir a ) · ( r b,ir b ) = difference between the item's rating by user a, b and average rating I a ∩I b = the set of items rated by the user a,b In recommending the items to the new users based on the demographic similarities and user interest similarities, there is another approach other than the naïve filterbot model that is Triadic aspect model. In the subsequent section some models are discussed with details as part of literature review.
Lam and his team in their research proposed a triadic aspect model which generates the prediction ratings of the item using information of the similar user as its input [12]. It extracts the statistical demographics like gender, age, work and other related information of the user. This model was suggested by the Hoffman to analysis the two-mode and co-occurrence data that has implementations in machine learning, retrieval of information and its filtration. This model predicts the interest of the users based on the triadic aspects includes age, gender and work, through which the interests of the existing users are extracted and generate recommendations for the new users with similar demographics and triadic features. This study concluded that the results are satisfactory when implemented to about 280 different types of new users but where the data set is relatively larger the results produced are not satisfactory and recommendations generated for the new user to address the cold start problem are not precise.
In a research paper it was narrated that predicting users based on their interests and behavior for generating recommendations related to them is like a web intelligence system [13]. The cold start problem discussed and addressed in their studies by suggesting implementation of clustering techniques to framework of item-based collaborative filtering. This research also suggests the integration of the information content into the collaborative filtering. It infers the hybrid approach towards the solution of the cold start problem that is item-Based Clustering Hybrid Method. The features of the content information are clustered through which the preferences of the user rating are complemented. A statistical approach is suggested in the study that linearly combined the likeness between the user ratings. It uses the cosine measures for the rating of clustered content and Pearson measures for the ratings of the users. It is concluded after experimental results that the sparsity in the data is effectively addressed and also significance improvement is noted in the recommendation [14]. It was not the only hybrid model used previously but number of researches suggested hybrid model as one such model was introduced in which collaborative filtering, content based and demographic based models were infused to address the cold start problem [15].
Meng used collaborative filtering method as a different approach to increase the performance of cold start problem [16]. The extortion of the user's interest is achieved by walking through the history interests and cognitive similarity among the alike users to implement the social sub community division through well-known similarity measurement of Pearson and clustering approach based on K-means. The recommendations were generated by building the CART based upon user's static information & distinct group of alike users.
Sakarkar and Deshpande used a dissimilar and unique approach to collect the information about the user. In this study [17] the past educational data of the user was subjected to the k-means clustering technique. It was a somehow like triadic model, but it uses three different features as an input of the recommendation model. Despite educational data, current professional experience and information of the parents are used www.ijacsa.thesai.org as an input. However, the classification of the new user is done by the k-nearest neighbor model based on his attendance in the computer based assessment. IMSAA online real dataset was used to carry out the experiment along with movie lens 100k dataset. This technique however, directly addressed the cold start for new user effectively.
Jazayeriy and his team in their study used the approach to effectively used the rating of the items to mitigate the cold start issue for new users [18]. This technique was based on the existing category of the item; it determines the average ranking of the each item in the category and judge against the user's average rating to improve the ranking of the recommended items. It shows an acceptable improvement while tested on the movie lens 100KB, 1MBand 10MB datasets.

B. Existing Recommender Algorithms
The cold start problem is addressed by various researchers in multiple ways however, it is established by now that the mentioned problem have its solution veiled in addressing the effectiveness and efficiency of the algorithms used to determine the input for the recommender model. This lead the study to infer that there are two most common proposed models; collaborative filtering and content based methods. The earlier one according to this study refers that the new user will receive the recommendations based on the likeliness of the item from other similar users with alike preferences. However, it also pointed that the features based analysis of the users is not mandatory. While the later solution according to this study refers that items are analyzed by the said methods (content based) or take in account the features of the users to extract the items from the internet that matches the interests of the new user. However, this study ponders on the distinctive situation of the cold start problem subjected to the condition that new user does not show up with any of the previous ratings. In such situation, both collaborative filtering and content based methods will function to a limited accuracy performing less effectively to address the issue. This way the recommendation will not be generated for the cold user, hence affecting the efficiency of the algorithms. This will elevate the quantity of objects present in systems than amount of objects that have been rated till then by the user especially through thin data availability and this intensify the contention of the cold start problem [19].
The above situation gives rise to two issues; cold start user and cold start item. The earlier issue arises due to insufficient activity turned up from the new user regarding item ratings, this in turns become difficult for the recommended system to find similarity among users. The later mentioned issue arises due to lack of ratings on items from other users, therefore less likely to be picked up by the recommender system. CF approach will not be proposed keeping the scale of this research in accordance as this technique uses models of deep and machine learning algorithms to predict the rating of the item which were not rated by the users up to adequate level. There number of collaborative filtering algorithms that are based on the Bayesian Networks, semantic models and clustering models. Markov decision process, singular value decomposition and multiple multiplicative factor based models are the examples of model based collaborative filtering [20].
Collaborative filtering accomplished through such algorithms develops the model of the rating of the user at first for providing the recommendation of the item. Probabilistic approach is used for algorithms in this group for visualizing the process of collaborative filtering. In such situations the visualization or imagination of the collaborative filtering process behaves same as if the estimated value of prediction for user is computed based on the rating given by the user on other items.
However, the scope of the study requires that memory based model must be used instead as the contribution of the study is to find the effective way of determining the similarity between the users and among items. The memory based collaborative filtering is used to determine the list of items for the subjected users that is based on the similar behavior among the users. Memory based collaborative filtering can be discussed from both item based and user based perspectives. However, according to the model proposed in this study userbased collaborative filtering will be discussed in the subsequent section.

C. User based Approach
The user based collaborative filtering is adopted to determine the rating of an item received from multiple users, for this purpose the similarity among these users is calculated. Henceforth, the rating for the same item from the subject user can be predicted with the help of computation. As discussed earlier, Pearson correlation coefficient for subjected user will produce the association and connection among user "u" and "v" illustrated in eq 1 as: Where; iϵI and I = over all co-rated items from both u, v users r u,i = rating of item "i" by the user "u" r u r v = mean rating of the user "r" and "v"

D. Prediction Computation
The prediction for the subjected user for a specific item can be obtained by the formula given below in eq 2: The above computation illustrated that, the summation of similarity of the user "a" and "u" and weights of the rating normalized by the sum of the weights on that specific item "i" then summing up this with the mean rating value of the active user "a" can produce the prediction of recommending that item "i" to the active user "a". Whereas user "u" belongs to the neighbor user "U" in terms of high similarity index while (a u) represents the range or set of the users with similar rating weights to which the specified user (a) belongs.

E. Fuzzy Clustering and C-means Algorithms
To determine the prediction of rating of an item for the specified user, computation of similarity of rating from the other similar users is achieved using statistical techniques. For www.ijacsa.thesai.org this purpose, this study proposes the clustering technique, in which the grouping of similar objects concealed in multidimensional and multivariate datasets and the partitioning of unordered items is done [21]. Using this technique, the scattered and uncategorized data can be gathered into clusters with similar type of items. This statistical approach is widely known in the machine learning enhancement, thus this study emphasize on the algorithm and technique of clustering in order to analyze the adjacency and likeliness of the users. For this reason, the clustering technique is used in the present study is mainly encompass Fuzzy algorithm. The fuzzy clustering is also known as soft clustering, unlike hard clustering the data points in the fuzzy clustering are not strictly belongs to a distinctive cluster but they can belong to more than one cluster at a time [22]. This is main approach of the study for using soft clustering, because the rating of an item may not be only represents all the similar users, but the item may be rated by the unique user or user from different cluster sometimes. The fuzzy clustering becomes very significant when these clusters do not have distinguishable boundary, and have overlapped behavior with neighboring clusters. Using this technique, proposed recommender system will not fall short when the exact similarity within the cluster is not available especially when the membership function is concerned. To achieve the fuzzy clustering the Fuzzy C-means algorithm is used which is known for identifying the fuzzy cluster with minimum cost function [23].

F. Clustering with Fuzzy C-Mean
Fuzzy C-mean algorithm is very similar to K-means algorithm and is one of the most widely used fuzzy clustering algorithms. This algorithm chooses number of clusters and not only one. In this algorithm each data point is randomly assigned with coefficients for being in the respective clusters. This algorithm is iterative until converged but the change between the coefficients is not more the given threshold i.e. denoted by ϵ. According to the scope of this study algorithm determine the clusters for the data point sets "x". This algorithm tries to partition the limited number of collection of n elements i.e. X= {x1……..xn} if the criteria is given for being into the collection of c fuzzy cluster. In this perspective of given finite data set a partition matrix is returned with a list of c cluster centers i.e. C= {c1…….cc} and illustrated as: Where: However, for a set of data point X j ϵ R d where j=1……N with minimum cost function to find the partition where data point belongs is given as in eq 3: Where; μ= [μ i,j ], i denotes cluster, j denotes its object and μ overall represents the value of fuzzy membership in given cluster of given object and m belongs to the range of [1, ∞].
However, the value of the distance between xj and mi is denoted by D i,j = D(x j ,m i ).
The algorithm in the continuity of above formulation can be developed as: Step 1 Value of the m, c and ϵ will be defined along with the assigning step value t = 0 and initializing the mean matrix.
Step 2 When t= 0 it computes the membership matrix μ else update when t<0 using the below formulation: If i=1 …. c and j=1 …. N.
Step 3 From the preceding step μi j will take new values and the mean matrix C will be updated accordingly using: Step 4 Iteration of step 2 and 3 is required until the function converges and minimum cost value is achieved i.e. difference between the change in mean matrix is smaller than the small number ϵ.".
III. PROPOSED MODEL Two algorithms are proposed as the framework of proposed model. One algorithm is for the training of the recommender system, where movie lens data set is presented to the algorithm as input and expected output of the recommender system will be fuzzy user-based measurement of similarity and briefs as:

START
Step 1 Access Movie Lens website to Load data set, to construct two matrices Step 2 Construction of rating matrix of user-movie Step 3 User based similarity matrix is produced from usermovie rating matrix i.e. preceding step, by implementing Pearson similarity measure.
Step 4 Truthfulness matrix of user is constructed Step 4.1 Computation of activity of user u (count (Mu)= User_activity (u) Step 4.2 Probity of user computation Where; R is rating U is user (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 2, 2020 534 | P a g e www.ijacsa.thesai.org J is movie Ru is average rating for the user (u) Step 4.3 Computation of the sore of user's neighbor MAX AVG (ui) = User_friend score (u) where i=1 ….. K(no. of nearest user) Step 5 Similar users are clustered keeping degree of membership in account based on the truthfulness matrix after implementation of fuzzy c -means clustering. Where range of value of membership is [0,1] Step 6 The fuzzy based matrix which infers the product of two similarity matrices (i,j) then produced by computing the proposed formula i.e. user-based similarity times fuzzy truthfulness similarity measure illustrated as: Similarity (i,j)fuzzy based= Similarity user based x C (combination Coefficient) + similarity fuzzy truthfulness x (1-C) Combining and taking the product of two matrices i.e. user based similarity matrix and similarity of fuzzy truthfulness matrix results in the fuzzy based similarity matrix as illustrated above.

END
The results of the first algorithm infers the fuzzy based similarity matrix, whereas, the second algorithm is used for the testing phase of the recommender system in which the new user with few ratings will be taken as input. This algorithm will produce the recommended items for the new user.

START
Step 1 selecting cold start users from the dataset Step2 information for the new user's truthfulness is computed Step 3 implementation of Pearson correlation coefficient on the new user's ratings matrix Step 4 according to the new users truthfulness matrix clustering Step 5 proposed similarity formula is computed i.e.

( ) ∑ ∑
Step 6 Implementation of prediction similarity formula (from prediction computation section) Step 7 Fetch results of clustering of truthfulness matrix of new user from Step 4 (highest estimated ratings) and recommend to the new users.

END
Finally, complete content and organizational editing before formatting. Please take note of the following items when proofreading spelling and grammar: in the title or heads unless they are unavoidable.
IV. DISCUSSION AND CONCLUSION The model proposed in this study is tested through experiment by providing Movie Lens data to the developed algorithms. This data was consisting of more than 1600 movies and their corresponding 100 thousand ratings from more than 900 users, is used to extract the ratings form the user and process it further after analysis. Cold start users for this model are selected randomly from the data those who have less ratings. Scrutinize intimately Table I, it is translucent that there lies similarity between the ratings of movies from different users. However, some users have rated very few movies from first 10 movies and some of them did not rated at all. The data for the rating of the first 10 movies from 10 users is included in tabular form in Table I.
As an instance, user 3, 4 & 5 did not rate any of the movies, however user 9 rated only one movie. They have no common ratings, and also user 3, 4 and 5 did not rate first 10 movies. On the other hand user 8 and 9 rated only movie 7. For the purpose of computation and analysis user 4 and user 9 will be considered. In Table III, user 9 rated 12 movies and user 4 rated 13 movies. On measuring the similarity matrix based on user data the similarity between the users will be computed and is shown in the Table II. It is clear from the user based similarity matrix that the value of similarity between user 4 and user 9 is zero. According to the algorithm 1 the processing is achieved offline which will definitely effective in reduction of recommendation time. The model proposed in this study is produced the usermovie matrix of rating as depicted in Table I and Table II is constructed when the Pearson correlation similarity formula is applied on the user-movie matrix.
However, the computation results of user activity and user probity are shown in Table III which illustrated the whole data set, which is obvious demonstration of the truthfulness of each user and their behavior.
In addition, fuzzy c-means clustering plays important role in the performance of prediction. According to the demand of the study the construction of fuzzy matrix is vital because producing clusters of truthfulness of the users to know about the belonging of the user which is illustrated in Table IV. These values are the membership values of the user belong to the respective cluster and to be used to compute the similarity between users using the formula developed in the equation 3. After this computation the sparsity problem is mitigated, hence improve the prediction accuracy especially with cold start problem. The depicted two clusters are computed with fuzzy c-means which illustrates that, for instance, user 1 have 32% membership of cluster 1 and 68%membership of cluster 2. The accuracy in prediction cannot be achieved if it is being done only on the bases of userbased similarity matrix in user-based collaborative filtering especially for the users who does not bear common ratings for the same item. As for instance, user 4 is neither share ratings as of 6, nor as of 9, hence similarity value is zero. However, if the model proposed in this study is used, then the similarity and prediction can be achieved for the new users using the upgraded equation 3 according to which user truthfulness plays www.ijacsa.thesai.org the key role to successfully recommend the item to the cold user. From results and explanation of the suggested model it becomes translucent that mitigation of sparsity and cold user problem are well addresses through the vigorous fuzzy userbased collaborative filtering is effusive with the proposed framework.