A Review-based Context-Aware Recommender Systems: Using Custom NER and Factorization Machines ESULTS FOR E ACH E NTITY

—Recommender Systems depend fundamentally on user feedback to provide recommendation. Classical Recom-menders are based only on historical data and also suffer from several problems linked to the lack of data such as sparsity. Users’ reviews represent a massive amount of valuable and rich knowledge information, but they are still ignored by most of current recommender systems. Information such as users’ pref- erences and contextual data could be extracted from reviews and integrated into Recommender Systems to provide more accurate recommendations. In this paper, we present a Context Aware Recommender System model, based on a Bidirectional Encoder Representations from Transformers (BERT) pretrained model to customize Named Entity Recognition (NER). The model allows to automatically extract contextual information from reviews then insert extracted data into a Contextual Machine Factorization to compte and predict ratings. Empirical results show that our model improves the quality of recommendation and outperforms existing Recommender Systems.


I. INTRODUCTION
In the last few years, the number of services and items offered and produced by businesses and websites have increased quickly, which makes the choice of products and services meeting customers' needs more difficult.
Recommender systems (RS) tackle this problem by helping users to find suitable resources, based on their past behaviors and preferences. Today, companies use RS in several domains to assist users, enhance customer experience and make it easier for them to satisfy their needs by speeding up searches.
Whereas, traditional Recommender systems such as Collaborative Filtering techniques [1] are based on two dimensions (User X Item) and on numeric rating (e.g., 5 stars rating) to compute similarities between users and items and produce recommendations. Context Aware Recommender Systems [2] use other dimensions beside the two classical dimensions, namely contextual information dimensions (User X Item X Context) to enhance the accuracy. The contextual information represents the environmental factors that influence the user's decision. In general, numeric rating expresses whether a user likes or dislikes an item, however, it does not allow us to understand why, when or where he/she makes this choice and reasons behind it.
Where sparsity and the lack of information represent big challenges to Recommender Systems, customers' reviews could be a good resource to solve these problems and help companies to well understand decisions made by users. In fact, many models have been proposed to extract valuable information from reviews like sentiment analysis and rating extraction, but only few works have been presented to extract contextual information from users' reviews.
In this article, we present a new model for contextual information extraction based on a pre-training language representation method trained on large amounts of data like Wikipedia called BERT and a custom Named Entity Recognition. The extracted data is used by a Contextual Machine Factorization algorithm to predict the user's interest. This paper is organized as follows.
The remainder of this paper is presented as follows. In Section 2, we give a review of related works. In Section 3, we define the context dimension and we present the Named Entity Recognition and BERT. We introduce Factorization Machine algorithm and its use in Context-Aware Recommender Systems in Section 4. In Section 5, we present the proposed work in detail. In Section 6 we discuss obtained results. Finally, a conclusion of the work is presented in the last section.

II. RELATED WORK
Recently, many works have been proposed for extracting precious data from reviews and integrating them into recommandation process. This section presents recent applications of reviews-based Recommender systems.
Zheng et al. [3] presented a Deep Cooperative Neural Networks (DeepCoNN) model based on word embedding techniques and two convolutional neural networks (CNNs). The first network extracts user behaviors from users' reviews and the second network extracts item properties from reviews written on items. The model merges network outputs and transmits it to a factorization machine algorithm for the prediction.
Similarly to DeepCoNN [3], R. Catherine and W. Cohen [4] proposed a model called Learning to Transform for Recommendation (TransNets) based on two parallel CNNs, one to process the target review and the other to process the texts of the user and item pair and Factorization Machine to predict rating. The difference between the two models is that the transnet model integrates an additional Transform layer to represent the target user-target item pair.
McAuley and Leskovec [5], introduced a Hidden Factors and Hidden Topics (HFT) model that merges reviews written by users and ratings to provide recommendations. The model uses Latent-Factor Recommender Systems to predict ratings and the Latent Dirichlet Allocation to discover hidden dimensions in review text.
Tan et al. [6], introduced a Rating-Boosted Latent Topics (RBLT) framework which models item features and user preferences by combining textual information extracted from reviews and numeric ratings. The RBLT model represents users item as a latent rating factor distribution, and repeats reviews with rating n time to dominate topics. To perform predictions outputs are introduced into a Latent Factorization Machine (LFM).
Zhang et al. [7], proposed an Explicit Factor Models (EFM) to produce explainable recommendations. EFM extracts user sentiments from reviews and explicit item features, then recommends or not recommends items based on hidden features learned, items features and users' interest. All previously cited works have exploited reviews to boost recommender systems, but they ignore contextual information which could significantly improve recommendations.
Other researchers succeeded in integrating context in recommendation tasks such as Aciar, [8] proposed a Mining Context Information method based on classification rules text mining techniques to automatically identify user's preferences and contextual information inside reviews, extract it and integrate it in recommendation. However, this method identifies sentences containing context but it can not extract contextual information from these sentences.
Hariri et al. [9] which proposed a Context Aware Recommender system that models user reviews to obtain contextual data and combines it with rating history to compute the utility function and suggest items to users. The model handles the context like a supervised problem of topic modeling and builds the classifier of context using a labeled-LDA. The system uses conventional recommendation algorithms to predict ratings. However, this work predicts the utility function not the rating.
Levi et al. [10] introduced a Cold Start Context-Based Hotel Recommender System based on context groups extracted from reviews. This approach uses many elements, including a weighted algorithm for text mining, an analysis to understand hotel features sentiment, clustering to build a hotel's vocabulary and nationality groups. Despite this study tackling the cold start issue, it doesn't show how to integrate extracted context in recommendation adequately.
Compos et al. [11] introduced an approach to extract contextual information from user reviews using large-scale and generic context taxonomy based on semantic entities obtained from DBpedia. In this approach a software tool builds the taxonomy by exploring DBpedia automatically, and also allows for manual adjustments of the taxonomy. Despite this work presenting a semi-automatic method to extract context from reviews, it does not explain how to use extracted data to predict ratings.
Lahlou et al. [12] proposed a review aware Recommender system based on users' reviews to build a contextual recommendation. The proposed architecture allows to automatically exploit contextual information from reviews to build recommendations. They also presented a Textual Context Aware Factorization Machines (TCAFM) which is tailored to context. This work shows good performances in terms of accuracy, but it considers the whole review as a context instead of extracting contextual data, and in the real world datasets only few reviews contain this kind of data.

III. CONTEXT EXTRACTION
To extract contextual information from reviews, we should firstly define context dimensions (a.k.a. categories of context). In the literature, many context modeling approaches have been introduced, but the most commonly used context representation is [13], the major of these approaches use ontologies to build context taxonomy. For instance, Castelli et al. [14] use the W4 model (a.k.a. Who, When, Where, What) as components of context, "Who" is linked to the Person, "When" is associated to the Time, "Where" refers to the Location and "What" refers to the Fact. Similarly, Kim et al. [15] instantiate the 5W1H model (a.k.a. Who, Why, Where, What, When, How) as contextual components associated respectively to Status, Goal, Location, Role, Time, Action. Chaari et al. [16] proposed a basic context descriptor to describe contextual components as Service, User, Activity, Loation, Device, Resource, Network. Table I resumes some principle modeling techniques of context. After revising and analysing proposed works, we choose to use four contextual dimensions, Time, Location, Companion and Environmental dimensions. where a given sentence is presented as a tokens sequence w = (w 1 , w 2 , w 3 , ..., w n ), and transformed to a token labels sequence y = (y 1 , y 2 , y 3 , ..., y n ) [18], the neural model is generally composed from three elements : word embedding layer, context encoder layer and decoder layer [19]. Bidirectional long short-term memory networks (Bi-LSTM) [20][21] is widely applied in Natural Language Processing tasks and adopted by most of NER, due to its sequential characteristic and its capacity to learn contextual word representations. Despite NER having been employed in several application domains, many application fields are still not discovered, such as in Context Aware Recommender Systems.

B. BERT
BERT (Bidirectional Encoder Representations from Transformers) As its name indicates, it is a model of language representation that relies on a module called "Transformer". A transformer is a component which relies on attention methods and which is built on the basis of an encoder and a decoder. In opposition with directional and shallow-bidirectional models (OpenAI GPT [22], ELMo [23]), BERT pre-trains deep bidirectional representations from unstructured text on both left and right context in all layers [24]. It has been pre-trained on large corpus such as the entire BookCorpus and Wikipedia.  Language Modeling is a usual NLP task of predicting the next word given the start of the sentence. The Masked Language Model (MLM) allows to BERT to learn in an unsupervised way, the entry is sufficient on its own, no need to label anything. The principle of Masked Language Modeling is to predict "masked" tokens from the other tokens in the sequence. In the first step of BERT's pre-training, 15% of the tokens of each sequence are masked, randomly. This step is very essential because BERT gets its deep bidirectionality from it.

IV. MACHINE FACTORIZATION FOR CONTEXT AWARE RECOMMENDER SYSTEMS
Factorization machines (FM) proposed by Rendle [25], a general-purpose supervised learning algorithm that could be used in regression and classification tasks. It rapidly became one of the most popular algorithms for recommendation and prediction. It is a generalization of the linear model that is able to capture interactions between variables and also it can significantly reduce the polynomial complexity to linear computation time. FM is very efficient especially within high dimensional sparse datasets.
Let x ∈ R d be the feature vectors and y be the corresponding label. The model equation for a factorization machine is defined as: Where w 0 ∈ R is the bias term, w ∈ R d are weights corresponding to each feature vector, V ∈ R d×k the interaction matrix, v i is the i th row of the V matrix, v i , v j the interaction between the i th and j th variable. It is important to point out that this factorization has the ability to compute all pairwise interactions, even hidden feature interactions which can significantly reduce engineering efforts.
Context Aware Factorization Machines is an application of the origin FM algorithm without any tuning. In effect, it is easy for the algorithm to incorporate the additional dimensions without making any changes, since it uses a sparse vector representation. Fig. 2 represents how to transform the contextual dimensions into a prediction problem from realvalued features using Sparse Feature Vector Representation.
We used FM for two main reasons. The First reason is that the algorithm is designed to support sparse data, and the extracted contextual information will make the matrix more sparse. The second reason is that the computation cost of FM is a linear time complexity (O(kd)), even for additional contextual dimensions.

V. METHODOLOGY
The implementation of our model is a two-steps process. The first step is for context extraction from reviews using a custom NER and BERT model. As shown in Fig. 3, in this step we aim to switch from the two-dimensional mode used by classic Recommender Systems to multi-dimensions mode used by Context-aware Recommenders. In the second step a contextual Factorization Machine is applied to predict ratings and generate recommendations based on outputs from the previous step.

A. Context Extraction Step
In this step, The Named Entity Recognition is treated as a sequence labeling problem. Our model consists of three layers as shown in Fig. 4 namely, word embedding layer, Bi-LSTM layer and CRF layer. In the first layer, the BERT pretrained model takes a sequence of n words (w1, w2, . . . , wn), then outputs a contextual embedding vector representation of each word. In contrast to context independent word embedding techniques such as Word2Vec [26], BERT is a powerful model, highly bidirectional and utilizes contextual information to learn word's context. BERT has two variants:  In this work we use the BERT Base model [27].
In the second layer, the Bidirectional Long Short-Term Memory (Bi-LSTM) took part. Bi-LSTM is an extension of LSTM proposed by [28] that uses forward and backward networks to process sequences. It is designed to avoid gradient vanishing and exploding and also escape the problem of long term dependency.
The output from the embedding layer is sent to the Bidirectional Long Short-Term Memory (Bi-LSTM) to extract vector features from words . Bi-LSTM concatenates the forward and the backward networks as a final result [H l , H r ]. In the last layer, Conditional Random Fields (CRF) [28] outputs the most probable tag sequences. CRF is a probabilistic discriminative model that is used to label sequences. The use of CRF helps the model to learn labels and constraints that ensure the validity of the sequence. For example, the BIO format (Beginning, Inside, Outside) is a common tagging format for tagging tokens, the first word label must begin by "B" or "O" not by "I", this constraint is learned automatically by CRF.
Let X be the input sequence and Y the corresponding tag sequences, P the matrix obtained from the previous layer and T the transition matrix which represents the probability from label y i to label y i+1 . The score of the tags sequence is computed as follow : (2)

B. Rating Prediction Step
In this step, a Contextual Factorization Machine (CFM) takes the output from the previous step and predicts the rating. CFM is similar to MF except that a matrix of weights is added to capture the importance of contextual dimensions. the CFM equation is given as follow: where w 0 ∈ R is the global bias , w ∈ R d are weights corresponding to each feature vector, V ∈ R d×k the interaction matrix, v i i th row of the V matrix, < vi, ji > the interaction between the i-th and j-th variable and B ∈ R p the matrix of weights of the importance. The parameter b i equal to 1 for item and user and b i x i for other dimensions.

A. Corpora and Dataset
We use three corpora to pre-train our custom NER : -Corpus 1 is the CoNLL-2003 [29] NER dataset which consists of 18,453 sentences, 254,983 tokens and four entities namely persons, locations, organizations and miscellaneous.
-Corpus 2 is the Groningen Meaning Bank (GMB) [30]]for name entity classification, developed at the University of Groningen. It comprises 63,256 sentences, 1,388,847 tokens and eight entities (Geographical Entity, Organization, Person, Geopolitical Entity, Time, Artifact, Event, Natural Phenomenon).
-Corpus 3 is a custom corpus that we build to face some lake in the two aforementioned corpus. In fact, after training our custom NER, it is still not able to extract some categories such as Companion context and Environmental context , e.g., "I watched the movies with my friend at the cinema" friend and cinema should be annotated as companion and location, but it's not the case. The new corpus allows us to fine-tune our custom NER, and tackle this problem. It is created in a BIO(Beginning, Inside, Outside) format and consists of 3500 sentences, 43,565 tokens and three entities.
To evaluate our model , we have selected Amazon Customer Reviews Dataset [31] and Yelp dataset [32]. Amazon dataset consists of customer reviews, ratings and product metadata(price, brand, descriptions, . . . ). It includes more than 233.1M reviews collected between 1996 and 2018 and 21 categories of products. This dataset is considered as the largest public dataset for rating. The Yelp dataset is a free to use dataset for academic and personal purposes, it contains more than 8,5M reviews and more than 160K businesses. We adopted three metrics to evaluate our custom NER namely Precision, F1 and Recall: P recesion(P ) = T P T P + F P .
Recall(R) = T P T P + F N .
And we use the Mean Square Error (MSE) to evaluate the performance of the CFM algorithm. The corresponding equation of the MSE is introduced as follow:

B. Experimental Setting
As previously mentioned the proposed approach consists of two steps: the context extraction step and the rating prediction step. In the first step, the custom NER is trained using the FLAIR library, an open source NLP framework for stateof-the-art text classification and sequence labeling. We use Google Colab to train the model (GPU 12GB). In order to not exceed the available GPU memory, we fix the mini-batch to 32 and the maximum sequence length 512. We use a Bi-LSTM with a single layer with a hidden size of 256 to process input sequences and a learning rate to 0.1.
In the second step, We build our CFM using Tensorflow framework [33], our implementation is inspired from [34] [35]. We split the data into 80% for training and 20% for test .As we are dealing with a regression problem, the model parameters are learned by minimizing the loss function, we also prevent overfitting by adding a L2 regularization term. Since it is efficient with sparse data, we use a gradient-based optimizer. We fix the number of iteration to 1000 since the CFM algorithm needs more time to converge.
C. Results and Discussion 1) Custom NER: Results of the custom NER for different corpus are represented in Fig. 5. As we can see, the custom NER achieves the best F1 score of 91.59% for Corpus1, 89.92% for Corpus3 and the worst F1 score of 80.17% for Corpus2. The disparity in results can be explained by the fact that the quality of data from Corpus to another. The GMB Corpus gets the worst performance because it is not perfect. It is not completely human, the corpus is built using existing annotation and must be corrected manually by humans.   All works already mentioned have incorporated reviews except FM and PMF. In Fig. 6  The difference in results is mainly related to the sparsity of the two datasets and it is worth to note that with other datasets less sparse than the two datasets used in these experiments our model can perform better. Table III represents all obtained results.

VII. CONCLUSION
This article presented an automatic method to extract contextual information from users' reviews, then use it to improve recommendation quality with less time spent in feature engineering. This work is divided into two main steps: The first for context extraction using a custom NER, where the model used in this step consists of three layers, namely word embedding layer which takes a sequence of words and outputs a contextual embedding vector representation of each word using BERT model, the Bi-LSTM layer which extracts vector features from words generated by the previous layer and the last layer called the CRF layer which helps to automatically learn labels and constraints and guarantee the sequence validity. The second step is for ratings prediction. The CFM algorithm takes the output from the first step and computes the ratings. In contrast to the generic FM, the CFM is able to capture the importance of the contextual dimensions and incorporate them into the process of recommendation.
To evaluate the performance, the proposed model was compared with five models and obtained results show that the model achieves good results. For future work, the proposed model will be improved in both steps, namely the data extraction step and techniques utilized for this end and also for ratings prediction.