A Deep Learning Approach Combining CNN and Bi-LSTM with SVM Classifier for Arabic Sentiment Analysis

Deep learning models have recently been proven to be successful in various natural language processing tasks, including sentiment analysis. Conventionally, a deep learning model’s architecture includes a feature extraction layer followed by a fully connected layer used to train the model parameters and classification task. In this paper, we employ a deep learning model with modified architecture that combines Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (Bi-LSTM) for feature extraction, with Support Vector Machine (SVM) for Arabic sentiment classification. In particular, we use a linear SVM classifier that utilizes the embedded vectors obtained from CNN and Bi-LSTM for polarity classification of Arabic reviews. The proposed method was tested on three publicly available datasets. The results show that the method achieved superior performance than the two baseline algorithms of CNN and SVM in all datasets. Keywords—Sentiment analysis; Arabic sentiment analysis; deep learning approach; convolutional neural network CNN; bidirectional long short-term memory Bi-LSTM; support vector machine; SVM


I. INTRODUCTION
Sentiment analysis (SA) is one of the most active research areas in natural language processing (NLP). The SA task aims to equip a machine with the ability to categorize people opinions into positive or negative based on the sentiment expressed in the texts. SA is technically a challenging task, as human language is complex and diverse. Nevertheless, research on SA has achieved considerable progress, especially for the English language, while Arabic SA is developing slowly despite increasing Arabic language usage on the Internet [1]. That can be attributed to the Arabic language's complex morphological nature and diverse dialects [2]. Additionally, technical issues such as scarce linguistic resources and limited linguistic tools of Arabic increased the level of complexity [3]. The Arabic language is one of the most common languages that is used by more than 400 million1 individuals to daily communication. Arabic is formally written in a form called Modern Standard Arabic (MSA), which is understood all over the Arabic world. Despite this fact, internet users usually tend to use dialectal words alongside MSA to write reviews, tweets, or comments. Dialectal words result from behaviors such as replacing characters and changing the pronunciation or the style of writing of nouns, verbs, and 1 https://en.unesco.org/commemorations/worldarabiclanguageday pronouns of MSA [4]. Thus, there are no standard rules that can handle the morphological or syntactic aspects of reviews written in MSA with the presence of dialects.
One of the prominent approaches employed in SA is machine learning (ML). In this context, SA is formalized as a classification problem which is addressed by using algorithms like Naive Bayes (NB), Support Vector Machine (SVM) and Maximum Entropy (MaxEnt). In fact, ML has proved its efficiency and competence; therefore, in the past two decades, it mostly dominates the SA task. However, a fundamental problem that would decay the performance of ML is the text representation through which the features are created to be fed to the learning process. For text representation, the most used model in literature is the bag-of-words (BOW) with unigram, bigram, or part of speech (POS). By using BOW, words are often weighted by their presence (binary scheme) or frequency like in term frequency (TF) or term frequency-inverse document frequency (TF-IDF). This model is relatively straightforward and important; however, it can be problematic and may degrade ML-based sentiment classification performance when it comes to a large number of features or the need for semantic information [5][6][7][8].
Over the past decade, the deep learning approach and word vector representations have attained an increasing interest from NLP researchers as they can successfully handle the limitations of traditional ML methods at NLP tasks, including SA (Kim 2014). Deep learning is an extended approach of ML and a subset of the neural network. It has a structure composed of multiple hidden layers, which enable it to automatically discover semantic representations of texts from data without feature engineering [9]. This approach has improved the stateof-the-art in many SA tasks, including sentiment classification, opinion extraction, and fine-grained sentiment analysis [10]. Along with deep learning, word vector representations, also known as word embeddings, has emerged as a powerful features representation model for the classification task. Word embeddings is a modern approach to learn real-valued lowdimensional vector space representations for a text [11]. It aims to encode continuous semantic similarities between words based on their distributional properties in a large corpus [12]. Word embeddings are typically extracted by using neural networks as the underlying predictive model [13]. One of the most used methods for word embedding is Word2Vec that was proposed by [11]. Word2Vec produces useful word representations learned by a 2-layer neural network based on the model that has been proposed in [14]. Word2vec is one of the key methods, which has led to the state-of-the-art results achieved by the deep learning approach on SA. Word2vec can be implemented using two main learning models; they are continuous bag-of-words (CBOW) and Skip-gram.
Researchers have proposed many variants of deep learning models to address the SA of English, which have shown promising efficacy. For instance, in [15], the authors use Recursive Neural Tensor Network (RNTN) along with word embeddings to address fine-grained sentiment classification for phrase level. In another work [16], the authors apply an ensemble model that includes Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) networks on top of pre-trained word vectors for tweets sentiment classification. The authors in [17] develop an adaptive Recursive Neural Network (RNN) that uses more than one composition functions and adaptively selects them depending on the linguistic tags and the combined vectors for tweets sentiment classification. Unlike English, there have not been many studies that address the Arabic language in SA based on a deep learning approach and word embeddings, except a few such as [18][19][20][21][22], which will be discussed with other studies in the following section. Accordingly, there is still room for exploring different deep learning algorithms and investigating their performance in addressing Arabic SA, especially with the challenges that the Arabic language is still imposing.
In this work, we propose a new method that combines deep learning models with SVM for improving the SA of Arabic reviews. Conventionally, a deep learning model's architecture includes features extraction layers (e.g. CNN or RNN) followed by a fully connected layer used for training the model parameters and classification task. However, in this work, we decided to modify the structure and explore the impact of this structure on sentiment classification performance. The structure is altered by replacing the fully connected network with a linear SVM classifier that trains the embedded vectors generated by the combined deep learning models. This is justified by SVM's remarkable improvement in recent research in SA, including Arabic SA, such as in [20,23,24]. The idea of combining the neural network and SVM was initially proposed in [25] to improve multilayer perceptron classifiers. In this respect, many researchers presented various models for different classification tasks including image classification [26,27], visual recognition [28], intrusion detection [29], object categorization [30], speech recognition [31], and sentiment analysis [32]. Although the proposed method is inspired by these studies, it is different as we use a linear SVM as a replacement for the fully connected layer instead of replacing only the top layer's activation function (i.e. Softmax or Logistic). In particular, this research presents a combination of CNN and Bidirectional-LSTM with linear SVM to improve the sentiment classification performance on Arabic reviews at the document level. To the best of our knowledge, this is the first attempt to use such a combination for Arabic SA. In this work, we used three publically available datasets for evaluation: LABR [33], OMCCA [34], and the dataset presented in [35].
The remaining sections of this paper are organized as follows: Section 2 discusses the related work in Arabic SA. In Section 3, we present the proposed method. Section 4 describes the experimental details used to evaluate the performance of the proposed method. In Section 5, we report and discuss the obtained results. Section 6 presents the conclusion and future work.
II. RELATED WORK Researchers have presented several methods and models to tackle Arabic SA based on three major approaches, namely, machine learning (ML) [36,37], semantic orientation [38,39], and hybrid approach [40]. In this research, we focus on the studies that employ the ML approach, specifically deep learning methods, which have been increasingly utilized in the past decade. These methods have presented remarkable improvements in the SA field, especially for the English language, such as in [41][42][43][44][45]. On the other hand, only a few studies have attempted to utilize the deep learning approach for Arabic SA. For instance, in [46], the authors use an ensemble model of CNN and LSTM proposed by [16] to predict the sentiment of Arabic tweets for the sentence level. The model is trained on top pre-trained word vectors developed in [47]. They evaluated the model on ASTD dataset [48] and achieved an F1-score of around 64% and an accuracy of around 65%. Another work applies CNN and LSTM to sentiment analysis of Arabic tweets described in [49]. They designed a system to identify the sentiment's class and intensity as a score between 0 and 1. The authors used word and document embedding vectors to represent the tweets. They translated the tweets into English to benefit from the available preprocessing tools. As they highlighted, the step of the translation led to degrading the overall performance. Although the system includes some preprocessing steps, it excludes some other important normalizing processes such as removing diacritics, punctuations and repeating characters.
In [50], the authors investigate LSTM and Bi-LSTM models' performance on Saudi dialectical tweets. After some basic normalization processes, they use the CBOW model to represent words with vectors. As reported, Bi-LSTM outperformed LSTM and SVM with an accuracy of 94%. The study of [19] explores different deep learning models, including a combination of LSTM and CNN, to predict tweets' sentiment. They also investigate the use of CBOW and Skipgram models of available pre-trained word vectors introduced in [51]. The experiments show that the highest results were obtained by using a two-layer LSTM model. Barhoumi et al. [52] utilized Bi-LSTM and CNN to evaluate different types of Arabic-specific embeddings. They suggested using available Arabic NLP tools to address the effect of the agglutinate and morphological rich specificity of Arabic on word embedding quality. However, these NLP tools such as lemmatization, light stemming, and stemming have been mostly developed to MSA texts.
The work in [53] reports efforts to detect the sentiment of tweets of time series data in the domains of stock exchange and sport. They use a deep neural network model with eight fully connected hidden layers in which each layer has 100 neurons. They compared the model against KNN, NB, and decision tree classifiers. Their model showed superiority with an F1-sore of around 91% and accuracy of around 88%. Nevertheless, the authors did not mention the kind of feature representation used in this work. The work in [54] explores different architectures for Arabic sentiment classification using Deep Belief Networks (DBF), Deep Auto Encoders (DAE), and Recursive Auto Encoder (RAE). The experimental results show that RAE obtained the best performance due to its capability to handle the sentence's context and order of parsing. Yet, the authors focus on sentence-level sentiment classification, and they used the traditional BOW to represent texts. In [55], the authors implemented LSTM to handle Arabic sentiment classification. The model was implemented along with character/word embedding features. In their experimental results, LSTM has shown the ability to improve the sentiment classification and achieved an accuracy of around 82%. The same author in another work [56] found that SVM work better than RNN on the same dataset, but the execution time of RNN was faster. However, both studies focus on aspect-based sentiment analysis of Arabic reviews.
This current work introduces a combination of CNN with Bi-LSTM and a linear SVM classifier for Arabic sentiment classification. Unlike the studies mentioned above, the proposed model does not follow the conventional CNN and Bi-LSTM. Rather, the fully connected layers would be replaced with a linear SVM algorithm to perform the classification task.
III. PROPOSED METHOD The proposed method for sentiment classification of Arabic reviews starts with preparing data by applying some data preprocessing techniques. Then, the word vector representation is built using pre-trained word embeddings trained on other large Arabic text corpora. After that, we combine CNN and Bi-LSTM for generating the final features representation to be passed to a linear SVM classifier.

A. Data Preprocessing
The preprocessing stage consists of removing noise from data and normalization. The process of removing noise from data includes removing repeated letters, diacritics, punctuations, numerals, English words, and elongation. After that, a normalization process was applied to particular letters, for example, the letters ‫آ(‬ ‫إ,‬ ‫)أ,‬ were converted to ‫,)ا(‬ the letters ‫ئ(‬ ‫)ى,‬ were converted to ‫,)ي(‬ the letter ‫)ة(‬ was converted to ‫,)ه(‬ and finally, the letter ‫)ؤ(‬ was converted to ‫.)و(‬ It is worth mentioning that we implemented these particular preprocessing steps to be compatible with those used in the pre-trained word embeddings model we intend to use. More details about the pre-trained word embeddings model in the next section.

B. Word Vector Representations
To generate representations for the reviews, we adopted the word vectors approach. Word vectors (also known as word embeddings) are vectors of real numbers representing the distribution of words or phrases in a given text [14]. This approach's key benefit is that high-quality word embeddings can be learned efficiently, mainly if a much larger corpus of text were used for training [57]. However, the adopted dataset's size is relatively small in our case, which might lead to low-quality word embeddings. Therefore, we decided to use a pre-trained word embeddings model that trained on other large Arabic texts. One of the most common Arabic pre-trained word embeddings is Aravec [47]. This pre-trained model was trained on texts from multiple sources such as Twitter, Web pages, and Wikipedia with tokens of more than 3,300,000,000. This model was implemented by the well-known word2vec technique described in [11] based on CBOW and Skip-gram architectures.

C. Convolutional Neural Network Layer
We first build a CNN layer to extract the features of reviews. The inputs to the CNN are matrices of word embeddings that reflect the word vector representations of the reviews. Each review is mapped to a matrix of size × , where is the number of words in the review and is the dimension of the embedding space. All reviews must be zeropadded to have the same matrix dimension ∈ ℝ ′ × . Then, in order to compute a new feature map, inputs are convolved with a filter matrix ∈ ℝ ℎ× , where ℎ is the size of the convolution. The convolution is computed as in Equation 1: Where ∈ ℝ is a bias term and ( ) is a nonlinear activiation function. The output is the result of element-wise product between an input matrix and a filter matrix , which then be summed as a single value in a feature map = [ 1 , 2 , … , ] , where ∈ ℝ −ℎ+1 . To obtain rich features representation the model can use multiple filters of different length that can work in parallel. Then, the convolved results are passed to the pooling layer to reduce the representation dimensionality by identifying the most important features. We employed the max-pooling technique which returns the maximum values from the feature map ̂= max ( ). This technique has the ability to force the network to capture the most useful local features produced by the convolutional layer [58]. A typical architecture of a CNN layer is illustrated in Fig. 1. The output of this layer is fed into the next layer that is Bi-LSTM, based on the assumption that is more informative representation will be obtained, more details in the next section.

D. Bidirectional-Long Short-Term Memory (Bi-LSTM) Layer
As mentioned above the input into this layer is the final feature map obtained from CNN. First, let us present a little background about LSTM to understand the Bi-LSTM. LSTM is a type of recurrent neural network RNN, which was developed by [59]. The main objective of LSTM is to avoid vanishing gradient problem that encounters RNN, where the 167 | P a g e www.ijacsa.thesai.org gradient that is propagated back through the network either decays or grows exponentially over time [60]. In other words, RNN is not an effective model for understanding the context behind an input with long-term dependencies. On the other hand, LSTM with its complicated dynamics that consists of several so-called memory blocks, can effectively resolve this issue and make a prediction based on time series data. Each memory block is composed of an input gate , a forget gate , an output gate , and a cell state . These elements are used to compute the hidden state ℎ by the following composite function in Equation 2: Where is the input vector to the memory cell at time , is the sigmoid function, is the weight matrices, and is the bias.
However, one downside of LSTM inherited from RNN is that it does not consider the whole context because the sentence is only read in a forward direction [16]. In this case, the context information provided by the future words will be dismissed, resulting in low classification performance. An extension of LSTM called bidirectional LSTM (Bi-LSTM) [61] was proposed to avoid this problem. In this work, we adopted this method to obtain more informative features based on the features given by CNN. Bi-LSTM simultaneously trains two separate LSTMs of opposite directions (one for forward and one for backward) on the input sequence and then merge the outputs. More formally, as explained above, the input vectors are fed one at a time into the LSTM, making use of all the available input information up to the current time frame to predict y t c . However, applying the bidirectional network's notion will enable the network to use the input information coming up later than by delaying the output by a certain number M time frames up to + to predict . The structure of Bi-LSTM is illustrated in Fig. 2 that shows the basic structure of folded Bi-LSTM which computes the forward hidden sequence ℎ �⃗ , the backward hidden sequence ℎ ⃖� , and the output sequence by iterating the backward and forward layers. This layer also if followed by a max-pooling layer as outlined in the previous section.
After this layer, we have a layer that concatenates the list of outputs from the previous layers into a single vector.
Conventionally, this vector is passed to a fully connected layer that applies weights over the generated features ( * + ) to predict the class of a given input, where ∈ ℝ × is the weights matrix, and is the activation function. However, in this work, we will replace these fully connected layers with a linear SVM to perform the prediction process, more details in the next section.

E. Support Vector Machine (SVM) Classifier
As mentioned earlier, we present a model that combines CNN and Bi-LSTM with SVM, where CNN and Bi-LSTM used for feature vectors generation and SVM for sentiment classification. SVM was employed as it has proven to be an effective algorithm in many NLP tasks, including SA. The key idea behind combining these heterogeneous methods is to use each method's advantages to improve the Arabic language's sentiment classification. In general, SVM is a machine learning algorithm for binary classification problems introduced by [63]. The SVM classifier aims to find an optimal hyperplane that classifies given data points into one of two classes by maximizing the margin. More formally, a linear SVM will receive a concatenated vector of features associated with its labels to be processed by a hyperplane ( , ) = . + to compute the coefficients of the hyperplane as feature weights, where is the normal vector to the hyperplane and also known as the weight vector and is the bias.
The architecture of the proposed model is illustrated in Fig. 3. Obviously, our model's architecture is essentially the same as the basic CNN and Bi-LSTM with a difference in using the SVM algorithm for classification. Further clarity, CNN or Bi-LSTM typically use a fully connected network with different activation functions (i.e. logistic) for learning and classification, as explained earlier. However, in this work, we replace the fully connected layer with a linear SVM which use a margin-based loss function for classification. 168 | P a g e www.ijacsa.thesai.org

IV. EXPERIMENTS
This section presents the datasets, models settings, and evaluation metrics. For evaluation, we developed some experiments to compare our model's performance versus baseline models. Particularly, the author used the individual classifiers of CNN and SVM as baseline models. Additionally, we compared the proposed model results with some of the state-of-the-art deep learning models proposed by other related studies. The experiments also explore the best pre-trained word embeddings architecture for the proposed model. For this purpose, each performance experiment is carried out with CBOW and Skip-gram. To perform all the experiments in this work, we used the Keras library. More details about the experiments are provided in the following subsections.

A. Datasets
We used three publicly available datasets, namely, LABR, OMCCA, and the dataset presented in [35]. These datasets vary in size and writing form, where MSA and different Arabic dialects were detected. For example, LABR contains over 63000 reviews written mostly in MSA with the presence of dialectal phrases. On the other hand, On the other hand, OMCAA consists of over 28000 reviews written mostly in Jordanian and Saudi Arabic dialects. The last dataset presented 2400 reviews written in Jordanian dialect with the presence of MSA. All the datasets are annotated on the document level, and we considered only two polarity classes, which are positive and negative. The training/validation/testing split is set to 70%, 15%, 15%, respectively, in our experiments. Due to the limited capabilities of the device used to perform the experiments, we considered only the randomly balanced versions of these datasets. Table I presents more details about datasets.

B. Evaluation Metrics
The performance is quantified using the following evaluation metrics: F1-score, Precision, and Recall; see Equations 3, 4 and 5.
where TP, TN, FP FN indicate a true positive, true negative, false positive, false negative, respectively.

C. Hyper-parameters Settings
This section describes the hyper-parameters settings for each model used in the experiments. The process of selecting the optimal hyper-parameters is a challenging task, and it varies based on the characteristics of the dataset. To this end, we performed several trials to choose the hyper-parameters with which the classifiers may yield the best performance results. As a result, the hyper-parameters in Table II have been assigned to the layers of CNN and Bi-LSTM. It is worth mentioning that all the data points have been zero-padded before being passed to the deep learning models, so all the input vectors have the same length. In the case of the CNN is a baseline model, the word vectors will be fed to a fully connected layer of size 64 nodes. The dense hidden layer is trained and regularized using a Relu function and l_2-norm method, respectively. The weights are then passed to an output layer with a sigmoid function to give the final classification probability. We also applied a Dropout layer after each model and early stopping strategy to reduce the over-fitting problem. Then, we added a max-pooling layer after each layer to keep the most important features. During the networks' training process, Adam algorithm) is employed to perform the optimization with a learning rate of 0.001, and binary crossentropy function for loss minimizing. A linear SVM classifier with its default parameters in Keras library has been employed for the classification process. Regarding generating word embeddings based on Aravec, after many trials, we decided to apply the skip-gram and CBOW models built on Web pages with tokens of 132,750,000 and embedding dimension of 300. For words that are not contained in the pre-trained model, the embedding is set to a vector of zeros.

V. RESULTS
Results of the proposed model against the baseline models on the datasets are summarized in Table III. As it can be seen, our model with skip-gram and CBOW outperforms the other models in all datasets. The results show a significant improvement in the F1-score of the proposed model in all datasets, where the highest is achieved in the third dataset with around 8% compared to CNN and SVM when skip-gram is used and around 7% when CBOW is applied. Although the proposed model outperforms SVM and CNN with CBOW on OMCCA, the highest results are obtained using skip-gram with an improvement of 7% and 8% compared to CNN and SVM, respectively. It is also noticed that the proposed model show improvement on the LABR dataset when skip-gram is used with around 6% and 7% compared to CNN and SVM, respectively, and around 5% for both models in the case of skip-gram.
Additionally, the proposed model performs very well in terms of the recall and precision in all dataset with a bit of trade-off, where the recall is the highest compared to all models used with 91.6% on the third dataset. On the other hand, the highest precision value is around 91% and achieved by CNN on the OMCCA dataset. Based on the two-word embeddings models, CNN achieved classification performance close to SVM with slight superiority to the former. This might indicate that the representation captured by both models is not enough to identify the latent connections between words, and a richer representation is required for better performance.
Finally, after the proposed model has been proven to be effective in Arabic sentiment classification for all datasets used in this work, we compare our results with other studies' results that employed state-of-the-art deep learning models. We choose work that used LABR dataset as it is the most common dataset in the Arabic SA literature. To guarantee a fair comparison, we only consider those who experimented on balanced class labels and used Aravec pre-trained model. In this sense, the work in [64] reported a significant improvement in Arabic sentiment classification based on a combination of Bi-LSTM and CNN on different datasets, including LABR. However, our proposed model outperforms their model, where they achieved an F1-score of 76.9%, which is 10% less than the highest results of our model on LABR. This paper presented a model that uses a linear SVM classifier on top of a combination of CNN and Bi-LSTM for Arabic sentiment classification. Unlike the conventional architecture of the deep learning model, the proposed model was tailored by replacing the fully connected layer with an SVM classifier, which receives the embedded vectors extracted by CNN and Bi-LSTM. The experimentation has shown the effectiveness of the proposed model with a significant improvement over the baseline models. Furthermore, we showed that the proposed model outperforms one of the stateof-the-art deep learning models with a considerable improvement. Yet, there is room for improvement as the proposed model does not concern with some issues that might affect the performance (e.g. negation handling). Further work is also required to explore the effectiveness of the proposed model with deeper layers and diverse architectures on different Arabic benchmark datasets. Additionally, to extend the work so that it can be applied to other Arabic SA tasks such as aspect-based sentiment analysis.