An Ensemble of Arabic Transformer-based Models for Arabic Sentiment Analysis

—In recent years, sentiment analysis has gained momentum as a research area. This task aims at identifying the opinion that is expressed in a subjective statement. An opinion is a subjective expression describing personal thoughts and feelings. These thoughts and feelings can be assigned with a certain sentiment. The most studied sentiments are positive, negative, and neutral. Since the introduction of attention mechanism in machine learning, sentiment analysis techniques have evolved from recurrent neural networks to transformer models. Transformer-based models are encoder-decoder systems with attention. Attention mechanism has permitted models to consider only relevant parts of a given sequence. Making use of this feature in encoder-decoder architecture has impacted the performance of transformer models in several natural language processing tasks, including sentiment analysis. A significant number of Arabic transformer-based models have been pretrained recently to perform Arabic sentiment analysis tasks. Most of these models are implemented based on Bidirectional Encoder Representations from Transformers (BERT) such as AraBERT, CAMeLBERT, Arabic ALBERT and GigaBERT. Recent studies have confirmed the effectiveness of this type of models in Arabic sentiment analysis. Thus, in this work, two transformer-based models, namely AraBERT and CAMeLBERT have been experimented. Furthermore, an ensemble model has been implemented to achieve more reasonable performance.


I. INTRODUCTION
Given the tremendous amounts of Arabic digital content that has been produced in the last couple of years, an increasing number of research works have been devoted to the automatic processing of this language. In this regard, different techniques have been used to classify a specific text. Many studies have addressed this task by making use of basic machine learning models such as Naïve Bayes (NB) and Support Vector Machine (SVM). The authors in [1] addressed Arabic text classification using SVM and NB combined with the N-gram feature. The best accuracy of SVM was achieved without the N-gram, as for NB the best accuracy was achieved when the Ngram feature was considered. Whereas the authors in [2] introduced their Arabic Jordanian twitter corpus, then evaluated N-grams and stemming techniques together with TF-IDF or TF weighting schemes. Experiments have been carried out by making use of SVM and NB. Results showed that training SVM model on top of stems and bigrams using TF-IDF could give better performance compared to NB model. In a similar work [3], the authors performed sentiment analysis on Arabizi text which is Arabic text written in Latin alphabets. For experimentation purposes, the authors used NB and SVM classifiers. Besides, they evaluated the filtering step, which consists of removing stop words and mapping emojis to their corresponding Arabic words. Results indicated that SVM model outperformed NB model. However, filtering step did not greatly improve the accuracy.
Recently deep learning models such as Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) have proven to be efficient for analyzing Arabic content. Many researchers have relied on deep learning models to tackle Arabic sentiment analysis task. In [4] performed a binary sentiment classification in Arabic. Initially, they applied preprocessing to clean input texts. Next, a word embedding layer has been used to represent texts as numerical vectors to be fed to the LSTM layer. Finally, a SoftMax layer followed to predict the polarity of the text. The experiments showed quite good results with an accuracy ranging from 80% to 82%. In another work [5] the authors made an empirical comparison between deep learning models (LSTM, CNN) and other machine learning models for both binary and multiclass classification using different datasets. The paper showed that deep learning models are effective for larger datasets. In contrast, basic machine learning algorithms perform well on smaller datasets.
More recently, with the increasing popularity of transformer models, sentiment analysis task has been significantly improved in terms of performance. Transformer models have replaced deep learning models and achieved stateof-the-art results on many automatic language processing tasks such as sentiment analysis [6], named entity recognition [7], question answering [8], and many other tasks.
The ineffectiveness of the existing methods performing sentiment analysis in Arabic is the main motivation for proposing a transformer-based ensemble method. In the last few years transformer language models alone led to significant improvements in sentiment analysis. Hence, making use of the advantages of this type of models to investigate more reliable approaches is indeed necessary to tackle sentiment analysis in Arabic being a morphologically rich language. In this paper, we propose an ensemble model that combines the strengths of two transformer language models.
Many different disciplines have made significant use of ensemble approaches to address text classification. However, there is relatively few studies on the use of ensemble methods in Arabic sentiment analysis. The primary goal in this paper is to propose an ensemble model that combines two base transformer models namely AraBERT and CAMeLBERT into a single model. To the best of our knowledge this is the first www.ijacsa.thesai.org study that investigates the ensemble of transformer-based models in Arabic sentiment analysis. Experimental results demonstrate that the proposed ensemble method outperforms stand-alone classifiers and majority voting ensemble model. The rest of this paper is structured as follows. Related works will be introduced in Section II. The overall proposed methodology will be discussed in Section III. Experiments and results are given in Section IV. Then conclusions are drawn in Section V.

II. RELATED WORK
Given the effectiveness of transformer-based models, there have been various transformer models used in Arabic sentiment analysis. The widely utilized models are Multilingual BERT, AraBERT, and MARBERT [9]. The author in [10] addressed sentiment analysis in Modern Standard Arabic (MSA) and other Arabic dialects such as Levantine, Egyptian, and Gulf. The author opted for three-way classification according to three scales (positive, neutral, and negative) and using different algorithms, namely: Naive Bayes classifiers (NB), Support Vector Machine (SVM), Random Forest Classifier, and BERT model (Bidirectional Encoder Representations from Transformers). The best results are obtained with BERT model reaching an accuracy score of more than 83%. The author in [11] addressed sarcasm and sentiment detection using two variants of transformer-based models, namely AraELECTRA and AraBERT. Evaluation results showed that AraBERT performs the best in terms of accuracy for both sarcasm and sentiment detection. In a similar work, [12] addressed the same tasks: sarcasm detection and sentiment analysis. The authors have examined six BERT-based models including: MARBERT [13], QARiB [14], AraBERTv02 [15], GigaBERTv3 [16], Arabic BERT [17], and mBERT [18]. MARBERT achieved promising results for both tasks.
Several studies in the literature investigated ensemble methods in Arabic sentiments analysis. The authors in [19] investigated different deep learning models to improve Arabic sentiment analysis accuracy. The authors proposed an ensemble model combining a Convolutional Neural Network (CNN) model and a Long Short-Term Memory (LSTM) model. To evaluate their model, they used the ASTD dataset [20] which consists of 10000 tweets. In this work, they focused only on opinion classification, hence the objective class tweets were removed. To construct their ensemble model, they experimented different CNN models and LSTM models with different hyper-parameters. The best CNN model is obtained by configuring the parameter fully connected layer size to 100. As for LSTM, the best model is obtained by using a dropout rate of 0.2, based on the best CNN model and the best LSTM model they built an ensemble model where the final predicted class is obtained using soft voting. Results show that the ensemble model achieved better results in terms of accuracy and F1-score compared to LSTM model and CNN model. In another study [21] , the authors introduced their approach to address three SemEval related sentiment analysis subtasks in Arabic. First Subtask (A) addresses Message Polarity Classification, then Subtask (B) addresses Topic-Based Message Polarity Classification, finally Subtask (D) which addresses Tweet quantification. The authors proposed two systems, the first is developed using their previous proposed sentiment analyzer [22] based on a scored lexicon. The second system is an ensemble of three different classifiers namely Convolutional Neural Network using Word2vec, Multilayer Perceptron and Logistic Regression. Using voting between the three classifiers the authors determined the final outcome. Evaluation results showed that their systems outperformed all the other systems by achieving an accuracy of 0.58 and 0.77 on Subtask A and Subtask B respectively, as for Subtask D their system outperformed the other systems as well by achieving 0.127 in terms of KLD score.
It seems clear that none of the existing ensemble methods has addressed Arabic sentiment analysis by making use of transformer language models. Accordingly, this study will be focusing on investigating the use of transformer language models in Arabic sentiment classification as well as proposing an ensemble technique based on transformer models to enhance classification accuracy.

III. PROPOSED METHODOLOGY
This section presents the methodology proposed in this paper. We will be discussing the background of transformerbased models and the models adopted in this work. Then, describe the proposed ensemble model architecture.

A. Background
A variety of neural network architectures have been proposed and used for text classification tasks, including sentiment classification. Among these numerous architectures, the best adapted architecture to sequential data is recurrent neural networks (RNNs). They have demonstrated to be effective on data where elements order is important. For example, in a sentence, the order in which words occur has a significant impact on the meaning of that sentence. Since its introduction, RNNs have been the state-of-the-art for capturing and processing dependencies in sequences. Nevertheless, it also has its drawbacks, it has been proved that RNNs cannot process large sequences of text such as long paragraphs. Moreover, in practice, data is processed sequentially, which makes it difficult to perform parallel computing using RNNs. Recently, a new architecture called Transformers has been introduced. Similar to RNNs, transformers use attention mechanism and inherit the encoder-decoder architecture of the sequence-to-sequence models to deal with sequential data. However, its architecture does not involve recurrent networks in order to speed up the training process. Transformers were firstly introduced by [23], and they were initially designed to perform translation. As illustrated below in Fig. 1, a transformer consists of two blocks, on the left, the encoder stack, and on the right, the decoder stack. The encoder stack is made up of a multi-head self-attention layer and a fully connected feed forward network. In addition to these two layers, the decoder stack has one more layer called the masked multi-head attention layer. As transformer architecture does not rely on recurrence, word position is not provided. To preserve this information positional encoding technique has been introduced. In addition to the input embedding vector, a positional vector with the same dimension as the input embedding vector is added to capture the context of a word in a sentence. www.ijacsa.thesai.org Transformer-based models include three types of models: encoder-only, decoder-only, and encoder-decoder, following a brief introduction to each type of transformers, its architecture and its applications.

1) Encoder-only models:
In this type of transformers only the encoder part is needed. A vector representing the input sequence is fed to the first encoder block that consists of a bidirectional self-attention layer and a feed-forward layer, the output of this block is passed to the following encoder block, which itself is composed of two layers. Each encoder block tries to enrich the embedding vector with contextual information. The final encoder block outputs the last contextual encoding. This type of transformer is suitable for tasks such as text classification or named entity recognition. The most popular examples of this type of models are: BERT [18], ELECTRA [24], and RoBERTA [25].
2) Decoder-only models: In decoder models only decoder stack is used. It consists of N identical decoder blocks; a single decoder block is composed of three layers. The first layer is a masked multi-head attention layer, in which future information is masked and only previous positions in the input sequence have attention. Similarly, to the encoder block, the next layers are multi-head self-attention layers and a fully connected feed-forward network. Decoder-only based models are also called autoregressive models and are more suited for tasks such as text generation. The most widely used models trained with decoder-only architectures are GPT (Generative Pre-trained Transformer) [26].
3) Encoder-decoder models: Also called sequence-tosequence models, this type of models is implemented using both blocks: encoder block and decoder block. In the encoder block, the whole sequence is considered, whereas in the decoder block for a given word, only the words that precede are considered. Encoder-decoder models are best suited for tasks that involve the input of a sequence of items (words, letters, etc.) and then outputs another sequence. This architecture can be applied in the case of machine translation or question answering, where a sequence of words is treated sequentially and the result is also a sequence of words. Recently, many encoder-decoder based models have been introduced.

B. Transformer Language Models for Arabic
Transformers were initially introduced as a novel architecture for translation. Ever since, they have been mostly used for natural language processing. In sentiment analysis task, pretrained transformer language architectures have significantly improved the performance of models. Each model has its own size and trained on different type of datasets. Table  I summarizes the most applied models in Arabic text classification.
In this paper, we have selected some of the most effective Arabic transformation models in sentiment analysis in Arabic. Each of these models is based on different architectures and trained using different Arabic variants. Hereafter, we discuss each of the selected models and their architecture.

1)
AraBERT [27] pretrained BERT model using a pretrained dataset of 70 million sentences, collected from Wikipedia dumps, Arabic news websites and two large corpora: Abulkhair Arabic Corpus [31] and OSIAN [32]. AraBERT comes in two versions AraBERTv0.1 and AraBERTv1. In this study AraBERTv0.2 is used for experiments. 2) CAMeLBERT [28] implemented their Arabic pretrained language model on top of three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic. The authors evaluated the proposed model by making use of 12 datasets to address five tasks: Named Entity Recognition, POS tagging, sentiment analysis, dialect identification, and poetry classification.

C. Ensemble Models
Ensemble learning is a technique that combines multiple machine learning models to improve the performance of the learning model and achieve a higher accuracy score than would be achieved by any single model in the ensemble. In this study, two ensemble techniques have been evaluated. The first technique is the majority voting. It is the most commonly used technique for ensemble learning. The second technique is based on calculating the sum of raw outputs of each model in the ensemble.

1) Majority voting:
In majority voting, the final output of the ensemble model is determined by counting for each class the number of votes of multiple models. The class with the majority of votes is predicted.
2) The proposed method SUM: As illustrated in Fig. 2, that represents the proposed ensemble model. Firstly, a raw text is fed to the model as input, then transformed into vector representation so that it can be processed with encoderdecoder approach. Then the decoder-block of each model outputs probabilities for each class. Finally, the output is obtained by calculating the weighted sum of the probabilities from the same class, then for each class, the argmax operation is applied to find the class with higher probability value.
For a given text, let and denote the probabilities predicted with AraBERT model for the class Negative and the class Positive respectively. Whereas, and denote the probabilities predicted with CAMeLBERT model for the class Negative and the class Positive respectively. For each class, the final probability is obtained by calculating the weighted sum of both probabilities, namely the probability given with AraBERT model and CAMeLBERT model. Weights values are not selected randomly, the main reason of selecting weight values 0, 7 and 0, 3 for CAMeLBERT and AraBERT respectively, is that CAMeLBERT model tend to perform well on the majority of the datasets (see Table II). Thus, we considered 70% of the probability generated by CAMeLBERT model and 30% of the probability generated by AraBERT model. The final probabilities are calculated using the following equations: Next, the final output corresponds to the index with higher probability value.
Therefore, if the output index is 0, the model will assign to the input text the class Negative, and if the output index is 1 the model will assign Positive.

IV. EXPERIMENTS
In this section, we discuss the implemented models and their results. Two transformer language models are experimented namely CAMeLBERT and AraBERT, an ensemble of these two models, as well as majority voting ensemble model.

A. Dataset
For experimentation purposes, in this work we consider four datasets of different sizes and sources. The first dataset is Twitter dataset collected by Abdulla et al,. [33] composed of about 2000 tweets, written in MSA and Jordanian dialect. And consisted of 958 negative tweets and 993 positive tweets. The second dataset is Arabic Gold-Standard Twitter dataset collected by Refaee and Rieser [35] composed of 6512 tweets, divided in three classes: Negative, neutral, and positive. The negative class contains 1941 tweets, the neutral class contains 3694, and the positive class contains 876 tweets. In this study we perform a binary classification, thus only Positive and negative classes are utilized. The third dataset is Arabic sentiment tweets dataset (ASTD) proposed by [19] which contains 10006 tweets written in MSA and Arabic dialect. Tweets are labeled as one of four classes: negative, positive, neutral, and objective. As this study focuses on binary classification only positive and negative tweets are considered. After data preprocessing, we are left with 1684 negative tweets and 799 positive tweets. The fourth dataset is a dataset that we have proposed in a previous work [36] which is consisted of 1299 Modern Standard Arabic books reviews with a balance between positive and negative reviews. Reviews are collected from Goodreads website and annotated manually. After data collection, each given text is decomposed into tokens. Then, Arabic stop words are filtered out as they do not hold any information. The next step is normalization, where elongation, hamza, and diacritics are removed. Finally, all emoticons and emojis are deleted based on a preselected list of the most commonly used emoticons and emojis.

B. Results
To investigate the effectiveness of the proposed ensemble model three models have been implemented, including two transformer language-based models: AraBERT and CAMeLBERT and majority voting model. All models are trained on the same training set, which represents 80% of the whole dataset, and tested on the same testing set composed of the 20% remaining data. For each of the four datasets the models are trained and tested on both balanced and unbalanced datasets. Performance results of the proposed ensemble method are compared with stand-alone models and majority voting ensemble model. The models are evaluated using accuracy, F1score, precision and recall metrics. The mathematical formulas of each of the used metrics is defined as follows: www.ijacsa.thesai.org Where, TP, FP, TN, and FN refer to "True Positive", "False Positive", "True Negative", and "False Negative" respectively. Table II shows the performance of each model on balanced and unbalanced datasets compared to existing models. As can be seen, it is clear that ensemble models provide remarkable improvement over baseline models. Majority voting model has achieved the best accuracy score on unbalanced Gold dataset with an accuracy score of 90.78% followed by our ensemble model with 89.01%, whereas on Twitter, and ASTD datasets the best accuracy score is achieved by our proposed ensemble model SUM. On Twitter dataset SUM model achieved an accuracy score of 96.16% with unbalanced dataset followed by majority voting model and CAMeLBERT model with the same accuracy score of 95.65%. As for ASTD dataset our proposed model outperformed all other models by achieving the first best accuracy score of 89.38% on balanced dataset, the second-best accuracy score is achieved by majority voting model with 87.53% on unbalanced dataset.
The results of all proposed models on our dataset are shown on Table III. AraBERT model has achieved better results than CAMeLBERT in terms of accuracy and F1-score. On the other hand, the performance of ensemble models varies from one model to another. Compared to the proposed ensemble model, majority voting model has failed to improve the performance. It has achieved an accuracy of 94.98% against an accuracy of 95.75% achieved by AraBERT. Whereas, our proposed ensemble model has reached the best results in terms of accuracy and F1-score. The mediocre performance of majority voting may be explained by the size of the dataset and the number of combined models. In summary, the best results have been achieved by our proposed ensemble model on balanced datasets. Thus, it is obvious from the conducted comparative experiments that training models on balanced data can improve classification performance, it can help models to learn better and achieve better accuracy results.

V. CONCLUSION
In this work, we have implemented an ensemble model based on two transformer language models, namely AraBERT and CAMeLBERT. The proposed ensemble model was evaluated on top of our balanced dataset composed of modern standard Arabic book reviews. In addition, to investigate more the performance of our proposed model it has been trained on top of three other datasets namely Twitter dataset, Gold dataset and ASTD dataset. Compared to majority voting and the two stand-alone transformer-based models, our proposed ensemble model has achieved the highest score of accuracy and F1 metrics on all datasets. In this paper, we have proposed a domain-independent model, the proposed ensemble model has achieved state-of-the-art on several datasets of different sources and domains. Thus, researchers can adopt our proposed model to address sentiment analysis in Arabic regardless of data type (MSA/Dialect) and domain. To continue working towards improving the model's performance, for future work, we plan to experiment more transformer models, combine multiple models and evaluate all possible combinations to determine the optimized model. Finally, we will be considering increasing the size of our training set as accuracy increases with the size of training data.