Utilizing Deep Learning in Arabic Text Classification Sentiment Analysis of Twitter

—The number of social media users has increased. These users share and reshare their ideas in posts and this information can be mined and used by decision-makers in different domains, who analyse and study user opinions on social media networks to improve the quality of products or study specific phenomena. During the COVID-19 pandemic, social media was used to make decisions to limit the spread of the disease using sentiment analysis. Substantial research on this topic has been done; however, there are limited Arabic textual resources on social media. This has resulted in fewer quality sentiment analyses on Arabic texts. This study proposes a model for Arabic sentiment analysis using a Twitter dataset and deep learning models with Arabic word embedding. It uses the supervised deep learning algorithms on the proposed dataset. The dataset contains 51,000 tweets, of which 8,820 are classified as positive, 37,360 neutral, and 8,820 as negative. After cleaning it will contain 31,413. The experiment has been carried out by applying the deep learning models, Convolutional Neural Network and Long Short-Term Memory while comparing the results of different machine learning techniques such as Naive Bayes and Support Vector Machine. The accuracy of the AraBERT model is 0.92% when applying the test on 3,505 tweets.


I. INTRODUCTION
Recently, sentiment analysis has been prioritized by researchers because it plays an important role in many domains.It is primarily used to study user feedback (user opinion) on a specific event, product or social phenomenon.Many studies have proposed models, approaches or novel databases to predicate and detect user opinions.These methods use machine learning classifiers, deep learning models and natural language techniques as pre-processing methods.Most of the sentiment analysis research focuses on languages other than Arabic.Recent Natural Language Processing research is now increasingly focused on using deep neural learning [1].Some research initiatives are being launched in a competition funded by the King Abdullah University of Science and Technology (KAUST).They focus on the Arabic language and some individual research efforts.
Generally, in other languages, specifically English, the universal language has proven to be significant due to the vast amount of data contributed by users on social networks (Facebook, Twitter, etc.).In machine learning, a classification known as supervised learning is used in sentiment analysis.There are several methods used in sentiment analysis which can be categorized into binary classification, multiclassification, polarity, multilingual and aspect-based sentiment analysis.In binary classification, the classes can be represented only as positive and negative.In multi-class, there are more than two classes.Additionally, there are classifiers used in binary classification such as DT and TH, while KNN and LR are used in multi-classification. Polarity in sentiment analysis is based on a dictionary that assigns a score to each word.Multilingual sentiment analysis requires many pre-processing steps to be performed in option detection and aspect-based.It is focused on one aspect, concept or word.
To the best of our scholarly knowledge, less attention has been given to Arabic sentiment analysis and there are fewer public Arabic datasets [2].Therefore, this paper proposes a model for Arabic sentiment analysis based on the proposed dataset.This work uses supervised deep learning algorithms.The original dataset before the cleaning process contains 51,000 tweets classified as 8,820 positive, 37,360 neutral and 8,820 negative.After cleaning, it contains 31,413 tweets classified as 4,855 positive, 21,842 neutral and 4,716 negative.This work introduces and applies deep learning methods on Arabic sentiment analysis text multi-classes with parameter optimization, and improves the process in the text preprocessing area.We apply the deep learning methods Convolution Neural Network (CNN) and Long Short-Term Memory (LSTM) to compare the results of different supervised machine learning techniques such as Naive Bayes (NB) and Support Vector Machine (SVM).The accuracy of the best CNN model is 95.8% and the accuracy of LSTM is 96.6%, which are better than the SVM and NB results, which are 82.5% and 69.4%, respectively.We used BERT pre-trained specifically in Arabic to achieve the same success that BERT achieved in English [3].Based on a review of the literature and the high accuracy achieved in the deep learning models, the main contributions of this paper can be summarized as follows.www.ijacsa.thesai.org Develop a model for Arabic sentiment analysis using machine learning and deep learning models.
 Explore the most recent approaches to Arabic sentiment analysis.
 Propose a novel dataset called ASAD that is publicly available.
 Perform a comparative analysis of the results of MLC and DLM.
The remainder of this paper is organized as follows: Section II presents an overview of related studies on sentiment analysis.Our research methods and materials are explained in Section III.Section IV presents the results.The conclusion is in Section V.

II. RELATED STUDIES
Scholars have not given enough attention to the Arabic Sentiment Analysis Dataset (ASAD).It [4] provides a comprehensive overview of a new Twitter-based benchmark dataset for Arabic sentiment analysis.ASAD is a massive, high-quality annotated dataset (including 95,000 tweets) with three-class sentiment labels, compared to other publicly released Arabic datasets (positive, negative and neutral) in [5], researched Twitter's sentiment analysis.Three machine learning algorithms are used, Logistic Regression, Aid Vector Classification, and NB, with two sets of characteristics.The word frequency approach, word embeddings, and machine learning classifiers can correctly identify rumour-related tweets with 84% accuracy, which classifies tweets into four categories: academic, media, government and health [6].
In [7], two new Arabic text categorization datasets are introduced.The first consists of Twitter, Facebook and YouTube posts from well-known Arabic news channels, and the second consists of tweets from popular Arabic accounts.In modern standard Arabic, the papers in the former are almost entirely written (MSA), while the tweets in the latter contain both MSA and dialectal Arabic.
In [8], Word2Vec models were collected from 10 newspapers in different Arabic countries from a broad Arabic corpus.The reports increased the accuracy of sentiment classification after applying various machine learning algorithms and convolution neural networks with different text feature choices (91%-95%).
In [9], an in-house-built dataset shows tweets and comments where three classifiers were applied, including NB, SVM and K-Nearest Neighbour, in particular.The findings show that SVM provides the highest accuracy, while KNN (K=10) provides the highest recall.
In [10], four classifiers were trained to incorporate a dataset consisting of 4,712 tweets to conduct a comparative study on the output of the classifiers, namely NB, SVM, Multinomial Logistic Regression and K-Nearest Neighbour.When running against the tweet"s dataset, these algorithms revealed that SVM gives the highest F1 score (72.0), while KNN (K=2) achieved the best accuracy, equivalent to 92.0.0.
In [11] the processes of gathering Twitter data and filtering, pre-processing and annotating the Arabic text to create a large dataset of sentiment analysis in Arabic are summarized.In addition to deep learning and CNN, machine learning algorithms (NB, SVM, and Logistic Regression) were used on the health dataset in the sentiment analysis experiments.The keywords are Machine Learning, Deep Neural Networks, Arabic Language and Emotion Analysis.
Several versions of RNN and CNN classifiers using GloVebased word embedding were introduced.All classifiers performed well, while the classifiers between 90% and 91% had the highest accuracy.Experimental findings indicate that BRAD 2.0 is rich and stable [12].To encourage more study in the field of Arabic computational linguistics, the benchmark dataset was made available as the key contribution.
In [13] a new mix model from CNN and LSTM is proposed using vector representations of sentences and the SoftMax regression classifier to identify the sentiment tendencies in the text.In [14] a method for evolving the CNN and creating an Arabic sentiment classification system is proposed using the differential evolution (DE) algorithm.In [15] a novel architecture for Arabic word classification and understanding is proposed.This is based on CNNs and recurrent neural networks that address the difficulty of handling unstructured social media texts in low data availability.Therefore [16] attempts to identify expressions related to feelings, such as happiness, rage, anxiety and sadness.In addition, it presents the emotion classification in Arabic tweets by using the CNNs and compares them with the machine learning methods SVM, NB and Multi-Layer Perceptron (MLP).
In [17] an Arabic sentiment analysis corpus culled from Twitter is presented, consisting of 36,000 tweets categorized as positive or negative, plus 8,000 tweets manually annotated and used to assess the corpus intrinsically by comparing it to human classification and pre-trained sentiment analysis models, with an accuracy of 86%.In [18] a survey is proposed that focuses on 90 recent research papers (74% were published after 2015).In [19] supervised and unsupervised transformation methods, such as principal component analysis (PCA) and latent Dirichlet allocation (LDA), are presented.They are tested on five Arabic opinion text datasets of various domains and sizes (1.6-94,000 reviews).In the two-class classification problem, accuracy values range from 95.5%-99.8%,and for the three-class classification problem, accuracy values ranged from 92%-97.3%.
In [20] a new study is presented to develop a new model to predict an individual"s awareness of the precautionary procedures.Tweets related to COVID-19 were collected from the five main regions in Saudi Arabia, and the accuracy level achieved was 85%.A systematic comparative overview of the most appropriate methods for analysing Arabic sentiment is presented in [21].It carries out a thorough comparison of various machine learning methods for Arabic sentiment analysis, such as NB, SVM, CNN, LSTM and several recently developed language models.The model achieves F-scores of 0.69, 0.76 and 0.92.A method for extracting knowledge from Arabic text on social media in four stagesdata collection, cleaning, enrichment and availabilityis shown in [22].It www.ijacsa.thesai.orgoffers an integrated solution for the challenges of preprocessing Arabic text on social media.This was undertaken to investigate the performance metrics as given in [23,24,25,26,27,28,29] and validates the proposed model for small-and large-scale datasets.Disambiguation using the deep learning techniques with the Arabic corpus is presented in [30].An Arabic model for text clustering using word embedding and Arabic word net is presented in [31].

III. METHODOLOGY
This section describes the research methodology that was used to conduct this research.It consisted of six main interrelated phases: text retrieval; text pre-processing; tokenization and feature extraction; application of the deep learning model; model performance evaluation as shown in Fig. 1; and use of the transfer learning by applying the AraBERT Model.
The analysis concentrated on three cases of tweets, positive, negative and neutral.To perform the sentiment analysis experiment, the large number of collected tweets was necessary.A lot of noisy data was used in the total number of tweets (51,000 tweets).The model architecture is shown in Fig. 1.The first phase in this work is text retrieval; the second is text cleaning; the third is tokenization; the fourth is embedding the text using the Word2Vec corpus in [32]; the fifth is to apply deep learning models; and the last phase is to evaluate the results.

A. Tweet Text Retrieval
This section describes the first phase, text retrieval.Data retrieval is performed using Tweepy API, and because the text characters are in Arabic, we implemented an Arabic text retrieval module using the Tweepy Twitter API library in Python.

B. Text Pre-Processing
Improving the accuracy of the text classification required enhancing the text features [22].We added some feature selection improvements, such as noise removal.Removing the noisy characters from the text enhanced the word representation.The following steps were run on the text to remove the noisy data.
 Remove the advertisement tweets.
 Remove the retweeted tweets, which started with the segment "RT".
 Remove the duplicate tweets, which were retrieved more than once.
1) Normalization: Normalization is a pre-processing method of text data cleaning, to format a sequence of texts into a standard uniform [33].It is difficult to analyse Arabic texts because of the nuances of the language, both in terms of infrastructure and conformation.Arabic has an abundance of diverse inflexions, dialects and spellings that change the meanings of the words.Using special labels, called configuration, rather than vowels, they differ according to the shape of the word.This method is necessary and useful in word processing to minimise uncommon terms and increase classification accuracy.The following steps were applied to the stored tweets in the dataset on two levels.The first level retained the emoji and the second level removed the emoji from the data collection phase.
 Remove digits, non-Arabic letters, single letters, punctuations, diacritics and special characters ($, %, &, #, .).We performed pre-processing operations on the data, such as removing the stop words, special characters, such as "@" and "#", URLs, non-Arabic characters and punctuation to create a clear analysis of the text.In addition, we used NLTK (https://www.nltk.org)word tokenization on the text data (refer Table I)  2 shows the dataset before and after cleaning.As the dataset published by [34] contains IDs and annotations only, the first step must be to retrieve the tweeter text using the authorized API object from Tweepy, using the API to retrieve all IDs from the file one by one.
3) Tokenization: Tokenization is the method used to break down text into individual tokens separated by white space.Tokenization removes all special characters, determines phrase boundaries, and processes abbreviations and numbers [35].Due to the morphological complexity of the language, the number of tokens in Arabic can exceed four.Since Arabic words often contain many affixes and clitics, the tokenization process was preceded by a segmentation process to eliminate suffixes.

C. Word Vectors Lookup
Word embedding [36] is a language modelling and feature learning technique, where each word is mapped to a vector of real values in such a way that words have a similar representation with similar meanings.Using neural networks, value learning can be achieved.Word2Vec, which has models such as skip-gram and continuous bag of words (CBOW), is a widely used word embedding method.The likelihood of words occurring in proximity to one another is dependent on both models.Skip-gram allows a word to begin with and to anticipate the words that are likely to accompany it.By predicting a word that is likely to occur based on particular background terms, CBOW reverses that.The CBOW model Tweets Before Cleaning Tweets After Cleaning www.ijacsa.thesai.orgused to learn domain-specific word embeddings from large amounts of Arabic text was collected from the free online encyclopaedia Wikipedia (2,000,000 words vectors of Word2Vec) to create the corpus defined in [31].
The Word2Vec corpus was used to look up all words" vectors from the corpus.In [11] it is suggested that pre-trained word embeddings trained on very large text corpora, such as the free Word2Vec vectors trained on 100 billion Google news tokens, can provide universal features for use in natural language processing.

D. Deep Learning Models
This section presents the two types of deep learning models used in the experiments.Using deep learning in text classification is powerful for feature extraction.In this work, we improve the CNN model in [13] as shown in Fig. 3, and the LSTM model as shown in Fig. 4 with parameter optimization to improve the classification accuracy and compare it with the different ML methods (KNN, NB and SVM).Additionally, we propose the new CNN model as shown in Table III.

A. Experiment Settings
This section describes the experiment settings of all three experiments.The experiment settings and hyperparameter tuning were performed to improve the accuracy of the performance model.In the first experiment, each CNN layer has various parameters such as the number of filters, kernel size, strides, padding, dropout rate, batch size and activation function.All CNN parameters are tuned and optimized to achieve high accuracy.In the second experiment, LTSM is used with the architecture.The embedding layer is used in the LSTM layers as in the CNN model.In the third experiment, the CNN and LSTM models with N-gram ranges were applied to achieve highly accurate model performance.The N-gram ranges method is applied with embedding vectors CBOW and skip-gram.All the parameters and details of the experiment are shown in Table II.In these experiments, we applied the SoftMax activation function as shown in formula 1.The key benefit of using SoftMax is the performance probabilities range.The range will vary from 0 to 1, and the sum of all the odds will be equal to 1.If the SoftMax function is used for the multi-classification model it returns the probabilities of each class and the target class would have a high probability.In the third experiment, two experiments applied CNN and LSTM models to get a highly accurate model performance.The first experiment used the N-gram ranges method along with embedding vectors using CBOW.The second experiment used the N-gram ranges method to extract the required features, but with embedding vectors using skip-gram.

B. Experiments Results
This section explains the results of the experiments.Several experiments were conducted, and these can be categorized into three main experiments, namely, the experiment based on the CNN model, the experiment based on the LSTM model, and the experiment based on the feature methods.III, Table IV and Table V. Fig. 5 shows a scatter plot diagram for tweets labelled against the predicted labels.In addition, this figure explains the correlation with different epochs compared to the pre-processing approaches, and whether the emoji were retained or removed.2) Experiment based on the LSTM model: In this experiment, the LSTM model has been utilized.LSTM is used to enhance the memorisation of important information.In the text classification, LSTM is used in multiple word strings to identify the class to which it belongs.In this experiment, the dataset has been divided into three groups with different dataset sizes of 3,000 tweets, 15,000 tweets and 31,000 tweets, respectively.The results of this experiment are shown in Table VII shows the results of our models" accuracy with the maximum tweets (31,000) and compares the classification accuracy when emoji are retained or removed from the content of the tweets.Training and testing accuracy for CNN and LSTM models is shown in Fig. 6, which also shows the accuracy of retaining or removing emoji.This comparison was implemented with different epochs, up to 100 epochs.The accuracy of CNN-1 with emoji retained is 95.8%, and with emoji removed is 95%.The accuracy of CNN-2 with emoji retained is 82.7%, and with emoji removed is 70%.The CNN-1 model is more accurate than CNN-2.The accuracy of LSTM with emoji retained is 95.5% and 96.6% with emoji removed.When applied to SVM and NB the results are 82.5% and 69.4%, respectively.Fig. 8 shows the comparison between the deep learning accuracy with emoji retained and removed.By applying the same models with N-gram and with skipgram, the results in Table VIII and Table IX, and represented in Fig. 7 and Fig. 8, show the LSTM model is better in both CBOW and skip-gram.After training the dataset on the AraBERT model using the parameter list as shown in Table X.The training accuracy (Fig. 9) of the AraBERT model is 0.92% when the test is applied on 3,505 tweets.II.The accuracy of CNN-1 with emoji retained is 95.8%, and with emoji removed is 95%.The accuracy of CNN-2 with emoji retained is 82.7%, whereas its accuracy with emoji removed is 70%.CNN-1 outperforms CNN-2 in terms of accuracy.The accuracy of LSTM when the emoji are retained is 95.5 %, and it is 96.6% when the emoji are removed.When we use SVM and NB, the outcomes are 82.5% and 69.4 %, respectively.The accuracy of the AraBERT model is 0.92%.In this work, we have shown that the LSTM architecture is the most suitable for the analysis of Arabic tweets.In the future, we can build a new system to analyse Arabic texts using the modern model GPT-3 from Open AI and apply the sentiment analysis on this dataset.

Fig. 3 .
Fig. 3. CNN architecture loss function in all experiments a "Categorical Cross-Entropy loss" has been utilized to train a CNN to output a probability over the C classes for each tweet, the target is three classes in the SoftMax activation function at the last layer.The mathematical formula of SoftMax activation and Cross-Entropy loss is shown in formula 2. www.ijacsa.thesai.orgwhere S p is the CNN score for the positive class.Before training the model, the dataset is divided into 25,130 training samples and validated on 6,283 samples.

1 )
Experiments based on CNN model: In this experiment, the CNN model applied two different architectures as shown in Table

Fig. 6 .
Fig. 6.Deep learning accuracy comparing retaining and removing emoji 4) Experiments based on features methods: In this experiment, two methods have been utilized to examine the accuracy of the model performance.In the first experiment, the two methods used are N-gram and CBOW.The experiments" results are shown in Table VIII.

Fig. 8 .
Fig. 8. Skip-gram and N-gram results comparisonIn the second experiment, skip-gram and N-gram have been utilized.
AraBERT model: In this experiment the AraBERT for Sequence Classification is applied (Transformer-based Model for Arabic Language Understanding) with the following parameters:

Fig. 9 .
Fig. 9. Results of the training for AraBERT Model V. CONCLUSION AND FUTURE WORKS Results of this work show that there is improvement in CNN model accuracy by retaining emoji in text content, and LSTM is more accurate when emoji are removed.These results are summarized in TableII.The accuracy of CNN-1 with emoji retained is 95.8%, and with emoji removed is 95%.The accuracy of CNN-2 with emoji retained is 82.7%, whereas its accuracy with emoji removed is 70%.CNN-1 outperforms CNN-2 in terms of accuracy.The accuracy of LSTM when the emoji are retained is 95.5 %, and it is 96.6% when the emoji are removed.When we use SVM and NB, the outcomes are 82.5% and 69.4 %, respectively.The accuracy of the AraBERT model is 0.92%.In this work, we have shown that the LSTM architecture is the most suitable for the analysis of Arabic tweets.In the future, we can build a new system to analyse Arabic texts using the modern model GPT-3 from Open AI and apply the sentiment analysis on this dataset.

TABLE III .
LSTM ARCHITECTURE LAYERS

TABLE VI .
Evaluation Results of Our Models" Accuracy with Different Dataset Sizes EVALUATION RESULTS OF OUR MODELS" ACCURACY WITH 31,000 TWEETS, WITH EMOJI RETAINED AND REMOVED3) VI.

TABLE VII .
EVALUATION RESULTS OF OUR MODELS" ACCURACY WITH DIFFERENT DATASET SIZES

TABLE IX .
CBOW WITH N-GRAM RESULTS COMPARISON

TABLE X .
SKIP-GRAM AND N-GRAM RESULTS COMPARISON