Developing Lexicon-based Algorithms and Sentiment Lexicon for Sentiment Analysis of Saudi Dialect Tweets

—Majority of studies on sentiment analysis field, specifically Arabic lexicon-based approach, are focused on doing preprocessing methods on targeted dataset text or collected textual data from Twitter (Twitter dataset) rather than dealing with lexicon itself. This study proposes a new method, we constraint firstly on building a new sentiment lexicon with reasonable number of words and then doing adequate preprocessing methods on the lexicon’s words in addition to the (Twitter dataset). The study presents Saudi Dialect Sentiment lexicon called SaudiSentiPlus contains 7139 words which mostly generated from Saudi tweets and other dictionaries. Moreover, this study also presents two lexicon-based algorithms for Saudi dialect to deal with (prefixes and suffixes) letters in order to increase performance of proposed Saudi dialect lexicon. The experiment which has been conducted in this study to evaluate the performance of SaudiSentiPlus comprises four phases. The precision, recall, accuracy, and F-Score are measured in every phase. We built our testing dataset from twitter by focusing on Saudi dialect hashtags (971 thousands tweets from 162 hashtags). The results, show that accuracy of SaudiSentiPlus with the two lexicon-based algorithms reached to 81%.


I. INTRODUCTION
A Social Network Site (SNS) is a platform enables people to share their opinions on any issue and to build social relations with individuals within and beyond their social circle [1].
Twitter as a one of the most popular SNSs that has been growing rapidly in recent years.Twitter"s users increased by more than 500% since 2009 [1].Twitter"s users express their feeling, opinions or spreads news or facts about 200 billion times annually via their tweets, 500 million of them per day, 350,000 per minute, and 6000 per second [2].
In 2014, total number of active Twitter users in the Arab world reached 5,797,500 users and the country with the highest number of active Twitter users in the Arab region is Saudi Arabia with 2.4 million users, accounting for over 40% of all active Twitter users in the Arab region.The estimated number of tweets produced by Twitter users in the Arab world in March 2014 was 533,165,900 tweets, an average of 17,198,900 tweets per day [3] (see Fig. 1).
Currently, in 2019, Saudi Arabia was ranked the fourth in the world with around 10 million active users after the United States, Japan and the United Kingdom (see Fig. 2) [4].This means that a massive amount of contents and opinions of Saudis toward phenomenon, topic, institution or individuals can be obtained and studied via twitter.This content can be either objective contents (e.g.news, facts) or subjective www.ijacsa.thesai.orgcontents such as (opinions or sentiments about entities).Opinions mining or sometimes alternatively mentioned as sentiment analysis is the research discipline which aims to analyze individuals" sentiments or opinions toward entities such as topics, people, issues, organizations or events [5] and classifying them as negative, positive or neutral opinions.
The Saudis speak and write in Arabic language and few of them are fluent in English.Arabic language is the fastest growing language on the web (8,917.3%), it is ranked the fourth among languages on the web as illustrated in Table I [6].
Arabic language has many variants, however we can categorized it to three categories.The first is the Qur"an language which is classical Arabic; the second is Modern Standard Arabic (MSA) which it used in formal speech and writing.The third is informal or dialectical Arabic.Dialectical Arabic refers to all oral diversities spoken in daily communication for 27 Arabic countries and from one area of the same country to another [7].
According to Darwish and Magdy [8] Arabic social media"s users tend to use Arabic dialects online rather than MSA.Likewise, Saudis use their colloquial language in social media and in Twitter in particular, which makes study their opinions or doing sentiment analysis based on their tweets a challenging task.In social media, Arabic colloquial or dialect are changeable and has word elongations with nonstandard spellings.Consequently, doing sentiment analysis based on standard formal-dependent lexicon is inefficient since that it will be unable to capture colloquial or dialect language in social media text.Thus, there is a need to develop another efficient method considering create a dialect-dependent lexicon for sentiment analysis of social media.
In this study, we think beyond of the box, we constraint firstly on building a new sentiment lexicon with reasonable number of words and phrases, and then conducting adequate preprocessing methods on the lexicon"s words and phrases in addition to the (Twitter dataset).To the best of our knowledge, no such effort (doing preprocessing methods on the lexicon and collected dataset) has been made in prior studies.This study presents Saudi Dialect Sentiment lexicon (SaudiSentiPlus) contains 7139 words and can be used for sentiment analysis of Saudi dialect tweets.Moreover, in this paper we propose a new method based on presenting two lexicon based algorithms to deal with (prefixes and suffixes) letters of lexicon"s words and phrases.This new method has a positive significant effect on increasing the performance or accuracy of (SaudiSentiPlus) lexicon.We evaluated the performance of SaudiSentiPlus through four phases.The precision, recall, accuracy, and F-Score are measured in every phase.We built our testing dataset from twitter by focusing on Saudi dialect hashtags (971 thousands tweets from 162 hashtags).We asked three annotators to classify the dataset"s tweets randomly and manually to three classifications (positive, negative, and neutral) as presented in evaluation section.
Next section presents the proposed methodology in details.Followed by evaluation, results and discussion, and then we conclude this study in Section 5.  [9,10], however, three approaches of them are broadly accepted in Arabic lexicon construction process.The first approach is building sentiment lexicon by taking words from Arabic dictionaries or from other sentiment lexicon and divided these words based on their polarities (see [11], [12]).The second approach is based on translation of English lexicons (see [13]).The third approach, based on selecting seed sentiment words and then finding the words that occur in conjunction with the seed words (see [14]).
In this study, we started with building the study"s sentiment lexicon, firstly, by using the second approach which is automatic translation of English sentiment lexicons that already created by two prior studies [15] and [16].Then, in order to enrich the study"s sentiment lexicon with Saudi dialect words, we manually extracted all the sentiment Saudi dialect words from the twitter data (datasets).This approach partly inlines with the third approach.
Finally, more words (4431 words) have been taken from Saudi dialect sentiment lexicon (SauDiSenti) for sentiment analysis of Saudi dialect tweets [17] and next we deleted the repeated words and divided all these words based on their polarities.This approach is consistent with the first approach.To determinate the polarity, two-way classification (positive or negative) on the datasets has been adopted.We asked three annotators to classify the words manually.All the annotators are Saudi and Arabic native speakers and two of them are Arabic language teachers.If there is any disagreement among the three annotators, we solved it by voting.We called the study lexicon SaudiSentiPlus, it contains (7139 words).

A. Preprocessing
Applying sentiment analysis directly to collected textual data from the Twitter (Twitter dataset) could lead to inaccurate outcomes [18]; thus, collected textual data or (Twitter dataset) need to be prepared for another processing procedure.This called data preprocessing which means transform the collected textual data or dataset into a format that be more adequate to the purpose of the study.Applying preprocessing techniques before doing the sentiment analysis processes is highly recommended in many Arabic sentiment analysis studies, particularly in the Arabic dialectal dataset because it is commonly written in an unstructured shape [19,20,5,11].
Thus, in this study, to enhance sentiment analysis results for the collected Twitter dataset, we applied some of preprocessing techniques which are as follows: 1) Tweet cleaning.This step to remove irrelevant data such as user names, Twitter characters, URLs and all non-Arabic letters.
2) Elimination of redundant letters.In a moment of emotion, some twitter"s users tend to repeat some word"s letters when they want to emphasize something such as Goooooooal, COOOOL, WOOOW, In the same way, it happens to Arab or Saudi twitters for example ‫"ألييييييييييييييم"‬ or ‫."روووووعت"‬

B. The Proposed Algorithms
In order to increase performance of proposed Saudi dialect lexicon (SaudiSentiPlus), we developed two lexicon based algorithms to deal with (prefixes and suffixes) letters of the lexicon"s words (see Fig. 3 and Fig. 4).
Due to that Saudi dialect words originally and mostly are extracted from Arabic language words and Arabic language is a morpho¬logical language and their words might be varied depending on the presence and position of some well-known letters in a word.Moreover, some of these letters come at the beginning (prefixes) or end (suffixes) of a word.Furthermore, these letters also have different shapes depending on their word appearance in the text or context.For instance, (A) letter in Arabic ‫)ا(‬ which pronounce (Alif) might be drawn or written in four ways ‫،آ(‬ ‫إ‬ ، ‫أ‬ ، ‫.)ا‬This has its effects in (The) in Arabic which is a combination of two letters ‫"ا"‬ and ‫"ل"‬ to give us in its turn (The) in four forms ‫آل(‬ ، ‫إل‬ ، ‫أل‬ ، ‫)ال‬ [21].
And because this letters is frequently repeated in most Saudi dialect words, we have counted and confined most of these repeated letters (prefixes) and (suffixes).Prefix letters contain (" ‫ال‬ " ," ‫أل‬ " ," ‫إل‬ " ," ‫آل‬ ") and Suffix letters are (" ‫ه‬ " ," ‫ة‬ ," " ‫وا‬ " ," ‫ون‬ " ," ‫ين‬ " ," ‫هم‬ " ," ‫هن‬ " ," ‫وهن‬ " ," ‫وهم‬ " ," ‫نهم,‬ " ‫نهن‬ " ," ‫ني‬ ").We applied the first algorithm (see Fig. 3) to remove these letters (prefixes and suffixes) from most of lexicon words and saved all these words in new lexicon with words have no (prefix or suffix) letters.Next step we applied the second algorithm (see Fig. 4) which used the new lexicon with words have no (prefix or suffix) letters.The second algorithm used words from new lexicon and added (prefix or suffix) letters (one after one) step by step on these words and compared it to the texts taken from Twitter.In other words, we applied the algorithm to each word to give us all possible options by adding firstly (prefixes) alone to the same word one by one and then compared it one by one with all Twitter dataset"s text and then added (suffixes) to the same word and compared them (one by one) with all dataset"s text again and then added some of (prefixes and suffixes) together one by one to the same word and compared them with the all dataset"s text and this has been repeated until last word in the new SaudiSentiPlus lexicon (see Fig. 4) .For example, the word ‫"غبي"‬ which means "stupid for male" when we applied light stemming algorithm the chances of finding or matching this word against the text taken from Twitter is increased.Since that this algorithm, for instance, will add the letter ‫"ة"‬ as a suffix to the word ‫"غبي"‬ to give us new word ‫"غبيت"‬ which means "stupid for female".

III. EVALUATION
To evaluate the effectiveness of SaudiSentiPlus, we compared it with SauDiSenti.We built our testing dataset from twitter by focusing on Saudi dialect hashtags (971 thousands tweets from 162 hashtags).We asked three annotators to classify the dataset"s tweets randomly and manually to three classifications (positive, negative, and neutral).All the annotators are Saudi and Arabic native speakers and two of them are Arabic language teachers.They labeled 300 tweets for each classification.
These tweets have been used to evaluate the lexicon"s accuracy for SaudiSentiPlus and SauDiSenti.Four of the most www.ijacsa.thesai.orgwidely used accuracy measures in the literature are utilized ( [17,13]).They are precision (P), recall (R), F measure (F), and accuracy (Acc) and their mathematical equations are as follow: Where TP or True Positive indicates to number of tweets that are correctly predicted as a positive, TN or True Negative are number of tweets that are correctly predicted as a negative, FP or False Positive indicates to number of tweets that are incorrectly predicted as a positive, FN or False Negative are number of tweets that are incorrectly predicted as a negative.

IV. RESULTS AND DISCUSSION
The purpose of this experiment is to study the effect of increasing the size of the lexicon and to find whether there is any effect when we applied the first (see Fig. 3) and the second algorithms (see Fig. 4).As aforementioned above, these two lexicon based algorithms were developed to deal with (prefixes and suffixes) letters in order to increase performance of proposed Saudi dialect lexicon.
Table II illustrates the performance results of the experiment.Better accuracy (74%) has been achieved when the lexicon size is increased (from 4431 words to 7138 words).Moreover, accuracy has been increased to reach (81%) when we applied the two algorithms with the lexicon-based approach.Table II lists the precision, recall, accuracy, and F-Score results of the experiment.
As aforementioned, the lexicon construction accomplished through four phases.In the first phase, the lexicon was at its smallest size with 4554 words taken from automatic translation of English sentiment lexicons that already created by two prior studies [15] and [16] and other more sentiment Saudi dialect words which were manually extracted from the twitter data (datasets).The lexicon (SaudiSentiPlus 1) performance or its accuracy reached 61% which is better than 54% of SauDiSenti with its 4431 words [17].In the second phase, or (SaudiSentiPlus 2) as shown in Table II, no new words have been added to the lexicon however we applied the two lexicon based algorithms on the lexicon (SaudiSentiPlus 1) to yield better accuracy (68%).
In the third phase, more words (4431 words) have been taken from Saudi dialect sentiment lexicon (SauDiSenti) for sentiment analysis of Saudi dialect tweets [17] and next we deleted the repeated words and divided all these words based on their polarities to reach to 7139 words.In this stage (see SaudiSentiPlus 3 in Table II) the accuracy has been enhanced to reach to (74%).Finally, we noticed that accuracy has been increased to reach (81%) when we applied the two algorithms with the lexicon-based approach (see SaudiSentiPlus 4 in the Table II).In this study, we think beyond of the box, we constraint firstly on building a new sentiment lexicon with reasonable number of words and then doing adequate preprocessing methods on the lexicon"s words in addition to the (Twitter dataset).The study presents Saudi Dialect Sentiment lexicon called SaudiSentiPlus contains 7139 words.
Due to that Saudi dialect words originally and mostly are extracted from Arabic language words and Arabic language is a morphological language and their words might be varied depending on the presence and position of some well-known letters in a word.Moreover, some of these letters come at the beginning (prefixes) or end (suffixes) of a word.Furthermore, these letters also have different shapes depending on their word appearance in the text or context.
In order to increase performance of proposed Saudi dialect lexicon (SaudiSentiPlus) we developed two lexicon based algorithms to deal with (prefixes and suffixes) letters of the lexicon"s words (see Fig. 3 and Fig. 4).
The experiment which has been conducted to evaluate the performance of SaudiSentiPlus comprises four phases.The precision, recall, accuracy, and F-Score are measured in every phase.We built our testing dataset from twitter by focusing on Saudi dialect hashtags (971 thousands tweets from 162 hashtags).We asked three annotators to classify the dataset"s tweets randomly and manually to three classifications (positive, negative, and neutral).All the annotators are Saudi and Arabic native speakers and two of them are Arabic language teachers.They labeled 300 tweets for each classification.
A comparison has been made among SauDiSenti with its 4431 words [17] and the study proposed lexicon (SaudiSentiPlus).The results, as illustrated in Table II, show that SaudiSentiPlus with the two lexicon-based algorithms achieved 81% accuracy which outperformed SauDiSenti with its 54% of accuracy.

Fig. 4 .
Fig. 4. Algorithm#2 to Increase the Chances of Finding Saudi Dialect Words.

Arabic 444,016,517 226,595,470 51.0 % 8,917.3 % 5.2 %
TABLE.I. TOP TEN LANGUAGES USED IN THE WEB Top Ten Languages Used in the Web -April 30, 2019 ( Number of Internet Users by Language ) Source: [6] www.ijacsa.thesai.orgII.PROPOSED METHODOLOGY Variant methodologies were wildly adapted among researchers to produce sentiment lexicons

TABLE .
II. PERFORMANCE OF THE SAUDISENTIPLUS LEXICON COMPARED WITH SAUDISENTI