Arabic Lexicon Learning to Analyze Sentiment in Microblogs

The study and classifying of opinions distilled from social media is called sentiment analysis. The goal of this study is to build an adaptive sentiment lexicon for Arabic language. Based on those lexicons the sentiments polarity classification can be improved. The classification problem will be stated as a mathematical programming problem. In this problem, we search a lexicon that optimizes the classification accuracy. A genetic algorithm is presented to solve the optimization problem. A meta-level feature is generated based on the adaptive lexicons provided by the genetic algorithm. The algorithm performance is supported by using it alongside n-gram features and Bing liu’s lexicon. In this work, lexicon-based and corpora-based approaches are integrated, and the lexicons are produced from the corpus. Five data sets are tested through experiments. The sentiments in all data sets are classified based on five polarity levels. A better understanding of words sentiment orientation, social media users’ culture and Arabic language can be achieved based on the lexicons generated by the proposed algorithm. Since stop words can contribute and add to the sentiment polarity, stop words will be considered and will not deleted. The results show that the F-measure is greater than 80 % in three data sets and the accuracy is greater than 80 % for all data sets. The proposed method out-performs the current methods in the literature in two of the datasets. Finally, in terms of F-measure, the proposed methods achieved better results for three datasets. Keywords—Sentiment analysis; sentiment lexicon; social media; twitter; optimization; mathematical programming; genetic algorithm; evolutionary computation; Arabic language


I. INTRODUCTION
The subjects of sentiment analysis are the study of opinions and its related concepts such emotions attitudes evaluations and sentiments.For the first time in humanity history, we have that massive volume of recorded data that reflects the opinions, emotions and attitudes of people around the globe.This came from Twitter, reviews, social networks, forum discussions, blogs and microblogs.So, it is natural that the field of sentiment analysis is emerged.
In business, sentiment analysis addresses the problem of studying the customer opinions regarding products through analyzing and extracting opinion from products reviews.However, most current algorithms which developed for the business purpose are not suitable to analyze sentiments in social domain.
The objective of Sentiment Classification task is to take a piece of text written by an author regarding a topic and determine the author general feeling toward this topic, whether this felling be positive or negative.
The current work tries to improve classification of sentiments in microblogs based on building sentiment lexicons.The sentiment classification problem is written as an optimization problem, finding optimum sentiment lexicon is the goal of the optimization process.The solution will be produced based on proposed genetic algorithm to find lexicons to classify text.Then, extraction of a meta-level feature will be done based on it.The experiments are conducted on several Arabic datasets.A better understanding of the Arabic language and culture of Arab Twitter users and sentiment orientation of words in different contexts can be achieved based on the sentiment lexicons proposed by the algorithm.
Since adaptive lexicons are developed in this work, the trends in the ever-changing environment of Twitter can be captured [1].Updating the lexicons to adapt with the changes in the culture of the users can be done easily.For example, based only on one feature, the results of the proposed method are promising.
Considering real benefits, to understand the social media and their words context in known domains gives the users the ability to use the words in their messages in more effective transmission methods.Similarly, this idea might be used in producing lexicons for languages that do not own one.In analogues with this, this method can be employed to calculate the sentimental scores for same terms in different contexts and websites.The modification of the method for strength and emotion classification will be explored.Based on the method, it is planned to generate lexicons for the Arabic language.
The rest of this paper is organized as follows: Section II presents the related work; Section III presents the methods including.Experiments, results, discussion is presented in Section IV.Finally, the conclusion and main results are presents in Section V.

II. RELATED WORK
In the proposed method, we try to develop an adaptive lexicon for sentiment analysis; the Statistical methods for sentiment analysis, lexicons-based approaches and evolutionary methods are explored.www.ijacsa.thesai.orgStatistical methods have been developed based on the following observation.If two words frequently appear together within the same context, they will have the same polarity.So, by calculating a word relative frequency of co-occurrence with special words for a given word, the polarity of this word can be determined.The performance of these algorithms did not give the same or even near results when applied to training data labeled with emotions which has the potential of being independent of domain, topic and time [2].
In that area, many approaches that address different dimensions of opinions, such as subjectivity, polarity, intensity and emotion were proposed to extract sentiment indicators from natural language texts, whether these indicators are at syntactic or semantic levels.Mohammad and Turney, 2013, conducted experiments on how to formulate the emotionannotation questions and show that asking if a term is associated with an emotion leads to markedly higher inter annotator agreement than that obtained by asking if a term evokes an emotion [3].
T. Wilson, et al., 2005, presented an approach to phraselevel sentiment analysis that first determines whether an expression is neutral or polar and then disambiguates the polarity of the polar expressions [4].M.M. Bradley and P.J. Lang, 2009, developed a set of verbal materials that had been rated in terms of pleasure, arousal, and dominance to complement the existing International Affective Picture System [5].
Despite that classifying manually will give the most accurate results.It is more than difficult to use manual methods in the labeling process for determining the polarity of comments or posts of users in social media.For this reason, some papers use emoticons as labels [6 and 7].In [8], the author discussed how this method will produce much noise.Using emoticons, Go et al., 2010, distilled 1600,000 tweets from Twitter dataset [7].Liu et al., 2012, presented a dataset and used a method of labelling that depends on using emoticons and manual classification [9].Da Silva et al., 2014, created a classifier ensemble for Twitter sentiment classification [10].Hu et al., 2013, combined the networked data to benefit from emotional spread in sentiment classification [11].In [12] features that depend on concepts of semantic are combined with the training set [13].In (Bravo- Marquez et al., 2013), different approach that employs meta-level features for social media sentiment classification is used, namely for twitter.In this method, different features of words are used for polarity and subjectivity classification.Kaewpitakkun et al., 2014, created a lexicon that finds scores for objective and out of vocabulary words, and used a calculation method that depends on weighting scheme for features [14].A method that depends on distilling patterns of terms and phrases was developed by Saif et al., 2014, for evaluating those terms and phrases on tweetlevel and entity-level sentiment analysis [15].Feature learning approach was introduced by Baecchi et al., 2015.They used this method for classification of tweets.Namely, they targeted posts that might contain pictures [16].An unsupervised Learning framework was proposed by Hu et al., 2013.In this method, they combined emotional signals, in Twitter datasets [11].In [17], a sentiment scoring function was used for classification of tweets.Combination of social connections as well as social emotions between users between posts of the same author was employed by Wu et al., 2016 to get better accuracy [18].Despite, sentiments are implicitly expressed through patterns, dependencies among words in tweets and latent semantic relations, most existing approaches to Twitter sentiment analysis suppose that sentiment is explicitly expressed through affective words.Also, these methods do not consider that words" sentiment orientations and strengths change continuously throughout various contexts in which the words appear.
Sentiment lexicons can be defined as: those groups of terms and phrases that are assigned numeric scores, which give the sentiment emotional value of a term or phrase.Some lexicons, simply, allocate labels for each term or phrase.These labels are either to be positive or negative.For example, we can report Bing Liu"s lexicon as the most known lexicon that uses this simple method.Many studies tried to establish lexicons for sentiment analysis [9, 19, 20, 21, 22 and 23].
Lexicon-based approaches to Twitter sentiment analysis becomes more popular because of their simplicity, domain independence, and good performance.These approaches depend on sentiment lexicons, where a list of words is marked with fixed sentiment polarities; for example, [17, 24, 25, 26 and 27].Arora et al., 2010, andGovindarajan, 2013, used a hybrid of Naive Bayes classifier and genetic algorithm for classification of movie reviews [28].
For Arabic sentiment analysis, Hossam et al., 2015, presented a sentiment analysis based on two lexicons.The first is a lexicon for adjectives and adjective nouns.The other lexicon contains the known idioms.They developed a method to expand the lexicon from seeds or words and idioms.The method reflects a static lexicon with fixed values for the polarity of each term.Also, they depend heavily on a translated version of HU-LUI lexicon [29].Haidy et.al., 2017, used a hybrid method to determine the sentiment polarity of a tweet.In the first phase they used a lexicon to classify a set of tweets.The result of this phase is the input of the second phase.The lexicon was composed of two parts.The first is a lexicon for words; the second is a lexicon of idioms [29].Al-Ayyoub and Essa, 2015, presented a sentiment analysis based on lexicon approach was adopted.The polarity of a given word is got from the corresponding English translation.Stop words are deleted with consideration of some stop words that can affect the polarity of a given word.The lexicon words and sentiment expression are stemmed.Using the polarity of the translated terms will reduce the functionality of the words, also neglecting the stop words, which contribute in the total meaning that the author wants to give [30].
Most of these works stated that they follow a supervised or unsupervised leaning approach without mentioning the training phase and testing phases in their works.To say that lexiconbased approach is an unsupervised approach is not correct in general.In this work, no translation will be applied to get the polarity of words.Also, the proposed method builds a dynamic lexicon where the polarity of the words related to the corps.www.ijacsa.thesai.org The polarity of the same word can be different from corpus to another and can be changed for the same topic by adding more and more sentiments.Also, all these works classified the sentiments into two classes, +ve and -ve classes.In the current work the level of polarity is considered, the sentiment polarity can be strong +ve, +ve, netural, -ve and strongly -ve.

III. THE METHOD
Based on one feature, namely (Adaptive Arabic Lexicon), this work tries to find optimized Arabic lexicon.The problem will be written as an optimization problem and the method of optimization will be genetic algorithms.The problem can be stated as: find the lexicon that minimizes the error of polarity classifications for a given set of texts.Suppose the set of lexicons is and the set of texts is .For a given text in and a lexicon in the score of with respect to is the sum of the scores of all words in with respect to .∑ , is the score of the word in the lexicon . is classified based on the value of according to: Where

And
The accuracy of a lexicon for the set of texts is ratio of correctly classified texts in , , to the total number of texts in , : So, the optimization problem can be written as: find Fig. 1 shows how the above classification works.To solve the optimization problem as a genetic optimization problem, we need to define the fitness function; if we used the accuracy function as the fitness function then the algorithm will try to maximize the value of the accuracy function more than improving the classification accuracy.To get a better approach, the concept of punishment and reward will be used.This means that, if the a given text is classified correctly, then the lexicon will be rewarded by adding a positive value to the fitness function and if it did not classify the text correctly, the lexicon will be punished by adding a negative value to the fitness function.Let the fitness function be .The increment function is given by: The fitness function is given by: where is the chromosomes of the current generation Fig. 2 shows an example of the classification of a sentiment based on a given lexicon.In this example the used sentiment is ‫أغسطس"‬ ‫في‬ ‫اقتصادي‬ ‫اغتيال‬ ‫لمحاولة‬ ‫تعرضنا‬ ‫"أردوغان:‬ ("Erdogan: We were hit by an economic assassination attempt in August").The algorithm distills the polarity of each term from the lexicon, add all values then it classifies the sentiment based on the proportional place of this value between min AAL and max AAL.Fig. 5 shows the details of the genetic algorithm.The genetic algorithm consists of five main parts.The first part is the initialization part where a random population is chosen.The algorithm will choose random vectors, the length of each vector is equal to the number of the unrepeated words and the values are distributed randomly over the interval (minv, maxv).The algorithm checked many values or minv and maxv during the training phase and kept the values that gave the best results.The second phase calculates the fitness value for each chromosome and chooses the next generation.The last phase includes crossover, mutation and replacement to generate the new generation.Based on the roulette wheel strategy, lexicon with higher fitness values are more likely to be selected.The crossover is implemented randomly.If a selected random number between 0 and 1 is less than a given probability value, then a crossover for the current parents will produce the next children otherwise the new children will be identical to their parents.A mutation is implemented, for a random selected value between 0 and 1; the mutation for the resulting children will be applied if the number is less than a specific probability.Finally, a replacement will be applied; Lexicons with lower fitness values are more likely to be replaced.Fig. 4 gives the details of the calculation of the fitness function.

IV. EXPERIMENTS
In this section, data sets, parameters and results are introduced.The results of using the proposed method on the datasets are analyzed and reported.

A. Data Sets
AAL was run on five different data sets from tweets of usedrs in Twitter.These data sets were given names, TLC, MBH, NSC, SIE and TRE. TRE consists of 982 tweets about the American elections.95 strongly +ve, 315 strongly -ve, 74 +ve, 357 -ve and 141 neutral tweets.Table I summaries the data sets information.
Fig. 8 shows how the program is running.A program was written to implement the proposed algorithm.A k-fold method was used for the algorithm with k=15.Each time the data sets are divided into 15 subsets and 14 of these subsets were used as the training set, the 15th subset was used as the validation set.This process was repeated 15 times, each time one subset was used as a validation set and the remaining 14 sets were used as the training set.The final result is the average of the 15 running"s of the algorithm.The range of terms polarity, crossover range and mutation rate were set as follows: Fig. 6 shows how the crossover process is applied.Some cells are chosen randomly from each chromosome.The chosen cells from Parent A are replaced by the corresponding cells from Parent B cells in Fig. 7 shows an example of mutation:  The range of polarity for each term in the lexicon was set to be between -10 and +10  A uniform crossover was applied with rate 0.8  The mutation rate was set to 0.05 The algorithm was run till no improvement can be achieved.Sets of parameters were chosen to run the algorithm on different data sets.Namely, there were two sets of parameters which were used with two different sets of data.The original sets of data were randomly divided into two equal data sets.Equal here means that the number of sentences in each set is equal to the number of sentences in the other set.

B. Results
In this section, we will provide the results of our approach to build adaptative lexicon in terms of F1-measure for our data sets.Table II shows the results of F1-measure and Accuracy values for different mutation and crossover rates on the SIE and TRE datasets.In each case, the best values of crossover and mutation rates were reported.For testing mutation and crossover rate settings, we examined different values.For these two datasets Fig. 9 and Fig. 10 show the relation between different parameter values and F-measure.For each dataset and setting, the algorithm was run.The results were reported based on averaging running.From the results we can conclude that the best performance was at values between 0.6 and 0.9 for crossover and at values between 0.05 and 0.1 for mutation.To insure the results independence from crossover and mutation rates, crossover and mutation rates were fixed at 0.8 and 0.06.Reviewing the results in Table III, the proposed method gave good results that outperform the current available methods in many cases.Regarding the number of iterations, a limited number of iterations, 100,000 iterations were enough, and conversion was achieved for small data sets.For big data sets, the conversion was achieved with iterations numbers around 250000 iterations.This leads us to consider iterations number 250000 for all data sets.

C. Discussion
Random search approach and Bing Liu"s lexicons are considered the best methods.So, it was natural to compare the performance of the proposed method with these approaches.Table IV shows the comparison results.Best values are bolded.In the random search, based on the representation in the proposed algorithm, a random value is given to initiate a single chromosome.For 250,000 iterations, a neighbor of chromosome is given through changing a single cell in it randomly.If the fitness value of the generated neighbor is higher (based on AAL calculations), the neighbor replaces the original one.A confidence interval is reported since the algorithm is run fifteen times for each fold and it each fold and we have 15 folds.The 0.95 confidence intervals are shown in Table IV.Many variations enhance the AAL performance.AAL-SW is the AAL after removing the stop words.AAL+1,2,3-grams are variations of AAL are the result of applying AAL supported by n-grams features.Enhancing AAL by considering features of meta-level Bing Liu"s lexicon produces a modified version of AAL, AAL+lex.Adding n-grams features and metalevel features of Bing Liu lexicon improves the results and makes them better in many measures in the datasets.From Table III, we note that AAL alone could to outperform the other methods in MBH data set.This is due to the clearness of positivity and negativity levels in this data set.However, the worst results of AAL were in NSC data set, also this due to that the level of polarity ambiguity in this data set is the highest among the other data sets.The results reflect a promising result based on using AAL alone.As a classifier, AAL results outperform other classifiers, see [7], [9].Falsely results in AAL can be explained because of tone of tweet problem.The terms that have low frequency tend to have higher variance when running the algorithm multiple times.Consequently, those terms tend to have improper values.The standard deviation of scores values of sentiment of terms is shown in Table IV.

V. CONCLUSION
In this work, we proposed a genetic algorithm to build an adaptive Arabic lexicon for sentiment analysis.We can report that the F-measure of AAL is 4.13 percentage points better than the average of reported results on the MBH dataset, 3.28 on the TLC dataset, 2.14 on the SIE dataset, and 1.56 on the TRE dataset.AAL achieved accuracy levels better than traditional methods on three data had better accuracy results than state-of-the-art methods on three datasets.For F-measure results, the proposed method achieved better results in four datasets.This work shows that adaptive lexicons can be applied for Arabic language.In fact, the independence of the method from the language is approved.The proposed method can enable better understanding of sentiment words.Since, we did not remove stop words, then this show that all words in Arabic can be considered as sentiment words.In this paper, we approved that writing generating adaptive lexicon as optimization search and applying genetic algorithms to get optimal solution can give an excellent result when applied to Arabic language.It is shown that, AAL can give a high accuracy with small data sets.From the business point of view, the companies can use AAL to create lexicons to help in finding and exploring what users think about.Companies can also use AAL to enrich the knowledge about individual words and their importance; this will increase the effectiveness of manual analysis of sentiments.For example, A supermarket manager can use AAL to create a lexicon for the products and use it for sentiment analysis of their customers behaviors.In this paper, AAL used to analyze the strength of opinions of sentiments.In the future, building a deep net that can apply AAL online with active learning to provide real time adaptive lexicons will be explored.

Fig. 3 Fig. 4 .
Fig. 3 illustrates how to calculate INC.The following algorithm explains how to calculate the function of chromosome in data set Algorithm 1 Fitness function of chromosome in data set 1. Fitness( , ) 2. =0 3. for each in 4. 5. for each word in 6. //the score of in chromosome 7. end for 8. if the value makes to be classified correctly and | | 9. Then | | 10. if the value of makes to be classified correctly and | | 11.Then 12. if the value of makes to be classified incorrectly and | | 13.Then | | 14. if the value of makes to be classified incorrectly and | | 15.Then 16. end if 17. end for 18. return

Fig. 10 .
Fig. 10.Values of F1-Measure for Multiple Mutation and Crossover Rates on the TRE Dataset.

TABLE .
II.THE F1-MEASURE AND ACCURACY VALUES FOR DIFFERENT MUTATION AND CROSSOVER RATES ON THE SIE AND TRE DATASETS

TABLE .
III. AAL RUNNING RESULTS ON ALL DATA SETS

TABLE .
IV. ACCURACY AND F1 VALUES FOR 0.95 CONFIDENCE INTERVAL FOR ON THE FIVE DATASETS Fig. 9. Values of F1-Measure for Multiple Mutation and Crossover Rates on the SIE Dataset.