Mining Trending Hash Tags for Arabic Sentiment Analysis

People text millions of posts everyday on microblogging social networking especially Twitter which make microblogs a rich source for public opinions, customer’s comments and reviews. Companies and public sectors are looking for a way to measure the public response and feedback on particular service or product. Sentiment analysis is an encouraging technique capable to sense the public opinion in a fast and less cost tactic than traditional survey methods like questionnaires and interviews. Various sentiment methods were developed in many languages, such as English and Arabic with much more studies in the first one. Sometime, hash tags are misleading or may have a title that does not really reflects the subject. Tweets in trend hash tags may contain keyword or topics titles better represent the subject of the hash tag. This research aims at proposing an approach to explore Twitter Hash tag trends to retrieve tweets, group retrieved tweets to learn topics’ profiles, do sentiment analysis to test the subjectivity of tweets then develop a prediction model using deep learning to classify a new tweet to the appropriate topic profile. Arabic hash tags trends have been used to evaluate the proposed approach. The performance of the proposed approach (clustering topics within hashtag trend to learn topics profiles then do sentiment analysis) shows better accuracy than sentiment analysis without clustering the topics. Keywords—Arabic sentiment analysis; twitter; opinion mining; trending hashtags; text analysis; deep learning

Abstract-People text millions of posts everyday on microblogging social networking especially Twitter which make microblogs a rich source for public opinions, customer's comments and reviews.Companies and public sectors are looking for a way to measure the public response and feedback on particular service or product.Sentiment analysis is an encouraging technique capable to sense the public opinion in a fast and less cost tactic than traditional survey methods like questionnaires and interviews.Various sentiment methods were developed in many languages, such as English and Arabic with much more studies in the first one.Sometime, hash tags are misleading or may have a title that does not really reflects the subject.Tweets in trend hash tags may contain keyword or topics titles better represent the subject of the hash tag.This research aims at proposing an approach to explore Twitter Hash tag trends to retrieve tweets, group retrieved tweets to learn topics' profiles, do sentiment analysis to test the subjectivity of tweets then develop a prediction model using deep learning to classify a new tweet to the appropriate topic profile.Arabic hash tags trends have been used to evaluate the proposed approach.The performance of the proposed approach (clustering topics within hashtag trend to learn topics profiles then do sentiment analysis) shows better accuracy than sentiment analysis without clustering the topics.

I. INTRODUCTION
With the development of information technology, everyone is interacting with electronic devices of various sizes.Different users from different age groups are dependent on the Internet to collect information, exchange news and jokes.The technological development has strong effect in the transformation and the impact on social life, which prompted all social, cultural and political figures and even government agencies to create accounts in various means of social communication to disseminate what is going on in their minds and to influence the followers.
With the rapid increasing in the number of social networks and the awareness of the crucial role social networks play through the posts published by users, opinion miners are interested in discovering automated ways to retrieve, analyse and report users response so that decision makers can draw their policies to develop services to the community.Previously, public opinion poll methodologies survey relevant people by distributing questionnaires or interviews which are considered as traditional methods.Mentioned previously traditional survey methods are based on small samples and may not be prepared accurately to cover what the public really feel on specific domain, which may result in inaccurate results, not to mention the cost.While the use of automated artificial intelligence methods may result in larger sample scans and actually measure their wishes indirectly by analysing their emotions and expressions that represents their feelings.In addition, the little cost that may be negligible compared to the cost of traditional survey methods mentioned above.
Data mining professionals use artificial intelligence techniques to develop methodologies that can identify the public's response, measure their opinions and interests, and classify them into different categories that can be used to promote products that are related to them.AI based techniques depend on the analysis of written posts according to the natural language in which they are written and the methodologies of natural language processing.Social networks such as Twitter can be used by specialists to sense the pulse of the public and to measure the extent of their impact or interest in a particular subject via opinion mining.Companies use opinion mining technology to measure the interest and admiration of the public to specific products for better marketing and guide users with ads lining to increase profits.
Opinion miners investigate and develop methodologies for better understanding of natural languages to improve the results of automated retrieval and understanding of desired written texts.Proposed techniques face glitches in natural language processing sine each language needs its own methodology and rules which vary from language to language such as Arabic language which is rich in synonyms, morphology and diversity of dialects [1].Furthermore, when referring to web sites that provide a broadcasting service for Twitter hash tags that reaches trends as in Fig. 1, for example, clarify that most of popular trends may be obsessed with uncommon words that may not reflect the Twitter trend contents.Hence, different topics that might be found in a trend may lead commercial companies to post in unwanted hash tag trend Therefore, knowing the key words of Twitter within the trends may solve raised confusion.In addition, identifying the accurate search query to accurately measure the public response to specific topic is crucial for the interested individuals or companies.www.ijacsa.thesai.orgTherefore, a hot area of research is to develop a methodology that identifies key phrases and synonyms in a manner that is adapted to the change in the interests of the public that may retrieve more relevant and relevant texts then develop a mechanism to suit the written language processing.This research is concerned with finding a mechanism that will quest for better key words and phrases (seed words) that can better retrieve the related tweets in a way that ensures continuous follow-up in the public opinion change.Twitter as the most famous social networking in Saudi Arabia will be used as the platform of analysis and experiment.Experiments of this research will test the proposed approach using seed words from Arabic language then report precision, recall and accuracy of classification.
The rest of the paper is organized as follows: Section II overviews the work related to sentiment analysis in Arabic.Section III describes the data collection and dataset construction.Section IV includes a description of the proposed methodology for mining the trending hashtags.Section V describes the experimental setup.Section VI presents the results of the evaluation of the proposed approach and Section VII concludes the paper.

II. RELATED WORKS
Sentiment analysis attracts researchers to focus on mine the tremendous amount of available information, discussions, texts, reviews and opinions in the digital form [2]. Recently, opinion mining or sentiment analysis has been used widely for various purposes due to its encouraging findings.Habitually, sentiment analysis is used in the business sector to measure the public response and points of view [3].Government agencies, business sectors and research agencies are using sentiment analysis for better and indirect understanding of public opinion in specific service, product or suggestions.Sentiment analysis shows promising results to its efficiency and less-cost compared to traditional survey methods.Various articles investigated the use of sentiment analysis for different social networks such Facebook and Twitter and apply sentiment analysis for different languages such as English and Arabic.Since microblogging such as Twitter has been introduced relatively recently, view researches have been investigated sentiment analysis with increasing attention [4].Other research highlights the importance to do sentiment analysis for unstructured information in social network such as Facebook [5].In [6] author develop a sentiment prediction model to investigate the added value of auxiliary data, including leading, lagging information, and traditional post variables on Facebook posts.
Each natural language has its own rules, lexicons and morphology and diversity of dialects.Hence, sentiment analysis should consider the difference in morphology and rules to get the desired results.Sentiment analysis has been investigated sufficiently in English language resulting in two derived methodologies: Corpus-based and lexicon-based [7].Arabic language includes a huge number of words synonyms resulting of data sparsity [8].Various methods for subjectivity and sentiment analysis for Arabic language has plotted in [9] SemEval-2017 Task4 [10] describes the fifth year of the Sentiment Analysis in Twitter task as a competition for 48 teams in both English and Arabic languages.In [11] author presents NileULex, which is an Arabic sentiment lexicon containing thousands of Arabic terms and compound phrases.A semantic approach has been developed to extract the user outlook from social networks in Arabic language investigating both regular and Arabic dialects to announce an Arabic Sentiment Ontology (ASO) includes various terms that describes how robust extracted terms express the feelings [12].Expressions/proverb phrases lexicon has been used to advance sentiment analysis polarity in Arabic sentences [13].
AdaMC was presented to boost the best accuracy of sentence-level negative/positive classification via adaptive Multi-Compositionality [14].This research focus in finding an approach to find the best seed words that express the public opinion keeping in mind the frequent change over time.

III. DATA COLLECTION
Millions of tweets are posted every day in different languages as a response to popular hashtags.Twitter hashtags attract public attention immediately at particular times becomes trending hashtags.The main aim of the research is to mine the Arabic trending hashtags to sense public response to particular trend hashtag.A trending hashtags dataset is needed to assess the proposed approach.To construct Twitter data set that can be analysed to evaluate the proposed methodology, following steps followed: 1) For the purpose of obtaining a large number of tweets for experiments with the potential to produce clear results, a site https://trends24.in/was usedas suggested by SemEval-2017 [10] -to identify the trending topics of interest in the search interval of time.Further supporting information such as filtering based on the geographical location will be of extra benefit.www.ijacsa.thesai.org2) Subscribe to Twitter API Streaming service provided by Twitter.Such registration will give you access token that permit you to download the tweets containing the search query.
3) Using Python code, create a session to download tweets texts (along with extra supporting information) then save retrieved tweets to a file.The limit of the number of tweets to be inspected for each search query was set to 100000.
4) Do tweets pre-processing to remove duplications and URLs from the tweets contents.
5) Only topics with tweets more than 100 are retained for the dataset constructions.

IV. METHODOLOGY
The research aims to provide a methodology that can help better understand and measure the public response or opinion on a particular topic by analysing their sensation expressed in social networks.Twitter is chosen for its popularity.Twitter streaming API was used to search and retrieve the trending hashtags.The proposed methodology can be described as in Fig. 2 and as follows:  The search (mother seed) used to extract the tweets of a hash tag trend to look for candidate seed words representing the public opinions of a desired topic.Additional factors may give a more subtle twist when used, such as the geographic location of Tweets, which may sometimes be mentioned in some tweets and can be added as key words.
 After that, tokenize the tweets into words, exclude the very common words that cannot distinguish the subject from another stop words.Microblogs differs from regular texts since microblogs includes noisy text blocks, therefore, TF-IDF proves ability to extract keywords from microblogs [15].Stem the remaining words, then weight them using weighting techniques such as TF-IDF.
 Considering that the public feelings change over time and the trend might contains altered topics, cluster the tweets into groups to formulate topics profiles within a trend.Each cluster (topics profile) includes weighted words as mentioned in the previous step.
 Sort the weighted words according to their weights (highest to lowest) as Algorithm 1 directs, then the top five words are picked to be the surrogate seed words that better represent the tweets' topic according to the users' interests and change over time.
 Topic profiles are mined using sentiment analysis technique such as AYLIEN Text Analysis.Sentiment analysis classify the tweets into objective (contains factual information) or subjective (useful for describing the opinion or feelings about specific topic).
 Finally, apply several classification methods to calculate the precision, recall and accuracy.The classification method with the highest accuracy is chosen.V. EXPERIMENTS This section describes the experimental set up and settings to give a clear understanding for the readers.
1) First, as proposed by SemEval-2017 [10] using https://trends24.in/ to explore the recent hash tag trends while setting the region as Saudi Arabia.
2) Second, for each hash tag trend a) Use Rapidminer studio to retrieve the tweets posted related to that trend using Twitter search operator which requires setting the access token sent by Twitter during subscription on Twitter streaming API.Write retrieved tweets to a file for later processing.The saved file contains the tweets in addition to some details {created at, from user, from user ID, to user Id, language, source,

VI. RESULTS AND DISCUSSION
Keeping in mind that the hash tag might contains more suitable seed words that better represent the topics, the evaluation of the proposed approach first cluster the tweets then do the sentiment analysis finally develop a prediction model to classify the new tweet to the class belongs to wither subjective or objective.Precision, recall and accuracy have been used to test the proposed approach efficiency.Precision, recall and accuracy are usually used to assess and relate various detection algorithms [17].For better understanding of the evaluation method in relation to the sentiment analysis of tweets, precision, recall and accuracy has been defined as follows: Precision measures the portion of relevant tweets among the truly positive and false positive retrieved tweets on a specific class: Precision = TP/ (TP + FP), where TP is truly positive and FP is false positive. ( Recall measures the portion of retrieved relevant tweets over the total amount of relevant tweets: Recall = TP/ (TP + FN) where TP is truly positive and FN is false negative. ( Accuracy is the = TP+TN/ (TP+TN+FP+FN) where TP is truly positive, FP is false positive, TN is truly negative and FN is false negative. ( Test analysis classifies the texts into two categories: facts and opinion [18].Facts are "objective expressions about entities, events and their properties" while opinion "are subjective expressions that describe people's sentiments, appraisals or feelings" [19].Several experiments were conducted on various trending hashtags but only trends with more than 100 tweets were reported.Tables from I to VIII report the precision and recall for retrieved four highest hash tags trends investigated in Arabic language from the website https://trends24.in/.Sentiment analysis for tweets normally classify the tweet into either objective or subjective.Subjectivity feelings of opinions can be used to tell about two concerns: people's feelings and language expressions used to describe that feelings [19] while objectivity describes some factual information.Tables illustrate the efficiency of the proposed approach (clustering the trending hashtags into profiles then do sentiment analysis for the tweets' texts) against the usual sentiment analysis (do sentiment analysis for tweets' texts without learning trends profiles first).Table IX reports the accuracy calculated for the four hash tags illustrating the number of tweets retrieved for each hash tag and the number of topics profiles (clusters) for each hash tag.The following observations obtained from Table IX: 1) The performance of the normal sentiment analysis is higher than the proposed approach only for the first hash tag.The reason beyond that can be explained due to the small number of tweets retrieved for that hash tag (96 tweets) compared to the remaining three hash tags.Hence, results prove the assumption mentioned in the methodology that we retain only hash tag with a minimum of 100 tweets.
2) Notably as illustrated by hash tag from 2 to 4, the more tweets we retrieve in a hash tag the better accuracy we get.

VII. CONCLUSION
Mining the tremendous amount of text, information, posts and customers' comments is essential to extract the desired knowledge.Companies might survey customers via traditional methods like questionnaires and interviews which is time consuming, costly to attract large sample size and people might respond incorrectly for different reasons.Sentiment analysis is a technique helps to understand and measure the targeted folk's opinions.The proposed approach aims at learning topics profiles helps to better understand the public response to a particular service, product or feedback by analysing recent Twitter hash tags trends.Occasionally, some hash tag titles is not understandable or misleading to tweets that represents different topics.To solve such problem, clustering of tweets was proposed to learn topics profiles.Recent Arabic hash tag trend-as listed by trends website announcer-were retrieved then the proposed approach was tested on the tweets of popular hash tags that becomes trends.Results show that the more tweets retrieved for a hash tag, the more groups or cluster (topics profiles) leading to enhanced sentiment analysis.Applying deep learning, findings show that the accuracy of the proposed approach is better than the normal sentiment analysis.As a future work, further investigation in using the proposed approach to automatically use the learned topics profile in each hashtags to retrieve similar topics.
www.ijacsa.thesai.orgMining Trending Hash Tags for Arabic Sentiment Analysis Yahya AlMurtadha Department of Computer Science Faculty of Computing and Information Technology, University of Tabuk Tabuk, Kingdom of Saudi Arabia
Process each cluster: tokenize the texts into words, remove stop list, stem the words and then weight them.f) Learn the topic profile by sorting the weighted words, then pick the top five words as the surrogate words best represent the topic profile.Use 10 k cross validation to test the prediction accuracy of the developed predict model.Cross validation splits the dataset into k subsets whereby each time one subset is used for testing and the remaining k-1 subsets are used for training.The average value for all the k experiments are used as the validation value.
text, IDgeo-location latitude, geo-location longitude, ID, retweet count}.b) Remove tweets duplication.c) For each tweet text, remove retweet symbol and links (-http).d) Use Support Vector Clustering technique to cluster the collected tweets of a specific trend into groups represent topics of related seed words.www.ijacsa.thesai.orge)

TABLE IX .
ACCURACY FOR THE 4 HASH TAGS