Annotated Corpus of Mesopotamian-Iraqi Dialect for Sentiment Analysis in Social Media

Research on Sentiment Analysis in social media by using Mesopotamian-Iraqi Dialect (MID) of Arabic language was rarely found, there is no reliable dataset developed in MID neither an annotated corpus for the sentiment analysis of social media in this dialect. Therefore, this gap was the main stumbling block for researchers of sentiment analysis in MID, for this reason, this paper introduced the development of an annotated corpus of Mesopotamian-Iraqi Dialect for sentiment analysis in social media and named it as (ACMID) stands for (the annotated corpus of Mesopotamian-Iraqi Dialect) to help researchers in future for using this corpus for their studies, to the best of our knowledge this is the first annotated corpus that both classify polarity as well as emotion classification in MID. Likewise, Facebook as the most popular social platform among Iraqis was used to extract the data from its popular Iraqi pages. 5000 comments were extracted from these pages classified by its polarity (Positive, Negative, Neutral, Spam) by two Iraqi annotators, these annotators were simultaneously classifying the same comments according to Ekman seven universal emotions (Anger, Fear, Disgust, Happiness, Sadness, Surprise, Contempt) or no emotion. Cohen's kappa coefficient was then used to compare the two annotators’ results to find the reliability of these results. The data shows a comparable value among the two annotators for the polarity classification as high as 0.82, while for the emotion classification the result was 0.65. Keywords—Sentiment analysis; Mesopotamian dialect; Iraqi dialect; social media; annotated corpus; emotion classification; Arabic language


I. INTRODUCTION
Mesopotamian-Iraqi Dialect (MID) is a main dialect of Arabic among more than 40 million people in Iraq and its neighbors. Making it the second most popular dialect of Arabic after the Egyptian dialect (which reach around 100 million speakers) in the Arab world. Facebook is the most popular social network among Iraqis, and usually, Iraqi people use their dialect in Facebook comments and posts.
Iraq is an important country in the region of the Middle East and the whole world, it is the cradle of civilization and one of the wealthiest countries in the world in its oil reserves and production that might affect the world economy, Iraq was the main front in so many global events during human history, it's hard to find someone in the world does not hear about Iraq because of the events that keep happening there.
Therefore, MID as a dialect for most residents of this country has an important role to extract the opinion of its people to have full knowledge of their thoughts and thinking better than hear their thoughts from others that cannot be mostly correct and lead to be misleading. Also, understanding people's opinions can be useful in making trading and social decision as well as investing in so many fields of the economy.
Social Media is the main source of getting people's opinions, by extracting data from people's comments and posts useful information can be introduced after classify its polarity and emotion towards certain events and ideas. Facebook as mentioned before is the main platform of social media using by Iraqi people, it has more than 21 million users in Iraq [1], extracting data from Iraqi pages of Facebook can be so useful to get people's thoughts and opinions.
Regardless of the Important of Mesopotamian-Iraqi Dialect (MID) in the world (and Arabic Language in general), studies on Sentiment Analysis in social media using this dialect is so rare and there is no real dataset developed in MID neither an annotated corpus that can be relay on for the sentiment analysis of social media in this dialect [2].
Some Researchers preferred to do their researches on the English version on the original Arabic text instead, because of the complexity of Arabic language in general and the features that facilitates the extracting of the result in the English language to get a more accurate result [3]. Therefore, this gap was the main stumbling block for researchers of sentiment analysis in MID, for this reason, this paper will introduce a new annotated corpus named (ACMID) extracting its data from popular Iraqi Facebook pages to help researchers in the future using this corpus for their studies and researches on sentiment analysis in social media used MID.
To make the new annotated corpus ACMID, Facebook was used to extract the data from its popular Iraqi pages as it is the most popular social platform among Iraqis. 5000 comments were extracted from these pages classified by its polarity (Positive, Negative, Neutral, Spam) by two Iraqi annotators, these annotators were simultaneously classifying the same comments according to Ekman seven universal emotions (Anger, Fear, Disgust, Happiness, Sadness, Surprise, Contempt) or no emotion.
In this paper, related works will be stated in the next section, a brief description for Arabic dialects will be shown in the third section, the fourth section will demonstrate the data collection and pre-processing, the fifth section will state the data annotation and the rules that have to be followed by the annotators, while the sixth section will discuss the results of this work.

II. RELATED WORKS
Related works for sentiment analysis in MID are so limited, most of the related works in the Arabic language are available in MSA and some regional dialects of Egypt (Egyptian dialect), Saudi Arabia (Najidi and Gulf Arabic dialects which referred to as Saudi dialect at most) and other dialects of Arabic language (Levanti, Meghribi, etc.).
AWATEF corpus one of the most reliable corpus by researchers of Arabic, AWATEF corpus was extracting its data from different sources in MSA [4]. COLABA (Cross-Lingual Arabic Blog Alerts) is a project in many Arabic dialects including MID was developing Natural Language Processing (NLP) resources for these dialects [5]. On the other hand, DIWAN software was developed to help training annotators to create their tagging corpus, it can capture the morphological characters in a certain text [6]. Itani et al. build Arabic corpora by extracting their data from Arabic Facebook pages (Al-Arabiyya and the voice) [7].
Al-Kabi et al. [8] create an Arabic corpus from reviews written in MSA and in addition to five Arabic dialects (Egypt dialect, Levant dialect, Arab Peninsula dialect, Maghrebi dialect, and Mesopotamian-Iraqi dialect), this corpus has 250 topics and 1442 reviews.
Meanwhile, many researchers were done studying sentiment analysis in Saudi Arabic dialect, Assiri et al. created the first reliable Saudi annotated corpus from Twitter comments [9]. While SDTC [10] was the first Saudi twitter corpus labeled by three annotators.
Alnawas et al. [11] were one of the few researchers who focuses on MID as the dialect of their interest, they used Doc2Vec to represent for binary classifier of machine learning (Decision Tree, Logistic Regression, Naïve Bayes and Support Vector Machine).

III. MSA, CA/QA AND MID
Modern Arabic Language (MSA) was derived from the Classic Arabic CA in the late 19th century and the beginning of the 20th century by Arab linguistic scholars as a modern form of the CA. MSA is used widely in the Arab world (Arab Homeland as prefer to call by Arabs) as the main language for learning, writing, the conversation among educated people in the universities, legislation, and other formal speech, and sometimes as a lingua franca among Arabs from different dialects of remote regions that cannot be intelligible understood between their speakers (e.g. Iraqi speaking with Algerian).
Classic Arabic Language (CA) or Quranic Arabic (QA) is the root language of all other Arabic dialects. It is based on the text of the Quran (The holy book of Muslims around the world), Quran was first introduced in the 7th century in the west part of the Arabian Peninsula which used the dialect of Arabic of that time in that region as the dialect of Arabic which eventually became the root of all Arabic dialects since.
Most of the Arab speakers cannot distinguish the differences between MSA and CA and most of them consider it as one dialect. Arab people usually named the two dialects as (Al-Arabiya Al-fusha-‫اﻟﻌﺮﺑﯿﺔ‬ ‫اﻟﻔﺼﺤﻰ‬ ) [12].
Arabic dialects can be divided into five groups as mention below: Mesopotamian-Iraqi Dialect (MID) is a main dialect of Arabic in most of the present-day country of Iraq, some regions in Iraqi neighbors as well as Iraqi people in diaspora around the world. People of this region usually use MID as their mother tongue in their daily conversation while using Modern Standard Arabic MSA in writing, formal conversation, and 102 | P a g e www.ijacsa.thesai.org media. Using MID in witting was so rare all the time from its development during the last 10 centuries ago until the inventing of the Internet and the phone which was used for texting and chatting at first and then was used when social media came after. South Mesopotamian Dialects (gelet) was used in this work, as it is the main dialect among Iraqis, especially in Baghdad the largest city and the capital of Iraq, Iraqis mostly used this dialect in social media even people from the north part of Iraq [13].
IV. DATA EXTRACTING AND PRE-PROCESSING Facebook as one of the most popular social media platforms among Iraqi people was used as a source to extract data in Mesopotamian-Iraqi Dialect for sentiment analysis. Three Iraqi Facebook pages was the target to get the data from its comments on different kinds of posts of these pages. The first page called ‫ﺑﻐﺪاد"(‬ ‫ﻣﻄﺎﻋﻢ‬ ‫,"دﻟﯿﻞ‬ Baghdad Restaurants Directory (which has more than one million followers, the second page called ‫ﺑﻄﯿﺦ"(‬ ‫وﻻﯾﺔ‬ ‫,"ﺑﺮﻧﺎﻣﺞ‬ Melon City show) which belongs to a famous comedian show among Iraqis and has more than three million followers, while the third page as unofficial page of Baghdad university which called ‫ﺑﻐﺪاد"(‬ ُ ‫ﻌﺔ‬ ‫ﺟﺎﻣِ‬ university of Baghdad") and has around forty thousand followers at the time this paper was written.
Facepager an application for retrieving data from the web was used to extract data from Facebook. At first, getting the address ID of the Facebook page from the Findmyfbid website to specify the page that comments will be retrieved from by Facepager and then extracting these comments to a CSV file.
In the next step pre-processing of the retrieval data will take place by the following procedures: • Remove empty comments from the corpus.
• Remove comments that contain just a tagged name without a real review.
• Remove redundancies from the corpus.
• Remove serious bad words that cannot be acceptable in any way.
• Remove comments that contains just one character or simple (e.g., ".", ‫.)"م"‬ • Remove any comment that wasn't written in MID or the Arabic language in general.

V. DATA ANNOTATION
To make the new annotated corpus ACMID two Iraqi Arab native speakers (one doctor in his thirties and one engineer 25 years old) will be involved tagging each comment that was extracted from Facebook pages and classifying them according to their polarity, the polarity classification will be either Positive, Negative or Neutral.
Simultaneously, the annotators will classify these comments according to Ekman's seven universal emotions (Anger, Fear, Disgust, Happiness, Sadness, Surprise, Contempt) [14] and if it shows no emotion the annotator will tag it as (no emotion).
The classification of these comments will be done according to the following steps and rules: • A brief explanation about sentiment analysis will be given to the annotators.
• An example of annotating five comments will be shown to the annotators.
• At first, annotators will be asked to classify ten comments only.
• After that, a short discussion among annotators and their works will take place.
• Annotators will be asked then to complete tagging all the comments separately.
• Annotators will be asked not to discuss their work with each other.
• Annotators will be asked not to influence their personal views about a certain topic in their classification.

VI. RESULTS AND DISCUSSION
The 5000 comments will be classified according to their polarity and emotions by two annotators as mentioned in the previous sections. The polarity will be either positive, negative, neutral or spam, these classifications will give a wide range for the annotators to classify the comments according to their polarity, not limit their choices to the positive or negative classification which might be confusing in some comments for the annotator to choose accordingly.
The second classification is about emotion according to Ekman seven universal emotions (Anger, Fear, Disgust, Happiness, Sadness, Surprise, and Contempt) and if the annotator saw there is no emotion to show in a certain comment, he can then choose the eighth choice which it is (no emotion).
The results of classification according to their polarity for the first annotator shows that positive toke 2243 comments out of 5000 with a percentage of 44.86%, while negative toke 1682 comments out of 5000 with a percentage of 33.64%, the neutral recorded 1038 out of the 5000 comments with a percentage of 20.76%, and finally the spam recorded only 37 comments out of 5000 comments with a percentage of 0.74%.
The second annotator has the following results, positive recorded 2179 comments out of 5000 with a percentage of 43.58%, negative 1662 comments out of 5000 with a percentage of 33.24%, the neutral recorded 1080 out of the 5000 comments with a percentage of 21.6%, and the spam recorded the same result of the first annotator of 79 comments out of 5000 comments with a percentage of 1.58%. Table I shows that the annotators agreed on 88.32% for the comment's classification according to their polarity which is considered as so high. To ensure the reliability of the result for the polarity classification Cohen Kappa coefficient was used to compare the results between the two annotators, Cohen Kappa is used to measure inter-rater reliability for qualitative items [15], when κ takes into account the possibility of the agreement by chance (AC).
The following formula will show the Cohen Kappa coefficient for the agreement between the two annotators: The final result for polarity classification shows the Kappa coefficient for the agreement between the two annotators as high as (0.82).
The classification of emotions shows the result for the first annotator as the following: (Anger= "256" out of 5000 comments with a percentage equal to "5.12%", Fear= "38" out of 5000 comments with a percentage equal to "0.76%", Disgust= "227" out of 5000 comments with a percentage equal to "4.54%", Happiness= "976" out of 5000 comments with a percentage equal to "19.52%", Sadness= "346" out of 5000 comments with a percentage equal to "6.92%", Surprise= "336" out of 5000 comments with a percentage equal to "6.72%", Contempt= "400" out of 5000 comments with a percentage equal to "8%", and No emotion= "2421" out of 5000 comments with a percentage equal to "48.42%").
While the result from the second annotator was as the following: (Anger= "369" out of 5000 comments with a percentage equal to "7.38%", Fear= "45" out of 5000 comments with a percentage equal to "0.9%", Disgust= "198" out of 5000 comments with a percentage equal to "3.96%", Happiness= "803" out of 5000 comments with a percentage equal to "16.06%", Sadness= "360" out of 5000 comments with a percentage equal to "7.2%", Surprise= "347" out of 5000 comments with a percentage equal to "6.94%", Contempt= "422" out of 5000 comments with a percentage equal to "8.44%", and No emotion= "2456" out of 5000 comments with a percentage equal to "49.12%"). Table II shows that the annotators agreed on 75.06% for the comment's classification according to their emotions.
Cohen Kappa coefficient again was used to compare the results between the two annotators for the emotion's classification, the following formula shows the Cohen Kappa coefficient for the agreement between the two annotators: The final result for emotion classification shows the Kappa coefficient for the agreement between the two annotators as (0.65).

VII. CONCLUSION
Mesopotamian-Iraqi Dialect (MID) is a main dialect of Arabic, Researches that have interested in this dialect were so rare, researchers have difficulties studying sentiment analysis in this dialect because of the lack of reliable annotated corpus in MID as well as a real dataset.
To the best of our knowledge, this paper was introduced the first annotated corpus ACMID that both classify polarity as well as emotion classification in MID. Two annotators were involved to tag the extracted data of comments from three Iraqi famous face pages. The result shows the Kappa coefficient for the agreement between the two annotators for the polarity classification as high as 0.82, while for the emotion classification the result was as 0.65.
Future plan is to applied Machine Learning techniques on the created corpus ACMID (Annotated Corpus of Mesopotamian-Iraqi Dialect).