BAAC: Bangor Arabic Annotated Corpus

This paper describes the creation of the new Bangor Arabic Annotated Corpus (BAAC) which is a Modern Standard Arabic (MSA) corpus that comprises 50K words manually annotated by parts-of-speech. For evaluating the quality of the corpus, the Kappa coefficient and a direct percent agreement for each tag were calculated for the new corpus and a Kappa value of 0.956 was obtained, with an average observed agreement of 94.25%. The corpus was used to evaluate the widely used Madamira Arabic part-of-speech tagger and to further investigate compression models for text compressed using partof-speech tags. Also, a new annotation tool was developed and employed for the annotation process of BAAC. Keywords—Component; arabic language; corpus; annotated corpora; analysis results I. BACKGROUND AND MOTIVATION The Arabic language “تيبرعنا” is acknowledged to be one of the most largely used languages, with 330 million people using the language as their first language, as shown in Table 1, plus 1.4 billion more using it as a secondary language [1]. The majority of the speakers are located across twenty-two nations, primarily in the Middle East, North Africa and Asia, and the United Nations considers the Arabic language as one of its five official languages. The Arabic language is part of the Semitic languages that includes Tigrinya, Amharic, Hebrew, etc., and shares almost the same structure as those languages. It has 28 letters, two genders – feminine and masculine, as well as singular, dual and plural forms. The Arabic language has a right-to-left writing system with the basic grammatical structure that consists of verb-subject-object and other structures, such as VOS, VO and SVO [2]–[4]. TABLE I. THE MOST UNIVERSALLY USED LANGUAGES Rank Language Users (millions)


BACKGROUND AND MOTIVATION
The Arabic language ‫"انعربيت"‬ is acknowledged to be one of the most largely used languages, with 330 million people using the language as their first language, as shown in Table 1, plus 1.4 billion more using it as a secondary language [1].The majority of the speakers are located across twenty-two nations, primarily in the Middle East, North Africa and Asia, and the United Nations considers the Arabic language as one of its five official languages.The Arabic language is part of the Semitic languages that includes Tigrinya, Amharic, Hebrew, etc., and shares almost the same structure as those languages.It has 28 letters, two gendersfeminine and masculine, as well as singular, dual and plural forms.The Arabic language has a right-to-left writing system with the basic grammatical structure that consists of verb-subject-object and other structures, such as VOS, VO and SVO [2]- [4].The non-colloquial written text for the Arabic language can be divided into two types: Classical Arabic and Modern Standard Arabic [5]- [8].The Classical Arabic (CA) epoch, as shown in Figure 1, is usually measured from the sixth century which is the start of Arabic literature.It is the language of the Holy Quran, the 1,400-year-old primary religious book of Islam with 77,430 words [9] and other ancient Islamic books from that era, such as the Hadith books [10].With the beginning of journalism and the spread of literacy in the eighteenth century came Modern Standard Arabic or MSA.MSA is the language of current printed Arabic media and most Arabic publications.
Most Arabic natural language processing (NLP) tasks perform better for MSA [11].One example of those tasks is parts-of-speech tagging (POS) of the Arabic language as reported in [10], [12], [13], where the performance of the taggers is best when tagging MSA text.The reason for the variation in performance between MSA and CA is that most Arabic language NLP systems were trained using MSA text [14], [15].More effort is currently being made, such as the creation of manually annotated CA corpora [16] and the evaluation of different Arabic POS taggers on CA text by Alosaimy and Atwell [12], to fill this gap in research.
The term corpus can be defined as a computerised set of genuine texts or discourses provided by language speakers and saved in a machine-readable form [17]- [20].Xiao [21] argues that a corpus is not a randomly collected collection of texts nor an archive, but a file that manifests four essential aspects: a corpus is a set of (1) machine-readable (2) genuine texts (that includes transcripts of spoken data) that are (3) tested to be (4) representative of a specific or a group of languages.Corpora play a significant factor in the development, improvement and evaluation of many NLP applications such as machine translation [22], [23], part-of-speech tagging [24] and text-classification [14], [23].The design of any corpus depends on its intended applications [25].Some corpora are for general use and can be utilised in many applications, and others may serve a specific purpose, such as building dictionaries or examining the language of a specific author or duration of time [10].
There are several kinds of annotations which could be applied to corpora, and each annotation is usually designed to www.ijacsa.thesai.orghandle a certain aspect of the language [26].One type of corpora annotation is the structural annotation of the corpus by attaching descriptive information about the text, like mark-ups that specify the boundaries of the sentence, section and chapter, or a header file that names the author of the text or adds information about participants, such as the age and gender.Another type of annotation is the morphological annotation, where information about the text, like the stems or root based in a language like Arabic, is added to the corpora.This research applies the most common type of corpora annotation, which is POS tagging of the text [26], where a tag, such as a noun, verb or particle is combined with each term in the corpus, and the number of tags used in the annotation varies from a few to 400 tags or more [27].
Based on the type of text and creation purposes, the corpus can be categorised into six categories: Raw Text Corpora, Annotated Corpora, Lexicon Corpora, Annotated Corpora and Miscellaneous Corpora.Examples of corpora for the Arabic language are provided below.
B. Multilingual corpora, also known as comparable corpora or parallel corpora, are corpora that are written in two or more languages.Multilingual corpora, such as the UN corpus [34] which is the most important and widely known free corpus [23], Corpus A [22], the Hadith Standard Corpus [35], [36] and MEEDAN Translation Memory [37], are widely used in NLP fields such as machine translation [22], [23].
C. Dialectal Corpora, where the corpus is written in a specific language dialect, such as the Bangor Twitter Arabic Corpus for the Egyptian, Gulf, Iraqi, Maghrebi and Levantine Arabic dialects [38].Such corpora are used in fields such as text-classification [14].
D. Web-based corpora, such as the KACST Arabic Corpus [39], the Leeds Arabic Internet Corpus [40] and the International Corpus of Arabic [41], where the corpora are only accessible online by an inquiry interface and the corpora cannot be downloaded.
2) The second type is Lexicon corpora, that can be divided into: A. Lexical Databases, such as the BAMA 1.0 English-Arabic Lexicon [42] and the Arabic-English Learner's Dictionary [43].

B. Words Lists such as the Word Count of Modern Standard
Arabic [43] and the Arabic Wordlist for Spellchecking [44], [45].These types of corpora act like a vocabulary or a list of words and can be employed by linguists to study many aspects of a language or combined with the lexicons of systems, like spell checking applications, to improve their performance [23].
4) Annotated corpora are essential for the development of many NLP systems, such as part-of-speech tagging [24], text parsing [50].Annotated corpora are divided into: A. Named Entities Corpora such as JRC-Names [51] and ANERCorp [52].Most corpora of this type include the names of persons with the company or organisation name and the locations.
B. Error-Annotated Corpora, such as the KACST Error corpus [53], is a beneficial resource for systems such as spelling correction and machine translation corrected output [54].
C. Miscellaneous Annotated Corpora, such as the OntoNotes corpus [55] and the Arabic Wikipedia Dependency Corpus [56] which are semantically annotated corpora [55].
D. Part-of-Speech (POS) tagged corpora are an important resource for the training and development of POS systems [24].Some of the existent resources will be presented in detail in the existing resources section below.POS annotated corpora are essential for the development of many NLP systems, such as part-of-speech tagging [24], statistical modelling [57] and tag-based compression which provides more effective compression for Arabic text than word or character-based compression methods [13].The lack of such resources limits some researchers from progressing further in their efforts.The limited availability of some existing annotated corpora and the cost of acquiring others are one of the main reasons that contribute to resource scarcity.Several efforts have been made to overcome the lack of resources [12], [16], [20].Fig. 2. A Social News from Press Sb-corpus [28] in MSA text.
There exist some annotated corpora for the Arabic language that cannot be utilised by many researchers, such as the tagbased text compression research applied by Alkhazi, Alghamdi and Teahan [13] due to availability, and cost issues, such as the Arabic Treebank corpus [58].Other resources are designed to be used for particular research or annotated using a distinctive tagset produced for an explicit purpose.The Qur'anic Arabic Dependency Treebank is one example where the text is written in CA text and the corpus uses a tagset which is designed to tag CA text using traditional Arabic grammar [16], [22].This need for annotated corpora, which are necessary for the development of many NLP systems, provided the motivation to create a manually annotated corpus for the Arabic language.www.ijacsa.thesai.orgThis research produces a manually annotated POS tagged corpus that is written in MSA.The tagset used in the new corpus was suggested by Alkhazi, Alghamdi and Teahan [13]; further details about the tagset will be discussed in the annotation tagset section (section III-B), and the annotation process follows the annotations guidelines prescribed by Maamour [59] .

II. EXISTING RESOURCES
In 2001, the Linguistic Data Consortium (LDC) published the first versions of the Penn Arabic Treebank (ATB) [58].This resource is widely used in many Arabic NLP applications such as the training of POS taggers, like the Madamira Arabic POS tagger [60] and the Stanford Arabic POS tagger [3].The corpus consists of three parts with a total of 1 million annotated words.The first part v2.0 was a newswire text written in Modern Standard Arabic and consisted of 166K terms acquired from the Agence France Presse corpus.The second part was obtained from the Al-Hayat corpus which was distributed by Ummah Arabic News Text and consists of 144K [58].The last part of the ATB corpus, part 3 v1.0,as shown Figure 3, is a newswire text obtained from the An-Nahar corpus and consists of about 350K morphologically annotated words.For nonmembers of the LDC, the cost of acquiring any part of the ATB corpus exceeds several thousand US dollars which prevents access to researchers with a limited budget [57], [58].
Khoja [61]- [63] has published a 50,000 terms manually annotated POS tagged corpus written in MSA text.According to the author, the corpus is divided into two parts; the first part is a newspaper text consisting of 1,700 terms that are manually tagged using a tagset that differentiates between the three moods of the verb and case structures of the noun [64].The second part of the corpus is tagged using a simple tagset that includes only the following POS tags: noun, verb, particle, punctuation or number [62].However, access to this resource was not provided.
Another annotated corpus was published by Mohit [56].The AQMAR Arabic Wikipedia Dependency Tree Corpus is a manually annotated corpus that contains 1262 sentences collected from ten Arabic Wikipedia articles and the 36K terms of the corpus are manually annotated using the Brat annotation tool [56].The ten articles were annotated for named entities beforehand [65]- [67] and cover topics such as Linux, Internet, Islamic Civilisation, Football, etc.The tagset used in this corpus contains a small number of tags and therefore cannot be used for the research concerning tag-based text compression.
The Columbia Arabic Treebank (CATiB) [27] is another manually annotated Treebank corpus that consists of newswire feeds, from the year 2004 to 2007 and written in MSA.The corpus was initially tokenized and then POS tagged by the MADA&TOKAN toolkit [15], [27].The TrEd annotation interface [68] was utilised in the annotation process.The number of tags used by CATiB is relatively small as it consists only of six POS tags, NOM, PROP, VRB, VRB-PASS and PRT, where each tag comprises a group of subtags, for example, the tag "NOM" can be used to tag nouns, adverbs, pronouns and adjectives.

III. BAAC: THE BANGOR ARABIC ANNOTATED CORPUS
The goal of this annotated corpus is to contribute by filling the gap created by the scarcity of freely available Arabic resources, manually annotated POS tagged corpora in particular, which is caused by the lack of availability and cost issues.Another goal is to provide a new resource required by many kinds of research, such as the ongoing tag-based text compression research conducted by Alkhazi, Alghamdi and Teahan [13], where the only annotation required at this stage is POS tags.The tagset used to annotate the new corpus is the same as used by the Madamira Arabic tagger, for reasons that will be discussed in the annotation tagset section (section B).Since the Madamira Arabic POS tagger is trained by the Arabic Treebank corpus [13], [14], and that corpus is written in MSA, the newly annotated corpus must also be written in MSA.

A. The Data Source.
The data source for the new corpus is the Press sub-corpus from the BACC corpus [28].The BACC corpus was created originally to test the performance of various text compression algorithms on different text files.The results of the text classification performed by Alkhazi and Teahan [14] reveal that the Press sub-corpus is 99% written in MSA, as shown in Figure 2. According to the authors, the sub-corpus is a newswire text consisting of 51K terms, gathered from various news websites between 2010 and 2012 and covers many topics such as political and technology news.The tagset used in the BAAC corpus is the same as used by the Madamira tagger [60], which was used initially by the MADA tagger [15].The tagset is the subset of the English tagset which was presented with the English Penn Treebank and consists of 32 tags and was initially proposed by Diab, Hacioglu and Jurafsky [69].The experiments conducted by Alkhazi, Alghamdi and Teahan [13] have concluded that the quality of tag-based compression varies from one tagset to another.The different tagsets, some of which are shown in Table 3, were used to compress MSA text using POS tags, and tag-based compression using the Madamira tagset outperforms other tagsets such as Stanford [70] and Farasa [71].Since one of the main goals of creating a gold-standard POS annotated text is to investigate the effect of manual annotation on the tagbased text compression, as described below in the experiments, therefore, the Madamira tagset, which outperformed other tagsets and consists of only 32 tags that are shown in Table 2, is used to annotate the BAAC POS tag and to create the ground-truth data which will be used later for training and evaluation purposes.www.ijacsa.thesai.orgC. Automatic POS Tagging.Madamira [60] was utilised to automatically tag the corpus by POS.The manual annotation process of the BAAC corpus followed annotation guidelines proposed by Maamouri [72] for annotating POS tags.All the previous corrections that are made to a tag are shown to the annotators during the process of annotation, as illustrated in section III-E, and the Madamira tagset used to annotate this corpus applies the criteria proposed by the author.

D. The Annotation Tool.
Most existing tools, such as TrEd tool [68], [73] which was used in the annotation of The Prague Dependency Treebank, are developed to annotate Treebank types of corpora, such as dependency trees corpora, that contain other information about the term, such as the gloss or a comment from an annotator, as shown in Figure 3.As mentioned earlier, the first stage of the BAAC annotation process will only add the POS tags to the corpus.Other linguistic information, such as the structural annotation, will be adapted in future work, therefore, the tool which will be used to manually annotate this corpus will only annotate POS tags.During the preparation for the annotation process, many constraints arose and defined four requirements that had to be met by the annotation tool.First, as the annotators are native Arabic speakers, a well-detailed Arabic translation of the tagset was provided with examples during the annotation process.Second, the software used for the annotation had to comply with the hardware and software requirements of the computer used to perform the annotation.Thirdly, the annotation tool, as shown in Figure 4, had to be executed on different operating systems, therefore, the tool was designed to be portable.Finally, online backing up procedures with the ID of the annotators was done to ensure the safety of the data.The previous requirements were met by developing a new annotation tool.First, a detailed Arabic translation of the tagset, which was obtained from Alrabiah [10] and then examined by Arabic specialists, was coded in the annotation tool as shown in Figure 4.The annotation tool also offers examples of the tag if required by the annotator.To comply with the hardware requirements and reduce memory dependency, the tool loads only one sentence to be modified at a time.To follow the Maamouri [72] annotation guidelines, the tool also displays the history of annotation by showing two types of modifications, the original tag assigned by the Madamira tagger and any tag chosen by previous annotators, if they exist.A current status of the annotation process is also displayed to the annotator, such as the number of annotated tags in the current session and the number of modified tags in the total document.The Java programming language was used to develop the annotation tool, and therefore, the tool can be executed on different operating systems.The tool also provided online backing up procedures each time the annotator modified a tag to eliminate any data loss.

E. Data Preparation.
After using Madamira [60] to automatically POS tag the corpus, a copy of the corpus was given to each annotator.Each copy was split into batches of documents that have 10-20 sentences and the ID of the annotator was coded with each batch to be used later in the evaluation section.The two annotators, who are native Arabic speakers and postgraduate students in Arabic Studies, started working to manually annotate the corpus on a full-time basis in two stages.
In the first stage of the annotation process, the annotators were required to work on-site to resolve any issues with the www.ijacsa.thesai.organnotation tool and the annotation of the corpus was completed using the facilities provided by Tabuk Public Library.When the annotation process was finished, the two versions were evaluated and the Inter Annotator Agreement was calculated using two metrics, as will be discussed below in the BAAC evaluation section.The differences between the two versions were examined and adjusted off-site by a third annotator, who is a native Arabic speaker and PhD candidate student in Arabic Studies, to produce a final version of the corpus.The total time needed to annotate the corpus was two monthsthree weeks for the first stage and the rest for the final stage.

IV. BAAC EVALUATION
The quality of the annotated corpus affects the quality of the NLP application that utilises it.For instance, Reidsma and Carletta [74] has illustrated that the errors produced by machine learning tools are the same errors made by the annotators of the corpus that was used for training those tools.
Two metrics were used to evaluate the quality of the BAAC, the Kappa coefficient [75] to calculate the inter-annotator agreement (IAA) among the two annotators and a direct percent agreement for each tag [76].Using the data in Table 4, the obtained Kappa value is 0.956, which is recognised as perfect according to Landis and Koch [77].The total observed agreement from Table 2, which displays the number of agreements and disagreements of different tags between the two annotators in a reverse frequency order, is 94.25%.Taking the number of tag occurrences into consideration, Table 2 shows that the tag verb or ‫'فعم'‬ has the highest agreement between the annotators with 99.24% agreement.It also shows that the annotators agreed only 25 times out of 136 (18%) on the tag 'adv_interrog' or ‫.'حال'‬ Also, the annotators agreed only on (45.98%) on the tag 'adv', and (38.78%) on the tag 'pron_interrog'.The reasons for such variation between the annotators were:   The different understanding of the tag and, in some cases, its subset of tags by the annotators.For example, Table 4 shows that the two annotators disagreed concerning the tag 'noun' and the tag 'adj' in many instances.The different understanding of the tag 'adv_interrog' and the tag 'adj' has also caused a noticeable number of disagreements between the two annotators.
 Human error in the annotation process contributed to some of the errors in the annotated corpus.This was confirmed by random samples taken to be re-annotated by the same annotator.The previous reasons were taken into consideration, and all the disagreements were highlighted, which was then given to the third annotator who went through all the disagreements and modified them based on his judgment.Finally, a final version of the corpus, which contains the agreements from the first two annotators and the agreements of the third one, was produced and used for further applications, as illustrated in the experiments section.

V. CORPUS STATISTICS
As stated, the text of the BAAC corpus was obtained from the sub-corpus Press of the BACC.The first annotator made 3150 changes to the originally tagged corpus and the second made 2959 modifications.Table 5 and Table 8 list the first ten most frequent tags for the annotators.The most frequent tag is 'noun' representing 47.52% for the first annotator and 46.48% for the second.The least used tag is 'noun_quant' being 1.13% of the tags for both annotators.A noticeable difference between the two annotators is the use of the tag 'adj' which represents 11.57% for the first annotator and occurring 1235 more times for the second annotator (9.13%).
Table 6 shows the ten most frequently used terms in the BAAC.The first and second most frequent words in the BAAC are ‫'في'‬ which is a 'prep', that translates as 'in', and ‫,'يٍ'‬ which is also a 'prep', that translates as 'from' representing 2.83% and 2.65% of the text respectively.The table also shows that the most commonly used bigram is ‫خالل'‬ ‫,'يٍ‬ which translates as 'through' occurring 37 times in the corpus.Since the Press subcorpus, which is the source of the BAAC, was gathered between 2010 and 2012 from several Arabic news websites, the most commonly used trigrams in the BAAC are ' ‫ييداٌ‬ ‫في‬ ‫'انتحرير‬ which translates as 'In Tahrir Square', and ' ‫نهقواث‬ ‫األعهى‬ ‫'انًسهحت‬ which translates as 'Higher Council of the Armed Forces', which were mentioned 12 times, and both trigrams relate to the events that happened in Egypt during the same period.
Figure 5 plots using log scales the ranked tag, bi-tag and tri-tag sequences versus their frequencies in the BAAC.There are 32 unique tags used in the annotated corpus, as mentioned earlier.The corpus also has 433 unique bi-tags where the sequence 'noun noun' dominates most of the bi-tags sequences.Finally, there are 2,113 distinct tri-tags used in the BAAC.The figure shows a Zipf's Law-like behaviour which mirrors the behaviour of a similar plot for the English language [78].More details about the BAAC n-tag sequences are found in Table 7 and will be discussed below.Table 7 illustrates the ten most frequently used tag, bi-tag and tri-tag sequences in the BAAC.The tag 'noun' was utilised 23,782 times (46.9%)followed by the tag 'verb' that appeared in 11.44% of the text.The sequence of two nouns, the bi-tag 'noun noun', appeared in 11,035 occasions (21.76%), followed by the bi-tag 'prep noun' which was used 4,255 times in the BAAC.The sequence of three nouns came 5,133 times in the text, which represents 10.12% of the text, followed by the tritag 'noun prep noun' which came in 4.18% of the BAAC.To further analyse the n-tag results of the BAAC, Table 9 shows the tag, bi-tag and tri-tag statistics of the News subcorpus from a different corpus, the Khaleej corpus [31], which also was tagged using Madamira tagger for comparison purposes.The sub-corpus contains 967K terms gathered from news websites.The table shows that both corpora, the News and the BAAC, share the same most frequent tag, bi-tag and tri-tag sequence, where the tag 'noun' in the sub-corpus News represents 50.2% of the text, the bi-tag 'noun noun' was used 243,525 times (25.2%) and the tri-tag 'noun noun noun' appeared in 0.13% of the text.These results confirm that the tag statistics are comparable between the different corpora.

VI. EXPERIMENTS AND FUTURE WORK
We have utilised the BAAC corpus in two applications, to evaluate the performance of the Madamira tagger, and to further investigate the tag-based text compression models as applied in by Alkhazi and Teahan [13].Using the BAAC corpus, the Madamira tagger achieved an accuracy of 93.1%.To evaluate the effect of manual annotation on the tag-based text compression, the two versions of the BAAC were compressed using tag-based text compression models.The results of the compression were then compared to the compressed results of the original Madamira auto-tagged corpus.Table 10 illustrates the compression size (in bytes) and ratio (in bits per charactar) of all three files, and the results confirm that (1) manual annotation of the text reduces the quality of tag-based compression, as mentioned by Teahan and Alkhazi [13], [78]- [82], and (2) compressing the text using other text compression algorithms outperforms the tag-based text compression when compressing small text files, such as the BAAC corpus, as mentioned by Alkhazi and Teahan [13].
Further investigation is required to study the effect of using POS tagging systems, such as the OpenNLP project [83], trained using the BAAC on the tag-based text compression.Future work will add more annotated MSA text and will expand to cover CA text.More linguistic information, such as the structural annotation, will also be added to the BAAC to increase the possible NLP applications of the corpus.

VII. CONCLUSION
A new corpus, BAAC, was presented in this paper.It is an MSA corpus that contains 50K words manually annotated by part-of-speech tags.The annotated corpus used the same tagset utilised by the Madamira tagger and followed annotation guidelines proposed by Maamouri for annotating the POS tags.Also, a new annotation tool was developed and employed for the annotation process of BAAC which obtained a Kappa value of 0.956, and an average observed agreement of 94.25%.The BAAC was used to evaluate the Madamira tagger and to study the effect of the manual annotation on the performance of the tag-based Arabic text compression.

ACKNOWLEDGEMENT
Many thanks to our sponsor, Tabuk Public Library, for providing the facilities required by this research, and to its head, Mr Khalaf Alonazy, for his support.I would also like to show my appreciations to Ms Khawla S Alkhazi for the time and effort she spent contributing to this research.

TABLE I .
THE MOST UNIVERSALLY USED LANGUAGES

TABLE V .
THE TEN MOST FREQUENT TAGS BY THE FIRST ANNOTATOR

TABLE VII .
MOST FREQUENT TAG, BI-TAG AND TRI-TAG SEQUENCES FROM THE BAAC Tri-tag www.ijacsa.thesai.org

TABLE VIII .
THE TEN MOST FREQUENT TAGS BY THE SECOND ANNOTATOR

TABLE IX .
MOST FREQUENT TAG, BI-TAG AND TRI-TAG SEQUENCES OF THE KHALEEJ SUB-CORPUS 'NEWS'