A Proposed Adaptive Scheme for Arabic Part-of Speech Tagging

This paper presents an Arabic-compliant part-ofspeech (POS) tagging scheme based on using atomic tag markers that are grouped together using brackets. This scheme promotes the speedy production of annotations while preserving the richness of resultant annotations. The proposed scheme is comprised of two main elements, a new tokenization approach and a custom tool that enables the semi-automatic implementation of this scheme. The proposed model can serve in many scenarios where the user is in a need for better Arabic support and more control over the Part-of-Speech tagging process. This scheme was used to annotate sample narratives and it demonstrated capability and adaptability while addressing the various distinguishing features of Arabic language including its unique declension system. It also sets new baselines that are prospect for further exploration by future efforts. Keywords—Arabic natural language processing (ANLP); partof-speech (POS) tagging; part-of-speech tokenization scheme; morpho-syntactic tagging; Arabic declension system


INTRODUCTION
Part-of-Speech (POS) tagging is the process of classifying and labeling words in a sentence according to their grammatical categories, i.e., verbs, nouns, particles, … etc. [1].It is considered as an important step in many Natural Language Processing (NLP) implementations [2] as it deliver a layer of abstraction over the vast variances of the lexical, syntactic and semantic content of natural language.This generalization process renders that vast amount of knowledge into controllable artifacts that are valuable for many related implementations.
In contrast to other languages, Arabic has several distinguishing and challenging features, more importantly, its rich morphology and highly inflectional nature.A single Arabic word can bear more meaning than it's English counterparts [3].Therefore and more often, information is either lost or misrepresented using the conventional Part-of-Speech tagging schemes.Moreover, there is a noticeable shortage in terms of standards related to Arabic Part-of-Speech tagging schemes, whether for the used tagsets or for the tokenization process [4], [5].
To assist in mitigating some of these challenges, we propose a new Part-of-Speech tagging scheme that can provide rich annotations while being simpler and less demanding than the detailed parsing of corpora, which is cumbersome and time consuming [6].The scheme we are proposing is based on using tagsets of atomic tag markers that can be aggregated and grouped together using brackets.Having such arrangements, users are provided with fundamental baselines that enable them to seamlessly commence with a rich morpho-syntactic annotation process for Arabic text.
The contributions of this work includes the definition of a declension system ‫االػشاة(‬ ‫)َظبو‬ complaint morpho-syntactic tagging scheme that promotes simplicity, clarity and agility of the produced annotations as well as the tagging process itself.Further, to the best of our knowledge, this is one of the rare studies that surveys Arabic Part-of-Speech tagging schemes and discusses their pros and cons.This important subject needs further investigation due to the unique linguistic features of Arabic language, while most related work concentrates on establishing rule-based or statistical motivated Part-of-Speech taggers and morphology analyzers.This paper is structured as follows.In Section II, we present a brief introduction about the distinguishing features of Arabic language.In Section III, we discuss the related previous work.Section IV presents some of the challenges that are related to the conventional Arabic Part-of-Speech tagging schemes.In Section V, the proposed tagging scheme is presented in more detail.Section VI presents the custom annotation tool.Section VII presents a sample narrative annotated using the proposed scheme and finally in Section VIII we present the conclusion and the suggested future work.

II. ARABIC DISTINGUISHING FEATURES
Arabic is a Semitic language spoken by over 300 million speakers in 22 Arabic countries, it has a liturgical importance as it is the language of Quran, the Holy book for over 1.2 billion Muslims around the world [7].
In contrast to many other languages i.e.Indo-European languages, Arabic has many distinguishing features.These features are related to its rich morphology, highly inflectional nature, subject dropping, free words order, short vowels omission, large lexicon and vocabulary and many others [8], [9].Accordingly, it is quite often challenging to identify the correct Part-of-Speech of a given word under a certain context.
The rich morphology of Arabic can be related to its template nature where new words are derived from root ones by applying a set of fixed patterns.In addition, Arabic has a concatenate nature where words (nouns and verbs) are inflected to indicate different senses.For example, Arabic www.ijacsa.thesai.orgnouns can be inflected to indicate number (singular, dual, plural), gender (masculine, feminine), definiteness (definite, indefinite) and case (nominative, accusative, genitive) as well as possession.Similarly, Arabic verbs are inflected to indicate aspect (perfective, imperfective, imperative), voice (active, passive), tense (past, present, future), mood (indicative, subjunctive, jussive), subject (person, number, gender) as well as object clitics.In addition, Arabic words can be prefixed with functional morphemes (single particles or prepositions) to indicate various senses (causality, conjunctions, assertion, inquiry, association … etc.).
To demonstrate the richness of Arabic language and the amount and diversity of information that can be convoyed in a single word, we consider the surface word (wa sa nokhberu hum, ‫,ٔسُخجشْى‬ and we shall inform them) as an example.This single word is comprised of the following constituents:  The proclitic morpheme (wa, ٔ, and) which indicates coordinating conjunction.
 The proclitic morpheme (sa, ‫ط‬ , shall) which indicates a future event.
 The inflection particle (nun, ٌ) which indicates first voice plural speaker (us).
 The stem (khabara, ‫خجش‬ , tell) which is the verb itself.
 Finally, the enclitic morphemes (hum, ‫ْى‬ , them) which is an attached pronoun that indicates a plural object of the verb.
In [10], the author provides a more detailed discussion about Arabic morphology and its distinguishing features.Nevertheless, annotating the previous sample word with a verb marker (VB) according to its grammatical category shall waste numerous information.Therefore, a viable Arabic partof-speech tagging scheme has to possess the capacity to address Arabic distinguishing features and to accurately classify Arabic words without losing information or creating ambiguities.In order to be able to support the distinguishing features of Arabic language, the required part-of-speech tagging scheme has to be able to fully support Arabic's declension system ‫االػشاة(‬ ‫.)َظبو‬In the next section, we present a brief discussion about the related previous work and highlight their main challenges.

III. RELATED WORK
A limited number of part-of-speech Taggers were presented for Arabic language [11].Generally, these automated taggers can be classified under three main schemes: the statistical-based schemes, the rule-based schemes and the hybrid ones [2].More importantly, reviewing the previous related work, we noticed an overlapping between part-ofspeech tagging and morphology analyses process.For example, Stanford NLP toolkit uses the reduced Penn tagset, while others like the Buckwalter AraMorph incorporates the syntactic category of a given word within the generated morphology analyses results.
Nevertheless, in this work, we are interested in the part-ofspeech annotating scheme and format that was implemented by every one of these tools.We start our listing with an early effort that was presented by [12] who introduced a hybrid algorithm for Arabic part-of-speech tagging.That algorithm used a custom tagset comprised of (130) fixed morphosyntactic markers that were defined based on Arabic grammar rules.Each marker identifies the grammatical category and the inflections of a given word.For example, a perfect verb in the second person masculine plural form is annotated using the (VPPl2M) marker and a singular masculine accusative definite adjective is annotated using the (NACSgMAD) marker.
An interesting tagging scheme was presented in Arabic Treebank (ATB) project [13].That tagging scheme was based on the well-known rule-based Buckwalter Arabic Morphology Analyzer (BAMA) [14].(BAMA) uses around (70) basic tag markers that can be combined together to form a larger number of composite tags.For example, in (BAMA), the (IV_PASS) marker indicates imperfective passive verb, three types of information are aggregated together in that composite tag, i.e., imperfect, passive and verb.(BAMA) include tags for indicating person, voice, mood and aspect for verbs, and gender and number for their subjects.It also includes gender, number, case and state for different types of nominals [5].
Another important tagging scheme was introduced by the Prague Arabic Dependency Treebank (PADT) project that was presented in [15].In that work, a multi-level annotation scheme for a selected corpus was implemented.The first level of annotation involved the morphology analyses of Arabic words.For that part, a morphology compliant tagset was used to construct a (15) slots structure covering the various morphological aspects of a given word i.e. gender, number, person, aspect … etc.In PADT, a single character represented each morphology feature.A challenge in (PADT) tagging scheme was that the meaning of the same character might differ according to a specific internal structuring procedure.For example, the letter (P) on the second position is to be read as Passive Particle if it was preceded by an (N, Noun), and as a Perfect if it was preceded by a (V, Verb).This arrangement requires specialized skills and knowledge to able to use and interpret (PADT) tagging scheme [16].
Similarly, CATiB project [17] presented an Arabic Treebanking scheme that was designed with the motivation of providing rich annotations while being simpler than other similar efforts i.e.ATB and PADT.The focus of CATiB was primarily on the speedy production of the manually annotated corpus while the inspiration was not to duplicate information that could be extracted or indicated by other means, i.e., by syntactic analysis.Consequently, CATiB introduced a succinct (POS) tagset comprised of ( 6) POS tags which are: NOM for nominals.PROP for proper nouns, VRB for activevoice verbs, and VRB-PASS for passive-voice verbs, PRT for particles and PNX for punctuations.Other markers were identified for the deeper level of syntactic-motivated annotations.
In [18], authors presented a functional based (POS) tagset where words are tokenized and (POS) tagged based on their grammatical functions rather than their morpho-syntactic structure.For example, the sentence ‫انًسٛشح(‬ ‫خهصذ‬ ‫,صيبَٓب‬ the march must have finished) is labeled as a modal (MD) www.ijacsa.thesai.orgalthough the direct (POS) for the Arabic word ‫,صيبٌ(‬ Time) is (NN, Noun).
A relatively recent effort was introduced by [11] who presented a systematic scheme for establishing Arabic compliant tagsets.In that work, a three level categorization of Arabic morpho-syntactic tagsets was defined.The first level was comprised of 7 tags, the inner level included 23 tags while lower level included 54 tags.Accordingly, the user of the system can use the depth of tagging that can better address his needs.
Finally, [2] and [5] presented interesting reviews on Arabic part-of-speech taggers and tagsets where the former concentrated on tagsets while the later presented a listing of the most prominent taggers along with a discussion about their challenges and limitations.

IV. CHALLENGES RELATED TO THE EXISTING ARABIC (POS) SCHEMES
The review process that was presented in the previous section revealed several challenges and limitations that are related to the existing tagging schemes.To further assess these schemes, we examined a number of the accessible taggers and morphology analyzer which included Stanford NLP toolkit [19], NLTK toolkit [20], AL-Khalil morphology analyzer [3], BAMA morphology analyzer [14] as well as MADAMIRA [21] and SAFAR platforms [22].Table 1 below presents a listing of the results that were captured while examining these tools over a sample sentence.Analyzing the results from a linguistic perspective, we concluded to the following list of observations: a) There is no standardized or a community adopted (POS) scheme for Arabic language.Our examination revealed that different (POS) tagsets were used by different (POS) taggers; some of these tagsets were generalized while others were more detailed to better address Arabic distinguishing features.The observation was also noted by [4].Similarly, the tokenization scheme of the tag markers is also different in each tool.
b) The accuracy of the examined (POS) taggers was questionable.For example, Stanford NLP produced numerous errors in the generated tagging such as the noun ‫,كشرّ(‬ his ball) which was annotated as (NNP) or a proper noun.Similarly, and for a different sample sentence, MADAMIRA identified the word ‫,شؼش(‬ felt) as a noun (poetry) rather than a verb, also the verb ‫,حضش(‬ came) was identified as a verb inflected for third person singular masculine while the correct interpretation according to the context was a third person plural masculine.Likewise, the verb ‫.خذػزكى(‬I deceived you) was identified as a verb inflected with a third person singular feminine subject while it was masculine according to the context.Moreover, the library failed to analyze some words e.g.‫,يسشػٍٛ(‬ in a hurry) which were tagged as NO-ANALYSIS.c) Some of the examined tools were not suitable for automated (POS) processing as they generate all the possible interpretations for a given word.This observation was noticed in BAMA and AL-Khalil morphology analyzers.Moreover, Al-Khalil does not employ any (POS) tokenization scheme, www.ijacsa.thesai.orgrather, it generates all its results in plan Arabic text according to Arabic declension system, this features makes it unsuitable for any integration potentials.d) Some of the investigated tools, i.e., SAFAR were a collection of other tools that were aggregated and compiled under a single platform.These tools were not stand-alone products by themselves and they did not introduce any original add-ins in terms of the Part-of-Speech tagging functionalities.
e) In many situations, words were tagged with an overly generalized version of tag markers where useful information was lost.This can be witnessed in Stanford (POS) tagger that employs the English Penn Treebank tagset for annotating Arabic words.That tagset lacks Arabic morphology features.Similarly, useful information is wasted as the examined tools are not fully compliant with Arabic declension system ( ‫َظبو‬ ‫.)االػشاة‬For example, gender information proper nouns, some adjectives and nouns were not included.Likewise, functional characters have an important role in Arabic language, yet the functional specificity for some Arabic particles was neglected such as the conditional ‫,ارا(‬ if).
f) The number of basic tag markers and the number of their possible combinations can reach large amounts that can complicate the tagging process.In [23], the authors identified over (2000) markers for Arabic while the combination of these markers can theoretically reach (33000) different tag combination [24].
g) Overlapping and duplications can be witnessed in some of the existing tagging schemes.Such overlapping can complicate string-based matching over the Part-of-Speech strings.For example, in the Penn Treebank tag markers presented below, we notice that the concept of feminine gender is represented using the single character (F), yet this same character appears as part of the (PVSUFF) marker in the same string.

VERB_PASSIVE+PVSUFF_SUBJ:3FS VERB_PASSIVE+PVSUFF_SUBJ:3FS VERB_PASSIVE+PVSUFF_SUBJ:3MP
The same remark can be observed for the singular number marker (S) and the plural (P) as they overlap with characters in the word (PASSIVE).
h) In addition, we can observe that the same concept might be represented using different markers within the same scheme.For example, the tags markers presented below demonstrate how the singular number was represented using (SG) in the first sample and using an (S) in the second.

ADJ+NSUFF_FEM_SG IV3FS+VERB_IMPERFECT
The same is true for the feminine gender markers i.e. the (F) and (FEM).Such inconsistency can create confusion during the use of the markers and weakens the scheme's standardization potentials.
i) For generating morpho-syntactic tagging, it is required that we perform a full tokenization for sentences prior to the tagging process.Such requirement might be cumbersome and time consuming and it should be useful if we can develop a simpler scheme that can replace the explicit tokenization with an implicit one as the missing information can be recovered using algorithmic measures.j) Considering the previously discussed challenges and limitations, manual intervention is often required to fine-tune the automatically generated annotations.This intervention is required to verify and/or extend the generated annotations and to validate their accuracy and adequacy for further stages of processing, which brings us to another challenge in this respect and that is the scarcity of available and accessible annotation tools that can enable and facilitate such functions of manual intervention.
In the next section, we present our proposed (POS) tagging scheme which might assist in addressing some of the aforementioned challenges as well as setting new perspectives for further exploration in future.

V. THE PROPOSED TAGGING SCHEME
In this section, we present the proposed part-of-speech tagging scheme including its objectives, design principles, the initial tagset, the tagging process as well as the custom tool that was prepared to enable this scheme.

A. Objectives and Design Principles of the Proposed Scheme
The main objective of the proposed tagging scheme was to provide users with initial baselines that enable them to implement a rich morpho-syntactic declension-system compliant annotation for Arabic words in a clear, simple and agile manner.Using this scheme, users can experiment with different tag markers that are more compliant with Arabic language, and would be able to examine their influence on different Natural Language Processing (NLP) applications e.g.Information Extraction, Text Translation, Text Summarization … etc.
The clarity, simplicity and agility of the proposed scheme were established by allowing users to commence with the annotation process without the need for the explicit tokenization of words.Rather, the tokenization is achieved using different brackets as shall be presented later.The inspiration for this arrangement was motivated by the tagging scheme that was presented in [17].In that work, the speedy production of annotations was enabled by eliminating the annotation of information that could be extracted by other means.For instance, case markers for nominals could be identified from syntax, therefore, the Part-of-Speech annotation scheme presented in [17] did not include such markers in its tagset.
The morpho-syntactic richness of the annotations is enabled by the support of different categories of tag markers that are compliant with Arabic declension system, this includes lexical categories of words; morphology related markers, functional markers as well as declension system specific ones.www.ijacsa.thesai.orgTo enable the aforementioned objectives, the proposed scheme was based on the following design principles: a) All the defined tag markers in the scheme were standalone and atomic.Each marker is self-explaining and self-contained and clearly defines a single concept e.g.gender, number, case, mood…etc.This design principle promotes the clarity of markers and ensures that no duplication or overlapping between markers can occur.For example, if a marker indicates a certain concept e.g.FEM for feminine gender, this same marker will be used for all words categories that might be inflected to indicate gender i.e. nouns, verbs, adjectives, pronouns, relative pronouns…etc.No other marker will be used for the same concept regardless of the word category.Therefore, the challenges that were stated in items g)) and h)) of section IV cannot occur.
b) Composite markers are established as aggregates of the basic and atomic ones.For instance, a plural noun is represented using the (NN) marker and the (PLR) marker, not with a single marker i.e. (NNS), for both concepts.This design principle preserves clarity and allows extendibility using clear composition of markers; it also facilitates string-based matching operations that can be implemented over part-ofspeech annotations.

B. Initial (POS) Tagset
The definition of a coherent Arabic-compliant tagset is out of the scope of our current work.In [11] and [25], the authors provided interesting guidelines that can assist in defining an Arabic-compliant tagset in a more systematic manner.
Nevertheless, for assessing our proposed model, we established an initial tagset to demonstrate the capability of the scheme and the diversity of markers that it can seamlessly support.This initial tagset (presented in Appendix A) classify the tag markers according to the following categories:

 Lexical markers:
This category includes the basic grammatical classification of words according to Arabic language rules.This includes the classification of nominals, verbs and particles, the three main Arabic word types along with their direct subsets.

 Morphology related markers:
This includes the markers that identify affixations and inflections related to nouns and verbs.

 Functional markers:
Functional markers include the tags that indicate the functional role of a given lexical entity.This includes senses of causality, modality, time and space relations, assertion, confirmation, negation, sequencing and conjunction coordination as well as others.

 Arabic declension system:
This category includes markers that are related to case definitions for Arabic nouns and mood definitions for Arabic verbs, as well as other features that signals specific insights that are related to Arabic language e.g.(Kana and its sisters, ‫ٔأخٕارٓب‬ ‫.)كبٌ‬

C. The Proposed Tokenization Scheme
A main objective of the proposed model was to better support Arabic declension system i.e. ‫االػشاة(‬ ‫)َظبو‬ where the user is able to employ adequate combination of markers that can better satisfy his needs and his language proficiency.
Having an extended and diverse tag set, it was important to define an adaptive, dynamic and flexible tokenization scheme that can utilize these diverse markers in a simple, clear and agile manner.
Two types of brackets were employed to establish the proposed tokenization scheme, the round brackets or parenthesis "( )" and the braces or the curly brackets "{ }".Using these brackets, different levels of grouping and hierarchies could be established to annotate different word categories.The parentheses are used to establish word level groupings while the curly brackets are used to create token level annotations.This arrangement combines concepts from conventional Part-of-Speech tagging, morphology analysis as well as syntactic tree parsing as a single Arabic word can encompass a multi-token paragraph according to its morphology.
To demonstrate the proposed bracketing scheme, we consider the sample surface word that was presented in Section 2 (wa sa nokhberu hum, ‫,ٔسُخجشْى‬ and we shall inform them).Using the proposed scheme, this single word is annotated as following:  {RP+WA+CC}: The proclitic morpheme (wa, ٔ, and) which indicates coordinating conjunction particle.

D. Advantages of the Proposed Scheme
To demonstrate the advantages of the proposed tagging scheme over other available schemes, we performed several examinations for annotation sample words using Stanford (POS) tagger, MADAMIRA morphology analyzer and the proposed scheme.
Stanford tagger produces basic syntactic based tag markers for Arabic, while MADAMIRA provides a more extended version of markers that includes syntactic word classifications as well as the morphology analysis related ones.Table 2 below presents a listing of the gathered results.
As demonstrated in the table, the proposed scheme can deliver the same set of capabilities that are provided by the other models only it has the following additional advantages:  The format of the proposed tagging scheme falls between the briefed Stanford format and the extended format of MADAMIRA.Nevertheless, the proposed scheme provides all the information that is delivered by those two schemes in a simplified manner that includes the syntactic word type classification as well as the morphology related ones.
 The use of brackets eliminates and substitutes the explicit tokenization of composite words.As demonstrated in the first sample, that composite word is comprised of two parts, the perfect verb and the attached pronoun.Curly brackets surround each of these two word parts and parenthesis surrounds the whole string.While in the other schemes, the aggregation is achieved by attaching characters together without any separators or using separators such as the underscore marker "_", the plus sign "+", the colons ":", as well as other approaches e.g.PV+PVSUFF_SUBJ:3MS.www.ijacsa.thesai.org The proposed scheme does not use single-character markers as they can create ambiguities and overlaps.Rather, multi-character atomic tag markers are used to establish a self-explaining set of annotations.
 Also, unlike [12], [14], [16], [17], no aggregate markers are used in the proposed scheme, rather, all aggregations are established using the plus sign "+" character which is inserted between the atomic markers.Reference [26] presents an interesting listing for tokenization alternatives that are used by a number of different schemes.While in the previous efforts, different approaches where employed to achieve the same objective where a combination of the tokenization process, part-of-speech tagging and morphology analysis are all combined causing overlapping and ambiguity.
 Finally, the proposed scheme enables the introduction of different categories and types of tagsets and tag markers, whether they are related to basic syntactic and grammatical markers, functional markers, morphology related and semantic markers or any other type that might be needed for a specific objective.The expendability while maintaining clarity and simplicity is a powerful feature that maximizes the benefits of the proposed scheme.This can be observed in many samples in the previous table where explicit markers are used for different Arabic linguistic features e.g.active participle, passive participle, KANA and its sisters … etc.Using such explicit markers can facilitate later efforts such as information extraction since explicit markers can signal the existence of specific types of information.

VI. THE CUSTOM ANNOTATION TOOL
To enable the proposed scheme, a Java based custom tool was prepared.We refer to this custom tool as the Bracket Based Arabic Annotation (B2A2) tool as it employs brackets to establish morpho-syntactic compliant part-of-speech annotations for Arabic language.Fig. 1 below presents a screenshot of the (B2A2) tool that demonstrates the tagging hierarchies (left) and the available tag markers (right).To commence with a new tagging process, a newline-terminated text file is uploaded into this tool where it will be initially bootstrap annotated using Stanford (POS) tagger.Later, the user uses the custom tool to review the initial annotations and modify/extend them accordingly.As demonstrated in the figure, the tool is delivered with an initial tagset where markers are classified into a number of categories e.g.base or lexical tags, functional tags, Arabic specific … etc.These tags and tagsets can be easily modified and configured by the user who can introduce new tagsets or tag markers or modify the existing ones according to his needs.The modification for these markers can be introduced into the designated (tag_def) database table i.e.SQL Server database.The structure of the tag definition table is described in Table 3 next.The user can modify the markers themselves as well as their categorization.The custom tool dynamically incorporates any modifications on the markers or their categories during its initialization process.This dynamicity in marker definition as well as their utilization by the user allows users to use different formats for annotating the same word.
The variance in annotations is related to the defined tag markers, the required depth of coverage and richness of the annotation process as well as the user's linguistic proficiency.

VII. ANNOTATING A SAMPLE NARRATIVE
To assess the proposed scheme in action, we used the (B2A2) tool to annotate a sample narrative comprised of a few sentences.As discussed in the previous section, (B2A2) scheme provides different alternatives for annotating text in terms of the tag markers that can be used as well as their arrangement and grouping using brackets.In this respect, the following guidelines were defined and enforced during the annotation process:  Verbs annotations were extended with number, gender and person markers.
 Nominals tagging was extended using number and gender markers.
 Prefix particles, propositions and affixes where separated and grouped using dedicated brackets.
 Arabic (KANA, ‫ٔاخٕارٓب‬ ‫,كبٌ‬ Was) was annotated using a custom tag (VBD+KANA) so that it can be better identified for future purposes.
 Occasionally during the annotation process, Arabic declension system was used in order to determine the correct grammatical analyses of some words and phrases so that ambiguous interpretations are resolved.www.ijacsa.thesai.org

VIII. CONCLUSION AND FUTURE WORK
This paper presented a proposed scheme for Arabiccompliant part-of-speech tagging (POST).
Acknowledging the complexity and the richness of Arabic language, along with the shortages in the related standardizations, efforts and resources, the proposed (POST) scheme presented new perspectives that might assist in enhancing Arabic-based part-of-speech tagging process as well as opening doors for new perspectives and insights to regular such efforts.
The theme of the proposed model is relatively simple and straightforward yet powerful and capable in representing different types of information specific to Arabic language and its declension system.This scheme is based on: 1) using welldefined atomic part-of-speech markers; and 2) grouping these markers using two types of brackets, the curly brackets for sub-word level and the parenthesis for the word level of groupings.
A custom tool that is bootstrapped using Stanford (POS) tagger enabled the initial version of the proposed (POST) scheme.This tool is freely available online and it can assist users to commence with a rich Part-of-Speech tagging process in a controllable and seamless manner.
The next work we intend to implement is to examine the benefits that can be achieved by using the proposed scheme in information extraction implementations.In addition, we intend to investigate the bootstrapping of the enabling tool using a morphology aware part-of-speech tagging library, e.g., MADAMIRA.

Fig. 2
Fig. 2 below demonstrates a screenshot of the (B2A2) tool, which clarifies how different annotations can be implemented for the same word according to the user's defined annotation guidelines.

Fig. 2 .
Fig. 2. Words and their constituents can be annotated different according to the user's definitions and requirements.

TABLE I .
ANALYSIS RESULTS OF ARABIC (POS) TAGGERS AND MORPHOLOGY ANALYZERS FOR A SAMPLE SENTENCE

TABLE III .
TAGS DEFINITION TABLE