A Hybrid POS Tagger for Khasi, an Under Resourced Language

Khasi is an Austro-Asiatic language spoken mainly in the state of Meghalaya, India, and can be considered as an under resourced and under studied language from the natural language processing perspective. Part-of-speech (POS) tagging is one of the major initial requirements in any natural language processing tasks where part of speech is assigned automatically to each word in a sentence. Therefore, it is only natural to initiate the development of a POS tagger for Khasi and this paper presents the construction of a Hybrid POS tagger for Khasi. The tagger is developed to address the tagging errors of a Khasi Hidden Markov Model (HMM) POS tagger by integrating conditional random fields (CRF). This integration incorporates language features which are otherwise not feasible in an HMM POS tagger. The results of the Hybrid Khasi tagger have shown significant improvement in the tagger’s accuracy as well as substantially reducing most of the tagging confusion of the HMM POS tagger. Keywords—Khasi corpus; BIS tagset; Khasi POS tagger; Conditional Random Fields (CRF); Hidden Markov Model (HMM)


I. INTRODUCTION
Part-of-speech (POS) tagging is the process of automatically assigning a part of speech to each word present in a sentence. It differs from a morphological analyzer, which gives a detailed analysis of a word such as root word, multiple parts of speech (if any), etc., by assigning a part of speech to the word depending on its context. These parts of speech are assigned from a specific list of POS tags called a tagset, applicable to the language at hand. The annotated corpus and tagset utilized in this work are described in [1]. The tagset was formulated according to the Bureau of Indian Standards (BIS) guidelines and referred to as the Khasi BIS tagset. In this paper, the Hidden Markov Model (HMM) approach has been incorporated in the development of a Khasi POS tagger and a ten-fold cross-validation has been carried out to rigorously test the performance of the tagger. To address the tagging errors of the Khasi HMM tagger, conditional random fields (CRF) have been integrated. The CRF approach has shown its capability in resolving issues in various natural language processing tasks [2], [3], [4], and integrating CRF allows the inclusion of features in a sentence such as capitalization, prefixes which are prevalent in Khasi, part of speech tag of the previous word, and context words. This leads to the development of a Hybrid POS tagger for Khasi with improved performance and the details of the tasks undertaken are given in the sections below. The background work is discussed in Section II, which briefly introduces Khasi and POS tagging approaches. Section III contains a description of the resources utilized in this work, while the construction of the Khasi HMM POS tagger is described in Section IV. The integration of CRF in developing a Hybrid Khasi POS tagger is presented in Section V, and finally the conclusion of the paper is given in Section VI.

A. A Brief Overview of the Khasi Language
Khasi belongs to the Austro-Asiatic language family and is categorized under the Mon-Khmer branch. It is a language spoken by the Khasi tribe who mainly inhabits the state of Meghalaya in India. As per the 2011 census of government of India, there are about 1.4 million speakers of the language in the state. Khasi is an analytic and isolating language, devoid of inflection, but typical of its Mon-Khmer features, it demonstrates simple derivational morphology contributing to the partial agglutination present in the language [5], [6]. Derivational morphology occurs when affixes attached themselves to a word base and they are easily distinguished from any given word. Another Mon-Khmer characteristic is that the word order is subject verb object (SVO). Khasi is written in the Latin script comprising of 23 letters with the exclusion of the letters c, f, q, v, x, z and the inclusion of the diacritic letters ï and ñ, and the diagraph ng [1].

B. Part-of-Speech Tagging Approaches
India"s rich language diversity can be understood by the presence of five language families namely Indo-Aryan, Dravidian, Austro-Asiatic, Tibeto-Burmese, and Semito-Hamitic. The reported accuracies for POS taggers for Hindi, a morphologically rich language and one of India"s official languages, are 87.55% on a rule-based tagger [7], 93.45% accuracy using a small-sized training corpus of 15,562 words aided with an extensive morphological analyzer and a massive lexicon [8], and 93.12% using HMM on corpus size of 66,900 words [9]. A trend observed across POS taggers for Indian languages is that stochastic taggers have to deal with the availability of only small-sized training data. In the Khasi language scenario, two HMM POS taggers have been reported for Khasi but trained and tested on two independent data sets and tagsets. The first HMM POS tagger trained on a dataset of 86,087 tokens using the Khasi BIS tagset of 33 tags provided an accuracy of 95.68% [1]. The second HMM tagger was trained on 7,500 words with a custom-made tagset of 54 tags reporting an accuracy of 76.7% [10]. However, both taggers reported accuracy performing only in a single run on their respective test data.  [1]. Excluding the punctuations, there are 83,312 tokens and 5,465 word types. Text segmentation has been performed on the corpus to visibly identify characters, words, and sentences. Written Khasi is very similar to written English because of the usage of the Latin script and the use of whitespace for marking word boundaries. Each sentence in the corpus is written in one line and marked with an end of sentence marker such as the period (.), the question mark (?), or the exclamation mark (!). Each token in a sentence is separated by a whitespace. Punctuations are also considered as tokens, except for two punctuations-the apostrophe (") which is part of a contracted word and the hyphen (-) which is part of a compound word. The data has been annotated with the Khasi BIS tagset containing 33 tags [1].
Corpus analysis revealed that 10.9% of the word types are multifunctional. The abbreviations used are in accordance with the Leipzig glossing rules except when clearly specified 1 . Table I shows the frequency of the most common words occurring more than five hundred times in the corpus. The statistics show that these 15 most frequent words account to 34.7% of the word tokens in the corpus. If the most frequent words occurring 100 times or more are taken into account, it amounts to 55.8% of the tokens in the corpus. However, from Table II we can see that approximately 47.7% of the word types occur only once in the corpus. These statistics are in line with what is reported by Manning and Shütze [11] about the difficulty in predicting the behavior of words even with the availability of a larger and bigger corpus.
Another phenomenon related to natural language data is the Zipfian distribution. When frequencies (f) of different word types is calculated and ranked in order of occurrences, then according to Zipf"s law, and we can also say that there is a constant k where f * r=k. Drawing this information from the training corpus, the extracted values and their respective calculations are shown in Table III. Based on this extraction, along with the usage of logarithmic scales, the graph of Fig. 1 shows the plot of the rank of word type on the X-axis versus the frequency of the respective word type on the Y-axis. The double line graph shows the ranks and frequencies of the words in the corpus, and the straight line shows Zipf"s predicted value for k = 10000. The graph seems to approximately hold Zipf"s law, except for very low rank and high rank words. Table I also reveals other suspected language phenomena. For instance, pronouns such as ka, u, i, and ki can have other functions and various researchers [12], [13], [14], [15] have referred to them as articles, pronominal markers, noun gender markers, subject enclitic, and so on. The frequencies indicate that the third personal pronouns-ka "singular feminine", i "singular neutral" (183 occurrences as a pronominal marker versus 93 occurrences as a pronoun), and ki "plural"are more likely to have a sense of pronominal markers (tagged as PR_PRP_M) than of personal pronouns (tagged as PR_PRP). 1 http://www.eva.mpg.de/lingua/resources/glossing-rules.php However, u "singular masculine" is more likely to be a personal pronoun than a pronominal marker. Since all animate and inanimate objects in Khasi have gender, and given that the Khasi tribe follows a matrilineal system, Khasi corpus analysis indicates that there are more objects tagged as feminine than masculine. The feminine pronominal marker ka is approximately 2 times more than the masculine pronominal marker u (Table I).  Finally, Table IV highlights the 10 most common tags in the training corpus excluding the punctuation tag. The most common tag is the common noun (N_NN). Remarkably, the fact that Khasi is known to be rich in adverbs is also reflected by its usage in the corpus and its position in the table (RB is the fifth most common tag out of 33 tags). As proposed by Brants [16], a second-order Markov model along with additional tags t -1 , t 0 , and t n+1 for beginning and end of sentence indicators is incorporated in part of speech tagging for Khasi as follows: To handle data sparsity in (1), he suggested linear interpolation of unigrams, bigrams, and trigrams. Hence, the probability is recalculated as in (2), where the λs are evaluated using deleted interpolation and .

A. Integrating Khasi Morphology to Handle unknown Words
As mentioned in Tham [1], Khasi affixes are easily detectable, especially the prefixes which play a major role in Khasi derivational morphology. There is a consistent pattern of Khasi words with prefixes such as jing-, nong-and mawmapping to common nouns (N_NN), and prefixes such as pynand ïa-(excluding the preposition ïa) mapping to verbs (V_VM). To estimate the probability of unknown words having these features, words in the Khasi corpus having prefixes jing-, nong-, maw-, pyn-, and ïa-(excluding preposition) are mapped to pseudowords _JING_, _NONG_, _MAW_, _PYN_ and _IA_ respectively. To handle unknown www.ijacsa.thesai.org words which do not have the above-mentioned prefixes, low frequency words in the training data are mapped to pseudoword _UNK_. As suggested by Manning and Shütze [11], words occurring only once in the corpus are treated as rare words or out-of vocabulary items, and hence can be mapped to pseudoword _UNK_. They have stated that these words, correspondingly known as hapax legomena, tend to comprise half of the word types, but only a fraction of the tokens in the corpus. Hence, these words will not significantly affect the model. The same phenomenon is likewise observed in Khasi, where such words comprise 47.7% of the word types but only 0.03% of the tokens in the corpus. Therefore, low frequency is taken to be less than or equal to a selected value of γ, and in this tagger γ=1. After the mappings are done, the HMM parameters are evaluated as mentioned earlier where the pseudowords _JING_, _PYN_, _NONG_, _IA_ and _UNK_ are treated like regular words. This mapping is carried out to ensure that the probability of P(w i |t i ) is never zero.

B. Testing Results and Error Analysis of HMM POS Tagger
The corpus comprising of 94,651 tokens is used for training and testing a baseline tagger, a Natural Language Toolkit (NLTK) tagger [17], and an HMM POS tagger. The mappings mentioned in Section A are incorporated in the baseline tagger and HMM POS tagger. A baseline tagger employed here is a tagger that tags the most probable tag to each word in the test data as put forward by Jurafsky and Martin [18]. Unigram, bigram, trigram taggers, etc., are also provided in NLTK. In the case of the NLTK tagger, it integrates a trigram tagger which backs off to a bigram tagger, the bigram tagger which backs off to a unigram tagger, and the unigram tagger which backs off to a Khasi regular expression. The Khasi regular expression tagger incorporates Khasi morphology, tagging words with prefixes jing-, nong-, and maw-as common nouns (N_NN), words with prefixes pyn-and ïa-as verbs (V_VM), and defaults to the most common tag which is the common noun (N_NN). Hapax legomena words not containing the mentioned prefixes are preprocessed and mapped to pseudoword _UNK_. The results of all the three taggers using ten-fold crossvalidation are given in Table V, with the HMM POS tagger giving a relatively good performance of 93.39% accuracy.
A confusion matrix of the HMM POS tagger, shown in Table VI, is used in analyzing the errors during the HMM POS tagging, with the values reflecting errors occurring at 0.5% and above (i.e., an average frequency of 3 and above). The rows in Table VI indicate the correct tags, the columns indicate the HMM tagger"s predicted tags, and each cell indicates the percentage of the tagging error. The most common error which is difficult to disambiguate is when proper nouns are tagged as common nouns and vice versa, accounting to 12% of the errors. Here, the HMM tagger has not been able to take into consideration the capitalization feature of proper nouns. A brief discussion on some of the tagging errors is given as follows: Verb Noun / Noun Verb confusion.
An interesting phenomenon that mainly contributes to the collective occurrence of 20% of the errors is when pronouns are tagged as pronominal markers, the words following them are inadvertently tagged as nouns rather than verbs. When pronominal markers are tagged as pronouns, the words following them are likewise tagged as verbs rather than nouns. In the sentence above, the verb lum "reap" is incorrectly tagged as a noun lum "hill/mountain" in the sentence given below. This is because the preceding word u was incorrectly tagged as a pronominal marker rather than a personal pronoun. Apart from what is described above, other instances when a noun is erroneously tagged as a verb are when the tagger cannot distinguish a verb functioning as a noun.

Adjective Verb / Verb Adjective confusion
Various researchers have put forward their views on the existence of adjectives in Khasi due to their syntactic similarity with verbs [12], [19], [20]. The confusion matrix also indicates that the tagger has tagged some adjectives as verbs, especially when the words follow a subordinating conjunction ba or the auxiliary word la. When a noun is tagged as an adverb, it is because the noun follows the verb without a preceding pronominal marker. This is an example where the mandatory pronominal marker is dropped before the noun Sein Iong "black snake". U/PR_PRP la/V_VAUX kylla/V_VM Sein/N_NN Iong/N_NN ./RD_PUNC. "He turned into a black snake." U/PR_PRP la/V_VAUX kylla/V_VM Sein/RB * Iong/RB * ./RD_PUNC.

Adjective Noun / Noun Adjective confusion
When an adjective is tagged as a noun, it is more likely that the tagger cannot differentiate a compound noun from a noun having a qualifying adjective. This is mainly because in most cases adjectives follow the nouns they qualify. Here the adjective badon baem "well-to-do" is tagged as a compound noun. Nouns are erroneously tagged as adjectives in compound nouns, or when an adjective or a verb semantically functions as a noun, as seen in the sentence below. Here u rit u riat "lowclass people" is confused as rit and ria, which means "small" in another sense of the words. Additionally, the preceding word being tagged as a pronoun rather than a pronominal marker adds to the confusion.
The above discussion indicates that one way of addressing the existing confusion is to consider the properties or attributes of words such as capitalization, next occurring word, and others. These considerations are presented in the next section.

V. A HYBRID KHASI POS TAGGER TO ADDRESS TAGGING ERRORS
To reduce the errors present in the HMM POS tagger output, the errors identified in Section IV B need to be addressed.
To do so, the sklearn-crfsuite 3 has been engaged as a means to achieve this purpose. Sklearn-crfsuite is a thin pythoncrfsuite wrapper which provides a fast implementation of conditional random fields (CRF). Unlike HMM, CRFs allow the inclusion of features that are non-independent and varied in depth even on the same observation [21]. Using CRF, given a sentence , the conditional probability of the tag sequence is given by: where ( ) ∑ (∑ ∑ ( ) ) is the normalization factor, is the weight and ( ) is the feature function. Implementing POS tagging in sklearn-crfsuite permits the possibility to include as many word features as possible to aid the tagging process. The word features included for Khasi are capitalization, prefixes (prevalent in Khasi, unlike suffixation in English) of length >=2 and length<=4, current word under consideration, previous word, next word, and whether a word begins or ends a sentence. An additional feature that can be included is the previous tag of a word. In the interface provided by sklearn-crfsuite the features are extracted from the training data and from the test data. It expects the training data to contain annotated data, i.e., the words and their respective POS tags. This will enable sklearn-crfsuite to extract all the specified features and learn the tagging process. However, when tagging the test data, the problem arises during feature extraction from the test data. In the provided interface, all the above-mentioned features are possible to extract from the test data except the previous word tag feature. This feature is not available in the test data because it contains only sentences where the respective words are yet to be tag. To overcome this problem, the output of the Khasi HMM POS tagger is used as input to the Khasi CRF POS tagger. This enables the previous tag feature to be easily extracted from the tagged output of the HMM tagger. www.ijacsa.thesai.org  Fig. 2 shows the block diagram of the implementation. The features mentioned above are included in the CRF tagger. The CRF POS tagger is then trained on the same training data used by the HMM POS tagger. To ensure consistency, ten-fold cross-validation is undertaken for training and testing the CRF tagger. During tagging, the word features are extracted from the test data and the previous tag feature is provided by the output of the HMM tagger. In doing so, training on 4k sentences and tagging on 10% of the training sentences, an average tagging accuracy of 95.29% is achieved with an average improvement of 1.9% over the performance of the HMM tagger shown earlier in Table V. VI. EVALUATION OF HYBRID POS TAGGER Table VII shows the average precision, recall, and Fmeasure of both the HMM POS Tagger and the Hybrid POS Tagger. The F-measure of the Hybrid POS Tagger shows a significant improvement over the F-measure of the HMM POS Tagger in 23 tags out of 32 tags, with one tag RD_UNK (for unknown tags) giving 0 F-measure in both taggers. This may be attributed that RD_UNK occurs exactly once in the corpus. Since the corpus captures only prose genre, it may be the factor where the frequency of symbols in the corpus is only 4. This may be the reason that the Hybrid POS Tagger has failed to predict the symbol tag by giving an F-measure of 0 for RD_SYM symbol tag. Fig. 3 shows the graphical comparison of the average confusion frequency among the tags between the HMM tagger and the Hybrid tagger, arranged in descending order of the HMM tagger confusion percentage. Table VIII shows the percentage of reduction or increase in confusion in the Hybrid POS tagger from the HMM POS tagger. The rows in Table VIII indicate the correct tags, the columns indicate the Hybrid tagger"s predicted tags, and each cell indicates the percentage increase or reduction in tagging error from the HMM tagger. The biggest improvement of the Hybrid tagger is the 100% elimination of all the tags confused as echo tags (RD_ECH) shown earlier in Table VI. Another significant improvement is in its ability to disambiguate proper nouns (N_NNP) from common nouns (N_NN) with an 89% reduction of confusion; a trait where CRF classifiers are good at capturing word features such as capitalization. The same is observed in noun and foreign word confusion (N_NN, RD_RDF), noun and adverb confusion (N_NN, RB), and the coordinating conjunction and preposition confusion (CC_CCD, IN); all of them over the 80% confusion reduction. Overall, it is clear that the Hybrid tagger has reduced most of the confusion mentioned in Section IV-B. Interestingly, even the Hybrid tagger has a problem in disambiguating adjectives from verbs, a language phenomenon debated by researchers [11], [13], [19], [20]. In this category, the confusion has not reduced but increased by 65%.The other two tags that showed a relatively small increase in confusion are the adverb and noun (RB, N_NN) -confusion by 10%-and the interjection and adverb (RP_INJ, RB)-confusion by 8%. The remaining three confusing tags showed a 2% or less increase in confusion. All the most common confusion tags that have an average frequency of 3 or more are indicated in Table VIII

VII. CONCLUSION
Although the present annotated 90k corpus available for Khasi is relatively small, nevertheless experiments with automatic tagging using HMM along with the BIS tagset for Khasi have shown performance that does not lack behind reported performance in other languages. As shown in this paper, addressing the tagging errors of the HMM POS tagger by coupling it with the sklearn-crfsuite, a fast implementation of CRF has given a Hybrid POS tagger for Khasi with an improved accuracy of 95.29%.
The results are very promising for an under resourced language such as Khasi. Apart from the concerns regarding the performance of the tagger, Khasi POS tagger development has also highlighted issues that were often raised in the literature of Khasi language. Does the Hybrid tagger"s confusion between verbs and adjectives imply that the confused adjectives are actually attributive verbs? However, to answer this question, further investigation in this direction is still needed. Finally, the next step is to include a wider range of genres for the corpus, which hopefully, with the current POS tagger in place, will ease the development towards a larger size annotated corpus.