Tagging Urdu Sentences from English POS Taggers

—Being a global language, English has attracted a majority of researchers and academia to work on several Natural Language Processing (NLP) applications. The rest of the languages are not focused as much as English. Part-of-speech (POS) Tagging is a necessary component for several NLP applications. An accurate POS Tagger for a particular language is not easy to construct due to the diversity of that language. The global language English, POS Taggers are more focused and widely used by the researchers and academia for NLP processing. In this paper, an idea of reusing English POS Taggers for tagging non-English sentences is proposed. On exemplary basis, Urdu sentences are processed to tagged from 11 famous English POS Taggers. State-of-the-art English POS Taggers were explored from the literature, however, 11 famous POS Taggers were being input to Urdu sentences for tagging. A famous Google translator is used to translate the sentences across the languages. Data from twitter.com is extracted for evaluation perspective. Confusion matrix with kappa statistic is used to measure the accuracy of actual Vs predicted tagging. The two best English POS Taggers which tagged Urdu sentences were Stanford POS Tagger and MBSP POS Tagger with an accuracy of 96.4% and 95.7%, respectively. The system can be generalized for multi-lingual sentence tagging.


I. INTRODUCTION
One of the most fundamental parts of the linguistic pipeline is part-of-speech (POS) tagging. POS tagging is the process of assigning grammatical tags (nouns, verbs, adjectives, adverbs) to each word in a text. This is a basic form of syntactic analysis of the language which has many applications in NLP. Most POS taggers are trained from treebanks in the Newswire domain, such as the Wall Street Journal corpus of the Penn Treebank. However, Stanford POS Tagger is widely used by the researchers due to its multi-lingual (computer language) support packages. Such as, Docker, F#/C#/.NET, GATE, Go, Javascript (node.js), PHP, Python, Ruby, XML-RPC and Matlab. Therefore, Stanford POS Tagger is considered as an example in this paper. Output from the rest of the POS Taggers is not discussed due to the page limitations. Challenges encountered due to the termination of tagging out of domain data, and nature of Twitter text conversations, lack of traditional orthography, and 140-character length limit for each message (-Tweet‖).
Since, the Internet has become a major medium of social interaction and communication. Whereas, the medium of communication is English, therefore, a rich source of information pool is growing with a very fast pace comprising some useful information. However, it is a tight and hard practice to filter out the useful information from such a massive stuff. Majority of contribution regarding to developing tools took place regarding to the English based communication. In case of POS tagging a rich literature is available regarding to English POS Taggers as compared to other languages. Each POS Tagger is working decently inside its domain and within its limitations. A lot of researchers natively other than English, are also contributing in English literature. However, the valuable information other than in English language is also as important as others. Apart to bring a decent amount of researchers to take part in non-English text, an idea of reusing English tools, techniques, methodology is proposed. More specifically, English POS taggers are to be reused for tagging non English language text.
In this research, after an extensive literature review of English POS Taggers, the Stanford POS Tagger, written specifically for English sentences is reused to tag Urdu sentences as an example. Twitter API is used to extract the Urdu sentences (tweets) on a specific topic from the Twitter. After the refinement process, sample of Urdu sentences is randomly selected for further processing. Google Translator is used to translate the sampled Urdu sentences into English, for tagging from Stanford POS Taggers. The state-of-the-art English POS Taggers were extracted and included in this exercise. However, their detailed result will be included in the extended version of this study. Such English sentences were injected into the Stanford POS Tagger to yield tagged-English sentences. These tagged-English sentences are translated back to their original language with the help of Google translator. Two human annotators tagged the original sample of Urdu sentences as benchmark tagged sentences. Kappa statistic www.ijacsa.thesai.org along with confusion matrix is applied to measure the accuracy of each tagger for Urdu tagging.
The rest of the paper is structured as follows: Section II comprises extensive background knowledge. Section III discusses the methodology of the research. Results and Future Implications are discussed in Section IV. Conclusion, limitations and future work are placed as final sections.

II. BACKGROUND KNOWLEDGE
In this section, an extensive background knowledge is presented as shown in Tables 1(a) and (b). A decent amount of literature has been carried out till date, however, current research is different in case of re-usability of benchmark POS Taggers, and generalizability of the idea. Additionally, Stateof-the-Art English POS Taggers are also the part of this section. This section comprises the methodology of the current research. Twitter APIs are used to extract the data on a specific topic. Data from Twitter for a novice topic PANAMA CASE is extracted with the help of Twitter API. Raw data are refined and ten sample sentences are randomly picked for further processing. Google Translator was used to translate the sampled Urdu sentences into English, for tagging from famous English POS Taggers, which were extensively explored from the literature. Such English sentences were injected into each tagger to yield tagged-English sentences. These tagged-English sentences were translated back to their original language with the help of Google translator. Two human annotators tagged the original sample of Urdu sentences as benchmark tagged sentences. Kappa statistic along with confusion matrix was applied to measure the accuracy of each tagger for Urdu tagging. Best two POS Tagger for Urdu sentences is hence prioritized. The whole process from step, selecting sample to find the accuracy was repeated three times to get the best results. On exemplary basis only Stanford POS Tagger is considered at this stage. The reason behind the consideration of Stanford POS Tagger here is, it outperformed the rest of the POS Taggers with 96.4% kappa statistics. The detailed results of the rest of the POS Taggers can be provided on demand. Below is the research methodology of current study in Fig. 1.
Twitter 1 is a social networking platform where millions of users communicate each day, billions of short text messages (up to 140 characters) tweets. Tweets on specific political issues were used to get tweets related to the keyword (Panama, PMLN and TTP). However, we make sure filter the unique tweets written in Urdu while we review the mesh by Twitter API 2 . To avoid re-tweets, the same check in the API is placed. The Hash functions were used to eliminate duplicate tweets. All non-Urdu characters were filtered out at the very first stage of the refinement, i.e. URLs, twitter connector (@username) and hashtags (#PTI, #PMLN) from tweets and then put them as a key in HashMap. Original tweets were used as the value of these keys. After running this procedure on all tweets, the number of tweets was reduced by approximately 40%. This remaining tweets can be safely said as unique tweets. Every Tweet was treated as a new sentence. A random sample of 10 sentences/tweets was considered for further processing as shown in Table 2. A decent amount of literature claims different types of English POS Taggers. However, Stanford POS Tagger was used at this stage for further processing. Yet, all other state-of-the-art famous POS Taggers will be discussed extended version of current study. Moreover, these taggers can be re-useable to tag multi-lingual sentences. Additionally, the overall result of all POS Taggers is provided in Fig. 2. In order to translate sampled Urdu sentences into English sentences, an Urdu-to-English translator namely, Google Translator 3 was used.
These translated English sentences were injected into a Stanford POS tagger. The output of this step was tagged translated English sentences as resulted in Table 3.
Google translator was used again to translate back the Tagged translated English sentences into the original form, i.e. Urdu as shown in Table 4.

IV. RESULTS AND FUTURE IMPLICATIONS
In order to check the accuracy of the subjected POS tagger with respect to Urdu language, Kappa Statistic with confusion matrix was considered. Manually annotations were applied with the help of two annotators to consider the best possible tags for original sampled Urdu data. Furthermore, Kappa Statistic with confusion matrix was applied to each tag used in Stanford POS Tagger for Urdu perspective as shown in Table 5. There were total 15 unique tags. The confusion matrix for actual tag (best possible) vs. predicted tag (tag assigned by Stanford POS Tagger) was synthesized for each of the following fifteen tags. Moreover, total accuracy and random accuracy were also calculated with the help of the following formula. Additionally, Kappa statistic was computed with the help of extracted values. The average value extracted by adding the individual kappa values of all the computed tags to the number of all tags. Accuracy of Urdu tagged sentences with the reuse of Stanford English POS Tagger was 96.4 on average, which is more than any of the existing Urdu POS Tagger. The process of randomly taking sample sentences was performed three times to remove the ambiguity of bias ness of sample selection.  POS Tagging is considered to be an essential component of several NLP applications. The new POS Tagger is not easy to develop for unstructured data.Therefore, it affects the accuracy of tagging due to the diversity of the language. In this study, the idea of reusability of famous English POS taggers is used for tagging non-Engish sentences. A famous Google translator is used to translate the sentences across the languages. Data from twitter.com is extracted for evaluation perspective. Confusion matrix with kappa statistic is used to measure the accuracy of actual Vs predicted tagging. The result shows the accuracy of 96.4% for Stanford POS Tagger which is the best among 11 famous English POS Taggers. The system can be generalized for multi-lingual sentence tagging.

Kappa Statistic
Alike other studies, current studies have also some limitations. Several translators have different translations of same sentence when translating the source language to target language. Additionally, even same translator translates a source language into targeted language, when re-translating the same text, produces different results. In this study, re-translation was carried out with the help of mapping the words. E.g. He is a boy. Wo aik larka ha. (he, wo), (aik, is), (larka, boy) and (ha, is). A customized Translator for specific language could ease the whole process. Another limitation of this study was the random selection of sentences. It was neutralized by taking the sample sentences thrice, however, the results were approximately same.
Short texts were used in this study; however, text other than from twitter will be used in an upcoming paper. Apart from the overall results, a detailed comparison of state-of-the-art English POS Taggers will be considered to rank the best POS Tagger for Urdu sentence tagging in the near future. Furthermore, sample data other than twitter will be considered for validation purposes. The current methodology could be used to tag multilingual tagging for the extraction of useful information. Therefore, a generic methodology for several different languages will be considered in future. Additionally, each language has different level of diversity; therefore, same methodology could be applied to several languages to avoid the development of novice complex taggers.