On State-of-the-art of POS Tagger, ‘Sandhi’ Splitter, ‘Alankaar’ Finder and ‘Samaas’ Finder for Indo-Aryan and Dravidian Languages

Computational Linguistic refers to the development of the computer systems that deal with human languages. In this paper, different Computational Linguistic Techniques such as Parts of Speech (POS) tagger, ‘Sandhi’ Splitter, ‘Alankaar’ Finder and ‘Samaas’ Finder were considered. After a thorough literature review, it was found that fifteen techniques were used for POS tagging, nine techniques were used for ‘Sandhi’ splitting, one work is done for ‘Alankaar’ finder and absolutely no techniques are available for ‘Samaas’ finder for the Indo-Aryan as well as Dravidian languages. Analysis shows that Rule Based Approach (RBA) and Hidden Markov Model (HMM) are frequently used for POS tagging, RBA is most frequently used for ‘Sandhi’ Splitter, the general Human Intelligence (HI) is used for ‘Alankaar’ Finder and no ‘Samaas’ finder technique is available for any Indian language. Keywords—‘Alankaar’; ‘samaas’; ‘sandhi’; parts of speech tagger (POST)


I. INTRODUCTION
Natural Language Processing (NLP) has two main branches comprising of Natural Language Understanding (NLU) and Natural Language Generation (NLG). Computational Linguistic is a part of NLP and it requires a good understanding of both programming as well as knowledge of the language. Computational Linguistic techniques include Machine Translation, Speech Recognition systems, Text-to-Speech Synthesizers, Interactive Voice Response systems, Search Engines, POST, 'Sandhi' Splitter, 'Alankaar' Finder and 'Samaas' Finder. Hindi, recognized as the official language of India, is one of the most common language in India [1]. It alone has 38 million native speakers and happens to be the fourth most spoken language of the world [2]. Hindi also has various dialects. For instance, Awadhi which is one of its dialects is spoken in 20 districts of India and 08 districts of Nepal [3]. The prominent texts like 'Ramcharitmanas', 'Hanuman Chalisa' and 'Padmavat' are written in Awadhi [4][5] [6]. This paper presents a very thorough and exhaustive study of the various types of tools for the various Indian languages. The tools covered in this paper include POS tagger, 'Sandhi' Splitter, 'Alankaar' finder and 'Samaas' finder. The best attempt has been made to present the research works done in the area till date.
The research work is segregated into various sections: Section II describes related work. Section III discusses the Analysis of NLP techniques for Indian Languages. Finally, the Section IV describe the Conclusion and Future work.

II. RELATED WORK
Basit et al. [7] talked about Awadhi POS tagger and its tag set. For developing tag set authors referred Bureau of Indian standards (BIS) and used Feature Based Approaches (FBA). Various features like word level, tag level, character level and Boolean level are used for POS tagging. Ekbal et al. [8] developed Bengali POS tagger using Maximum Entropy Approach (MEA). They worked on 72,341 words and uses 26 tags. MEA is based on feature selection and it can be lexicon feature, name entity recognition, suffix and prefix of word, context free feature, digit feature etc. By using the above features, the system got 88.2% accuracy. Proisl et al. [9] experimented parts of speech tagging on Magahi and Bhojpuri by using SoMeWeTa, Bi-Long Short Term Memory (LSTM)+Conditional Random Field (CRF) and Standard tagger approach. SoMeWeTa tagger depends on average structure perceptron. Bi-LSTM uses character word embedding and support transfer learning. Standard tagger based on Maximum Entropy Cyclic Dependency Network (MECDN). After experimenting authors achieved 90.70% for Magahi and 94.08% for Bhojpuri. Ojha et al. [10] used CRF and Support Vector Machine (SVM) for tagging the Indo Aryan Languages Specifically Hindi, Odia and Bhojpuri. 90K tokens were used for training the system and 2K tokens were used for testing purpose. 88% to 93.7% accuracy was achieved with SVM and 82% to 86.7% accuracy was achieved with CRF.
Singh et al. [11] presented Bhojpuri POS tagger developed by SVM with 87.3% to 88.6% accuracy and errors can be minimized by increasing the corpus size. Pandey et al. [12] developed Chhattisgarhi POS tagger using RBA. 40,000 words (taken from story books) and 30 tags were used for testing purpose and achieved 78% accuracy. Sinha et al. [13] presented Chhattisgarhi language rules so that this could further be used for developing parser and translators for the Chhattisgarhi language. Reddy et al. [14] developed cross language POS tagger using HMM i.e. Kannada POS tagger using Telugu resources. Bhirud et al. [15] talked about the significance of various Computational Linguistics (CL) tools such as Grammar checker, POST, 'Sandhi' Splitter and 'Samaas' Finder.
Verma et al. [16] talked about the Lexical analysis or tokenization process. The authors used different religious text such as Bible, Gita, Guru Granth Sahib, Rigveda and Quran to perform the lexical analysis process. Bhatt et al. [17] checked the accuracy of Gujrati POS tagger. For this, the author worked on two different data sets and two different methods. The data sets were Sports information dataset and Amusement dataset. By using HMM 70% and 56% accuracy were gained for sports information dataset and Amusement dataset respectively. By using RBA model, authors got 76% and 80% accuracy for sports information dataset and amusement dataset respectively. Sharma et al. [18] stated that multiple techniques were used to perform POS tagging on Hindi text. The techniques either based on Rules or based on Statistics or based on both. The statistical model could be SVM, HMM, CRF and MEA.
Narayan et al. [19] developed Hindi POS tagger using Artificial Neural Network (ANN) and achieved 91.03% accuracy. Narayan et al. [20] developed Hindi POS tagger using Quantum Neural Network (QNN) and achieved 99.13% accuracy. Mohnot et al. [21] proposed Hindi POS tagger developed using Hybrid Approach (HA) and it could be the combination of RBA, CRF, HMM and so on. 80,000 words and seven types of tags were used for experiment purpose. Joshi et al. [22] stated that three approaches were very common for POS tagging, they are RBA, Statistical Approach (SA) and HA. Garg et al. [23] used RBA for Hindi POS tagger. In this paper authors referred news, essay and short stories and collected 26,149 words and used 30 different tags and achieved 87.55% accuracy. Shrivastava et al. [24] developed Hindi POS tagger using Longest Suffix Matching Approach of HMM and got 93.12% accuracy. Dalal et al. [25] stated that Maximum Entropy Markov Model (MEMM) is used for POS tagging and chunking. This model is having various features such as corpus based feature, word based feature, dictionary based feature and context based features. The first three features are used for POS tagging and last feature is used for chunking purpose.
Antony et al. [26] developed Kannada POS tagger using SVM. Authors himself developed his own corpus, and words are taken from Kannada newspaper and books. Initially the corpus size was 1000 words then 25,000 words and finally 54,000 words and 30 tags. Accordingly, authors gained 48%, 66% and 86% accuracy respectively. Priyadarshi et al. [27] proposed Maithili POS using CRF. Author himself annotated Maithili text and created a corpus which consisted of 52,190 words. 85.88% accuracy was achieved when experiment was performed on wikipedia dumps and other Maithili web resources. Mundotiya et al. [28] developed Maithili POS tagger using CRF and achieved 0.77% precision & recall, 0.78% F1 score and 0.77% accuracy. Jha et al. [29] discussed about the 'Sandhi' rules and Machine Learning models for analyzing the word, generating multiple words, concatenation with root word to suffix or prefix. Singh et al. [30] developed morphology based Manipuri POS tagger. Authors used dictionaries for root word, prefix and suffix. System was tested on 3784 sentences that consist of 10,917 words. The result shows that 69% words were correctly tagged while 31% of them were incorrectly tagged (23% unknown words and 8% known words).
Patil et al. [31] developed Rule based Marathi POS tagger. The system is tested with small corpus size and achieved 78.82% accuracy. Authors stated that system's accuracy can be increased by increasing the corpus size. Singh et al. [32] presented N-gram HMM for POS tagger. Authors considered tourism domain and collected 1,95,647 words for experiment purpose. Kaur et al. [33] talked about Punjabi POS tagger developed using HMM with tag set of 630 tags. Large tag set creates the data sparseness problem and it could be resolved by reducing the tag set. In this paper author suggested the new tag set proposed by Technical department of Indian languages (TDIL) and it consist of only 36 tags instead of 630 tags. The accuracy with 36 tags and 630 tags were 92-95% and 85-87% respectively. Mittal et al. [34] described N-gram HMM model for Punjabi POS tagger. Result showed that N-gram model is not suitable for unknown words because of spelling mistake or foreign language words.
Sharma et al. [35] stated that correctness of POS tagger depends on how accurately tagger tags the words of a sentences. The problem with the existing tagger is that it fails to tag the compound words and complex sentences. Authors were interested to increase the efficiency of existing Punjabi POS tagger by implementing the Viterbi algorithm of Bi-gram HMM. Suresh et al. [36] developed Telugu POS tagger using HMM with 620 tags but TDIL proposed only 34 tags for Indian languages. After experimenting 92-95% and 85-87% accuracy achieved with 34 tags and 620 tags respectively. Jagadeesh et al. [37] used unsupervised learning algorithm and Deep Learning (DL) methods for developing Telugu POS. The Table I  Kovida et al. [43] discussed General Approaches (GA) used for language independent 'Sandhi' Splitter and the system has been tested on two languages Telugu and Malayalam. Devadath et al. [44] conducted 'Sandhi' splitting experiment on Dravidian languages. Authors evaluated the performance of 'Sandhi' splitting tool and analyzed error propagation rate. Joshi et al. [45] presented 'Sandhi' viched ('Sandhi' Splitter) using different Hindi rules. They experimented their system on 847 Hindi compound words and got 75% accuracy. Gupta et al. [46] developed a Rule based 'Sandhi' Viched system for Hindi Language. The authors tested the system on more than 200 words and got 60% to 80% accuracy. Deshmukh et al. [47] compared four 'Sandhi' analyzer and 'Sandhi' Splitter systems developed in Sanskrit, Marathi, Hindi and Malayalam and authors found that RBA was used for all four languages.
Murthy et al. [48] developed first 'Sandhi' Splitter in Kannada using 'Sandhi' Place Determination (SPD) and Prefix Suffix method (PSM). The experiment was performed on 7000 words in Kannada language and achieved 80% accuracy. Shashirekha et al. [49] presented RBA based agama 'Sandhi' Splitter namely Yakaragama and Vakaragama. The experiment was tested on the words taken from Kannada newspaper and online resources. The developed system achieved 98.85% accuracy.
Shree et al. [50] proposed Kannada 'Sandhi' Splitter using CRF method. Sebastian et al. [51] discussed the results and issues of Malayalam word Splitter developed using Machine Learning (ML) approaches. Premjith et al. [52] used DL methods such as RNN, LSTM and Gated Recurrent Units (GRU) for constructing and splitting the words and obtained 98.08%, 97.88% and 98.16% accuracy respectively. Nisha et al. [53] developed the Malayalam 'Sandhi' Splitter using Memory Based Language Processing (MBLP) algorithm. This algorithm was based on suffix separation. Authors discussed three methods for suffix separation such as Root driven method, Affix stripping method and the Suffix stripping method. Devadath et al. [54] developed the Malayalam 'Sandhi' Splitter using the HA and got 91.1% accuracy and authors stated that HA was better than RBA and SA, because it is faster and more accurate.
Das et al. [55] developed Malayalam 'Sandhi' Splitter using HA and Malayalam characters were represented by unicode. Nair et al. [56] developed Malayalam 'Sandhi' Splitter using RBA to split the compound words. The system was tested on 4000 compound words and got 90% accuracy. Authors stated that work can be extended to other Dravidian languages because they have structural similarity. Joshi et al. [57] developed Marathi 'Sandhi' Splitter using RBA. The experiment was tested on 150 words and got 70-80% accuracy. Patil et al. [58] proposed 'Sandhi' viched system for sanskrit language using RBA.
Bhardwaj et al. [59] developed Sanskrit benchmark called 'Sandhi'kosh. 'Sandhi'kosh includes Rule based corpus, Literature corpus, Bhagavad Gita corpus, UoH corpus and Astaadhyaayi. In this paper authors presented three most popular 'Sandhi' splitting tools such as JNU tool, UoH tool and INRIA tool. All these tools refer 'Sandhi'kosh for referring any 431 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 4, 2021 rules. All these are openly available and can be used by anyone for validating their tools.
Hellwig et al. [60] introduced Convolution Neural Network (CNN) and RNN for splitting the Sanskrit compound words and this model is also suitable for German compound words. Hellwig et al. [61] developed 'Sandhi' resolution and 'Sandhi' splitting system using RNN. Natarajan et al. [62]  Adhikari et al. [65] discussed the rules for improving the existing Nepali morphological analyzers. Paul et al. [66] discussed about the Nepali stemmer developed using an affix stripping technique and rule based technique. The system was tested on 1800 words of different domain. These domains include news on Economics, Health & Political in Nepali language, which are based on Devanagari Script. The overall accuracy of the designed system was 90.48%. Basapur et al. [67] stated that developing a 'Sandhi' Splitter or 'Sandhi' joiner for Pali language is bit difficult because the complex nature of grammer rules. The Table IV represents approaches for 'Sandhi' Splitter for International Languages from 2014 to 2020.
Hemlata et al. [68] stated that translation is the process of changing the words from one language to the other language without altering the meaning. Translation is a difficult task because it involves large no. of Ras and Alankaar. These help to enhance the beauty of the literature. Ramcharitmanas is an Awadhi epic which has a tremendous usage of Alankaar. It can be translated through machine, but doing so will deplore the beauty of the epic. Authors did this work better with the help of Human Intelligence (HI).
Das et al. [69] stated that parse structure and simple sentence generation algorithm are used to generate simple sentences from the complex or compound sentences. Sharma.
[70] stated two things. Firstly, sentence simplification methods are used to simplify compound sentences. Secondly the RBA, HMM POS tagger and lexicon based morph are used to identify syntactic errors. On testing, the system got 93.30% precision, 97.32% recall rate and 95.25% F measures. Garain et al. [71] stated that sentences can be simplified by preparing parse tree and their efficiency could be decided on the basis of parse tree's efficiency. Poornima et al. [72] defined the RBA for sentence simplification. It is a two-step process. In first step, split the sentence by seeing the delimiter and in second step again split the sentence by seeing the connectives. Zhu et al. [73] stated that sentence simplification process consists of source and target. Complex sentence and simple sentence could be source and target. Tree based simplification model is used for splitting, dropping, reordering and substitution.
As discussed above, although some papers on sentence simplification were found, no papers were found on 'Samaas' Finder for any Indian language.

III. ANALYSIS OF NLP TECHNIQUES FOR INDIAN LANGUAGES
After analyzing the contents of Table I and Table III, we find that fifteen techniques are used for POS tagging and nine techniques are used for 'Sandhi' splitting for many Indian Languages. Very less work is done for 'Alankaar' Finder and no work is done for 'Samaas' finder. Various graphs have been prepared by considering the different parameters. Figure 1 shows Language-wise available POS tagger. Figure 2 is for No. of approaches used by POS tagger and Figure 3 is year-wise POS tagger. 432 | P a g e www.ijacsa.thesai.org Different graphs have been made for 'Sandhi' Splitter. Figure 4 represent language-wise available 'Sandhi' Splitter. Figure 5 shows the No. of approaches used by 'Sandhi' Splitter and Figure 6 is year-wise 'Sandhi' Splitter.   The Figure 7 shows the various approaches used by different Computational Linguistic tools. After reviewing all research papers, it is observed that most of the Computational Linguistics work is done in Maharashtra, Punjab, Telangana, Tamil Nadu and Uttar Pradesh. Table VIII depicts the state wise statistics.  As a future work, the authors would like to extend this work and use ML techniques for linguistic tools i.e. POS tagger, 'Sandhi' Splitter, 'Alankaar' Finder and 'Samaas' Finder for Indo-Aryan and Dravidian languages.