Highly Efficient Parts of Speech Tagging in Low Resource Languages with Improved Hidden Markov Model and Deep Learning

—Over the years, many different algorithms are proposed to improve the accuracy of the automatic parts of speech tagging. High accuracy of parts of speech tagging is very important for any NLP application. Powerful models like The Hidden Markov Model (HMM), used for this purpose require a huge amount of training data and are also less accurate to detect unknown (untrained) words. Most of the languages in this world lack enough resources in the computable form to be used during training such models. NLP applications for such languages also encounter many unknown words during execution. This results in a low accuracy rate. Improving accuracy for such low-resource languages is an open problem. In this paper, one stochastic method and a deep learning model are proposed to improve accuracy for such languages. The proposed language-independent methods improve unknown word accuracy and overall accuracy with a low amount of training data. At first, bigrams and trigrams of characters that are already part of training samples are used to calculate the maximum likelihood for tagging unknown words using the Viterbi algorithm and HMM. With training datasets below the size of 10K, an improvement of 12% to 14% accuracy has been achieved. Next, a deep neural network model is also proposed to work with a very low amount of training data. It is based on word level, character level, character bigram level, and character trigram level representations to perform parts of speech tagging with less amount of available training data. The model improves the overall accuracy of the tagger along with improving accuracy for unknown words. Results for “English” and a low resource Indian Language “Assamese” are discussed in detail. Performance is better than many state-of-the-art techniques for low resource language. The method is generic and can be used with any language with very less amount of training data.


I. INTRODUCTION
Parts of speech tagging can be viewed as the problem of word classification. Each such class contains words having some common properties regarding their usage in the sentences. For example, the English language contains the following parts of speech: "noun", "pronoun", "verb", "adjective", "preposition", "conjunction", and "interjection". The meaning of a sentence and its grammatical correctness depends on the kind of parts of speech being used in the sentence. The accurate parts of speech tagging can only determine the correct interpretation of the text. Any Language Processing task, therefore, depends heavily on the accuracy of tagging. Information in natural languages like parts of speech can be helpful in many language-related tasks. But, most of the developments of natural language processing are observed for a few dominant languages spoken widely in the world. This is because of a lack of extensive research and the non-availability of computable resources for other languages. It is therefore very important to identify the key factors that affect the accuracy and to make proper use of them so that such languages can also benefit from the advancements of natural language processing techniques. The concern is to find the innovative idea to improve accuracy for languages with fewer amounts of data available for use. With the availability of methods to work well with low training, we can develop NLP applications for languages that are poor in terms of computable resources. Also, it is important to design systems that can be used across any language so that the benefit can be transferred to any language. The Hidden Markov model is one of the most popular stochastic models used for natural language processing. The Viterbi algorithm uses Hidden Markov Model to find the most likely sequence of hidden states. It is used to derive an observed sequence. Thus, the algorithm can be used to predict parts of the speech tags of a sequence of words [1]. The model is trained with a large amount of already tagged training data. Such a model does not work properly when training data is less in volume. The model fails to learn the behaviour due to less training and also due to unknown words that were not used during training. The accuracy is further low in the cases of languages for which adequate training data are not available for training the model. A lot of research has been carried out to overcome this by modifying the hidden Markov model for unknown words but there is a lot of scope for improvement. Most of the research has been carried out to improve the Hidden Markov Model and Viterbi Algorithm using a large training dataset and with the imposition of some rules. Usage of huge training datasets limits the scope of the methods only to those languages that are very rich in computable resources. Also, rule-based methods limit the scope only to the specific language in context because rules are language-dependent. Hence, a language that is not studied in many details cannot get the benefit of rules-based NLP methods. This paper describes two specific works to improve performance automatic parts of speech tagging for languages with low quantity of resources in www.ijacsa.thesai.org computable form. At first, we introduce a new method to calculate the probability for unknown words. We use bigram and trigram of characters for this purpose and we have also made modifications to the Viterbi algorithm accordingly. The bigram and trigram combinations of characters are used as resources during training for the Hidden Markov model. Bigram and trigram of characters help to model unknown words and also to improve recognition for untrained words. Improving accuracy for unknown words improves the performance for languages with fewer amounts of computable resources available for training. Experiments have been conducted in English languages. We have also tested the same with a small set of Assamese, a low resource language spoken widely in North East India. We have reported results that show considerable improvement. The concept is generic and can be used with any language. This is helpful especially for languages with very less amount of computable resources. Such languages cannot be trained with a huge amount of vocabulary due to lack of data and it results in many unknown words being encountered while testing the system.
Next, a deep learning method is proposed to improve recognition for words using bigram and trigram of characters. Machine Learning and deep learning models are very much popular these days. This kind of model learns the behaviour of the system after being trained with enough labeled datasets. The model learns the association of training datasets and applies it to the real data with good accuracy. Such a model needs to be trained with a huge amount of training data to learn the behaviour well. Challenge is to make the system learn from the least amount of available data so that it works for lowresource languages. We have therefore discussed the proposed deep learning architecture that is capable of classifying the words into defined parts of speech with considerable accuracy. It is based on word level, character level, character bigrams, and character trigram level representation.
The rest of the paper is organized as follows: Section II next describes some of the earlier works and popular methods used for automatic parts of speech tagging, the Hidden Markov Model, parts of speech tagging related to low resource languages, and machine learning methods. Section III describes the research objective in brief. Section IV describes the proposed methods and the data sets used for the purpose. Section V outlines the experiments and results. Section VI discusses the outcomes and Section VII concludes with the future scope of improvements. The reference section outlines the references used in the paper.

A. Works Related to Parts of Speech Tagging
The automatic parts of speech tagging is one of the key areas of research since people started working on processing natural languages. Rule-based methods, stochastic methods, and transformation-based learning approaches are the most widely used supervised techniques for parts of speech tagging. The stochastic or probabilistic methods use a training set and calculate the probability of a word belonging to all possible tags. Based on this calculation, it then assigns the tag with the highest value of probability. As outlined by Martinez (2011), one of the popular methods involved in POS tagging is the Rule-Based Method which is based on a set of rules set by humans [2]. However, it requires too much manual intervention and also requires in-depth knowledge of grammatical rules that varies from language to language. Transformation Based Learning is also used recently to automatically tag POS where the rules are learned from an initially annotated corpus. This method requires a huge amount of training to make the system learn and provide accurate results. The most popular method for POS tagging is Markov Model Taggers that are based Hidden Markov Model. It works on statistical methods to find the best possible sequence of tags out of the possible tag sequences. It consists of three components: outputs, transitions, and states where states represent the tags in case of POS tagging. Maximum Entropy Methods, Support Vector Machines, Neural networks, Decision trees, etc. are some of the other methods used for this purpose. Accuracy can be obtained above 95%, but the model needs to be trained with a huge training dataset [2]. Among the early works, Janas [3] in 1977 proposed a two-step method based on knowledge of linguistic regularities for English texts. He used a large corpus to get 84% of words tagged correctly. Research into parts of speech tagging for languages other than English has also progressed a lot recently. Fernando et. al (2016) recently presented a Support Vector Machine-based Part-Of-Speech tagger for the Sinhala language [4]. Application of Hidden Markov Models based tagger for the language is far behind as stated by the authors. They reported an overall accuracy of 84.68%, and unknown word accuracy of 59.86%. Better use of techniques like the hidden Markov model will be helpful to improve accuracy for the Sinhala language. But unavailability of a huge corpus like that of English is a concern to do so. Hyun-Je Song and Seong-Bae Parkhave (2020) have recently addressed two of tagging for Korean Language problems using a two-step mechanism [5]. Udomcharoenchaikit, Boonkwan, and Vateekul (2020) have introduced an evaluation scheme of Sequential Tagging Methods based on an example-based system using known spelling errors for the Thai language [6].

B. Related Works using Hidden Markov Model
The Hidden Markov Model is the most popular stochastic method for automatic parts of speech tagging. Cutting et al. described [1] some initial works on automatic parts of speech tagging based on the Hidden Markov Model (HMM). They reported an accuracy of 96% using Brown corpus [7], developed by Francis et al. (1979) for English. Cing et. al.
(2019) presented a comparison of using HMM only, and HMM with morphological analysis for parts of speech tagging of Myanmar Language. Morphological rules are used to improve performance for unknown words. They have stated in conclusion that using only HMM for a small dataset has no scope at all. A large training dataset or rule-based method in combination with HMM is the only solution [8]. Jurgen et al.
(2011) used infinite HMM for parts of speech tagging in an unsupervised manner [9]. Myint et al. [10] proposed to use lexical information with HMM for the Myanmar language as HMM is only capable of using in contexts, not lexical information. They have therefore proposed Lexicalized Hidden Markov Models (L-HMMs) for improving recognition. Thorsten Brants [11] reported that a Markov model-based tagger performs at least as well as other approaches, including www.ijacsa.thesai.org the Maximum Entropy framework. He has used some rules on top of HMM for unknown words and found an accuracy of up to 81%. Ferran PLA and Antonio Molina [12] applied Lexicalized Hidden Markov Models and reported improvement of accuracy for part-of-speech tagging. They reported 6% improvement for unknown words. Recently Tham et. al. (2020) have reported the usage of hybrid POS tagger for the Khasi language. A tagger is developed using the Hidden Markov Model (HMM). It is then integrated with conditional random fields (CRF) rules. The errors obtained from the first tagger are used to improve accuracy [13]. Rule-based features are very much dependent on language. Jassim et. al(2021) have used an N iterative HMM model for parts of speech tagging in Iraqi National Song. The iterative approach has improved accuracy as claimed by the authors [14]. Since HMM uses a huge amount of training data, it can very well map the context of the words and provide high accuracy as seen in the works conducted by the researcher discussed above. However, the model suffers in the case of words that are not trained during training. Among the early works, Ratnaparkhi (1996) proposed a maximum entropy model [15] to successfully tag unseen words with accuracy up to 96%. It uses some specialized features to take decisions and uses approximately 900 thousand of words from Wall Street Journal corpus as a training data set taken from the Treebank project (Marcus et al., 1994) [16]. Robert M. Losee [17] used tagging for improving decisionmaking with the help of linguistic information. Toutanova et al. [18] proposed a part-of-speech tagging using lexical features, preceding and following text context with fine-grained modeling for features of unknown words. Martin Haulrich reported [19] the implementation of a part-of-speech tagger based on the first-order Hidden Markov Model and compared different strategies to improve the result for unknown words. Other rule-based techniques take decisions based on the presence or absence of a number, upper-case letter, etc as proposed by researchers [15]. Rules may be added to improve accuracy for the unknown word. But these rules are languagedependent. Scott M.Thede [20] proposed a few statistical methods for predicting unknown words. The methods are applicable to any language. The author used Brown corpus [7] in this case too. Mikheev (1996) used the beginning and end of a word to predict the parts of speech for unknown words [21]. They used certain morphological rules: Prefix rules, suffix rules, and ending-guessing rules. Anastasyev et. al [22] described rules for detecting unknown word tagging for rich languages. Use of context, word endings are used as clues for detecting parts of speech. The following sections detail the basics principle of the Viterbi Algorithm and Hidden Markov Model.

C. Basic Principle of Hidden Markov Model
The Hidden Markov Model (HMM) basically comprises of two kinds of events: observed events and hidden events. For the part of speech tagging problem, observed events are the words that appear in the input text and hidden events are the parts of speech that are to be predicted. Components of hidden Markov model are: a matrix of state transition probability, sequence of observation, sequence of emission probabilities and initial probability distribution [23] [24].

1) Transition probability:
It is the first Markov assumption which implies that the current state depends only on the previous state. We will call this as transition probability. A transition probability matrix is to be calculated based on training data for all pairs of tags (states). Mathematically it is represented by: where, qi is the state at i th instance.

2) Emission probability
Here, oi is the observation at ith instance.
This second Markov assumption implies that the observation at any instant depends only on the present state. We will term it as emission probability. An emission probability matrix is to be calculated based on training data for all pairs of tags (states) and words (observations).

3) Viterbi algorithm:
It is used to predict the part of speech of a word based on maximum likelihood calculated as: ,V(t(j)) is the Viterbi path probability of the model being at state j at any instance, after observing the first t number of observations. Here a is transition probability matrix and b is emission probability matrix [25] [26].
The problem with this basic algorithm is the fact that the emission probability P( oi|qi ) is zero for an unknown word. It is because.
where, N (o i |q i ) is the number of time observation oi appears in training set as state q j and Nq i is the number of time state qi appears in training set. Hence, over the years scientists have come up with different methods to improve accuracy for such words. One simple method is Laplace smoothing, where the following modification is made: where V is the length of vocabulary trained.
Considering only the transition probability for calculation in case of unknown words is another option to overcome it by replacing the value zero by one .A brief discussion on earlier works in this area is discussed above in this section.

D. Related Work for Low Resource Languages
The resources available for the available languages in the world are extraordinarily unbalanced [27]. There are many organizations that are working dedicatedly for technology development on low resource language. Under the LORELEI (Low Resource Languages for Emergent Incidents) Program of Linguistic Data Consortium for DARPA, researchers are working on developing NLP technologies for natural disaster management for almost three dozen low resource languages [28]. NLP research teams are working seriously on approximately 20 of the almost 7000 languages of the world www.ijacsa.thesai.org leaving a majority of the population not reachable to advanced NLP applications [29]. Low resource languages are those language that are less computerized, less privileged resource scared languages. These are languages where statistical methods cannot be applied due to the availability of fewer data [29], [30], [31], [32]. As Simpson et. al. [33] reported that a low resource language: "All meet the basic criteria of being significant in terms of the number of native speakers but poorly represented in terms of available language resources." Christopher et. al. states that defining a language as "low resource language" depends on the Demography, Linguistics, and Resource availability of the language and the speakers [31]. Some amount of work has been done for low resource languages, but the researchers are not yet able to develop NLP applications for majority of the languages. This is because a strong language-independent method is highly essential to work with languages with fewer amounts of training data to develop NLP applications. With the low amount of training, testing always encounters more and more unknown data and it eventually makes the hidden mark model not much useful for such cases. Recently, some works [34] [35] for resource-poor language and rule-based methods have shown some improvement. But still, rule-based NLP applications are language-specific and the advantages are limited. Researchers have used unsupervised techniques for low resource languages. N-gram Models [36] can be also very useful for of processing many natural language processing tasks. Authors reported some considerable result taking help from another resourceful language as parent language and with standardized text for the two languages [37], [38]. But they also reported difficulty in choosing the parent language because typologically close languages do not always work best. Researchers have used modern days supervised techniques based on long short-term memory networks (LSTM) on multilingual embeddings to get good results. But that also requires quite a large training dataset [39], which is not available for most of the languages. Some Researchers have used bilingual lexicon available to some extent for few languages to investigate the possibility of designing language models with limited training data. The method uses the learning of cross-lingual word embeddings to train monolingual language models. The training shows improvements due to the pre-training process [40]. Some amount of work has also been done for Assamese using some stochastic methods [41], [42], [43] [44] [45]. Recently authors have discussed the key areas of NLP research in Assamese [46]. Researchers have also been working recently on parts of speech and other nlp issues on Arabic languages [47] [48].

E. Related Work using Machine Learning
Recently deep learning methods have gained high popularity. The same is being used for Indian and other lowresource languages. Sequential deep learning methods are very popular in this regard. Some of them are long short-term memory (LSTM), bidirectional LSTM, gated recurrent units (GRU), recurrent neural network (RNN), etc. Authors in [49], have applied deep learning for tagging the Chinese Buddhist language. Their learning model is based on RNN. The model as informed by the author is more effective than traditional methods. Bidirectional LSTM is another popular method in this regards authors in [50], experimented with BLSTM and auxiliary loss over a set of 22 languages. They used the auxiliary loss to improve the performance for rare words. Techniques of using character level along with wordlevel representation are used recently for POS tagging using deep neural networks. Authors in [51] used the method for English and Portuguese. A convolutional layer was used to prepare the data with character representation. Authors in [52] discussed in detail, how to represent character-level information from raw text. They successfully did it to predict the next character from a given sequence of characters. They used a simple recurrent network for this purpose.
Authors in [53], reported using the character level outputs of convolutional neural network (CNN) as inputs to an LSTM RNN model. The authors have stated that it highly improves the performance in the case of morphologically rich languages like Russian, Spanish, French, Czech, Arabic, German. Machine learning approaches have recently been used with many low-level languages. Authors in [54] described their architecture for Korean POS tagging. They addressed the issue of rare word detection by input-feeding and copying mechanism and got considerable results. Authors in [55] used machine learning models for POS tagging of Sanskrit language. They represented each word as a point and then used clustering with LSTM autoencoder to get the tagging. Authors in [56] used deep learning methods for the Nepali language. They used Long Short-Term Memory Networks (LSTM), Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU), and bidirectional variants to successfully tag Nepali words with high accuracy.

III. RESEARCH OBJECTIVES
Parts of speech is the preliminary pre-processing step that requires to be executed for any NLP application. But the popular methods available for tagging require a huge amount of training data. They perform very poorly for languages for which a huge amount of training data is not available. The problem is to develop parts of speech tagging methods that are applicable to any language in the world with a very low amount of computable resources.
Most common methods like the Hidden Markov Model and Machine Learning are not well applicable to languages where large amounts of computable data resources are not available. Accuracy is also not very high for words that are not trained earlier because the models cannot read much information for such words. Some of the rule-based methods are mainly used to address these issues. But the scope of such a rule is limited and dependent on the language for which rules are set. The need is to develop systems that can improve accuracy, especially for unknown words, with low training and applicable to any language. This paper concentrates on language-independent approaches towards using the Hidden Markov Model and deep learning techniques with a very small amount of training data so that it can be used for any low resource language with considerable performance. Main focus is to devise language-independent methods to improve accuracy especially, the accuracy of unknown words which is the major concern for any supervised method trained with a low training dataset. www.ijacsa.thesai.org

A. Using Modified Viterbi Algorithm
We have proposed modifications to Hidden Markov Model to improve accuracy for untrained words with a small set of training data. We have used Subset from Brown Corpus [7], which is mostly used by researchers in this area. This will give a better platform for comparing proposed modifications with exiting methods. Proposed method is based on probability of character bigrams and character trigrams. The emission probability of a character bigram bi given tag ti is calculated as follows: where N( b i | t i ) is number of time bigram b i appears in training set as tag t i and N(t i ) is Number of time tag t i appears in training set. Similarly P(tr i | t i ) = N (tr i |t i ) / N ( t i ) (6) where N(tr i | t i ) is the number of time trigram tr i appears in training set as tag t j and N(t i ) is Number of time tag t i appears in training set.
The probability of unknown words is calculated based on the fact that bigrams (and trigrams) constituting the unknown word may have already appeared in some trained words, and thus this information may be used to predict the possible tag of the unknown word being tested. We assume that the probability of a word given a tag is proportional to the product of probabilities of sequences of character bigrams (also trigrams) of the word given the tag. Thus instead of considering the emission probability of unknown word as zero or one, we make the following changes: P(o i |q i ) ∞ c * P(b 1 |t i ) * P(b 2 |t i ) *……* P(b n |t i ) where c is a constant and b1, b2…bn are the bigrams of the characters constituting the word o i .
Hence, equation (3) can be rewritten as V(t(j))=max:V(t-1)*(P(b 1 |t i )*P(b 2 |t i )*…*P(b n |t i ))*b j (O t ) If any value of P(b k | t i ) turns out to be zero, it is considered to be a very small value to avoid zero product.
Similarly the same kind of modifications can be made for trigrams of characters: V(t(j))=max:V(t-1)*(P(tr i1 |t i )*P(tr i2 |t i )*…*P(tr in |t i ))*b j (O t ) (9) If any value of P(tr ik |t i ) turns out to be zero, it is considered to be a very small value to avoid zero product. This is an alternative to considering a zero or one value for the emission probability for the entire word as discussed in the previous section. This helps us to guess the probability of an unknown word, by using the probabilities of bigrams that may have occurred with other words that are already trained.

B. Using Deep Learning Architecture
The recent usage of neural network and machine learning methods has proved to be very much useful in modern technology. One of the most popular neural networks used for this purpose is a recurrent neural network (RNN). It feeds the output of one stage as input to the next. The states in RNN can store input of variable length of sequences. This particular property makes it very much useful for inputs with variable lengths like text sentences, speech processing, etc. LSTM is a kind of RNN that uses a special unit that can store memory for the long term. This kind of model can keep information retained for a long time because of its ability to select the kind of information to retain.
We have designed an architecture for the deep learning model and have successfully implemented it to work considerably well with less amount of training data. The work is inspired by [57], [58], and [59], where authors have used character-level representation along with word-level representation to train the model. We take a sequence of tagged words and feed the words, characters, bigrams and trigrams of characters of words into the first layer of the learning model. We have experimented with different parameters of the layers to get the best-suited architecture. The first layer of the learning network transforms words into feature vectors. It captures information about words' semantic and their morphological characteristics. Every word is converted into a vector of sub vectors of word-level embedding, character-level embedding, bigram character-level wording, and trigram character-level wording.

1) Word-level embeddings:
Word-level embeddings are encoded in an embedding matrix by column vectors where each column represents the word-level embedding of the corresponding word in the vocabulary. Every word thus is converted into its word-level embedding. A word is first encoded as a one-hot column vector. It is then fed to the input layer. A word embedding matrix is used to multiply it to finally get the word embedding. A word vector at time instant t, WV t is multiplied with embedding matrix WM w to get the Word Embedding E w as follows: 2) Character-level embeddings: All characters in a word are represented by a character level embedding. Like, wordlevel embedding, Character-level embeddings are encoded in an embedding matrix by column vectors where each column represents the character-level embedding of the corresponding character in the character vocabulary. The word embedding is calculated by concatenating word and character embeddings.
where, € is concatenation symbol. A character vector at time instant t and position i, CV i t is multiplied with embedding matrix CM c to get the Character Embedding.
where € is concatenation symbol. A bigram character vector at time instant t and position i, BV i t is multiplied with embedding matrix BM c to get Bigram Character Embedding.

4) Trigram-character-level embeddings:
Similarly, trigram combination of characters in a word is represented by a trigram level embedding. It is encoded in an embedding matrix by column vectors where each column represents the trigram character-level embedding of the corresponding trigram-character in the trigram-character vocabulary.
where € is concatenation symbol. A trigram character vector at time instant t and position i, TV i t is multiplied with embedding matrix TM c to get Trigram Character Embedding.
The addition of bigram combinations and trigram combinations helps the model to learn the inflections and morphological patterns related to tagging very well. The results are discussed in the next section. The basic principle of concatenating the different kind of embedding is illustrated in Fig. 1.
The difference with the usual LSTM applied is in the fact that word embedding in proposed model is a concatenation of word embedding, character embedding, and bigram character embedding or trigram character embedding. The order of characters is the same as the order in which they appear in the word. Similarly, the order of bigram and trigrams of characters are also in the same order in which they appear in the word. The embedding dimensions are decided after repeated experiments to get the most suitable value.

C. Data Sets
The English dataset is based on already tagged Brown Corpus [7]. The corpus is based on the current American English language containing about a million words. It comprises elements of statistics, psychology, linguistics, and sociology. Kučera and Francis (1967) reported their initial work on basic statistics on the corpus which eventually turned into Brown Corpus [7] [60]. The dataset considered for Assamese is developed inhouse. Assamese is the language spoken over the state of Assam and entire North-East India. However, not much information is available in computable form. A publicly available set of tagged words is not available in Assamese. Hence it is prepared in-house. It comprises an annotated text of approximately ten thousand words. The dataset is prepared from a corpus collected from TDIL (Technology Development for Indian Languages). It contains articles of different categories like storybooks, scientific articles, health articles, drama, etc. The corpus was tagged into different kinds of parts of speech as per Assamese grammar.

V. EXPERIMENTS AND RESULTS
We have prepared three small datasets of three different sizes of words taken from already tagged Brown Corpus. The original implementation of the Viterbi algorithm performs poor due to low training data. Due to low training, it cannot approximate the transition matrix accurately. Also, it encounters many unknown words during testing because of the small training size. In the original Viterbi algorithm, due to zero utterances of unknown words in training data, the emission probability evaluates to zero, thus resulting in zero value for the observation. This is already stated in equation (4) The transition probability is only considered by many researchers for the multiplication of transition probability and emission probability. It is equivalent to consider emission probability as one for unknown words. This is obviously a biased method, thus resulting in poor performances for both unknown and known words. Again, if zero value for emission probability is considered for unknown words, then the calculated value becomes zero for all unknown words, which is an obvious fault.
For small training data, this is a big challenge because, with a small training data, the transition history cannot be learnt properly and so it can never accurately measure the transition matrix. Therefore, the performance is very poor especially for unknown words. Proposed method replaces the transition probability with the probability of multiplication of individual emission probability of bigrams. The result shows improvement as it gives the word a tag based on the probability distribution of its constituent bigrams towards the tags as per corpus. The product of individual probabilities of bigrams is multiplied with transition probability, thus considering both emission and transition rather than considering none or only transition probability.
First, experiment was conducted for 5000 words as Training Set and 1250 words as Test Set for the very basic Viterbi Algorithm. The performance is poor because the system was trained with too low data. Similarly, the same was done with 10,000 and 20,000 words of training and test them with 2,500 and 5,000 words respectively for the basic Viterbi algorithm as a baseline for comparison.
The result of the implementation of the basic Viterbi algorithm on the three small sequences only tagged words (4:1 ratio for trained and tested words) used for training is tabulated Table I. The result shows improvement with more training data due to better guess of transition probability because of more training. As discussed earlier, the evaluation of the correct tag(state) is based upon the equation: It can be simplified as finding maximum likelihood of state over all possibility based upon the equation. P(State)=P(Tag/PrevTag)*P(Word/Tag) (14) In simple term, it can be written as: P(State) =P(emission) * P(transition) Next, two modifications are made on the following basic formula:

A. Modification Method1
It is based on emission probability of character bigrams and trigrams. The probability of a state is calculated based upon the following equation: P(State) = Product of emission probabilities of bigrams (trigrams) (15) The transition probability is discarded as work is concentrated on small set of training data.

B. Modification Method2
It is based on emission probability of character bigrams and trigrams and transition probability of tags. The probability of a state is calculated based upon the following equation: P(State)=Product of emission probabilities of bigrams (trigrams) * P (transition) (16) A part of the bigram probability matrix is also shown in Table II. It is calculated from first 5000 words of Brown corpus. The table shows probabilities of BIGR(Bigram) for the following parts of speech: NOU(Noun), VRB (Verb), ADJ(Adjective), DET (Determinant), ADP(Adposition), PRO (Pronoun), ADV(Adverb) and CON(Conjunction). Matrix for only five bigram combinations are shown.
The basic algorithm performs better with the larger size of the training data set. The accuracy value above 96% has been reported [1] using entire Brwon Corpus [7] of approximately 540 thousand words. However, unknown word accuracy is not good. Then the proposed modifications are applied upon the same set of data. The goal is to improve the results with a small set of training data. The trigram character probability has also been used instead of using bigram probability for the same sets of data. The results with the four datasets of different sizes are described in Table III, Table IV, Table V and Table VI, respectively.
A comparison of the results for the four different sizes of training data with the same 4:1 ratio of different testing and training dataset are shown in Fig. 2. Improvement is more visible with bigram characters than that of trigrams. For subsequent experiments, bigram characters are used.
Next, Experiments were also conducted for bigrams with increased rate of testing data (50% training and 50% testing) and the results are tabulated bellow in Table VII. In this case three different sizes of text were taken from Brown corpus and then the systems were trained using them. Equal volume of data was used to test each of the systems. The test dataset was selected from a different part of the corpus.  With the size of test dataset being almost double than the previous experiments, the results are seen to be consistent for unknown words. A comparison is detailed in Fig. 3.   The accuracy for unknown words is only detailed above. The overall accuracy is not any issue with Hidden Markov Model, which is high in this case too even with very low size of training data. The overall accuracy for the different sizes using bigram characters are depicted in Fig. 4.
The system is also tested with "Assamese" language with low training data. Assamese is a low resource language spoken in the state of Assam located in North East India. The findings are tabulated below in Table VIII and a brief comparison is shown in Fig. 5 and Fig. 6.
The system is also tested with "Assamese" language with 1:1 ratio of training and testing data. The findings are tabulated below in Table IX and Fig. 5.
The Assamese corpus was collected from TDIL (Technology Development for Indian Languages), which can be procured from their website after due permission. The corpus was tagged as per Assamese grammar rules to prepare train and test dataset. The result shown here is based on a train and test size of 10k words each that are non-overlapping. The results are satisfactory even with large proportion of unknown words.

C. Result with Deep Learning Model
The proposed deep learning model is then applied to automatically tag the same set of English words used above. The result obtained is satisfactory and described in Table X and Fig. 7. The training dataset used is much smaller to check its usefulness for low resource languages. Datasets of sizes 10k and 20 k are used to train the system and it is tested with the equal size of different data. Next, proposed deep learning model is used for the same purpose for the Assamese Language. Initially, only the traditional machine learning method of using word-level is applied embedding to get an accuracy of 72.51% which is comparable to the proposed traditional stochastic model. Next, the system is trained with word and character level embedding (Model1) and the accuracy jumps to 88.21%. Next, the model combining word and character sequences with tigrams (Model2) and trigrams (Model3) is implemented. As the models learn the morphological behaviours much better than before, the accuracy goes up to 93.52% for bigrams and 94.51% for trigrams. The accuracy of unknown words also raises due to better learning. Table XI and Fig. 8 compares the result of applying the proposed deep learning models for Assamese.  The results of the experiments with modified HMM methods clearly state improvement of accuracy for unknown words. The training sets of 5k, 10k, 20k, and 50k are used on an experimental basis and the system can also be tested with a large dataset. However, as the goal is to improve accuracy with a small training set, so the limited sizes of data are considered for training. The methods do not use any specific rule, and hence can be implemented for any language. The initial experiments are based on a 4:1 ratio of training and testing dataset, which shows considerable improvement with Modifications even though the test size increased with an increase in training size to maintain the ratio. The results are also encouraging for the next set of experiments that are based on a 1:1 ratio of training and testing dataset. Even with equal sizes of testing and training datasets, the system performs considerably well specially for a second modification. In the experiments conducted, it is observed that the accuracy has improved with an increase in the training set from 5,000 to 50,000 particularly in the case of modification method2. This happens because, with an increase in the size of the training set, the transition probability starts contributing along with emission probability, thus increasing accuracy. The improvement observed is less in Method1 as it only considers emission probability. Table VII and Table IX clearly show that the accuracy does not degrade even after increasing test size and unknown words. The system maintains overall accuracy of above 85% in all cases, which is considered good with such a low amount of training. The overall accuracy is higher than 80% even when it is exposed to a test dataset of size equal to that of the training set.
The accuracy further improves with the proposed deep learning model. The words when trained with characters and bigram or trigrams of characters improve the accuracy further as shown in Table X and Table XI. The model learns better when bigram and trigrams of characters are fed into the input along with character sequences. The bigrams and trigrams of characters allow the model to learn better the inflections and hence accuracy improves. The improvement is more obvious for Assamese because of high inflections.

VII. CONCLUSION
It is observed that the accuracy of transition probability increases with an increase in the size of the training dataset. The transition among the different parts of speech can easily be computed in the form of transition probability with the help of a very large training set. But, for a small set of training data, the transition probability is not predictable. Hence, for a small set of training data, emission probability plays the most important role to decide the total probability. Due to nonappearance in the training set, the emission probability for unknown words is zero and this is the root cause of the problem for detecting correct tags of unknown words. With the usage of bigram and trigram of characters in proposed modifications, unknown words may also have non-zero emission probability if such bigrams and trigrams have ever occurred during training other words. This increases the accuracy while classifying unknown words. The experiments conducted for English with low training data prove that the results are comparable with other methods used with large training data. The same technique of character sequences, bigrams sequences of characters, and trigrams sequences of characters applied to design a deep learning model also makes the system learn the behaviour so well that accuracy level increases up to a great extent. The system also performs well for low resource language like "Assamese" when used with a very small volume of training data. The methods are language independent and we hope that the methods will be useful for future implementation for any low resource language. More indepth research on this will further improve accuracy for low resource language.