Highly Efficient Parts of Speech Tagging in Low Resource Languages with Improved Hidden Markov Model and Deep Learning

Diganta Baishya; Rupam Baruah

doi:10.14569/IJACSA.2021.0121011

DOI: 10.14569/IJACSA.2021.0121011

PDF

Highly Efficient Parts of Speech Tagging in Low Resource Languages with Improved Hidden Markov Model and Deep Learning

Author 1: Diganta Baishya

Author 2: Rupam Baruah

International Journal of Advanced Computer Science and Applications(IJACSA), Volume 12 Issue 10, 2021.

Abstract and Keywords
How to Cite this Article
{} BibTeX Source

Abstract: Over the years, many different algorithms are proposed to improve the accuracy of the automatic parts of speech tagging. High accuracy of parts of speech tagging is very important for any NLP application. Powerful models like The Hidden Markov Model (HMM), used for this purpose require a huge amount of training data and are also less accurate to detect unknown (untrained) words. Most of the languages in this world lack enough resources in the computable form to be used during training such models. NLP applications for such languages also encounter many unknown words during execution. This results in a low accuracy rate. Improving accuracy for such low-resource languages is an open problem. In this paper, one stochastic method and a deep learning model are proposed to improve accuracy for such languages. The proposed language-independent methods improve unknown word accuracy and overall accuracy with a low amount of training data. At first, bigrams and trigrams of characters that are already part of training samples are used to calculate the maximum likelihood for tagging unknown words using the Viterbi algorithm and HMM. With training datasets below the size of 10K, an improvement of 12% to 14% accuracy has been achieved. Next, a deep neural network model is also proposed to work with a very low amount of training data. It is based on word level, character level, character bigram level, and character trigram level representations to perform parts of speech tagging with less amount of available training data. The model improves the overall accuracy of the tagger along with improving accuracy for unknown words. Results for “English” and a low resource Indian Language “Assamese” are discussed in detail. Performance is better than many state-of-the-art techniques for low resource language. The method is generic and can be used with any language with very less amount of training data.

Keywords: Hidden markov models; viterbi algorithm; machine learning; deep learning; text processing; low resource language; unknown words; parts of speech tagging

Diganta Baishya and Rupam Baruah, “Highly Efficient Parts of Speech Tagging in Low Resource Languages with Improved Hidden Markov Model and Deep Learning” International Journal of Advanced Computer Science and Applications(IJACSA), 12(10), 2021. http://dx.doi.org/10.14569/IJACSA.2021.0121011

@article{Baishya2021,
title = {Highly Efficient Parts of Speech Tagging in Low Resource Languages with Improved Hidden Markov Model and Deep Learning},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2021.0121011},
url = {http://dx.doi.org/10.14569/IJACSA.2021.0121011},
year = {2021},
publisher = {The Science and Information Organization},
volume = {12},
number = {10},
author = {Diganta Baishya and Rupam Baruah}
}

Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

Highly Efficient Parts of Speech Tagging in Low Resource Languages with Improved Hidden Markov Model and Deep Learning

Upcoming Conferences