Future of Information and Communication Conference (FICC) 2024
4-5 April 2024
Publication Links
IJACSA
Special Issues
Future of Information and Communication Conference (FICC)
Computing Conference
Intelligent Systems Conference (IntelliSys)
Future Technologies Conference (FTC)
International Journal of Advanced Computer Science and Applications(IJACSA), Volume 12 Issue 12, 2021.
Abstract: Now-a-days, digital documents have become the primary source of information. Therefore, natural language processing is widely utilized in information retrieval, topic modeling, document classification, and document clustering. Preprocessing plays a significant role in all of these applications. One of the critical steps in preprocessing is removing stopwords. Many languages have defined their list of stopwords. However, a publicly available stopwords list isn't available for the Tamil language since it is under-resourced. This study identified 93 general and some domain-specific stopwords for sports, entertainment, local and foreign news by analyzing more than 1.7 million Tamil documents with more than 21 million words. Also, this study shows that removing stopwords improves the accuracy of a Tamil document clustering system. It showed an improvement of 2.4%, 0.95% in the F-score for TF-IDF with one pass algorithm and FastText with the one-pass algorithm, respectively.
M. S. Faathima Fayaza and F. Fathima Farhath, “Towards Stopwords Identification in Tamil Text Clustering” International Journal of Advanced Computer Science and Applications(IJACSA), 12(12), 2021. http://dx.doi.org/10.14569/IJACSA.2021.0121267
@article{Fayaza2021,
title = {Towards Stopwords Identification in Tamil Text Clustering},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2021.0121267},
url = {http://dx.doi.org/10.14569/IJACSA.2021.0121267},
year = {2021},
publisher = {The Science and Information Organization},
volume = {12},
number = {12},
author = {M. S. Faathima Fayaza and F. Fathima Farhath}
}
Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.