Computer Vision Conference (CVC) 2026
21-22 May 2026
Publication Links
IJACSA
Special Issues
Computer Vision Conference (CVC)
Computing Conference
Intelligent Systems Conference (IntelliSys)
Future Technologies Conference (FTC)
International Journal of Advanced Computer Science and Applications(IJACSA), Volume 16 Issue 9, 2025.
Abstract: Tokenization is a foundational step in the NLP pipeline, and its design strongly influences the performance of transformer-based models, particularly for morphologically rich and low-resource languages such as Assamese. While most tokenizers are traditionally assessed using intrinsic metrics, their practical impact on downstream tasks has remained underexplored. This study systematically evaluates nine subword tokenizer configurations—spanning Byte-Pair Encoding (BPE), WordPiece, and Unigram algorithms with vocabulary sizes of 8K, 16K, and 32K—on sentiment classification in Assamese. Each tokenizer was integrated into a BERT-base-multilingual-cased model by replacing the default tokenizer and reinitializing the embedding layer. On a manually curated dataset, na¨ıve fine-tuning proved unstable under class imbalance, but a class-weighted loss restored effective training and exposed clear performance differences across tokenizers. WordPiece consistently outperformed BPE and Unigram, with the wordpiece 16k configuration achieving a weighted F1-score of 0.4897 across 10 random seeds. This score was statistically comparable to mBERT (0.4919) and competitive with larger multilingual baselines such as XLM-R (0.4978), despite relying on a far smaller, Assamese-specific vocabulary. These findings underscore that tokenizer choice is not a neutral preprocessing step but a critical design decision, highlighting the importance of downstream evaluation when developing practical NLP pipelines for low-resource languages.
Basab Nath, Sagar Tamang, Osman Elwasila and Yonis Gulzar. “Task-Oriented Evaluation of Assamese Tokenizers Using Sentiment Classification”. International Journal of Advanced Computer Science and Applications (IJACSA) 16.9 (2025). http://dx.doi.org/10.14569/IJACSA.2025.0160979
@article{Nath2025,
title = {Task-Oriented Evaluation of Assamese Tokenizers Using Sentiment Classification},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2025.0160979},
url = {http://dx.doi.org/10.14569/IJACSA.2025.0160979},
year = {2025},
publisher = {The Science and Information Organization},
volume = {16},
number = {9},
author = {Basab Nath and Sagar Tamang and Osman Elwasila and Yonis Gulzar}
}
Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.