The Science and Information (SAI) Organization
  • Home
  • About Us
  • Journals
  • Conferences
  • Contact Us

Publication Links

  • IJACSA
  • Author Guidelines
  • Publication Policies
  • Outstanding Reviewers

IJACSA

  • About the Journal
  • Call for Papers
  • Editorial Board
  • Author Guidelines
  • Submit your Paper
  • Current Issue
  • Archives
  • Indexing
  • Fees/ APC
  • Reviewers
  • Apply as a Reviewer

IJARAI

  • About the Journal
  • Archives
  • Indexing & Archiving

Special Issues

  • Home
  • Archives
  • Proposals
  • ICONS_BA 2025

Computer Vision Conference (CVC)

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact

Computing Conference

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact

Intelligent Systems Conference (IntelliSys)

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact

Future Technologies Conference (FTC)

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact
  • Home
  • Call for Papers
  • Editorial Board
  • Guidelines
  • Submit
  • Current Issue
  • Archives
  • Indexing
  • Fees
  • Reviewers
  • RSS Feed

DOI: 10.14569/IJACSA.2025.0160979
PDF

Task-Oriented Evaluation of Assamese Tokenizers Using Sentiment Classification

Author 1: Basab Nath
Author 2: Sagar Tamang
Author 3: Osman Elwasila
Author 4: Yonis Gulzar

International Journal of Advanced Computer Science and Applications(IJACSA), Volume 16 Issue 9, 2025.

  • Abstract and Keywords
  • How to Cite this Article
  • {} BibTeX Source

Abstract: Tokenization is a foundational step in the NLP pipeline, and its design strongly influences the performance of transformer-based models, particularly for morphologically rich and low-resource languages such as Assamese. While most tokenizers are traditionally assessed using intrinsic metrics, their practical impact on downstream tasks has remained underexplored. This study systematically evaluates nine subword tokenizer configurations—spanning Byte-Pair Encoding (BPE), WordPiece, and Unigram algorithms with vocabulary sizes of 8K, 16K, and 32K—on sentiment classification in Assamese. Each tokenizer was integrated into a BERT-base-multilingual-cased model by replacing the default tokenizer and reinitializing the embedding layer. On a manually curated dataset, na¨ıve fine-tuning proved unstable under class imbalance, but a class-weighted loss restored effective training and exposed clear performance differences across tokenizers. WordPiece consistently outperformed BPE and Unigram, with the wordpiece 16k configuration achieving a weighted F1-score of 0.4897 across 10 random seeds. This score was statistically comparable to mBERT (0.4919) and competitive with larger multilingual baselines such as XLM-R (0.4978), despite relying on a far smaller, Assamese-specific vocabulary. These findings underscore that tokenizer choice is not a neutral preprocessing step but a critical design decision, highlighting the importance of downstream evaluation when developing practical NLP pipelines for low-resource languages.

Keywords: Assamese NLP; tokenization; subword tokenization; sentiment analysis; low-resource languages; BERT; class imbalance

Basab Nath, Sagar Tamang, Osman Elwasila and Yonis Gulzar. “Task-Oriented Evaluation of Assamese Tokenizers Using Sentiment Classification”. International Journal of Advanced Computer Science and Applications (IJACSA) 16.9 (2025). http://dx.doi.org/10.14569/IJACSA.2025.0160979

@article{Nath2025,
title = {Task-Oriented Evaluation of Assamese Tokenizers Using Sentiment Classification},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2025.0160979},
url = {http://dx.doi.org/10.14569/IJACSA.2025.0160979},
year = {2025},
publisher = {The Science and Information Organization},
volume = {16},
number = {9},
author = {Basab Nath and Sagar Tamang and Osman Elwasila and Yonis Gulzar}
}



Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

IJACSA

Upcoming Conferences

Computer Vision Conference (CVC) 2026

21-22 May 2026

  • Amsterdam, The Netherlands

Computing Conference 2026

9-10 July 2026

  • London, United Kingdom

Artificial Intelligence Conference 2026

3-4 September 2026

  • Amsterdam, The Netherlands

Future Technologies Conference (FTC) 2026

15-16 October 2026

  • Berlin, Germany
The Science and Information (SAI) Organization
BACK TO TOP

Computer Science Journal

  • About the Journal
  • Call for Papers
  • Submit Paper
  • Indexing

Our Conferences

  • Computer Vision Conference
  • Computing Conference
  • Intelligent Systems Conference
  • Future Technologies Conference

Help & Support

  • Contact Us
  • About Us
  • Terms and Conditions
  • Privacy Policy

The Science and Information (SAI) Organization Limited is a company registered in England and Wales under Company Number 8933205.