Investigating the Impact of Preprocessing Techniques and Representation Models on Arabic Text Classification using Machine Learning

Mahmoud Masadeh; Moustapha. A; Sharada B; Hanumanthappa J; Hemachandran K; Channabasava Chola; Abdullah Y. Muaad

doi:10.14569/IJACSA.2024.01501110

DOI: 10.14569/IJACSA.2024.01501110

PDF

Investigating the Impact of Preprocessing Techniques and Representation Models on Arabic Text Classification using Machine Learning

Author 1: Mahmoud Masadeh

Author 2: Moustapha. A

Author 3: Sharada B

Author 4: Hanumanthappa J

Author 5: Hemachandran K

Author 6: Channabasava Chola

Author 7: Abdullah Y. Muaad

International Journal of Advanced Computer Science and Applications(IJACSA), Volume 15 Issue 1, 2024.

Abstract and Keywords
How to Cite this Article
{} BibTeX Source

Abstract: Arabic Text Classification (ATC) is a crucial step for various Natural Language Processing (NLP) applications. It emerged as a response to the exponential growth of online content like social posts and review comments. In this study, preprocessing techniques and representation models are used to evaluate the effectiveness of ATC using Machine Learning (ML). Generally, the ATC operation depends on various factors, such as stemming in preprocessing, feature extraction and selection, and the nature of the dataset. To enhance the overall classification performance, preprocessing methodologies are primarily employed to transform each Arabic term into its root form and reduce the dimensionality of representation. In the representation of Arabic text, feature extraction and selection processes are imperative, as they significantly enhance the performance of ATC. This study implements the chosen classifiers using various feature selection algorithms. The comprehensive assessment of classification outcomes is conducted by comparing various classifiers, including Multinomial Naive Bayes (MNB), Bernoulli Naive Bayes (BNB), Stochastic Gradient Descent (SGD), Support Vector Classifier (SVC), Logistic Regression (LR), and linear Support Vector Classifier (LSVC). These ML classifiers are assessed utilizing short and long Arabic text benchmark datasets called BBC Arabic corpus and the COVID-19 dataset. The assessment findings indicate that the efficacy of classification is significantly influenced by the preprocessing methods, representation model, classification algorithm, and the datasets’ characteristics. In most cases, the SGDC and LSVC have consistently surpassed other classifiers for the datasets under consideration when significant features are chosen.

Keywords: Arabic Text Classification (ATC); Text Mining (TM); Machine Learning (ML); preprocessing methods; representation models; Feature Extraction (FE); Feature Selection (FS)

Mahmoud Masadeh, Moustapha. A, Sharada B, Hanumanthappa J, Hemachandran K, Channabasava Chola and Abdullah Y. Muaad, “Investigating the Impact of Preprocessing Techniques and Representation Models on Arabic Text Classification using Machine Learning” International Journal of Advanced Computer Science and Applications(IJACSA), 15(1), 2024. http://dx.doi.org/10.14569/IJACSA.2024.01501110

@article{Masadeh2024,
title = {Investigating the Impact of Preprocessing Techniques and Representation Models on Arabic Text Classification using Machine Learning},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2024.01501110},
url = {http://dx.doi.org/10.14569/IJACSA.2024.01501110},
year = {2024},
publisher = {The Science and Information Organization},
volume = {15},
number = {1},
author = {Mahmoud Masadeh and Moustapha. A and Sharada B and Hanumanthappa J and Hemachandran K and Channabasava Chola and Abdullah Y. Muaad}
}

Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

Investigating the Impact of Preprocessing Techniques and Representation Models on Arabic Text Classification using Machine Learning

Upcoming Conferences