Investigate the Impact of Stemming on Mauritanian Dialect Classification using Machine Learning Techniques

—Despite the plethora and diversity of research on Natural Language Processing (NLP). As a technique allowing computers to understand, generate, and manipulate human language; It still remains insufficient, especially with regard to the processing of Arabic texts and their dialects which are widely used. The proposed approach focuses on the application of machine learning techniques taking into account evaluation criteria such as training to comments expressed in Mauritanian dialect, published on social media notably Facebook, and compares results generated by three algorithms which we applied such as the Random Forest (RF), Na¨ıve Bayes Multinominal (NBM), and Logistic Regression (LR) algorithm. Additionally, We then study the effect of machine learning techniques when different stemmers are combined with other features such as the tokenizers used to process the dataset. Although major challenges exist such as the morphology of Arabic is completely different from Latin letter languages, and there is no pre-existing dataset or dictionary to train the algorithms, the result we obtained after the experiments carried out on Weka shows that the RF and NBM algorithms are more efficient when applied with ArbicStemmerKhoja giving results respectively 96.37% and 71.40%; However, Logistic gets better performance results with Null Stemme is 81.65%. Results obtained by the three techniques applied with a light Arabic stemmer were more than 70%. This article presents a contribution to NLP based on Machine learning, descript also an important study that can determine the best Arabic classifier.


I. INTRODUCTION
Mauritania, like other countries around the world, has been invaded by new technology, which has given rise to exchange platforms commonly known as social media, through which inter-family exchanges on the one hand, and intergovernmental and two-way exchanges between government agencies and the public (citizens) on the other, can take place.Thereafter a data stream in dialect Mauritania and Arabic language will be generated reflecting the citizens' sentiment.Whereas, Sentiment analysis is an approach that uses natural language processing (NLP) [1], machine learning analysis methods [2] [3], or other lexicon-based methods [4] to extract, convert, and interpret opinions from a text and classify them into positive, negative, or neutral sentiments.However, the emergence of new technologies will allow governments and companies to take into account the opinions of their public via social media, which would help them make better decisions.Thus, artificial intelligence (AI) within other technologies has solved the challenges of business practice and introduced the application of Business Intelligence (BI) that has promoted the transformation of information.In this sense, several processing techniques have been used to classify large volumes of data (Big Data) for example regression analysis, Naïve Bayes (NB), Support Vector Machine (SVM), and Neural Network (NN) [5].So far, most research work has been done to classify text using Machine Learning for various languages like English more than Arabic sometimes when Arabic native speakers are more than non-Arabic according to [4].
We consider dialectal Arabic to be a new field of research in the field of text classification, for several reasons: firstly, dialect is widely used in social networks, which generates a large amount of data; secondly, dialect, whether HASSANIYA or others, is generally more widely used than the main language, even if this is not an official case.Moreover, the Mauritanian dialect has an alphabet and script that are those of Arabic, which means that dialect has become an important area of research; Regarding the complexity of HAASANIYA, justified by the reasons listed above, in this work, we propose an approach that gives a clear view of the classification of HASSANIYA text using machine learning algorithms and comparing the archived result.To implement the proposed approach, we use WAKE, which implements several filters and classifiers from machine learning algorithms [6].
The differences between Standard Arabic and dialectal Arabic are minimal in terms of derivation and grammar, as well as termination.On this basis, we decided to study the classification of Mauritanian dialectal texts taking into account the effect of the stemmer method.The goal of this work is to identify the best Machine Learning for dialect classification and the effect of stemming and tokenization on text classification, particularly the HASSANIYA dialect; nevertheless, we fusion deference filters for building our property models.
More specifically, a HASSANIYA dataset was collected on Facebook and contains comments posted on popular pages in Mauritania such as bloggers' pages or government pages (the Ministry of Hydraulics, the Ministry of the Interior and Decentralization, Ministry of Housing and Urban Development) that The paper is organized as follows: an introduction followed by a state of the art and literature review gives an overview of the Arabic language and its dialects, then we explain our research methodologies followed in this paper and finally, we discuss the results obtained and give a conclusion.

II. RELATED WORKS
Text classification presents an amazing field in the data analysis area, and still rich in terms of scientific research, increasing domain due to what we let know.Many researchers studied these cases and realized more articles, but the Arabic language and its dialects still need more work.In this context, several studies have been carried out such as the approach [4] gives an approach for the classification of Arabic texts using various algorithms, and showed an enhancement in the accuracy of classifier models.
Authors in the article [7] explored a comparative system on two different datasets based on the machine learning technique, classification models are compared in terms of accuracy for each dataset.
Another study [8] applied six variations of the Bayes classifier on Arabic data, after analyzing their results were compared, and showed that the best values were generated successively from Naïve Bayes and Naïve Baye Update, in another way Naïve Bayes Multinomial Text generated the worst results.
Proposed [9] a contribution to big data processing which is considered a challenging stage of data analysis ax, so a solution proved for the challenges in four stages: data collection, cleaning, enrichment, and availability.they looked to convert social media data to computation-based data after it was source-based.
The author [10] Proposed a model for text dialectal classification, they prepared a dataset of Marocain dialect scraped from Twitter comments and a combination of extraction(ngrames), weighting schema(Bow, TF-IDF), and word embedding was applied in order to prove the Marocain dialect classification and get the best classify model.the Machine Learning techniques which they applied are following: Naive Bayes, Random Fest, support Vector Machine, Logistic Regression, and a Deep Learning Model such as Long Short term Memory (LSTM).the experimental work showed that the SVM achieved an accuracy equal to 70%.
This paper [11] proposed a new algorithm to generate all potential derivation roots of an Arabic word, without deleting initial affixes.the author seeks to address the weaknesses and errors of existing algorithms in order to improve the accuracy of Arabic Natural language Processing.they used in this study a data set that includes a collection of roots, patterns, and affixes.by matching the derived word to identify the root.and then, they get an average accuracy rate of 96%.This study [12] proposed a model as a novel assembly of CNNs for analysis of the task of Arabic dialect classification from spontaneous Arabic speech dataset.this model is based on a fusion of linguistic and acoustic features and uses pre-trained bidirectional encoder representation from the transformed (BERT) Model.the proposed approach achieves an accuracy of 82.44% for the identification task of five Arabic dialects.
The author [13] proposes an approach to improve P-Stemmer by combining it with various classifiers such as Naïve Bayes, Random Forest, Support Vector Machines, K-Nearest Neighbor, and K-Star.In this study they used a data set synthesized from various online news pages and did the experience on Weka tools, which is achieving the result showed that the P stemmer has Improved when using NB.

III. REVIEW OF ARABIC AND ITS DIALECTS
Arabic is one of the major languages used in the world, it is used by all Muslims because is the language of the holy book of Islam [6], [14], as well, Arabic divided into three categories according to [14] as follows: First, Classical Arabic (CA) is considered the oldest type it is the Arabic literature, The Holy Quran; Second, Modern Standard Arabic (MSA) can be defined as a simplified version of Classical Arabic to be comprehensible by whole people and be became largely used; and then, exists a third type called Arabic Dialect use the same Arabic characters for writing, this one used more than two above types in daily life.The Arabic dialect is divided into several types, as shown in Fig. 1; for instance: • Moroccan Dialect: The Moroccan dialect commonly Diarj [10]; • Egyptian Dialect: The Egyptian dialect is spoken in Egypt; • North African Dialect: The North African dialect is spoken in Algeria, Libya, and Morocco [12]; • Tunisian Dialect: Tunisian [15], and others are also Arabic dialects [16]; • Mauritania Dialect: This is named HISSANIYA dialect and is spoken mainly in the middle Mougreb region, more specifically in Mauritania country.
Mauritania Dialect: The Mauritanian dialect named HASSANIYA is a local dialect and a variety of Maghrebi Arabic spoken by Mauritanian Arabs widely used in daily life not only to change between families but also to indicate or share feelings and opinions on social media and to interact with others' posts.The operation of HASSANIYA text classification is becoming increasingly complicated, for three reasons: firstly, HASSINIYA is an Arabic dialect that has the same letters of the alphabet for writing, with changes in pronunciation and meaning depending on their diacritics, and ambiguity between words' root and their derivation; secondly, it is an unstructured language; thirdly, there is no data set or dictionary pre-exists.
In Table I, we segment an example of a Hassaniya word into sub-segments that show its basic construction; as mentioned above, this dialect uses Arabic letters and can be conjugated with all subjects and tenses; as shown in the following table, the word HISSANIYA has an affix such as prefix, suffix, and postfix determined by usage.

IV. MATERIALS AND METHODS
The main stages in our proposed Methodology are data collection, preprocessing, building technique, classification, and Evaluation stages will be described in the following.This approach was applied using Weka tools.For the sake of a better selection of dialectal words, we adopt in this work a methodology consisting of phases shown in Fig. 2.

A. DataSet Description
Social media data is the main source of our Data set.We have built our own Data set which contains words and sentences in HASSANIYA by gathering hundreds of comments from Facebook pages using scrap tools that present cytosine's reaction to government activities and then annotating them according to their polarity.We annotated each comment extracted according to his opinion hidden behind the writing.
The corpus of the dataset is present in the below Table 2. Based on our knowledge of the local language, we divided the corpus into three categories looking at opinions reflected as positive, negative, and neutral as well as shown in Fig. 3.Moreover, we loaded comments on Interim storage as a CSV file after converting it to ARFF format for use on Weka.

B. Pre-processing
Pre-processing is the first step in the data analysis process and that is a crucial step when dealing with Arabic documents [17].In order to convert input data to a performed text clearly and useful for machine analyses, we were using in this work a process consisting of various steps presented in the following.thus, these steps and filtering are offered by the Weka tool.
Tokenization : Tokenization is a technique that divides and transforms the word into tokens while preserving the meaning of the words by removing spaces, punctuation, and non-Arabic words [18]; in this case, the document is also reduced to words.
Normalisation : Word normalization means giving a format where some letters appear differently [19] for instance, , can appear in different forms like , and .
StopWordsHandler : stop word is used to eliminate everything not part of the word's root.

Stemming :
The streaming method is an essential step in Natural Language Processing or text classification, which converts the word into its corresponding root or stem.stem is the combination of a root and its derivation which is a suffix prefix and postfix [16].There are two main types of stemming in Arabic, namely Stem or Light Stemming and root-based stemming, the first one can be explained by removing the suffix and prefixes from the word in order to obtain its root; The second type is divided into three sub-categories.according to [17] such as (i) Dictionary Based when using a file dictionary; like khoja.(ii) no-dictionary bases, and hybrid that is shown in Fig. 5.There are several Stemming approaches applied to the Arabic language the following is a non-exhaustive type.
Light stemmer : Light stemmer is one category of stemming approach that aims to reduce words to their stem by means of removing the most frequent word's prefix and/or suffix [20], [21], [19] .
Heavy Stemmer : Heavy stemming is the process of eliminating affixes and changing certain letters in words to obtain the root word [22].SetMinTermFreq() : The SetMinTermFreq() method is used to define the minimum frequency threshold of a term (word) to be taken into account in the feature vector; In Weka, the StringToWordVector filter allows us to convert a collection of text documents into a set of numerical features, where each feature represents the frequency of a specific word in the document.In this study, you have used a minimum frequency equal to 2.

C. WEKA tools
WEKA is a machine learning framework with a graphical user interface, making it easy to use for beginners.it also includes a large collection of machine learning models such as Neural Networks, Decision Trees, and K-means.provides implementations of learning algorithms that can be applied for data analysis purposes [23].
Weka covers tools for transforming datasets, such as discretization and sampling algorithms for pre-processing a dataset, integrating it into a learning scheme, and analyzing the resulting classifier and its performance.

D. Machine Learning Algorithms
1) Naive Bayes Multinominal : Naïve Bayes (NB) is a data mining algorithm dedicated to data classification [24].It is used to deduce the probability of a datum belonging to a class, based on the assumption that all attributes are independent of each other given the class [25].In this work, we use Multinomial Naive Bayes to assign texts to classes based on statistical analysis of their content.This algorithm offers an alternative to the often cumbersome semantic analysis based on artificial intelligence and considerably simplifies the classification of textual data.It aims to classify by assigning text fragments to classes while determining the probability of a document belonging to the class in other documents with the same subject.
2) Random Forest : According to [26] RF is a set of decision trees where each tree is built from a bootstrap version of the training data set.Each tree is built according to the principle of repetitive partitioning: starting from the root node, the same node-splitting procedure is applied repeatedly until certain stopping rules are met.Its predictive power comes from the aggregation of many weaker learners (decision trees).Performance is particularly good if correlations between forest trees are low.
3) Logistic : Logistic regression is an important technique in the field of artificial intelligence and machine learning for data analysis that uses mathematics to find relationships between two data factors.It then uses this relationship to predict the value of one of these factors as a function of the other.The prediction usually has a finite number of outcomes, such as yes or no.Logistic regression belongs to the family of supervised machine learning models.It is also considered a discriminative model, meaning that it attempts to distinguish between classes (or categories) [27].

E. Evaluation Metrics
Text classification models are evaluated using well-defined essential criteria.This set was used to evaluate our models [28].To evaluate the accuracy of our Models', a confusion metric is defined by [10] as a tool to evaluate the accuracy of ML models' predictions and compare their predictions to reality.Since We have three classes to be classified, six important terms will have come into the evaluation process as shown in Fig. 6. Results obtained are assessed using the F1 score, precision, accuracy, and recall.
Tp: here is true Positive, where the prediction is positive, and the actual values are positive also.
Fp: here is a False positive, where the prediction is positive, but the actual values are Negative or Neutral.
Tng: here is true Negative, where the prediction is negative, and the actual values are negative.
Fng: here is a false negative, where the prediction is negative, but the actual values are positive or neutral.
Tn: here is true Neutral, where the prediction is neutral, and the actual values are Neutral also.
Fn: here is a false Neutral, where the prediction is neutral, but the actual values are positive or negative also.
Precision : Precision (P) measures how many of the "positive" predictions are made correctly by the model.The mathematical formula is as follows : Recall : Recall(R) measures how many of the positive class samples present in the dataset were correctly identified by the model.calculated by the following mathematical formula: F-Measure : F-Measure or F1 score is a machine-learning evaluation metric that measures a model's accuracy.It combines the precision and recall scores of a model.given by the formula: Accuracy : The accuracy metric computes how many times a model made a correct prediction across the entire dataset.
Fig. 6.Confusion metrics for three classes.

V. RESULTS AND DISCUSSION
There are three experimental works carried out using Weka tools are shown in Table III, in order to investigate the stemmer method effect in Mauritania dialectal classification and to compare the performance of the Machine Learning techniques applied.In the first EXP (i), we combined the ArabicStemmerkhoja, the MultiStopwords, and the word tokenizer in order to construct an appropriate feature; Exp (ii) is the result of a combination of ArabicLightStemmer, multistop-word, and word tokenizer; The last one EXP(iii) was done of null Stemmer combined with multi Stop Words, and WordTokenizer.The accuracy of the three experimental works is illustrated in Fig. 7 and Table VII, which shows that three machine learning techniques (Random Forest, Logistic Regression, and Naive Bays Multinomial) were tested using training data at three different stages, with the result changing according to features used.Tables IV, V, and VI shows the results obtained by the Random Forest, NBM, and logistic techniques on the basis of the training data.It can be seen that the three classifiers managed to classify the positive class more than the others with better data by RF with Ligth StemmerArabic equal 98,5%; moreover, RF gets better results than others classified in three cases.The results obtained from exp(i) with Light Stemmer Arabic are shown in Table IV; this shows the performance evaluation measure for each selected class or sentiment (positive, negative, and neutral), so positive sentiment was ranked higher by RF.
Table V shows the results obtained when using Arabic Stemmer Khoja.This experience shows RF arrives at a significant number classified in all classes, followed by Logistic which is better for the positive, and negative classes than the neutral.
The results obtained during the exp(iii) indicated in Table VI show that RF and Logistic in terms of classification than NBM.However, the correctly classified number of the neutral class is less important here than the other classes.Metric in the three experiments for the three classes given with the NMB and Logistic, it shows that the technique used is good for predicting the positive class, especially in experiment (i), and bad for predicting the negative class.Unlike RF who managed to predict all classes.Table VIII illustrates the results obtained with previous work, which focuses on different dialects; Likewise, our approach also studies a dialect.However, the experimental study in this approach gave a result of 96.37% higher than those obtained by existing studies.Therefore it is a successful approach.
Diacritization and derivation or rootization of Arabic words are the limitations of this approach.We recommend that future research enhance algorithms by taking diacritization and all possible word lengths into account; So that the correct word meaning can be processed.

VI. CONCLUSION
This study essentially focuses on the Mauritanian dialect, looking at its morphology, structure, and meaning, with the aim of analyzing it using Machine Learning algorithms.In order to prove the classification of the Mauritanian dialect using ML algorithms, we experimented on a corpus of dialect words that gave satisfactory results, however, the study proved that the results obtained are influenced by the effect of stemmer methods; In this article, three types of stemmer were tested with the objective of measuring and comparing their effect on the classification of dialect text, this process showed that the stemmer method "ArbicStemmerKhoja" is the most efficient with the NBM and RF algorithms in terms of prediction, unlike logistic which gives a better performance without stemmer.These results will guide us to a deeper study of the language data in order to uncover sentiments behind his comments written in the Mauritanian dialect and find an accurate prediction.

ACKNOWLEDGMENT
First of all, I would like to thank God for allowing me to get through all the difficulties.Secondly, I would like to thank the members of the research team of which I am a member for their efforts during this work.Finally, I would also like to give special thanks to my supervisors for their guidance and advice.

Fig. 7
Fig. 7 illustrates the Accuracy of algorithms given respectively by the three experiments applied.

TABLE I .
EXAMPLE OF A HASSANIYA WORD WHICH HAS DIFFERENT AFFIXES ATTACHED TO A ROOT WORD

TABLE II .
DATASET

TABLE III .
COMBINED FEATURES

TABLE IV .
EXP(I) CLASSIFICATION RESULTS OF EACH CLASS USING STEM-BASED (LIGTH STEMMERARABIC)

TABLE V .
EXP(II) CLASSIFICATION RESULTS OF EACH CLASS USING ROOT-BASED (ARABIC STEMMER KHOJA)

TABLE VI .
EXP(III) CLASSIFICATION RESULTS OF EACH CLASS USING NULL STEMMER)

TABLE VII .
MODELS PERFORMANCE As shown inTable VII above RF and NBM algorithm was better in performance when using ArbicStemmerKhoja; however, Logistic gets better performance results with Null Stemmer.Overall, with the Light Stemmer Arabic feature, the RF algorithm had the highest accuracy rate compared to the NBM and Logistic algorithms.

TABLE VIII .
COMPARISON OF EXPERIMENTAL RESULTS