Different Classification Algorithms Based on Arabic Text Classification: Feature Selection Comparative Study

—Feature selection is necessary for effective text classification. Dataset preprocessing is essential to make upright result and effective performance. This paper investigates the effectiveness of using feature selection. In this paper we have been compared the performance between different classifiers in different situations using feature selection with stemming, and without stemming.Evaluation used a BBC Arabic dataset, different classification algorithms such as decision tree (D.T), K-nearest neighbors (KNN), Naïve Bayesian (NB) method and Naïve Bayes Multinomial(NBM) classifier were used. The experimental results are presented in term of precision, recall, F-Measures, accuracy and time to build model.


I. INTRODUCTION
We know that the amount of Arabic information that founded on the internet is very large and increasing rapidly.This growth directs researchers to find some of the effectiveness mechanism and good tools that may help the researchers to better managing, filtering, processing and classification a large Arabic information resource.Text classification (TC) is the task using to classify a specific dataset into different classes; it also called document classification, text categorization or document categorization.
TC also used to solve some research problems such as information retrieval (IR), data mining, and natural language processing.There are many applications on TC like document indexing, document organization, text filtering, word sense disambiguation, speech recognition and web text hierarchical categorization.
TC can use as a binary classification like -nearest neighbors (KNN), Naïve Bayesian method and SVM and as a multi classification like boosting and multi-class SVM.
TC task can divides the dataset into two part: training set and testing set, the classifier algorithm learn on training to build a TC model, then TC system to classify the testing set into different classes, To achieve effective performance we used feature selection methods.
To get a better performance wedid some preprocessing steps on the dataset which we will talk about later in this paper.Section two will talk about the related work, section three will talk about our objectives, section four talk about experimental results, and then conclusion and future work, and finally the references.

II. RELATED WORK
In [1] the authors presented the performance of using a Support Vector Machines (SVMs) based text classification system on Arabic text.The authors using one of the feature selection methods which is CHI square method, theyuseda preprocessing stepsin their work to give a better evaluation.The proposed system gives good results.To classify any text we must determine a set of features to achieve best classification.This paper presents the effectiveness of six features selection method to extract and choose a good features from Arabic document.The authors used SVM classifier algorithm to compare the performance between these six methods (CHI, NGL, GSS, IG, OR and MI).
The authors in [2] used an in-house collected corpus from online Arabic newspaper archives, including Al-Jazeera, Al-Nahar, Al-Hayat, Al-Ahram, and Al-Dostor.The collected corpus consists of 1445 documents.These documents consist of nine categories, the authors did some Pre-processing for the dataset such as remove digits and punctuation marks, all the non-Arabic texts were filtered, remove the Arabic function words (stop words) and other.In [2] the result showing that CHI, NGL and GSS performed most effective with SVMs for Arabic TC tasks, but OR and MI performed terribly.In [3] the authors talked about three contributions: (i) showing successful classification of Arabic documents, (ii) make their database available to other researchers, (iii) find a better performancebetween Binary PSO and K-nearest neighbor using feature selection methods.In [3] the authors presented BPSO -KNN as a feature selection method and applied this method on three Arabic text dataset.The authors used three classification algorithms which are SVM, Naïve Bayes andC4.5 decision tree learning.www.ijacsa.thesai.org In [4] the authors used Chi-Squaremethod as a preprocessing step which applied on dataset before doing the classification.In [4] the authors compared between the proposed method and other feature selection methods the result shows that the proposed methodperformed better performance than other features selection methods.

III. OUR OBJECTIVES
To compare the performance between different classification algorithm (decision tree, K-nearest neighbors(KNN), Naïve Bayesian method and Naïve Bayes multinomial classifier) in different situations: using feature selection methods with light stemmer, (khoja stemmer) and using feature selection with full word.

A. TC Process
Text classification system usually separated into three main phases which are : Data preprocessing and feature selection phase that makes the dataset more compatible and applicable to train the text classifier, text classifier phase that use to classify dataset into different classes, and evaluation phaseto show the performance of the used classification algorithm.

B. Arabic Dataset Preprocessing
There are a lot of Arabic dataset available on the internet that can be used, we used BBC Arabic dataset that contains 4763 documents belongs to seven categories (News Middle East in 2356, News of the world in 1489, the economy and business 296, Sport 219, the press world 49, Science and Technology 232 Arts & Culture, 122).The dataset contains 1,860,786 words and 106,733key word.These dataset are processed according to the following steps: 1) Remove digits, dash, punctuation marks and any other mark.
4) Use feature selection methods with stemmer and with full word.

C. Feature Selection Methods
Feature selection (FS) is a task to choose a subset feature from the original feature set, FS is widely used in TC task.FS consist of following steps: 1) Feature generation: in this step we generate a subset of feature by using some search process.
2) Feature evaluation: in this step we used some evaluation matrices to measure the goodness of selected features.
3) Feature validation: in this step we used a validation procedure to measure if the selected features are valid or not.
In this paper we used two feature selection methods the Information Gain (IG), and the 2 statistics (CHI) as shown in table 1.

D. Text Classifier
In this paper we used different classifiersthese classifiers are: decision tree, K-nearest neighbors (KNN), Naïve Bayesian method and Naïve Bayes multinomial,we have compared between the performance of these classifier in different terms of categorization effectiveness.wedivided the dataset into two parts, one for the training, and the other for testing.

E. TC Evaluation Measure
We have evaluated the performance for the classifiers (decision tree, K-nearest neighbors(KNN), Naïve Bayesian method and Naïve Bayes multinomial ) in terms of precision, recall, accuracy, F-Measures and time to build model as shown in equations 1, 2, and 3.

IV. TC EXPERIMENTAL RESULTS
We have used two feature selection methods ( CHI and IG), four classifiers ( decision tree, K-nearest neighbors(KNN), Naïve Bayesian method and Naïve Bayes multinomial classifier) were used, a Weka tools of version 3.7 were used, the results are shown in table II to table X.    V. CONCLUSION we have been investigated the performance of two FS methods with four classifiers(decision tree, K-nearest neighbors(KNN), Naïve Bayesian method and Naïve Bayes multinomial classifier) using Arabic dataset.The accuracy for decision tree, Naïve Bayesian method and Naïve Bayes multinomial better than K-nearest neighbors(KNN) in all cases.In Future work we will use more feature selection methods with different classifiers algorithms.

TABLE I .
FS METHODS