NADA : New Arabic Dataset for Text Classification

In the recent years, Arabic Natural Language Processing, including Text summarization, Text simplification, Text Categorization and other Natural Language-related disciplines, are attracting more researchers. Appropriate resources for Arabic Text Categorization are becoming a big necessity for the development of this research. The few existing corpora are not ready for use, they require preprocessing and filtering operations. In addition, most of them are not organized based on standard classification methods which makes unbalanced classes and thus reduced the classification accuracy. This paper proposes a New Arabic Dataset (NADA) for Text Categorization purpose. This corpus is composed of two existing corpora OSAC and DAA. The new corpus is preprocessed and filtered using the recent state of the art methods. It is also organized based on Dewey decimal classification scheme and Synthetic Minority Over-Sampling Technique. The experiment results show that NADA is an efficient dataset ready for use in Arabic Text Categorization. Keywords—Data collection; arabic natural language processing; arabic text categorization; dewey decimal classification; synthetic minority over-sampling


INTRODUCTION
Data collection consists of gathering information to assess the outcomes and validate the research study.The accuracy of data collection is crucial to keep the truth of research.Data collection is required in all research areas and studies such as mathematics, physics, humanity, business, computer science and many more.
Arabic Text Categorization is one application of Natural Language Processing in Computer Science that needs a huge amount of text documents to perform classification.Accessing to freely available corpus is a desirable aim.Unfortunately, these corpora are not easily found or not designed for Arabic Text Categorization such as Al-Dostor newspapers [1].In other words, the existing corpora ( [2], [3] and [4]) need modification before the usage.For example, increasing the number of classes, performing preprocessing techniques and providing the corpus with specific formats to facilitate the integration of the data.In fact, most of the existing Arabic corpora don't follow any technique necessary to organize the class hierarchy.This hierarchy helps illustrate the needed classes and keep corpus balanced to accomplish an accurate result.Moreover, some of the existing Arabic corpora are not dedicated for classification because either there are no defined classes such as 1.5 billion words Arabic Corpus [5], or the existing classes are not well defined ( [6], [7], and [8]).Furthermore, most of the available corpora are published as raw data, which requires applying linguistic pre-processing operations such as cleaning, tokenization, normalization and stemming before use.
Consequently, the researchers in this field face a fundamental problem in comparing the results of their proposed methods with those of the state of the art techniques.This makes the validation step more difficult and timeconsuming.So, it is extremely needed to propose a new Arabic corpus that overcomes the above limitations.
In this paper, we present NADA, a New Arabic Dataset built from two existing Arabic corpora and complemented with extra classes and documents.To cover the entire classes from different domains, the standard classification schemes (Dewey Decimal Classification scheme (DDC) [9]) is used to provide a logical hierarchy of classes needed in document classification.In addition, to reach a high classification accuracy, Synthetic Minority Over-Sampling Technique (SMOKE) [10] is applied to make the classes balanced.NADA is composed of 10 categories belonging to different domains, including Social science (e.g.economies, and law), Religious science (e.g.Islamic religion), Applied science (e.g.health), Pure science (e.g.Technology), Literature science, and Arts science (e.g.Sport).After the data was assembled and organized, the preprocessing methods and filtering are applied to make the data ready in ANLP and particularly ATC field.This paper is organized as follows.Section 2 introduces the Arabic Language.Section 3 presents the Dewey Decimal Classification scheme.Section 4 surveys the existing Arabic corpus.Section 5 shows the formation of NADA corpus.Section 6 displays the experiment results and finally section 7 concludes this works.

II. ARABIC LANGUAGE
Arabic is a complex language.It has diverse characteristics that make it different from the other languages.The Arabic word contains the diacritics placed above or below the letters rather than short vowels.However, these diacritics have been left in contemporary writing and expected to be filled in by the readers from their knowledge of the Arabic language [11].Furthermore, in Arabic, many letters have a similar structure and are differentiated only by the existence and the number of dots.For example, the letters (b-‫,ﺏ‬ n-‫,ﻥ‬ t-‫)ﺕ‬ have the same structure but with different dot location and number.Moreover, the different shapes of Arabic letters depend on the placement of the letters in the word.Four shapes are found for 22 letters in Arabic, which are (word-initial, word-medial, and wordfinal).In Arabic, nouns and adjectives involve genders [12].www.ijacsa.thesai.orgAnother obvious complex characteristic of Arabic language is the richness of vocabulary.For example, the word "darkness" has 52 synonymous, "short" has 164, and 50 synonymous for the "cloud" [12].

III. DEWEY DECIMAL CLASSIFICATION
In order to arrange resources on the shelves and facilitate the retrieving process, the Dewey decimal classification scheme (DDC) can be used.The most usage of this scheme is in the libraries.DDC is a hierarchical number system that organizes all resources into ten main categories [9].Each main category is then divided into ten sub-categories and so on.In this study, this scheme is used to help build NADA.

IV. RELATED WORKS
The first step in text classification studies is data collection.The collected data must be suitable for the classification purpose.Data collection is required in each language performing text classification or other NLP applications.Many corpora can be found in English language (for example Newsgroup English benchmark [13], ACL Anthology Reference polish Corpus (ACLARC) [14], Reuters 21578 English corpus [15], and Reuters Corpus Volume 1 (RCV1) [16]) as long as in the other languages such that Chinese Souhu News corpus [17], Thai dataset [18].
In Arabic language, the state of the art studies presented a number of Arabic Corpora such that Al-Nahar 1 , Al-Jazeera 2 , Al-Hayat 3 and Al-Dostor newspapers [1], Hadith corpus [4], Akhbar-Alkhaleej corpus [2], Arabic NEWSWIRE [3], Quranic Arabic Corpus [4], corpus Watan-2004 [6], Khaleej-2004 [19], KACST Arabic corpus [20], BBC Corpus [7], CCN Corpus [8], Open Source Arabic Corpora (OSAC) [21] and Arabic corpus 4 that is composed of Watan-2004 and Khaleej-2004 corpora.Table 1 summarizes the existing corpora dedicated to ATC researches.Even though there are freely available Arabic corpora used in Arabic processing projects, most of them are either not suitable for text classification, or they might be appropriate for classification but still the data needs more filtering, processing and format conversion steps, which can negatively affect the classification accuracy.
On the other hand, few commercial corpora 5 , are available but with extremely excessive cost.So, the need for developing free new corpora is critical in Arabic Text Categorization.

V. NADA DATASET SETUP
NADA corpus is collected from two existing corpora, which are Diab Dataset DAA corpus and OSAC corpus.DAA dataset has nine categories each of which contains 400 documents.Each category has its own directory that includes all files belonging to this category.These files have already been preprocessed and filtered [22].The documents in each class of DAA corpus are considered in NADA corpus.On the other side, OSAC dataset [21] has six classes each containing [500, 3000] raw documents.Each category has its own directory that includes all files belonging to this category.
The OSAC dataset is a raw data that requires preprocessing.For this, each text file is pre-processed as follows: 1) the digits, numbers, hyphens, punctuation marks and all non-Arabic characters are removed.2) Some letters are normalized to unify the writing forms.3) Arabic stop words like pronouns, articles, and prepositions are removed.4) The light stemming is applied to the dataset to remove the entire affix and suffix from the word.However, Chen stemmer or Khoja algorithm for extracting the roots are not employed, because usually it is not valuable for Arabic text classification tasks, due to the conflation of various words to the same root form [12]. Furthermore, to reduce the dimensionality of the dataset, the recent new proposed Firefly based feature selection [23] is used.Firefly Algorithm is a well-known Artificial Intelligent technique applied to select the relevant words from a given document.This technique is applied to each document to reduce its size.The processed and filtered documents are considered in NADA dataset.
In this study, DAA and OSAC datasets are partitioned into two parts to building the training and testing data for the classification purpose.By this step, NADA corpus is constructed and becomes available for usage.This construction is based on DDC scheme to make its classes well organized.Figure 1 displays the hierarchy of NADA corpus; only the green classes and subclasses are considered in NADA.Furthermore, SMOTE technique is used to balance the classes and then increase the classification performance [10].The data collection is summarized in Table 2 and 3  ARFF file: it is an ASCII file that involves a group of instances with a set of attributes.These instances are the text scripts that are involved in the text files.Each instance represents one text file.This file format is necessary to analyze and process the corpus using WEKA tool [5].
 Text files: each file involves Arabic script in a specific category.These text files are classified into 7 categories as shown in Table 3.
 Sampled file: to avoid imbalanced impact on classification results of the collected dataset, SMOTE [10] is used to balance the dataset classes.The impact of SMOTE is shown in Figure 2. www.ijacsa.thesai.org

VI. EXPERIMENTAL RESULTS
After CSV file is generated, it is converted into a sparse ARFF file format using TextDirectoryToArrf converter and StringToWordVector converter in WEKA (version 3-7-13).To measure the performance of classifying NADA, recall, precision and F1 measures are calculated and averaged using SVM classifier.
To apply the experiment, the training and testing data are required.So, the entire dataset is gathered in one ARFF file.Then, the data is divided into two partitions using percentage method, where the first partition is training data, with 60% of the dataset and the second partition is testing data with 40% of the dataset.
According to the result in Table 6, the classification accuracy of NADA is 93.8792% even though the classification accuracy of OSAC is 98.1758 % in Table 4.The result beyond the degradation of NADA's classification accuracy is due to the low accuracy of DAA where it is 80.9087 %, in Table 5 This can be explained by the fact that DAA is not well preprocessed and/or filtered which negatively affected the classification result.
For the running time, Tables 4, 5 and 6 show the time taken in classifying each dataset.The time required to classify Nada is 1467.62 seconds which is about 24 min and 28 seconds.This time is higher than the time needed for classifying OSAC and DAA datasets.This is because the number of instances in NADA dataset, which is 13066 instances is higher than that of OSAC and DAA datasets which are 3710 and 3600 instances respectively.
To conclude, NADA is well-organized dataset ready for use in ATC purpose and can be considered as a benchmark in this field of research and study.www.ijacsa.thesai.orgNADA is a New Arabic Dataset built from two existing Arabic corpora including OSAC and DAA datasets.This corpus followed a standard classification scheme (DDC) to provide logical hierarchy presentation of classes.NADA corpus is composed of 10 categories, which achieved 5 classes from the first level of DDC and some classes from the second level.To increase the classification performance, SMOTE technique is applied to balance the whole classes.This dataset passed through preprocessing and filtering steps to reduce researchers' efforts in rebuilding Arabic corpus.NADA is tested and validated using SVM classifier and three evaluation measures.The experiment results show that NADA is an efficient dataset for ATC purpose.This corpus can be extended by adding new classes and documents to increase its usage especially in Big Data and Deep Learning.

Fig. 1 .
Fig. 1.NADA Corpus based on DDC Hierarchy . Table2shows the categories and the number of documents of OSAC and DAA datasets and Table3displays the content of the new corpus.

TABLE II .
OSAC AND DAA ARABIC DATASETS

TABLE III
Running Time1467.62secondswww.ijacsa.thesai.orgVII.CONCLUSIONThis research study is performed to meet the extreme need of Arabic corpora and to overcome the difficulties faced by ANLP researchers especially in ATC field to find an appropriate corpus.