The Multi-Class Classification for the First Six Surats of the Holy Quran

—The Holy Quran is one of the holy books revealed to the prophet Muhammad in the form of separate verses. These verses were written on tree leaves, stones, and bones during his life; as such, they were not arranged or grouped into one book until later. There is no intelligent system that is able to distinguish the verses of Quran chapters automatically. Accordingly, in this study we propose a model that can recognize and categorize Quran verses automatically and conclusion the essential features through Quran chapters classification for the first six Surat of the Holy Quran chapters, based on machine learning techniques. The classification of the Quran verses into chapters using machine learning classifiers is considered an intelligent task. Classification algorithms like Naïve Bayes, SVM, KNN, and decision tree J48 help to classify texts into categories or classes. The target of this research is using machine learning algorithms for the text classification of the Holy Quran verses. As the Quran texts consists of 114 chapters, we are only working with the first six chapters. In this paper, we build a multi-class classification model for the chapter names of the Quranic verses using Support Vector Classifier (SVC) and GaussianNB. The results show the best overall accuracy is 80% for the SVC and 60% for the Gaussian Naïve Bayes.


I. INTRODUCTION
Text classification of the Holy Quran is a research topic researchers should pay attention to in the context of machine learning algorithms.
The Holy Quran is a book that was sent down from the heavens into the heart of the prophet Muhammad to be delivered to all human beings, not only Muslims. The sacred words were revealed by Allah and written into a meaningful textual format that could be analysed and classified using machine learning classification algorithms.
It is considered a comprehensive book covering every component of life and accessible to all people. It addresses the heart and mind as one.
The texts of the Holy Quran are fertile ground for naturallanguage processing and text classification. Their uniqueness and meanings distinguish the features. The Holy Quran is the first source of legislation in Islam. It is necessary to apply data-mining techniques to classify the verses into chapters (surats) intelligently based on machine learning techniques.
Furthermore, annotation of the verses of the Holy Quran's surats depends not only on the text itself but also on the ordering of the surats. Therefore, this study builds a model to classify and differentiate Quranic verses, according to their surats.
We have previously studied the architecture of the Arabic Language Sentiment Analysis (ALSA) [1]. We extended the concept of text classification to apply it to the Holy Quran's verses. The total number of verses in the Holy Quran is about 6000. Multi-class classification means that we need an automating model that enables classification of the texts accordingly. For this reason, this paper looks at the first six chapters from the Holy Quran; its approximately 1000 verses contain a total 8000 features for the training and testing data. This paper is constructed as follows: the next section presents related work on multi-class text classification of the Holy Quran. Experimental method and analysis are covered in Section 3. Finally, the fourth section includes the results followed by the conclusions and anticipations of future work.

II. RELATED WORK
The study detailed in [2] proposed an automation model that could classify Al-hadeeth features into Sahih, Hasan, Da'if, and Maudu, using machine learning techniques (LinearSVC, SGDClassifier, and LogisticRegression).
The author of [3] built a machine-learning model using an algorithm (KNN, SVM, and Naïve Bayes) classification model to annotate labels for the Quranic verses. The accuracy of the text-classification algorithms reached over 70% for the multi-labels of the Quranic verses. www.ijacsa.thesai.org commentary on the verses and the English translation. In addition, they proposed the IG-CFS technique to label Quranic verses of surats al-Baqara and al-Anaam [9].

III. EXPERIMENT AND ANALYSIS
The proposed model consists of four important phases as shown in the following framework architecture: 1) data collection, 2) text feature engineering, 3) The Term Frequency -Inverse Document Frequency (TF-IDF) feature representation, and 4) The GaussianNB and SVC classifiers. The framework architecture of the multi-class Quran framework classification is shown in Fig. 1.

A. Data Pre-processing and Cleaning
Before machine-learning modelling, we applied text preprocessing and cleaning techniques to extract features according to the following steps: remove the Arabic Tashkeel symbols (e.g., ً ً ً ً ً); and remove consecutive Tatweel ‫)'ـــ'(‬ within Arabic characters.

B. Corpus
The corpus size was 954 verses collected from the first six surats of the Holy Quran. Table I shows generated descriptive statistics summarizing the central tendency, dispersion and the shape of the corpus' distribution. Table II outlines the extracted sample from the Holy Quran corpus for the six classified categories ["Fatiha", "Albaqrah", "AlEimran", "Alnisaa", "Almayida", "Alaneam"] in the first column. The number of verses is shown in the second column. The selected verse and its translation appear in columns three and four.

C. Exploratory Data Analysis
The goal of the Exploratory Data Analysis (EDA) is to extrapolate on the breadth of information reflected by the corpus data. Fig. 2 shows the number of verses per corpus class.

D. Feature Engineering and Selection
Feature-text selection and engineering are considered the process of choosing the essential features required to represent the model for machine-learning classifiers. The following figures (Fig. 3-8) show word clouds for each Surat in the Holy Quran corpus.

IV. RESULTS
We calculated Accuracy, Recall and F1-value according to the following mathematical equations:

A. Machine-Learning Classifiers
The Support Vector Classifier (SVC) is considered the implementation of the Support Vector Machine (SVM) [5] for solving multi-class classification problems. The GaussianNB performs accurate feature-vector classification for the multiclass text problems [10]. We tested the proposed model against the performance metrics. The results are shown in Table III.

B. Evaluation Metrics
The classification algorithms need the performance metrics to measure the model accuracy and losses. Fig. 9 shows that most of the performance metrics we used to evaluate the proposed multi-class Quranic model. The performance metrics are: 1) cohen_kappa; 2) log_loss; 3) zero_one_loss; 4) hamming_loss; and 5) Mathews_corrcoef.
The proposed model is evaluated according to two classifiers, SVC [7] and GaussianNB, as shown in Table V and Table VI and the Fig. 10 and Fig. 11. The performance of the proposed model is measured in terms of accuracy, precision, recall, f-measure, AUC, and ROC curves. The SVC classifier had the highest AUC value of 0.97 while the GaussianNB had the AUC value of 0.82 (see Fig. 12 and Fig. 13).    Finally, SVC [3] and GaussianNB classifiers were implemented for each verse of each Surat and measured the results in terms of the area under the curve (AUC) (see Fig. 14 and Fig. 15) [8]. The experimental results have shown that the proposed model had significant impacts on the multi-class Holy-Quran verse classification (see Fig. 16-19).

V. CONCLUSIONS
Classifying chapters of the Holy Quran is considered a multi-class classification problem. In this paper, the multiclass classification for the Holy Quran corpus was used to train GaussianNB and SVC classifiers to predict the classification of the Quran verses into six surats. Increasing the size of the corpus and improved feature classification may improve the quality and accuracy of the framework. The experiment shows that the SVC provides the best results with an average of 88% f1-score. The research is to be continued by building a larger corpus for the verses of the Holy Quran chapters.