Fine-Grained Quran Dataset

Extracting knowledge from text documents has become one of the main hot topics in the field of Natural Language Processing (NLP) in the era of information explosion. Arabic NLP is considered immature due to several reasons including the low available resources. On the other hand, automatically extracting reliable knowledge from specialized data sources as holy books is considered ultimately a challenging task but of great benefit to all humans. In this context, this paper provides a comprehensive Quranic Dataset as a first part (foundation) of an ongoing research that attempts to lay grounds for approaches and applications to explore the holy Quran. The paper presents the algorithms and approaches that have been designed to extract an aggregative data from massive Arabic text sources including the holy Quran and tightly associated books. Holy Quran text is transferred into structured multi-dimensional data records starting from the chapter level, the word level and then the character level. All these are linked with interpretations and meanings, parsing, translations, intonation roots and stems of words, all from authentic and reliable sources. The final dataset is represented in excel sheets and database records format. Also, the paper presents models of the dataset at all levels. The Quranic dataset presented in this paper was designed to be appropriate for: database, data mining, text mining and Artificial Intelligence applications; it is also designed to serve as a comprehensive encyclopedia of holy Quran and the Quranic Science books. Keywords—Arabic Language; Holy Quran; Quranic Dataset; Text Mining; NLP


INTRODUCTION
In recent years, large amount of language datasets and corpora have been developed, these are increased with the spread of cloud computing applications and data linking. Different forms of datasets are available now in the web. However, datasets in Arabic language did not receive attention compared to datasets on other languages like English.

A. Objective of the Study
This study aims to build a group of datasets for the holy Quran, its interpretations, its meanings and related scientific books. Such group of books is considered one of the largest groups of books in Arabic literature. The holy Quran is composed of 114 chapters and about 6,236 verses with 77,477 words [1]. This group of books addresses every verse and word by interpretation, parsing, clarification; supporting these with the reasons of the revelations and the sayings (Hadith) of prophet Mohammad Peace Be Upon Him (PBUH). This results in a massive amount of text which could make it hard to process separately in the form of unstructured text. Also, compiling them to one group requires significant efforts and special processing. Therefore, this study focus in developing a model (algorithms and methodologies) to build a homogeneous dataset that fit all of these contents to produce a set of comprehensive structured data for the Holy Quran and its Scientific books, to be used as an encyclopedia for the Holy Quran and to serve as infrastructure for the technical applications that seek to produce results and carry out research on this vast amount of data.

B. Problem
The Major challenge of this study is how to convert this amount of unstructured text documents into groups of datasets? Provided that they are comprehensive, accurate and they preserve what in the sanctity, characteristics and relationships of this data. The other question is: can these groups of datasets be suitable for all types of applications that require different characteristics and techniques?

C. Previous Studies
Little number of research studies exist that are concerned with the production of datasets related to the Holy Quran and they are not comprehensive [2], [3], [4], [5]. For example [6] built semantic Quranic dataset, which include the Quran in several languages. They have established a set of data through the integration of data from different sources: Arabic, Amharic and Amazigh, Their goal was that the dataset would be appropriate to use in Natural Language Processing (NLP) applications, whereby their study have focused on the development of ontology to organize data. In [7], the author has compiled a dataset for nearly 8,000 Quranic verses related to each other in the form of pairs. They included explanation of the Quran and anaphora information for the purpose of conducting a number of text mining tasks. In a study presented by [8], the authors built a standard dataset for interpretation of the holy Quran.
The rest of the paper organized as follows: Section II is dedicated to present the model of the study, section III illustrates building of the dataset, section IV shows the details of testing the dataset, and finally section V includes conclusion and future directions.

II. THE PROPOSED MODEL
The proposed model works according to the following algorithm (see figure 1): www.ijacsa.thesai.org  Prepare sources and ensure its credibility and authenticity.
 Convert sources into text files (the first stage of data processing (first phase of data processing)).

 Design databases.
 Process the data records.
 Segment Text and save in data tables (second phase of data processing).

A. Sources processing
The dataset designed to carry all possible Quranic texts. At this stage we are going to provide the most available and needed sources, with capability to accept coming and updated sources. The sources that used at this stage are: 1) The holy Quran text with narration of Hafs/Nafi and Ottoman drawing http://tanzil.net/download/.

4) French translation of Holly Quran text [14] 5) Three books in English translation of Holly Quran text:
 Yusuf Ali [15].

 Mohammad Habib Shakir English translation of Holly
Quran text [16].

 Mohammed Marmaduke William Pickthall
 English translation of Holly Quran text [18] 6) One book in Quran words meaning in English.

B. Convert sources format to text files (the first stage of data processing)
At this stage, the Quran and its science books are processed and prepared in the form of text files to fit the software that will be prepared later to process and convert them into data records. One text file has been extracted which represents Quran with the narration of Hafs/Nafi according Ottoman drawing download from i . Figure 2 illustrates a sample of the text file of the holy Quran. Also all interpretations and others books has been modified and preprocessing manually to be in appropriated text format,

C. Design data records
Datasets will be in two formats: spreadsheets and database records, these will be designed to be comprehensive for all properties in the form of fields in a multi-dimensional approach, based on holy Quran structures (which are character and word):  Records on the character-level: It contains an index of the character shows chapter and the order of the word in the verse and the order of the character in the word and then the characteristics of the character in intonation and its diacritic (Superimposition character, whispered characteristic), pronunciation rules etc.
310 | P a g e www.ijacsa.thesai.org The records appear at each level in multi-dimensional manner as it is shown in figure 5. Because the dataset based on the Quran text, and Quran text is structured by verse, word, and character, then our data schema is based on these components (verse, word, character). The other books components are designed to be properties for the verse, word and/or character. Figure 6 shows the entity relationship diagram (ER) for the dataset schema.

D. Segment Data into data tables records (the second stage of data processing)
In this phase the Text documents are segmented and saved in the form of records according to the designed database tables. This phase is composed of three stages: Fig. 6. Entity relation diagram ER of the proposed dataset schema  Stage one: Identification of templates that will be used to divide the text on the basis of the verse template and the corresponding sentences interpretation and translation, the word template and the corresponding meanings and parsing and the rule of intonation, and character template and the corresponding intonation base statement and diacritic. These templates are considered the basis on which the text of the holy Quran and its sciences will be divided.
 Stage two: Building of algorithms that will read the text and then break it down into: o Verses, depending on the punctuation inside the Quranic texts (the Quran and its Sciences) like stop marks that help in the text segmentation process and is not considered as original part of the Quranic text, such as: spaces, comma, semicolon and so. Then, the related text from the Quran science books is inserted accordingly. Figure 7 shows a sample of the algorithms used in this stage. o Words and the corresponding rule of intonation and the meanings and translation. Figure 8 illustrates the segmentation algorithm used in segmentation of the words of the Quran. o Characters, and its corresponding TAJWEED (type, shape, diacritic, weight and phoneme level).  Storing the text fragments in the appropriate table according to the designed relational database. So by end of this stage we can reach the acquired a clean datasets.

III. THE APPLICATION (BUILDING THE DATASET)
The proposed model has been applied to a range of texts that have been selected from 14 electronic books as it mentioned in subsection II-A. Then these is text organized and implemented in three structured database tables and electronic sheets, the verse data, word data, and character data. The following subsections show how these data are implemented. www.ijacsa.thesai.org

A. Verse template dataset
As illustrated in figure 9, all verses of the holy Quran become in the form of data records, the record contains an index for each verse with its interpretation and translation in English language... etc., parsing, and ...etc. The results of the table match the original data without errors, inconsistency or missing data compared to the verses of the holy Quran and its science books.  These records of data make it easy to extract information related to the verses and their meanings and make translation to English or French an easy task. For example, one can search for a verse by its significance or its translation or through the content of the word or the subject, besides the possibility of linking between verses and the meanings and reasons of revelation, also one can search based on themes, or similarity. Besides the possibility of carrying out some statistical and relational tasks easily.

B. Word Template Dataset
As illustrated in figures 10, all words of the holy Quran became a valuable form of data records and the record contains a single index for each word, besides the word and its meaning, parsing and meaning in English, and its root and stem …etc. The results were identical to the original text without any error or missing data. These records make accessing words of the holy Quran easy and quick in several ways. One can also use some of the words characteristics to link between the subjects and conclude results from the holy Quran and interpretation books. For example, if we wish to conduct research on usury or prayer or Zakat, can search all words that contain the desired subject through the root or stem, or directly through the word then connect the verses and interpretation, and then we can come up with an integrated research on what is stated in the Quran and the interpretation book about the search. Beside the possibility to make any statistical or relational tasks on the word level easily and quickly.

C. Character template data set
As can be seen in figure 11, all the characters of the holy Quran become data records, whereby records contain a single index for each character besides: character, diacritic and time. These records of the data make it easy to produce reports at the character level. For example, in [17], the authors presented statistics on the letters only without offering to provide any information to distinguish between like-like characters ‫،إ،ا((‬ ‫)أ‬ ( ‫ى‬ , ‫ي‬ )). Character dataset can be considered more statistically accurate and comprehensive (see figures [14][15]. One of the interesting aspects of this dataset is its comprehension; the methodology didn't ignore even the diacritic of the character www.ijacsa.thesai.org which allows users to handle the character with or without diacritic. Such aspect provides multiple benefits, such as the beneficial aspects of intonation studies. Figure 16 shows statistics on diacritic.

D. Application interface
The dataset are available and ready for use, its implemented in three format, access table, excel sheet, and SQL-server database. Also an application for searching the content of the dataset is generated and available online at http://www.anwermustafa.com/su/. This application provide searching engine for the all the contents of the dataset for example, searching using the exact Quranic word, the meaning, the English or French word, any part of sentence, the root, the stemming ... etc. Figure 12 shows an example of using this application in searching for the word ‫."نافق"‬ TESTING THE DATASET This section presents the aspects that confirm the algorithms used and the correctness of the data produced. Because we used automatic algorithms for transferring the text to data base table, it will be important to validate these algorithms by comparing the results with the actual data (the original books). As its shows in figures 9-16 the results are matching the real data, for example: 1) The holy Quran is composed of 114 chapters and 6,236 verses with 77,477 words [1], when we compare this statistics with the statistics that generated from our data set we found the same result. Figures 13-14 show statistical reports that generating form our dataset.
2) All Quranic verses are included in the data set and linked correctly with its appropriated chapters and its parameters, such as interpretation, translation ...etc. Figure 9 shows an example for this aspect.
3) The Quranic words that generated by our algorithm are the same as that in the holy Quran, same number of words, same relation to its verse and chapter, and linked correctly with its corresponding meaning, translation, syntax, root, stemming, ...etc. Figure 10 shows an example of this aspect.
4) The statistical analysis of Quran character provide a good result in comparison with the previous work example [17]. Figures 14-16 show examples of such results.   313 | P a g e www.ijacsa.thesai.org CONCLUSION This paper presented a model that contains algorithms that read Arabic texts and convert them into datasets in the form of relational database records and electronic sheet. This model is applied to one of the most important and largest Arabic texts; the holy Quran and more than 13 books include English and French translations, and sciences books such as interpretations, syntaxes, word meaning... etc. Algorithms used have achieved accurate results when its compare with the original text. This makes this model an appropriate model for the treatment of Arabic texts and building datasets from Arabic texts.
Applications of the model in the construction of the Quranic dataset proved that the automated algorithms provide correct result when converting any text from Quran and its Science books into data tables. The finding is tested and compared with the original sources and shows that the data in the records are same as the original text, this implies that the model presented in this paper is trusted and can be used in accommodate any future data within any of the prepared templates or by creating new templates, all can be done automatically and without new settings.
Quranic datasets presented in this paper has put the Quran and its science books text in the form of encyclopedia to serve all areas of research which contain a tremendous amount of knowledge and meanings and relationships. Making it a scientific contribution in this area since most previous studies did not reach such a level of datasets.
The model capabilities which are used in the construction of these groups of datasets makes it flexible to accommodate new set of text titles related to the holy Quran. It also allows issuing reports and search results in multi-dimensional manner being a relational datasets.
Quranic datasets presented in this paper were placed in templates and then converted into relational database records which establish the infrastructure for the implementation of multi-technology applications; such as: database applications, data mining and knowledge discovery applications, Artificial Intelligence applications, machine learning applications, natural language processing NLP and statistical applications. Accordingly, our next work will exploit these generated datasets in exploring and discovering worthy results appropriate to the greatness of the holy Quran and its unlimited knowledge for everything "And you went book gives an account of everything, and a guidance and a mercy and glad tidings to Muslims" (Nahl: 89)). In addition, we expect to use this dataset in conducting research work in the area of Arabic language processing and/or Arabic text mining because the holy quran is considered as a standard Arabic reference for more than 14 century.