Ontology-based Query Expansion for Arabic Text Retrieval

The semantic resources are important parts in the Information Retrieval (IR) such as search engines, Question Answering (QA), etc., these resources should be available, readable and understandable. In semantic web, the ontology plays a central role for the information retrieval, which use to retrieves more relevant information from unstructured information. This paper presents a semantic-based retrieval system for the Arabic text, which expands the input query semantically using Arabic domain ontology. In the proposed approach, the search engine index is represented using Vector Space Model (VSM), and the Arabic’s place nouns domain ontology has been used which constructed and implemented using Web Ontology Language (OWL) from Arabic corpus. The proposed approach has been experimented on the Arabic Quran corpus, and the experiments show that the approach outperforms in terms of both precision and recall the traditional keywordbased methods. Keywords—Information Retrieval; Arabic Ontology; Semantic Search; Arabic Quran Corpus


INTRODUCTION I.
In the information overloading era, the search engines are the most important applications.The existing search engines still present several problems related to the user"s query such as word mismatch or retrieve many irrelevant documents, particularly when the user"s queries are not specific enough.Nowadays, the World Wide Web (WWW) has become a library of a huge unstructured data, which is difficult to be understood and processed without representing the web content in machine processable form.
The semantic-based retrieval aims to search with concepts rather than with terms, which retrieves more relevant information.Various semantic based search techniques have been adopted since the evolution of semantic web.Semantic Web (SW) is an extension of the current web in which information provides well-defined meaning, its enable system and people for better understanding.Although, SW enable to work effectively by understanding information from different sources [1].In general, semantic search is a process use to improve search retrieval by applying data from semantic networks, which is disambiguate the queries and web texts.Hence, the semantic network at the level of ontology expresses a vocabulary that is helpful and can be used for machine processing [2] [3].Besides, ontologies play a major role in support information search and retrieval processes for the Arabic language [4].The Arabic WordNet (AWN) is a semantic resource and a free lexical for the Arabic ontologies [5].But constructing AWN presents a challenges that are include the scripts and the morphological properties of Semitic languages, which centered on roots [6].
The semantic web contains a meta-data which is a data about data.Moreover, ontology applies major rule in the semantic web, which adds to the Web page to let the machine understanding the document.In this aspect, Tim Berners-Lee-2006 introduced the semantic Web architecture, which contains eight layers, which is the Resource Description Framework (RDF) and Ontology has considered the most important layers [7].The RDF is a language for creating a data model for objects ("resources") and relations among them.It enables to represent information in the form of graph.The Resource Description Framework Schema (RDFS) provides basic vocabulary for describing properties and classes of RDF resources.The Web Ontology Language (OWL) extends of RDFS by adding more advanced constructs to describe semantics of RDF statements [8], [9].
The Arabic language is one of the Semitic language groups, is the language of the Islamic Holy Quran [4], [10].Recently, the complexity of Arabic"s grammar and ambiguity face a difficult to develop Arabic ontologies.Besides, many researchers are interested in the Islamic research such as the Holy Quran and developed different domain ontologies, which are the most of them are free [11], [12].There are many tools using for searching on holy Quran, most of these tools are using keyword-based search.These tools face problems related to the meaning of the query terms.Hence, find accurate information directly from the holy Quran is very difficult [7].Therefore, intelligent methods are required to enhance the current search engines especially for the Arabic holy Quran that is related to the Islamic concepts, because the Muslims need the Quran in all their affairs.This paper presents a semantic-based system for Arabic information retrieval to improve Arabic retrieval results.The system semantically expands the input query using Arabic domain ontology.In this approach, we used our previous work for domain ontology "Place Nouns" mention in Arabic Quran [13].The term index is built using Vector Space Model (VSM) to represent both documents and user query.The proposed approach applied to the Arabic Quran corpus, and found that the query expansion is promising which the experiments shows good results comparing to the traditional Arabic information retrieval systems.www.ijacsa.thesai.orgThis paper is organized as follows.Section 2 presents the Related Work.The proposed system architecture is described in section 3, while section 4 explains the Ontology-based Query Expansions.Moreover, section 5 presents Experimental Results and Evaluation.Finally, the conclusion has presented in section 6.

RELATED WORK II.
A Semantic-based approach aims to search with the concepts rather than the words.Semantic Search Engine (SSE) attempts to make sense of the search results based on document context.The specific domain ontology and upper ontology are the main two types of the ontology [14].In recent times, a lot of researchers interested to devolve Arabic ontologies, the vast majority were in a specific domain and others for usage available ontology to reuse [12].The semantic-based approaches expand the input query semantically using term cooccurrence or exterior resources such as a lexical thesaurus, domain ontology, etc..In his work, [15] proposed a web-based multilingual tool for information retrieval including the Arabic language.The authors built domain ontology in the legal Arabic.By applying the Arabic ontology, the authors improved the recall 115 to 1230 and precision from 2 to 7.
The Arabic Quran is the Holy book of Islam, which is undoubtedly an important book, covering many themes and concepts in the all worlds.The Arabic Quran corpus is annotated linguistic resource, which consists of (6,236) verse (Ayah), and a total of (77,430) word [16].Several studies have presented to facilitate and develop the search process for the Holy Quran and how to build Islamic ontology such as [10][17] [18][19], and [13].Although, most of these studies were directed to the Arab user, on the other hand, it has presented studies investigate in the Quran translations for other languages such as [18] covered limit knowledge for the domain solat (prayers) in the Holy Quran.In the same aspect, research [21] worked in the last Juz" in the Quran (Juz" Amma), which the authors developed the ontology for Juz" Amma.
In the meantime, ISWSE is an "Islamic Semantic Web Search Engine" system presented in [7], which is based on Islamic Ontology, and used Azhary [22] as a lexical ontology for the Arabic language.The experiments have been used in the Quran Prophets stories as the most detailed part of the Islamic Ontology, which contains 1153 concepts.The system improves 98.5% and 97% for precision and recall respectively for 30 executed queries.The system retrieval is based on the classify concepts in the ontology.
Dukes K. [23] presented in his PhD thesis the Arabic corpus ontology based on the Tafsir by Ibn-Kathir book, which defines 300 concepts in the Quran, and the number of relations is 350 based on Part-of or IS-A relation between concepts.In same corpus works, [2] presented a QurSim as a language resource for Quran scholar and researchers.They create a dataset called QurSim, which consists of 7600 pairs of related verses for evaluating the relatedness of short texts.Also research [24] developed a tool "Quran Search for a Concept" for searching in the Quran topics index from an academic source: Tafsir Ibn-Kathir and book of Mushaf Altajweed topics, which consists of 1100 concepts.
In the meantime, the researches [6], [21] proposed a model for detect the concepts in the Quran by using knowledge representation and shows the relationship between the concepts by Description Logic (Predicate logic).This research attempts to reuse and improve exist Arabic Quran corpus ontology by [23], which extend and add more than 650 relationships depends on the Quran, Hadith, and Islamic websites.Also, the authors proposed a semantic search system for the Quran domain, a top-down approach was followed, which consist of 15 abstract concepts.The papers [10], [25] are proposed a methodology for automatic extraction the concept from the Holy Quran, which is for the format of English translation, and used these concepts to build domain ontology.The ontology is based on the information toked from the domain experts.The authors does not cover all subjects in the Quran only 63 verses, and don"t talk about the format or ontology technologies used.In other his work, [18] the framework recognizes into account the sciences of the Quran, such as the reason of revelation (Asbab Al Nuzul).The ontology consists of 374 extracted cases cover for the verses that have the word salat/prayer.
In the same vein, the research [17] developed a simple ontology for the domain animals and birds that are mentioned in Holy Quran, and applied a semantic web search for the Quran in semantic search.The Pickthall used to English translation the Quran in this paper, the ontology consist of 167 references for animals mentioned in direct or indirect in the Holy Quran based in the book entitle "Hewanat-El-Qurani".

THE PROPOSED SYSTEM ARCHITECTURE III.
The architecture of the system has shown in Figure 2. It consists of two phases: offline and online phases.In the offline phase, the index of the Arabic information retrieval system is created and maintained for the Arabic corpus based on vector space model.In addition, the Arabic domain ontology has designed and implemented from the Arabic corpus.In the online phase, the user query has expanded using ontology and then the search results are retrieved and ranked.These phases can be described as follow.The first phase is offline phase, which consists of three modules: Documents-Pre-processing, Indexing, and Ontology Building modules.Those modules can describe briefly as follow.

Arabic language
1) Documents Pre-Processing Module: This module consists of three processes: words-tokenization, stop-words-Removal, and words-weights.The words-tokenization response to break Arabic sentences into tokes each document for the Arabic corpus.Stop-words-Removal removes the useless words like ‫"يٍ"‬ (from)", ‫"عهٗ"‬ (on)", etc.Finally, words-weighting this process computes the terms" weights based on Term Frequency-Inverse Document Frequency (TF-IDF) statistical measures by using equation (1), which shows computation of the term frequency, equation (2), which calculates the inverse of the number of documents in which the term occurs, and equation ( 3), shows how the term weight is computed [26]: where denotes the normalized term frequency for term i in document j, and is the number of occurrence for term where N denotes the number of the documents in the corpus, and is the total number of occurrence term in all documents.

{ ( )
where denotes the weight of term in document j.
2) Indexing Module: For the terms" weights generated by the word-weighting process, this module indexes the document terms, where the index contains the term weights of words using Vector Space Model (VSM) [26].VSM is an algebraic model that represents both documents and queries as vectors.For example, the vector = ( , , , …, ) represents the vector weights of document j, where is the weight for term i in document j.
3) Ontology Building Module: This module responses to build the domain ontology, and represents the domain ontology for Arabic language.This ontology is represents by the Web Ontology Language (OWL), which is the standard language for the semantic web.In this paper, the proposed system is an Islamic semantic search engine searching in the Holy Quran.It is ontology-based search and uses Arabic language vocabulary associated with the "Place Noun" mentioned in the Arabic Quran [13].
The second phase is online phase, which consists of four modules, User-Interface, Query-Pre-processing, Semantic Query Expansion, and Information Retrieval based VSM and Results Ranking.Those models can describe briefly as follow.
1) User-Interface Module: the module is facilities the query input from the end user and displays the results retrieved.
2) Query-Pre-processing module: this module preprocesses the input query for tokenization, stop words removal, etc.
3) Semantic Query Expansion Module: This module expands the input query based on the domain ontology.For each query words related to the concepts in the domain ontology, the relations between concepts including that individuals is retrieved, and hence to enrich the expanded query.In this paper the OWL API [27] package have been used in the proposed system in the loading the ontology, extract the concepts and the relations.
4) Information Retrieval based VSM and Ranking Module: this module matches the expanded query vector versus the document vectors to compute the similarity between them.In this paper, the cosine similarity is used as www.ijacsa.thesai.orgshown in equation ( 4) [26].Then the documents retrieves will rank according to the similarity to the user query.
where ( ) is the similarity between document j and query , denote the weight of term i in document j, and is the weight of term i in query q.
THE ONTOLOGY-BASED QUERY EXPANSION IV.
The main contribution of this paper is semantic-based retrieval, which automatically expands the input query based on Arabic domain ontology.Both Arabic domain ontology and query expansion process are described in details as follow.

A. The Arabic Domain Ontology
In this paper, we used our previous Arabic domain ontology, it is completely manually built using protégé [13].Protégé-OWL editor has been used to implement the ontology by covering the knowledge on the Arabic language vocabulary associated with Place Noun mentioned in the Holy Quran.The output of this process is the Islamic ontology that includes the Islamic concepts in a hierarchal classes form.
The domain ontology consists of three main classes " ‫يكاٌ‬ ‫"جغرافي‬ (Geographic place)," ‫يكاٌ‬ ‫عبادِ‬ " (Devotional place), and " ‫اياكٍ‬ ‫انحياة‬ ‫بعد‬ ‫يا‬ " (After Life place), which contain two main sub-classes: ‫"انجُت"‬ (Paradise) and ‫"انجحيى"‬ (Hell).The vocabulary contains a total of 99 words.Words in hierarchy are linked with components via ontological semantic relations.Semantic relationship synonyms have been used (in the protégé is named "same individual as" such as the individual ‫"يكّ"‬ (Makka) has a lot of synonyms names in the Holy Quran such as ‫انقرٖ(‬ ‫"او‬ (mother of cities), ‫االييٍ"‬ ‫‪"(city‬انبهد‬ of security), ‫)‪"(city‬انبهد"‬ ‫بكّ‬ ,' "(Bakka), etc.), Figure 3 show the sample RDF which represents the synonyms of the term, and Figure 4 displays the graph ontology of the classes with the individuals.

B. The Query Expansion using Arabic Domain Ontology
This section describe how to expand the input query based on the ontology, the query preparing module improves the system retrieves for relevant information to the user.It is converting the RDF triples to hash map, and extracts the equivalents terms from the relations between concepts such as ‫"انكعبّ"‬ (alkaba) same-as ‫انحراو"‬ ‫"انبيج‬ (the scaret house) , which have the same Arabic meaning.For Example, Figure 5 shows the concept ‫"انكعبّ"‬ in the ontology as OWL with the relations and object property.These relations and object property of the terms will be used in the expanded query process.<SameIndividual> <NamedIndividual IRI="http://www.Ain-Shams.org/ont.owl#‫>/"بكه‬<NamedIndividual IRI="http://www.Ain-Shams.org/ont.owl#‫>/"مكه‬</SameIndividual> <SameIndividual> <NamedIndividual IRI="http://www.Ain-Shams.org/ont.owl#ٖ‫>/"او_انقر‬<NamedIndividual IRI="http://www.Ain-Shams.org/ont.owl#‫>/"مكه‬</SameIndividual> www.ijacsa.thesai.orgThe proposed semantic-based Arabic information retrieval has designed and implemented using the Java language and the SQLite database.Figure 6 shows the system user interface, which enables the user to input the query and set the search methods mode (with query expansion or without query expansion).The proposed system shows a set of retrieved and ranked documents for the input query with or without expansion as shown in Figure 6, which presents the search results for the input query ‫"يكّ"‬ (Makka) using the ontologybased expansion.

A. Dataset and Arabic Domain Ontology
The system has been experimented on the corpus of the Holy Quran scripts.The number of documents (verses) is 6236, the number of words in the corpus is 77430, and the number of unique tokens is 14662 [28].The Arabic domain ontology implemented using protégé as discussed in the previous section (IV-A).The ontology has been tested on the place nouns vocabulary and on new terms from the new semantic filed in the Quran.While protégé editor has been used to build the domain ontology, which the syntactic and semantic qualities are both verifies by this tool.Therefore, proposed approach There are two experiments have conducted to test our system: word-based and ontology-based query expansion.Word-based is use full-form terms without modification the input query.Ontology-based expansion is expands the query terms automatically using place nouns domain ontology.As well as, to depict the accuracy and functionality of the ontology for checking that the concepts from relationship and its characteristics, also we have used the DL query language in the protégé tool [20].
For example: If we have a query ‫"انكعبّ"‬ (AlKaba), the query expansion process expands the query using place nouns domain ontology by extract the concepts based on the relations.The proposed system is search in the ontology for the concept ‫"انكعبّ"‬ (AlKaba), and then all synonyms and relative derivatives attached to the concept are extracted as shown in Table 1.Then the extended query becomes: { ‫انبيج‬ ‫انحراو,‬ ‫انبيج‬ ‫انبيج‬ ‫بكّ,‬ ‫يكّ,‬ ‫انكعبّ,‬ ‫({انعخيك,‬ Ka"aba),(Makka),( the sacred house),(the Ancient house ),(Bakka )}, the translation by Yousf Ali [29].Table 2 shows the total of relevant documents retrieved versus a simple query for the word-based and proposed approach using the domain ontology for the place nouns domain.The word-based search method retrieved only 2 documents from the corpus.The proposed approach using ontology-based method is retrieved a total of 54 documents, which are semantically related to the input query.The preliminary results are quite promising and show that there is a significant improvement in the precision and recall.In additions, the proposed approach has experimented using five queries as shows in Table 3.The results of the proposed system have been compared with word-based and stem-based methods.Therefore, the results of the proposed system are evaluated versus the results of this method.Figure 7 shows the proposed system has better precision comparing in to both the word-based and stem-based methods.The SPARQL query language is used [30].Thus ensures that use ontology can do semantics manipulation and inference, among many following sample queries were used, such as the following example: Query: ‫يكّ"‬ ‫في‬ ‫انعبادة‬ ‫اياكٍ‬ ‫"ياْي‬ (what is the devotional places in Mekka)?Answer: " ‫,"انكعبّ‬ ‫","انصفا"‬ ‫انًرِٔ,‬ ", " ‫جبم‬ ‫عرفاث‬ " (alka"aba), (Arafat Mount), (Safa), (Marwa).SELECT?Place_Nounc_Class WHERE {? Devotional_Palce: ‫يكّ‬ .: ‫يكّ‬ rdf: type: City.
In this paper, we present a semantic-based Arabic information retrieval system, which semantically expands the input query using domain ontology.The search engine index is represents using Vector Space model (VSM), and the place nouns domain ontology is used, which constructed from Arabic corpus and implemented using Web Ontology Language (OWL).Many researches have been done in information retrieval domain of knowledge but no work has been done for efficient topic search from Arabic Holy Quran.In addition, trials to develop the Arabic ontologies are present but with low usefulness, because the complexity of Arabic"s grammar and ambiguity, therefore several of the ontology developed in specific domain.The proposed approach improves retrieval for Arabic query through place nouns domain ontology.The proposed approach outperforms results in term precision and recall obtained from term-based method.Moreover, it is useful in the knowledge of the Islamic learning, linguistics researches, and semantic Web applications.In the future work we plan to merge and integrate the Place-Noun ontology with the Time-Nouns ontology (Al-khalefa .H et.al, 2010) and apply the proposed approach for the semantic retrieval.

Figure 1
Figure 1 display ontology base system for the Arabic ontologies developed, which is classify in to Arabic ontologies and holy Quran ontologies.

Fig. 6 .
Fig. 6.The user interface with sample results query

Fig. 7 .
Fig. 7. Precision for word-based, stem-based, and ontology based ObjectPropertyAssertion> www.ijacsa.thesai.orgextractedmorerelevant concepts for the input query, which are used in expansion as shown in Table1.

TABLE I .
SAMPLE OF THE OBJECT PROPERTY FOR CONCEPT ‫"انكعبّ"‬

TABLE II .
RELEVANT DOCUMENTS RETRIEVED FOR THE QUERY ‫"انكعبّ"‬

TABLE III .
LIST OF QUERIES