Relation Inclusive Search for Hindi Documents

Information retrieval (IR) techniques become a challenge to researchers due to huge growth of digital and information retrieval. As a wide variety of Hindi Data and Literature is now available on web, we have developed information retrieval system for Hindi documents. This paper presents a new searching technique that has promising results in terms of F-measure. Historically, there have been two major approaches to IR - keyword based search and concept based search. We have introduced new relation inclusive search which performs searching of documents using case role relation, spatial relation and temporal relation of query terms and gives results better than previously used approaches. In this method we have used new indexing technique which stores information about relation between terms along with its position. We have compared four types of searching: Keyword Based search without Relation Inclusive, Keyword Based search with Relation Inclusive, Concept Based search without Relation Inclusive and Concept Based search with Relation Inclusive. Our proposed searching method gave significant improvement in terms of F-measure. For experiments we have used Hindi document corpus, Gyannidhi from C-DAC. This technique effectively improves search performance for documents in English as well.


INTRODUCTION
World Wide Web is used to share huge repository of texts in many languages as everybody can post content in any language.Search across multiple languages is desirable with the increase of many languages on the web.In order to enable a wider proportion of population to benefit from Information technology, it is desirable that human-machine interface permits one's native language of communication.In the context of a multi-lingual country like India, this can be of immense value.The users will be greatly helped if they are able to retrieve information in a language in which they are comfortable.In the context of Indian languages, Hindi language has been given much emphasis leading to the development of significant number of Hindi documents.According to Ethnologue statistics, among the list of the top 10 languages, Hindi is coming forth with 260 million first language speakers and English is coming third with 335 million first language speakers [14].Hindi language is spoken by 41% of population of India whereas about 5% of population understands English as their second language [10].A wide variety of Hindi Data and Literature is now available on web [6].The number of users who want the information in Hindi language is increasing.We have developed an Information Retrieval System (IRS) for Hindi documents.Various search engines are available on the internet as independent search engine sites in Hindi like Google [11], Raftaar [12] and Hinkhoj [13].The retrieval accuracy of these search engines does not satisfy users need.Increasing the appropriateness of the results returned by these search engines is critical to dealing with the huge repository of the data.
Historically, there have been two major approaches to IRkeyword based search and concept based search.In keyword based search, search engines use words or multiwords phrases that occur in documents and queries as atomic elements and the content of the documents are described by list of keywords.The search procedure, used by these search engines, is principally based on the exact matching of document and query terms and does not take into consideration the various meanings or possible concepts that a word represents [4], [5].If the user chooses a valid synonymous word that is not in any document then it would fail the search.In general, this approach has many problems as the user may not get the most relevant and useful content related to the query.The first problem is low precision, which is due to the irrelevance of many of the search results.This results in a difficulty finding the relevant information.The second problem is low recall, which is due to the inability to index all the information available on the Web.This results in a difficulty finding the unindexed information that is relevant.Also the level of deepness of analysis of the language is very low, so the relevant information is not retrieved by the search engines.
On the other hand, concept based search fetch document based on their meaning rather than the presence of the keywords.Here, the meaning of words is analysed and not only their syntactic representations.This type of searching uses query expansion techniques where the initial query is appended with related, contextual, or synonymous terms so as to make the new query more complete to define the required concept [6], [7], [8].But, this approach also uses the words that occur in query and in documents as atomic elements and no relation between them is used for retrieval.However, concept-based approaches allow reaching a higher precision than keyword based approaches.contains temporal relation given by "पहले ".If the documents are indexed without considering relational information and remove prepositions/postpositions as stopwords before indexing of the documents to reduce index file size, then the relation existing between terms is lost.Taking this point into account, we propose relation inclusive search as a new and promising way of improving search on the IRS.We call it RSearch (relation inclusive search).The main idea is to keep the same infrastructure which has made previous methods so successful, thus improving the system performance.Informally, RSearch do not use terms present in query and in documents as atomic elements, but the semantics of relationships between query terms is considered for improving search.While indexing it stores all the relational information existing between terms present in the documents in the index file and use that information for retrieving relevant documents.
To reduce the size of index file we have removed stop words before indexing the documents.We have categorized stopwords in two categories -Relational (नीचे , ऊपर, पर, आगे , अं दर, ने , को, है , था etc) & Non-relational (ही, तब, यह etc).These stopwords have different impact on the information retrieval process.Relational stopwords indicate semantic relevance that is necessary for efficient information retrieval.Removing relational stopwords from the document would result in loss of such relevant semantic information resulting in decrease of relevance efficiency of the system.While removing nonrelational stopwords would reduce the document length resulting into faster search.So, we remove only non-relational stopwords.

II. RELATION INCLUSIVE SEARCH
There are multiple objects in a query which are dependent on each other.Relations between objects are given by prepositions (in English) and postpositions (in Hindi) depending upon the language of the document.We have proposed a new relation inclusive searching technique RSearch which stores relation between terms present in the document in the index file and also considers that relation for retrieving relevant documents from the corpus.It is based on new indexing scheme.It fetches documents based on Case Role (Karaka) relation, Spatial relation and Temporal relation existing between query terms.All these types of relations are discussed in the following subsections in detail.Before that, let us consider the document collection shown in figure 1.In figure 2, we showed examples of ten queries, which are submitted to this document collection.We will use the same document collection and sample queries in the following subsections.

III. TYPES OF RELATIONS
There are three types of relations we have considered in relation inclusive search.

A. Case Role (Karaka) relation
Hindi language has eight types of case roles which are shown below with related परसगा /suffixes.(see table I As Hindi is free order language; order of words contains only secondary information such as emphasis etc.Primary information relating to 'gross' meaning (e.g., one that includes semantic relationships) is contained elsewhere [1], [2], [3].
Therefore, same sentence could be written as: Sentence 2: श्याम को जानवर ने मारा । An answer for query Q6 in figure 2 using keyword based searching would be D5 as this document contains all the query terms.But this is not the correct answer as case roles of जानवर and श्याम do not match.In D5, the case role of श्याम is कताा (nominative case) and जानवर is कमा (objective case) but reverse is the case in Q6.
So, to improve the information retrieval effectiveness in terms of precision and recall, we have stored the case role relation between terms in the index file.www.ijacsa.thesai.org 1) Indexing a Case Role relation While indexing the documents, the case role (giving the karaka relation) of the word is also added in its posting list along with its position in a particular document.For example,    Inverted Index file in relation inclusive search

2)
Steps for matching To retrieve the documents for query Q6, matching involves the following steps: 1) Retrieve the posting lists of जानवर and श्याम.
2) Get the set of documents from the posting lists containing both the terms and also their case roles in those documents.
3) Find out the case roles of query terms.
4) For each document check if the case roles of both the terms are matching with their case roles present in the query.
In this example, document D5 contains both the terms.In D5, case role of जानवर is कमा (K2) and that of श्याम is कताा (K1).But the जानवर is कताा (K1) and श्याम is कमा (K2) in Q6.As the case roles of जानवर and श्याम are not matching, an answer for query Q6, computed by relation based search is the empty set which is the correct answer.

B. Spatial relation
Spatial relation shows position of an object w.r.t another object.
2) Retrieve posting lists of query terms.

C. Temporal relation
All the temporal related words (पहले , बाद मे ) come under relational stop words.

For example, दो कदन पहले बाररश हुई थी। do din pehle barish huee thee
Here, "दो कदन पहले " gives the temporal relation.If we remove it, the temporal relation would be lost and the system does not retrieve relevant document.In order to improve the efficiency of the system we don't discard temporal related words (पहले , बाद मे ) as stop words.

1) Indexing a Temporal relation
All the relational stopwords giving temporal relation are indexed in separate inverted file.The posting lists of temporal related words contain temporal information.We have considered two types of temporal relation-Number_of_days & Order_of_entities.
1. Number_of_days: If the document/query contains words like {कदन, कदवस} along with words {पहले , बाद}, then the temporal relation is "Number_of_days".In that case we store number of days in the posting list of temporal related words along with their positions.
For example, In document D1, temporal related word पहले is present at 3 rd position and number of days are 2. Same information is stored in the posting list of पहल (stemmed word) ( see figure 6).

Order_of_entities:
If the document/query contains words like से , के along with words {पहले , बाद}, then the temporal relation is "Order_of_entities".In that case we store order of entities in the posting list of temporal related words along with their positions.However, if the query contains temporal related word पहले , then we check for the posting list of बाद also and vice versa as पहले and बाद are opposite words.
For example, In document D6, temporal related word बाद is present at 3 rd position giving the order of entities between श्याम & राम which are stored at 1 st and 4 th position respectively.Same information is stored in the posting list of बाद.Relation between श्याम & राम is श्याम => बाद <= राम (see figure 6).www.ijacsa.thesai.orgFig. 6.
Indexing a temporal relation

2)
Steps for matching First we identify if there are temporal related words present in the query.If query contains temporal words then retrieve the posting list of those words from their index file.

For example,
Consider the query Q10 in figure 2. Following steps are involved for searching of relevant documents: 1) Stem the query terms and remove non-relational stop words.Then query Q10 is reduced to: राम से पहल श्याम घर जाए 2) Retrieve posting lists of query terms excluding relational stop words (see figure 7).

A. Estimation of F-measure
Sample queries shown in figure 2 are run on the four environments as mentioned above to get the results.Table IV gives the estimation of F-measure for the four types of searching when they run on same set of queries.From the estimation, it can be seen that F-measure for relation inclusive searching is much higher as compared to that of without relation inclusive.In some cases where synonyms of words are not being used, F-measure is coming out to be same for keyword based searching and concept based searching.For example, for queries Q4, Q6 and Q10 F-measures are 0.57 and 0.80, 0.75 and 0.86, 0.67 and 0.77 respectively for keyword based searching and concept based searching.www.ijacsa.thesai.org8 gives the graphical representation of the table IV.A graph is plotted between F-measure and the search items to give a comparison of the four search environments.This experiment indicates that system performance is increased in terms of F-measure using relation inclusive searching both in keyword based searching and concept based searching.The performance gains came from query classes which had relations between query terms that could be either case role, spatial or temporal.The experiments showed the benefit of relation based searching by improving precision and recall values.The relation inclusive searching method will give better search performance among documents in English and other similar languages where case roles are represented by spatial/temporal prepositions.For experiments, we have used Hindi test collection extracted from gyannidhi corpus from CDAC.Our proposed searching method gave significant improvement in terms of F-measure.The performance gains came from query classes which had relations between query terms that could be either case role, spatial or temporal.The above relation inclusive searching method will give better search performance among documents in English and other similar languages where case roles are represented by spatial/temporal prepositions.

Fig. 3 .
Fig.3.Inverted Index file in relation inclusive search

Fig. 4 .
Fig.4.Inverted Index of Relational Stopwords 2) Representation of Spatial relation We are extracting relations in the form of triples: Word 1 =>RELATION<=Word 2 and they are used as indexing terms.For example: In documents D2 and D3, relation टे बल =>पर<= ककताब exists, that means ककताब is lying on the टे बल.

4 )
Properties of Spatial relationFollowing are the properties of spatial relation: www.ijacsa.thesai.org
VI. CONCLUSION In this paper, we have introduced new relation based technique which performs searching of documents using relational, spatial and temporal relations of query terms and gives results better than previously used keyword based and concept based approaches.New indexing technique is used in our method which stores information about case role, spatial and temporal relation along with term position.We have compared four types of searching: Keyword Based search without Relation Inclusive, Keyword Based search with Relation Inclusive, Concept Based search without Relation Inclusive and Concept Based search with Relation Inclusive.

TABLE I .
CASE ROLES WITH RELATED SUFFIXES

TABLE II .
DETAILS OF HINDI DOCUMENT CORPUS