in the IRS

In the information systems, the query’s expansion brings more benefices in the relevant documents extraction. However, the current expansion types are focused on the retrieve of the maximum of documents (reduce the silence). In Arabic, the queries are derived in many morphosemantical variants. Hence the diversity of the semantic interpretations that often creates a problem of ambiguity. Our objective is to prepare the Arabic request before its introduction to the document retrieval system. This type of preparation is based on pretreatment which makes morphological changes to the query by separating affixes of the words. Then, present all of morphosemantical derivatives as a first step to the lexical audit agent, and check the consistency between the words by the context parser. Finally we present a new method of semantic similarity based on the equivalence probability calculation between two words.


I. INTRODUCTION
In the information research systems, relevance is function of the similarity degree between the query and the document. However, many functions of similarity are recently proposed. Most of these functions are based on the principle of vector distance where the meaning of words is not supported. However two words whose distance is zero, meaning they are similar. Unlike, there are words of high distance that mark the same meaning (synonyms), or words of distance equal to zero which mean several things.
Functions based on the distance vector are unable to provide the exact value of semantic similarity of terms. Withal, there are many stemming algorithms which contributing to the calculation of relevance by comparing the roots of words. That has yielded good results. But still insufficient, because there are some Arabic words whose roots are written in the same way while their meaning is different.
The difficulty of having a function of semantic similarity lies in the fact that the comparison of meaning between two words is possible only after an inclusion of a valid morphosemantical analysis. Hence, prepare the query before being introduced to the data retrieval system.

II. STATE OF THE ART
Most of the works on the Arabic syntactic analysis have led to the achievement of some laboratory prototypes. Indeed, to our knowledge to date, there is no parser marketed or distributed for scientific research. In the remainder of the State of the art, we present a few of the Arabic language analysis systems.
A. The AraParse system for syntactic analysis of the Arabic unvowled AraParse is a system to analyze Arab texts in their vowelized, unvowelized or partially vowelized form [1]. The objective is to achieve a core of morpho-syntactic analysis system that can be reused in other applications such as information research and automatic translation.
To recognize unknown sequences or unknown words, this system uses an approximate matching technique implemented with the AGFL formalism and using the priority operator between the alternatives of a rule and regular expressions [1].
Ouersighni [1] proposed the use of AraParse detect and diagnose the faults of accord. He used the accord rules proposed by Belguith [4] in the DECORA system.

B. The system DECORA for detection and correction of the Arabic accord errors
In scientific research focused on the analysis of Arabic language, Belguith [4] proposed a method for detecting and correcting errors in accord. This method has been implemented in the system DECORA. It is based on syntagmatic analysis for the error detection and correction multi-criteria analysis. Extended sentence is defined as a group of one or more original sentences linked to accord between them. www.ijacsa.thesai.org Extended syntagmatic analysis operates in two stages ( [5], [6]). The first step is to cut the initial phrase in initial phrases by locating the boundaries between them. This Division is guided by a set of rules using the syntagmatic borders as a means of identification of the original phrases. The second stage, allows building the extended phrases. The constitution of these phrases is guided by indicators of surface and is based on a set of rules to locate the accord links between the original phrases. These rules allow for example to relate the possessive pronouns in the phrases to which they relate, to integrate the verbs in the phrase that contains the subject considered to relate the original phrases that represent anaphoric proposals to the phrase containing the syntactic unit to which it relates.

C. Spoken Arabic Levantin Analyzer
Chiang [7] is interested by the analysis of the Arabic Levantin (AL) (a group of Arab dialects spoken in Syria, Palestine, Western Jordan and the Lebanon). He proposed an approach to translate the AL in Standard modern Arabic (SMA). Then link the sentence in AL to the corresponding analysis in SMA.
Note that the automatic translation is particularly difficult when there is no resource available as the parallel texts or the transfer lexicons. Thus, Chiang is primarily based on a corpus annotated from modern standard Arabic (MSA Treebank) [8] as well as a corpus annotated Arabic Levantine and more specifically that of the Jordanian dialect (i.e., TBPC Treebank [10]).
He built a lexicon contains the AL/ASM pairs of the forms of words. Also he built a synchronous grammar ASM-dialect. He assumes that each tree in the grammar of modern standard Arabic extracted from the MSA Treebank is also a tree of Levantine Arabic given syntactic similarity between the DSO and the AL.

D. A morphological and syntactic Arab text Analyzer
Debili Zouari [9] proposed the automatic construction of a dictionary containing all the inflected forms. This construction is made by a conjugator and a derivator.
The principle of morphological analysis is to make:  The division of the text to graphical words.
 Research of enclitic and proclitic of the word.  Verification (for each possible division) of the compatibility (proclitic / enclitic; enclitic / root; root / proclitic).
When consulting the dictionary, Zouari and Debili use the rewrite rules to find the "normal" form of the word.
The parsing process follows the phase of morphological analysis and related on the construction of the frequency binary and ternary matrices of precedence. These matrices are constructed from the annotated start texts "by hand" (this is the learning phase). They are then used to analyze new texts.

E. IRLA analyzer
The IRLA Analyzer is a queries interrogation system in the Arabic natural language [11]. It takes in input an Arabic sentence and translates it as query to run by an operating system. This parser allows to treat a subset of natural language (i.e., essentially imperative sentences), it produces a parenthesed form expressing the semantic of the query [12].
The parser can treat some simple linguistic problems (synonymy, negation, coordination). It is based on the detection of conceptual and linguistic surface indicators at the analysis.

F. Elliptical sentences Analyzer
In its research work on the Arabic analysis, Haddar [13] conducted a parser for the detection and resolution of the elliptical sentences in the Arabic texts. This parser is based on a method of syntactic analysis for verification of the syntactic structures of the proposals. This method uses a formal grammar rules generating verbal proposals written in Arabic. Access to these rules is coordinated with increased transitions (ATN) networks. The parser is coupled with another parser treating with semantic ellipses.

III. CONTRIBUTION
The calculation of relevance in our approach is focused on semantic similarity function which gives a result as a percentage of equivalence between two Arabic words. Knowing that they are written in various derived forms, it had to begin by morphological analysis which returns the origin of the derivative in question. Therefore, the possibility of separate affixes of the word is subsequently obtained by the original non-vocalized of the word which may refer to several meanings. The probable meaning to be just, is that which is on conflict with the user profile. To filter the true meaning, we have developed an automatic profiling system that brings together user queries and implements format indexed in a database. Our approach has given a good result on the morphosemantical ambiguity. In the remainder of this article we will present the various stages of analysis that we introduced in the relevance calculation [17] [18].

G. The morphological analysis
After a sending of query through the meta-search engine, we get a list of results. This list is sorted according to the relevance algorithm used in each data source (search engine).
We begin firstly by re-indexing of the documents founded by a semantic analysis module. Indeed, this module receives three parameters: the document, the query and the user profile. The document and the query are an affected by an in-page modification, which dissects the words to remove affixes [13]. If there are several cases (ex: ‫بطريق‬ = > ‫بطريق‬ = 'Penguin' or ‫ب‬ ‫طريق‬ = 'by road'), we tests the consistency of each derived, relatively with other words in the query, document and profile. Therefore, we accept just the possibility which is in the current context [5]. The separation of the terms of the request and the document offers more precision to the similarity of the triplet: document, request and profile, Hence, the need to have a flexible, fast and easy method [17].
Stemming used methods are unable to resolve the problem of semantics, because, to return the root of a word means that we can derive several forms to build a set of words that are not necessarily similar on the semantic meaning [6]. For example, the root of the word ‫'طريق'‬ is ‫.'طرق'‬ This root may take several derived forms, same as ‫,'مطرقة'‬ ‫'طريقة'‬ which do not mean the same thing. Hence, we have to found the word origin by keeping the semantic aspect of all. The method that we have introduced is to dissect the word to draw the origin after applying a light stemming. The origin is later transformed into a singular to test its existence in the dictionary of the Arabic words (ARRAMOOZ ALWASEET dictionary). If the word exists, then we retrieve its definition and we type to construct a semantic entity (SE). These SE are used to test the consistency of the word in a text. For this, presents the following heuristic algorithm: At the first entry in the derive function, the word ‫'بطريقنا'‬ is not found in in the Set of words M. we test if the Word 'Mot' is begun by a prefix. If yes, we remove the prefix to have the new word ‫'طريقنا'‬ (En: our path) which will be introduced as a parameter to the recursive function 'derive'. At its entry, the word ‫'طريقنا'‬ is undefined ‫)نكرة(‬ and is not in the set of words, and not starting with a prefix. Therefore, we pass to the second test on the suffix. The word ends with the suffix ‫'نا'‬ (En: our), it also removed to have the newest derived word ‫'طريق'‬ (En: path) which will also be introduced in the third hierarchical level of the recursive function. At its entry we test again if the word ‫'طريق'‬ exists in the set of the Arabic words. Now, the word is founded and added in the set D of the derived words.
On returning to the first hierarchical level of recursion, the word ‫'بطريقنا'‬ must be passed to another test of suffix. Then, we remove the suffix ‫'نا'‬ to have the word: ‫'بطريق'‬ (En: Penguin). This last word passes as a parameter to the function 'derive' which tests its existence in the set 'M'. We find also the word ‫'بطريق'‬ in 'M' and add to the set of derived words 'D'. The following figure illustrates the changes [14]. In this way, the statistical parser and the profiling algorithm receive well presented data. Then, we will see how to use the morphological analysis to give valid semantic presentations [11].

H. Semantic gene building
The semantic gene is an object containing the information needed (from a database) to the disambiguation of the Arabic words. The construction of gene starts at the level of the morphological analysis in the determination of the origin of the word. The following diagram illustrates the format of the semantic gene [17].

I. Two words queries Analysis
The expression of need is made by a multi-word query. The information system research returns a set of documents that contain all of the semantically valid sentences. These sentences include those that contain the desired word. This Word can mean several things, Hence the problem of semantic ambiguity.
To reduce the effect of this type of ambiguity, we designed a semantic filtering system that recognizes the type of the word based on the rules of constitution of the Arabic sentences. Given the difficulty of semantic analysis of the Arabic sentences, we consider the case of significant sentences of two lemmas. The following table shows the different cases of a sentence of two words semantically consistent [7]. We note that there are prohibited cases (as: " O ‫و‬ P"). Therefore, we have designed a set of mussels forming all possible cases of the sentences of two lemmas. This set of mussels is an array of objects where each element describes a phrase (pattern) model. The process of correction is applied firstly on the list of the semantic entities (alimented query) of the user to remove the inconsistent morphosemantical variants [12]. Then, we send the remaining lists for contextual correction system. The latter uses the contextual corpus to refilter the list. The result is one or more lists of consistent semantic entities at the contextual level as at the semantic level. Finally, the research system is receives a suite of semantic genes containing all information that can help the extraction, selection and filtering of relevant documents [8].
There are words that can be objects or properties. Our approach supports the gene by assigning the type of the word [17] [18].
This work is an aspect that has largely been addressed to the Latin language (English, French, ...) and even in some work for the Arabic language. Indeed, research based on the user profile to reduce noise and silence in the information research has yielded satisfactory results especially with the modeling of the user profile and the research domain with the notion of ontology. However, the ambiguity in the terms of query cannot guess the domain to choose from. Hence, we must prepare the query to reduce morphosemantical ambiguity, then guess the context from the user profile and www.ijacsa.thesai.org create genes to clarify the semantic field and the context intended by the user. The following diagram illustrates the various steps of our approach [16] [17]. To test the effectiveness of our method, we have developed a test meta-search engine. The latter uses data sources "Bing", "Yahoo", "Yandex". Then we compared our results with Google results. We have obtained the following table after throwing 100 queries [17].

V. CONCLUSION ET PERSPECTIVES
In this article, we introduced the concept of the semantic gene that contributes to the Elimination of ambiguity in the information research systems. We also explained how to create the semantic genes from the morphological, contextual and semantic analysis and how to differentiate between homonyms. The automatic profiling is also an interesting factor to approach to the needs of users.
Our target is to automatically create semantic graphs whose semantic genes nodes are very rich in side informational data. Where each node has a context, a definition, a type of Word, a morphological form, a list of successors and a list of predecessors. Finally we wish to develop a meta-research engine which can return optimal results.