The Paradox of the Fuzzy Disambiguation in the Information Retrieval

Current methods of data mining, word sense disambiguation
in the information retrieval, semantic relation, fuzzy
sets theory, fuzzy description logic, fuzzy ontology and their
implementation, omit the existence of paradox called here the
paradox of the fuzzy disambiguation. The paradox lies in the
fact that due to fuzzy data and the experts knowledge it can
be obtained precise knowledge. In this paper to describe this
paradox, is introduced a conceptual apparatus. Moreover, there
is formulated an information retrieval logic. There are suggested
certain applications of this logic to search information on the
Web.


I. INTRODUCTION
Recently information retrieval (IR) on the Semantic Web usually means searching a reliable source of information.So far systems of information retrieval and systems of semantic relation indicated only for the most semantically similar source of searched information [13], [14].
To define the semantic relationships is typically used measurement of the keywords incidence.However, meaning of these words is exactly identified and represents certain knowledge.Therefore, these IR methods cannot always be used.
Sometimes, the information retrieval about an object can lead to uncertain knowledge described in the appropriate, ontology language.In spite of this uncertainty, it can be found a disambiguated source of information about this object.In this way, the compliance with the description of the object model (compliance with the thesaurus) is obtained.The situation described above is called here the paradox of the fuzzy disambiguation in the information retrieval.Methods of data mining [11], word sense disambiguation in information retrieval [2], [7], [8], [13], [16], semantic relation [7], [8], fuzzy sets theory and fuzzy logic [10], fuzzy description logic [3], [4], [17], [18], and also their implementation (i.e. in OWL language), do not concern about this paradox.The following are two examples to illustrate this paradox.

A. Example 1
Data from user of X iron: the fabric Y shrunk and waved after ironing.
Data from expert: some threads of the fabric Y shrink, but only in the water steam at 100 • C.Only program 1 uses the water steam.
Data from user of the X iron are uncertain, if we want to find out which program was used for ironing.However, from these uncertain data indicate a precise data: during ironing was turned on the program 1 with the steam.Reasoning in this case uses particular data from the expert.The situation that when ironing the fabric was only heated by the iron is excluded with the experts knowledge.

B. Example 2
Whether "Ralf Möller", a German bodybuilder and TV actor and "Ralf Möller", a professor at the Hamburg University of Technology, is the same person?
Data from Web resources: Ralf Möller, the actor, born in 1959 r.; Ralf Möller, the professor, born when the German Chancellor was Ludwig Erhard (from history it was between 1963 and 1966).
Data from expert: Any attribute for a single person has only one value.
Data on the Web resources are ambiguous and uncertain.Names of people are: name(person1)="Ralf Möller" for the actor, name(person2) = "Ralf Möller" for the professor.Whether person1 = person2?Complementing the uncertain knowledge of the attributes values of expert knowledge: birth-Date(person1) = 1959, birthDate(person2) > 1963, it can be concluded that person1 = person2.The result of reasoning is accurate information.

II. RESIDUUM RULE IN IR
The information retrieval on the Semantic Web is based on finding a copy of data which are: 1) values of single attribute arguments, i.e. concepts -data representing knowledge of certain properties or object types, 2) values of multi attribute arguments, i.e. roles -data representing knowledge about relationships between objects.Firstly, concepts and roles are described by the language of the Description Logic (DL) [1].Secondly, the DL logic, describing concepts and roles, is extended for some first-order logic formulas.Then, in this extended logic, is created the thesaurus -language describing the reference concepts and roles.While the ontology describes the real, found on Web pages, concepts and roles which are searched.If the interpretation of www.ijarai.thesai.orgconcepts and roles from the ontology, accordingly to experts knowledge and criteria, will result in the interpretation of concepts and roles of thesaurus, then this relationship is called the residuum.This interpretation determines the membership degree of the data copies (set of the Web addresses X).This degree also includes the semantic structure of the resources, determined by the Semantic Web.
Due to fuzzy degree of knowledge, concepts and roles can be interpreted as fuzzy sets in the space X ∪ X × X of the Web addresses and their pairs.Then can be made the fuzzification of knowledge [5].Whereas setting residuum is necessary to make the knowledge defuzzification [5].Then for a given query, it can be indicate, a reliable-for-experts set of Web addresses representing this knowledge.Interpretation sets forming residuum will be further treated as an information search result.Then, the following search rule is applied.
Firstly, the question (search query) is compared to the thesaurus.Secondly, if for all interpretation the set of searched addresses is empty, then this question is compared to ontology, so that the compared entry has the most similar meaning to the thesaurus.Found, for this entry, the set of Web addresses represents knowledge which is identical or the most similar to the searched one.This rule of IR is called the residuum rule.Below is introduced the information retrieval logic (IRL) using this rule.Applying this logic to the information search is an attempt to develop a new, more universal IR method, using the artificial intelligence.

III. LANGUAGE OF IRL
In the context of the Semantic Web research [1], [3]- [5], the representation of knowledge in the Semantic Web can be defined by the attribute language (AL) of the Description Logic (DL) [1].Then knowledge is represented by concepts TBox, roles RBox and assertions ABox.The Semantic Web might be extended by edges representing relationships between concepts and roles.Descriptions of these edges are called axioms.Then knowledge is represented by two systems: the terminology called TBox and the set of assertions called ABox.Where the assertion is the relationship between the concept or the role and their instances.
Further is presented the syntax of the language AL for the IRL, analogously to the fuzzy Description Logic (fuzzyDL) [3].Articles [10], [11] show the semantic and the interpretations of this language which are called the fuzzification and the defuzzification.

A. Syntax of TBox
The following names are included to the set of concepts and roles names: The universal concept (Top) and the empty concept ⊥ (Bottom).
The universal concept includes all instances of concepts and the empty concept informs about no instance of a concept.
Let C, D be the names of the concepts, R be the name of a role, and m be the modifier.Then complex concepts are: ∀R.Cthe universal quantification; it means all occurrence of the concept C which is in role R with some occurrence of the concept C; m(C) -the modification m of the concept C; it means the concept C which is modified by the word m.For example m can occur as a word: very, more, the most, high, higher or the highest.
Concepts which are not complex are called atomic.

B. Syntax of ABox
For any concepts instances t 1 , t 2 , the concept name C and the role name R, the assertions are "t 1 : C", "(t 1 , t 2 ) : R".We read them: t 1 is an instance of the concept C, the pair (t 1 , t 2 ) is an instance of the role R.
For any concepts instances t 1 , t 2 , the concept name C and the role name R, the assertions with membership degree α α α are "< t 1 : C, α >", "< (t 1 , t 2 ) : R, α >".We read them: t 1 is an instance with membership degree α of the concept C, the pair (t 1 , t 2 ) is an instance with membership degree α of the role R.

C. Syntax of axioms TBox
For any concepts names C, D and any number α ∈ [0, 1], the axioms are:

IV. POSTULATES OF FUZZYDL AND FUZZY DISAMBIGUATION IN IR
The occurrence of the paradox of the fuzzy disambiguation in IR determine the following postulates (P1 -P9): P1.There is a thesaurus which is a set of certain reference terms and formulas of the IRL language.Thesaurus terms represent knowledge in the same area as searched information and can be found in a text document (from thesaurus).
P2.An ontology includes all terms and formulas of the IRL language, which are semantically related to the searched information.The ontology includes the thesaurus.All thesaurus formulas are constructed of certain terms and assertions from the base set Tez.Likewise, all ontology formulas are constructed of certain terms and assertions belonging to the base set Ont.The degree of semantic similarity of ontology expressions to the thesaurus expressions is determined by an expert system based on the experts knowledge.
P3.The space IR is a group of addresses of knowledge resources on the Web, semantically related to the terms and formulas representing the searched information.Knowledge resources are text documents available at these addresses.
P4. Finding information is to search the text document, which semantic structure (terms from the Semantic Web created by the ontology) is the most similar to the structure of the thesaurus document.If both documents contains expressions that are equally used by agents in the communication process, then these expressions represent the same knowledge.Furthermore information retrieval is searching for the text document that represent the same knowledge or most similar knowledge to the one from the thesaurus [6].Therefore, the residuum rule of information retrieval is applied.
P5.The information retrieval of the intersection of concepts represents collective knowledge of these concepts.Complementary formulas ϕ, φ are formulas ϕ&φ represent the collective knowledge represented by the data set of these formulas.
P6.According to the intuition and practice of IR, if x ∈ [0, 1] is the degree of the semantic similarity of the instance t 1 of a concept C 1 or formula ϕ to the concept instance or formula belonging to the thesaurus and y ∈ [0, 1] is the degree of semantic similarity of the instance t 2 of a concept C 2 or formula φ to the concept instance or formula belonging to the thesaurus, then the degree of the concepts intersection C 1 , C 2 or complementary formulas ϕ&φ are a number x • y.The operation • : [0, 1]×[0, 1] → [0, 1] is some t-norm.This t-norm has the following properties.For all x, y, z, x 0 , y 0 ∈ [0, 1]: Each t-norm determines uniquely its corresponding implication → (the residuum), defining a similarity degree of the formulas implications, satisfying for all x, y, z ∈ [0, 1]: or The implication of these formulas is semantically similar to the thesaurus formulas, if the similarity degree of the implication predecessor is the closest to its successor.Furthermore, this means that in the found text document is the formula ϕ, which implies a certain formula φ contained in this document, or represents the searched information.When φ represents the searched information and is not contained in this document, then the formula ϕ is supplemented with the formula θ representing the experts knowledge, so that the complementary formulas ϕ&φ have the same similarity degree as search formula φ (the degree equal 1).Thus, in the case of imprecise implications predecessor ϕ, the successor is a sharp expression with the semantic similarity degree equal 1 (Example 1).
P7.If in the text document is a conjunction ϕ∧φ, according to the classical propositional calculus, then ϕ ⇒ φ.Thus, the conjunction is recognized by firstly recognizing the formula ϕ and secondly by recognizing the implication ϕ ⇒ φ.Therefore, the conjunction ϕ∧φ is recognized as ϕ&(ϕ ⇒ φ).Further is assumed that: Hence, the similarity degree of the conjunction ϕ ∧ φ to the thesaurus formulas is defined: where x, y are the similarity degrees of formulas ϕ, φ to the thesaurus formulas.
P8.If in the text document is an alternative ϕ∨φ, then based on propositional logic, it can be assumed that this alternative is recognized based on the following assignment: Hence, the similarity degree of the alternative ϕ ∨ φ to the thesaurus formulas is defined: where x, y ∈ [0, 1] are the similarity degrees of formulas ϕ, φ to the thesaurus formulas.
P9.The algebra BL =< L, ⊗, ⊕, •, →, 0, 1 > is a regular residuated lattice (or a BL-algebra).It is the algebra such that: www.ijarai.thesai.org 1) < L, ⊗, ⊕, 0, 1 > is a complete lattice with the largest element 1 and the least element 0; 2) < L, •, 1 > is a commutative semigroup with the unit element 1, i.e. • is commutative, associative, and 1•x = x for all x; 3) the following conditions hold (for all x, y, z ∈ [0, 1]): In the BL-algebra can be defined operations of the completeness and the equivalence ↔: V. FUZZIFICATION The expressions of the IRL logic are interpreted in the regular residuated lattice BL =< L, ⊗, ⊕, •, →, 0, 1 > and in the chosen, ordered algebra of fuzzy sets: Where for the space X ∪ X × X, F is a family of fuzzy sets, µ : X ∪ X × X ∈ [0, 1], described as follow.For any fuzzy set µ there are exactly two fuzzy sets µ 1 : X → [0, 1] and µ 2 : X × X → [0, 1]: F is a set of all fuzzy sets in the F F F algerba, which only apply to mentioned bellow operations and relation, described by t-norm • [8], conclusion, equality and modification norm.
The operation: of equality of fuzzy sets [8].These operations are defined in the regular residuated lattice BL =< L, ⊗, ⊕, •, →, 0, 1 > defined as follows.For any fuzzy sets µ A , µ B ∈ F and x ∈ [0, 1]: The symbol 0 F is any fuzzy set only with values 0, the symbol 1 F is any fuzzy set only with values 1, M is a set of one-argument operations f : [0, 1] → [0, 1] called the modification functions; F 0 is a subset of F .Let X is a set of all objects (data copies), which are part of the Semantic Web and X × X is a set of all ordered pairs of the set X. Then there can be described the interpretation I = (F, I ) which: F1.For the concept instances t assigns certain values t I ∈ X and for the pair of instances (t I 1 , t I 2 ) assigns pairs (t I 1 , t I 2 ) ∈ X × X.Most frequently concept instances are associated with data copies.These copies are considered by IT specialists as objects.Thus, the space X ∪ X × X is a set of Web resources which include documents with considered data.For example specific word in a computer screen is an instance of data copy indicated by the specific Web resource address.Also the relationship between this word and other word is indicated by a pair of Web resources addresses.

A. Semantic of concepts Tbox
For any x ∈ X, concept names C, D, the role name R and the modifier m: ⊥ I (x) = 0 (25)

B. Semantic of assertions ABox
For any instance t of the concept C and any instances t 1 , t 2 of the role R:

D. Semantic of formulas
For any formulas (assertions and axioms) ϕ, φ and degree α ∈ [0, 1]: (∃xϕ(x)) I = sup{y ∈ [0, 1] : exists the instance t of the concept T that y = (ϕ(t)) I )} (43) When the interpretation function I satisfies the conditions F1 -F5 and ( 24)-( 45), then it is called the fuzzification of the IRL logic language.If after the fuzzification as the result there are only characteristic functions, then this interpretation is called an exact.Then it is equivalent to the standard interpretation of description logic DL [2] and it satisfies the conditions F1 -F5 and ( 24)-(37) are satisfied.

VI. BASIC INFORMATION RETRIEVAL LOGIC
The Information Retrieval System can be extended for searching reliable for experts subsets of X ∪ X × X, where X is a set of Web resources relating to a chosen field of knowledge.These subsets for a given query indicate reliable for experts Web addresses, representing the searched knowledge.Therefore, these sets can be used to reliable interpret the IRL expression.
For this purpose, as in the statistics, is used the confidence range V .It is considered that the most important is that all experts, on the basis of the confidence range V , accept the set of membership degrees of an object to the fuzzy set.This set represent fuzzification of the concept or the role, defined by the knowledge base K =< T ez, F uz, V, Ont >.T ez is a set of concepts and roles from the thesaurus and Ont is a set of concepts and roles from the ontology defined due to the postulate P2.F uz is a set of possible-to-use interpretations (the fuzzification) defined due to postulates P1 -P9, conditions F1 -F5 and (24) -(45).All IRL formulas consist of the set T ez ∪ Ont and are interpreted in the F F F algebra by means of the fuzzification set F uz.The formula ϕ is true in the knowledge base K, when for any fuzzification I ∈ F uz, ϕ I = 1.Since the formulas interpretations are at once the BL-algebra interpretations, algebra which is uniquely defined by these interpretations (the F F F algebra), we write val BL (ϕ) = 1.Thus, the set of all IRL formulas belongs to the class of formulas sets of the fuzzy logic BL∀ interpreted in the BL-algebra.These logic were studied by Hàjek [9].In the BL∀ there are the following inference rules: 1) Modus ponens: from ϕ and ϕ → φ infer φ; 2) Substitution rule: we can substitute any formulas for the propositional variables; 3) Generalization: from ϕ infer ∀xϕ(x).
Theorem (Soundness and Completeness).Let ϕ be a formula of the BL∀, T be a set of all formulas from the BL∀theory.Then the following conditions are equivalent (Proof.see [9]): 1) the formula ϕ is derived from T-theory with use of the inference rules; 2) val BL (ϕ) = 1 for each BL-algebra (with infinite intersection and infinite union) that is model for T.
The IRL is the two-variable fragment of the second-order logic.The values of all predicated variables are concepts and roles.However, the validities of a monadic predicate calculus with identity are decidable.When the formula with predicate variables and roles which are predicates would be removed, then we obtain the monadic predicate calculus.This fragment of the IRL is decidable.There are some fragments of the IRL of roles, which are decidable and some are known for their use [9], [12].

VII. DEFUZZIFICATION IN IRL
In this paper, the defuzzification is identified with the interpretation < K, Def > in the IRL logic.This interpretation, for a given query, indicates a reliable-for-experts set of the Web addresses representing knowledge from the knowledge base K.
In this purpose, for the knowledge base K =< T ez, F uz, V, Ont > and for some fuzzification I ∈ F uz, any concept C or any role R, have interpretation belonging to the set of the fuzzy confidence range V .It is accepted by all experts and is defined by: V (C) ⊆ {α :for some instance t of the concept C or I ∈ F uz, α = (t : C) I } (46) V (R) ⊆ {α :for some instances (t 1 , t 2 ) of the role R or I ∈ F uz, α = ((t 1 , t 2 ) : R) I }.
Experts consider knowledge, which is in the fuzzy confidence range, as the exact knowledge and as part of the possible interpretations.The designation of such subsets will be identified with the knowledge defuzzification of the objects, belonging to the X or X × X [5].
The function (.) Def is called the defuzzification interpretation or the defuzzification of the knowledge base K =< T ez, F uz, V, Ont >, if following formulas are true.For any concepts C, D, roles R, R 1 , R 2 , instances of concepts t, t 1 , t 2 : www.ijarai.thesai.org

¬C
the concept negation; it means all instances of concepts which are not an instance of the concept C; C Dthe intersection of concepts C and D; it means all instances of both concepts C and D; C Dthe union of concepts C and D; it means all instances either of the concept C or the concept D; ∃R.Cthe existential quantification; it means all instances of the concept C which are in role R with at least once occurrence of the concept C;