A New Method to Build NLP Knowledge for Improving Term Disambiguation

Term sense disambiguation is very essential for different approaches of NLP, including Internet search engines, information retrieval, Data mining, classification etc. However, the old methods using case frames and semantic primitives are not qualify for solving term ambiguities which needs a lot of information with sentences. This new approach introduces a building structure system of natural language knowledge. In this paper all surface case patterns is classified in advance with the consideration of the meaning of noun. Moreover, this paper introduces an efficient data structure using a trie which define the linkage among leaves and multi-attribute relations. By using this linkage multi-attribute relations, we can get a high frequent access among verbs and noun with an automatic generation of hierarchical relationships. In our experiment a large tagged corpus (Pan Treebank) is used to extract data. In our approach around 11,000 verbs and nouns is used for verifying the new method and made a hierarchy group of its noun. Moreover, the achievement of term disambiguating using our trie structure method and linking trie among leaves is 6% higher than old method. Keywords—Information Retrieval; NLP Knowledge; Disambiguation; Word Semantics; trie structure


INTRODUCTION
Natural language processing (NLP) systems use many dictionaries. In this paper, we discus two types of information. The first is morphological information about morphemes, or words, and their fundamental attributes such as a part of speech [11], and the second is semantic primitive [16][17] [28], and so on.
The understanding of implicit in events is of great interest in recent years. Nouns in NL is assumed as real-world entities due to the implicit with nouns in most of work. The lexical nouns name classes of entities, some of which are kinds and some of which are not. This is compatible with the view of compositional semantic in which nouns are viewed as oneplace predicate. They are argument-taking functions which take individuals into truth values. On the other hand, verbs are viewed as n-place predicates, functions which take n-tuples into truth values. The (extensional) meaning of any sentence is composed by recursively combining functional terms with quantifiers, operators, and logical connectives.
So first, generic knowledge of events consist of implication is very essential for understanding. For example a person buys something because he wants it. Such knowledge incorporates the implications that buying is enabled by having enough money, and that asking implies that subject of the asking wants something. The second type of knowledge classifies verbs with subject, object, and place into groups which are considered relation. It is important to design an Implicit Inference of nouns and linking group that can efficiently integrate multi-attribute relation.
Implicit inference iinformation is defined by knowing the verbs deepest meaning, determining a deep knowledge about nouns. Also, multi-attribute relation information is defined by a pair of basic words and its record includes the attribute of relation . Consequently, the problems, are a very large space cost for storing all pairs and a high frequent access of pairs and their attribute in the record. A trie, or a digital search structure, must be introduced to the basic scheme since a word is basically a string. Relational information such compound words is formed significantly, and occupies a large spaces in the morphological dictionary. Artificial intelligence (AI) basic knowledge IS-A also depend on term relationships.
A case frame [25] [27] is an important technique to solve ambiguity in syntax and semantic analysis [20] [21]. Japanese to English, machine translation systems in both direction [25] requires using case frame to build translation dictionaries.
Aoe et al. [1] [2][3] [4] and Morita et al. [5] introduced a two-trie structure for storing compound words into the compact structure. Morita et al. [6] presented a link trie. This paper present an implicit inference of nouns, and collect all knowledge about the sentence and make it groups of linking and high frequent access between verbs with subject, object and place. Moreover, by introducing a trie that can define the linkage among leaves, this paper present an efficient data structure. Therefore, the proposed structure defines, multi-attribute relationships between words which can be merged into the same record.
Section II of this paper describes relational information as multi-relations among terms with a case frame of the basic knowledge. The link trie and an integrating morphological is presented in Section III. The proposed method is verified by simulation results in Section IV. In Section V, we discuss conclusions and potential future work.

A. Information Of Multi-Attribute Relation
MOR(x) is the morphological information for word x. here we will discuss relational information, call a multi-attribute relation, for a finite of relational attributes briefly.
Multi-attribute relation's information can be defined as a triplet (x, y, Alpha), where x and y are interrelated, and the attribute is Alpha. In natural language processing one can get a variety of attributes, and clearest meaning by using relationships among words as follows.

B. Case frame
To cope with this complexity we have to use the services of some syntactic and semantic information at the same time for the analysis of a sentential structure. The best grammatical framework for this purpose is the case grammar (C. Fillmore in 1968). the semantic primitive shown in Table 2 is utilized to determine which kind of noun can be in which case slot. For instance, the verb eat load a noun connected with one of the semantic primitive animal as the cause of the verb, and noun of semantic code eatable stuff as an object. This case slot determination is specified for each handling of all verbs in a dictionary.
The information to be inserted in the dictionary record differs depending on each part of speech, but in general include this kind of information: head word, number of character of words end, alternate, root word, correlated words, morphological piece of spoken language, conjugation, prefix information, area code, grammatical part of speech, subcategorization of piece of speech, patterns case, feature, model, option, semantic primitives, co-occurrence information (adverb, predicative modifier), idiomatic expressions, degrees, degrees of nominality and so on.
Here, Verbs and nouns case pattern is one of the important information. We have renowned over 30 instances, see Table  1. Each case slot in a pattern of verb use include semantic information about the noun, which could be seen in the slot. The noun has the matching semantic code in an entry. We have renowned over 50 semantic primitives (codes) in Table  2.  Nagao, et. al. [26], [27] as in Table 1, the semantic primitive is employed to determine which kind of noun can be in which case slot. For instance, the verb eat requires a noun linked with one of the semantic primitive animal as the cause of the verb, and noun of semantic code eatable substance as an object. That case slot determination is specified for each use of all verbs in dictionary.
The VERB and NOUN (OBJECT, PLACE) relation relations are defined as follows: Sentence: Jhon will climb the Chocolate in the next winter holiday.
(ACTOR: Hala, HUMAN) (OBJECT: Chocolate) As in Table 2 of semantic primitives, we will find that "Chocolate" is an OBJECT, and by the information of verb "eats", then the noun "Chocolate" is an EATABLE MATERIAL that HUMAN will eat. Therefore, SEMANTIC("Chocolate") in this previous sentence is Example [3].

SEMANTIC("Chocolate") [PRODUCT MOBILE PHONE]
Sentence: Data is organized in Chocolate.
(ACTOR: Data, INTELGENT PRODUCTs) (OBJECT: Chocolate) As in Table 2 of semantic primitives, we will find that "Chocolate" is an OBJECT, and by the information of verb "organized", then the noun "Chocolate" is an INTELGENT PRODUCTs that can be organized data on it. Therefore, Examples show in sentence ambiguities, WSD can be carried out based on the clear semantic primitives in sentences. However, in the case of context ambiguities, although the sentence includes semantic primitives, context ambiguities are still hard to be solved, Appendix A.

C. Implicit Inference of a Noun
It is very essential to have systematic study on the verbs and nouns, to have a deep knowledge. Due to implicitly in the events, a generic knowledge is necessary. By creating a verb semantic representation in the case frame, we can get more information about noun. A detailed study has been carried out with many examples to get nouns implicit inference as follows: Example [1]: Mr. Atlam eat fried fish in a restaurant.
For a case frame of this sentence; (ACTOR: Mr. Atlam) (OBJECT: fried fish) (LOCATION: Restaurant) We notice that, a noun has just semantic primitive. For example in this example, we find that fried fish is one kind of food. This means that by using By using implicit inference (deep information of a verb), we find a SEMANTIC_REFER of slot, which indicates that the frame may possibly refer to the semantic depiction of that slot. The knowledge of the verb eat in this example refers to knowledge of fried fish and a restaurant, then an object fried fish is referred to 'eatable material' and a restaurant is referred to 'eating place.' Example 2: Mr. Samouda swims in a river.
For another case frame of the sentence; (ACTOR: Mr. Samouda) (LOCATION: River) By using the semantic primitive of noun Table 2, we find in this example that a river refers to only LOCATION, and has no knowledge about 'swimming place,' and Mr. Samouda refers to the HUMAN.
But by employing the deep information of the verb we see a slot of SEMANTIC_REFER indicating that the frame may refer to the semantic description of that slot. The knowledge of the verb swim in this example is refers to the knowledge of a river, then the place a river is referred to 'swimmable place' where the HUMAN swim, and if there is relation that LOCATION has OBJECT, then the dynamics knowledge that a river has water.
Example 3]. Mr. Atlam wants to buy a computer from a store.
For this case frame of this sentence; (ACTOR: Mr. Atlam) (OBJECT: Computer) (LOCATION: Store) (TOOL: $10,000) www.ijacsa.thesai.org By using the semantic primitive for nouns Table 2, this refers to a computer which is an ARTIFICIAL PRODUCT, a store refers to the place location, and $10,000 just TOOL (has no information about the price of computer).
But by employing implicit inference, we find the slot of SEMANTIC_REFER indicating that the frame may possibly refer to the semantic depiction of that slot. The knowledge of the verb buy in this example is refers to knowledge of a computer, a store, and $10,000, then the knowledge refers that Mr. Atlam is enabled, so by having money and the cost is $10,000, that Mr. Atlam intends to use what he buys, also store is referred to 'buyable place', if there is a relation that LOCATION has OBJECT, then a store has a computer.
By using verb information in the case structure, implicit knowledge of nouns can be derived. By extending this knowledge, we can build some linkage groups between a subject with a verb, a verb with an object, and a verb with a place. We can write the same typical sentence, as follows: By collecting a large number of this examples, we could build the following groups: the linking between nouns and verbs is shown in the figure 1, and this group is arranged from down to up depending on the strong and has the weak relation. This means that the relation between nouns and high-leaky verbs, such as talk, think, speak, and so on. They are verbs of a higher animal action, and strong verbs. But another is general verbs, such as eat, drink, and so on, or a weak verb, as follows: Although each knowledge dictionary in primitive systems is built separately, almost all modern natural language applications become more complicated combining the above relationships. For this reason, it become necessary to design a fast and compact structure to be efficiently integrated with any of multi-attribute relation.
Since information about multi-attribute relation is defined by a pair of basic words and its record as well as the attribute of relation, the problems become a very large space cost for storing all pairs and a high frequent access of pairs and their attributes in the record. Since a word is basically a string, a trie, must be added to the basic scheme representation

D. Compound Word
The triple <x, y, α> is called Compound Word relations which indicates that x composite with y to give new information. By using case frame relation <tool> + <Verb>  of computer processing, and <Subject> + <Verb>  of language processing are called compound word relation. By using this case frame relation the clearest meaning and information about word can be extract rather than the single one, another example as follow: Information Retrieval Natural language.

A. Tries and Efficient Representation of Verb and Noun Linkage
Trie is an n-array tree [2], [10], [11], [15] having n-place vectors as nodes with components corresponding to digits or characters. For confusion avoidance between keys like the and then, let us insert a special end marker; # to the end of all keys, so no prefix of a key can be a key itself [1]. Let K be a keys set. Each path in the trie starting from the initial node (root) to a leaf corresponds to a key in K. Therefore, the nodes of the trie correspond the prefixes of keys in K. A trie definition is as follows [3], [4], [5].

1) S is a limited set of nodes, represented as a positive integer.
2) I is a limited set of input characters, or symbols.
3) g is a goto function from S I to S {fail}. This means that, a node r is in F if and only if there is a path from 1 to r reads some string x in K. A move titled with a (in I) from r to t means g(r, a) = t. The nonexistence of a move means stoppage (failure). Figure 2 shows a trie example for eleven words with '#', where enclosed in a square nodes will be later discussed. The key 'Atlam#' retrieving can be done by applying the transitions g(1, 'a') = 3, g(3, 't') = 22, g(22, 'l') = 14 , g(14, 'a') = 32, g(32,'m')=15, and g(15,'#')=2, sequentially.

B. Link Trie(LT) Function [K. Morita, 6] Term Relationships Definition
Assume (X, Y, R) is the relation R between termss X and Y. With tries, there is one-to-one correspondence between leaves and keys, so we can define its link trie by linking leaf s for X and leaf t for Y. In such case, the definition of function LINK is t  LINK(s) and the relation by the record R  CONTENTS(s, t). Link trie is the trie including the function LINK and CONTENTS. Link information for figure 2 is shown in Table 3.
We can see the relationship between Atlam as a subject and buy as a verb by the trie and there exist one-to-one correspondence, the leaf 2 correspondence key Atlam and leaf 52 correspondence key buy, and link function is defined by 52  LINK(2) and the record (<subject>, <verb>)  CONTENTS(2, 52). We can see the relationship between words (<verb>, <object>) and (<verb>, <place>), as follows:

Retrieval Algorithm
For the relationship (X, Y, R), the proposed retrieval algorithm (i): retrieve Y and R from X, (ii): retrieve R from X and Y.
For LT and for key X, the function GET_LEAF(LT, X) gives the leaf for X# and gives fail if LT has no X#. The function GET_LEAF (LT, "store") gives leaf 6 in Figure 2.
For the relationship (X, Y, R), the following ALGORITHM returns leaves s for X# and t for Y# if they are recorded in the trie. s and t could be processed to recover CONTENTS(s, t) including relationship R. If any of s or t is not recorded in the trie, then ALGORITHM outputs s = t = 0.
[ALGORITHM] start s GET_LEAF (LT, X); t GET_LEAF (LT, Y); if (s = fail or t = fail) then output s = t = 0; if ((t  LINK(s) and R  CONTENTS(s, t)) then output s and t; end; (Algorithm End) Figure 3, shows the frame work of our approach by Searching for Some English Textbook &Papers, concerning with Cross Language Information, Classification Summarization, and Noun Extraction from the Penn www.ijacsa.thesai.org Treebank• Extract compound noun after stemming and use stop word dictionary , from large Corpus. Moreover, Extract the linkage between verb with noun, verb with place , and verb with place , by using part of speech dictionary, and make linkage group and high leaky relation between them. By using this frequent and high leaky relation we can make disambiguate for word, where the surrounding words frequently associated with a sense are used to disambiguate a word.

D. Semantic Field Information
As Section 2.1 discussed that some words have many semantic meaning. Therefore, various semantic(x) usually appears in various branches.  This section describe how to get more information & new knowledge from case-frame storing by using trie structure and linking between leaves, perhaps by keeping links between them to reflect some relationships. e.g. Jhon * is unknown word Context (case frame) Level1: Jhon * eats apple, Jhon IS -A animal?, Jhon is similar to dog, or human Level2: Jhon * buys computer, Jhon IS -A human. Fig. 4. Trie structure www.ijacsa.thesai.org By using this information as common knowledge, we will show later by using trie structure and link trie, how we can know new automatic & variety relation from this common knowledge. This new knowledge are very useful in NLP, because it make a text more readable and understandable for human, this new knowledge can be combined to provide additional useful <IS -A> hierarchical information, as follow: In this examples with <SUB. -VERB> relation using the information: Next using this automated linking information, one can understand from this linkage that things which can eat and drink only and cannot speak and buy (i.e. eatable & drinkable only) is Animals, also things which can eat ,buy, drink, and speak and cannot treat sickness (i.e. buyable, speakable , eatable, drinkable only) is a provoke (normal ) human, and the man who can eat food , drink drinks , buy goods , speak languages and have the ability to care for sickness (treatable) is a doctor. And we can create also this group as in Figure 5. From this link trie, we can get the < IS --A> hierarchy relationship that Doctor is a human, and human is an animal.

V. SIMULATION RESULTS
A. Experimental data and information 99,714 statements from tagged corpus (Pan TreeBank), having diverse of features, is implicated in this experiment.

Data Set 1:
About 11,970 subject-verb case relationship and about 2,514 of verb-object relationship, and 679 verb-places are used. Due to high frequent access of pairs, we could not take them up. See Table 6.

Data Set 2:
Utilizing case frame with trie structure to present a lot of relationships between words as shown in section II.

Data Set 3:
Employing trie structure with linking trie among leaves for additional information as shown in Figure 4, and Table  3.

Data Set4:
Restrict 10 group of typical verbs and objects from Data 1, as in Table 5.

Result [1].
By employing Data 1,2,3. We can establish an automated generation of hierarchy of relations among words as shown in Figure 5 and Algorithm 2.

Result [2].
By gathering this data and establishing relationships among verbs and other kinds of keys with link trie, we can see the hierarchy group. Figure 6 for example shows the subject and verb linkage. www.ijacsa.thesai.org Utilizing high-leaky this indicates that sandbags are maintainable and supportable but not exchangeable or buyable, company exchangeable , buyable supportable, maintainable , but not eatable ,talkable, but Jhon can , maintain, support, , buy, exchange, eat, talk.
Result [3]. Disambiguation Figure 7 shows how the algorithm fare with sentences containing ambiguous element, be able to handle many such cases, as will be illustrated here. Consider the pair of sentences below: 1) Investment company support the bank.
2) The sandbags support the bank. By this three sentence we show the semantic meaning of the word bank have two meaning financial house & edge of rive and by use more information about the word bank by another verbs, we can change the case to disambiguation case. As follow: The first approach: semantic meaning of bank is financial house in the first sentence, this by using another verbs to declare this meaning as in these sentence say : Jhon exchange from bank. and for more information about bank we can say that : Bank buy money . By this more information we find all sentence speak about money this implies more disambiguate for word bank and now the clear semantic is financial institution .the second approach: semantic meaning of bank is edge of river in the second sentence, this by using another verb to declare this meaning as in these sentence: Sandbags maintain bank, and for more information about bank we can say: Bank maintain river i.e.
By this more information we find all sentence speak about hold up physically this implies more disambiguate for word bank and now the clear semantic is edge of river.

Fig. 7. Example of disambiguation
Result [4]. The accuracy of experimental results is defined as: Accuracy = α/β Where α is the number of words disambiguated correctly and β is total number of ambiguous words. Table 7 summarizes the experimental observation of using the old method (TM) and new method using the trie structure and linking trie between leaves. All English terms of our experimental are in Appendix A.
In Table 7, both performed with accuracies 73% for TM and 79% by new method using the link trie between leaves which is significant than the value that can be obtained by TM one. This means that, our new approach is viable in solving term ambiguities.

VI. CONCLUSION
In this paper, we proposed a new approach for building structure system of natural language knowledge. In this paper all surface case patterns are classified in advance with the consideration of the meaning of noun. Moreover, this paper introduces an efficient data structure using a trie which define the linkage among leaves and multi-attribute relations. By using this linkage multi-attribute relations, we can get a high frequent access among verbs and noun with an automated generation of hierarchical relationships. In our experiment a large tagged corpus (Pan Treebank) is used www.ijacsa.thesai.org to extract data. In our approach around 11,000 verbs and nouns is used for verifying the new method and made a hierarchy group of its noun. Moreover, the achievement of term disambiguating using our trie structure method and linking trie among leaves is 6% higher than old method. The preliminary result of our method shows a good promise, because the extracted information structures of a special database, can be extended by a more large input of data and more general relations from a large information corpus. The results of disambiguating the word ambiguities are much better than that of case frames. Experimental results also show that enough distinctive terms can help determine the semantic sense of a word in a specific context. The preliminary syntactic analysis can be achieved by many natural language processing system, we will be able to obtain more precise semantic information from the syntactic resource. Moreover, the accuracy of disambiguating words by our method using trie structure and linking trie between leaves is 6% higher than traditional method. Future work could focus in using context analysis to improve disambiguates of words. Extract Arabic keyword By using stop word Dictionary and stemming rule, from large Arabic Corpus with Classification for Arabic text by using Classification engine.