Semi Supervised Method for Detection of Ambiguous Word and Creation of Sense: Using WordNet

Machine Translation, Information Retrieval and Knowledge Acquisition are the three main applications of Word Sense Disambiguation (WSD). The sense of a target word can be identified from a dictionary using a ‘bag of words’, i.e. neighbours of the target word. A target word has the same spelling of the word but with a different meaning, i.e. chair, light etc. In WSD, the key input sources are sentences and target words. But, instead of providing a target word, this should automatically be detected. If a sentence has more than one target word, then the filtration process will require further processing. In this study, the proposed framework, consisting of buzz words and query words has been developed to detect target words using the WordNet dictionary. Buzz words are defined as a ‘bag-ofwords’ using POS-Tags, and query words are those words having multiple meanings. The proposed framework will endeavor to find the sense of the detected target word using its gloss and with examples containing buzz words. This is a semi-supervised approach because 266 words of multiple meanings have been labelled from various sources and used based on an unsupervised approach to detect the target word and sense (meaning). After experimenting on a dataset consisting of 300 hotel reviews, 100 % of the target words for each sentence were detected with 84 % related to the sense of each sentence or phrase. Keywords—Word sense disambiguation; machine translation; information retrieval and knowledge acquisition; target word; WordNet; bag of words


I. INTRODUCTION
Choosing the correct sense in a context is related to Word Sense Disambiguation (DWS) because most words have multiple meanings, i.e. the word -run‖ has 179 meanings of the word while the word -take‖ has 127 different definitions of the word [1].WSD methods are usually classified into two types: knowledge-based and machine learning [2], [3].Knowledgebased WSD systems exploit the information in a lexical knowledge base, such as WordNet and Wikipedia, to perform WSD.These approaches usually choose the sense with the definition most like the context of the ambiguous word, using textual overlap or using graph-based measures [4].Machine learning approaches, also called corpus-based approaches, do not make use of any knowledge resources for disambiguation.These approaches range from supervised learning [5], in which a classifier is trained for each distinct word in a corpus of manually sense-annotated examples, to entirely unsupervised methods that cluster the occurrence of words, thereby inducing senses.Recent advances in WSD have significantly benefited from the availability of corpora annotated with word senses.
Most accurate WSD systems to date exploit supervised methods which automatically learn cues useful for disambiguation from manually sense-annotated data [6], [7], [8].
In this study, WSD is categorized into two approaches:  WSD-1: can be used to determine a summary of a sentence.However, in a sentence, there may be a word with more than one meaning, i.e. "date‖, -bass‖ where the sense of these words will be considered in a sentence by a device or application.
 WSD-2: WSD can be used to detect the semantics of a word in a sentence concerning the polarity, i.e. -his work is unpredictable‖.Here the word -unpredictable‖ is a negative word, but in this instance, it will be considered as positive.
In this study, work is focused on WSD-1.There has been quite a lot of work conducted on WSD-1 by other researchers.However, in this study the target word has already been provided, i.e. -Sit on a chair‖, -Take a seat on this chair‖, -The chair of the Math Department‖.These phrases reflect the meaning of the chair, as the word has multiple senses?.Here, the target word -chair‖ is used to determine that the word chair means furniture or person [9].And -I find a switch for the light‖ or, -I do like to eat something light‖, where the target word is portrayed as -light‖.Therefore, the sense may be viewed as -shine‖ or -weight‖ respectively [10].Word detection has also been conducted in the work of [11], but these words could be considered as aspect or entity words on which an opinion has been given.-An electric guitar and bass player stand off to one side, not part of the scene‖.What could be the sense of -bass‖ in this sentence, [12] where the target word is given as -bass?‖.Therefore, there is need to filter the single word from obscure words or words with multiplemeanings.In this paper, we investigate, what could be the target word.
A manually created multiple meaning words list (MMWL) was developed through:  The union of words taken from the English language [1];  Multiple Meaning Words (100) grouped by the word -Grad‖ [13];  Multiple meaning word list of 200 words [14], [15]; www.ijacsa.thesai.org Easy vocabulary words [16];  Speech therapy ideas [15]; and The manually developed MMWL contained 266 words with multiple meanings.The work has only used candidate definition/gloss [18], [19] but sense could also be detected from the examples to improve the accuracy because sometimes a ‗bag of words' is not present in the definition/gloss of the target word.
The following contributions in this study are as follows:  We propose to develop a method to filter a target word (sense required) from multiple ambiguous words with the help of using buzz words and query words using a lexicon of multiple meaning words list MMWL; and  Generate a correct sense of target words with the help of buzz words using gloss and examples of target words from the lexicon of WordNet.

II. RELATED WORK
A word in a sentence can be expanded by relating it to other words in a sentence to determine the actual meaning of the word.This is essential because the majority of research studies to date, have investigated opinion words [20], [21] or examined text mining through the creation of summaries of a given document [22], [23] in different languages such as Arabic [24], [24] and Chinese [25], [26].This is so that a document or sentence can easily be understood by users as well as by intelligent machines.Automatic summary generation procedures have faced many problems including WSD words sense disambiguation.WSD also involves natural language processing applications [10].For example, human intelligence can automatically sense and detect the meaning of a word from examining a sentence.However, in the field of artificial intelligence, efforts are continuing to be made to understand a sentence from the correct dimension or aspect given that a single word may have multiple meanings.
The MeSH-based disambiguation method, considers the meaning of a target word as the same throughout a document and the word tends to have the same meaning when used in the same collocation using MeSH which consists of words from different domains with the precision of 0.5841 [27].Automatic disambiguated words on Wikipedia have several limitations due to the small sample size and a large number of fine senses found in WordNet [18].In a study, supervised WSD [28] determined the sense value 2 and 3 from 57 target words.In a separate study, in [29] the determined sense of the target word was found by using three left, and three right words from the target word.This performed well only where the supporting words were present at the front and at the back of the target word.Instead of taking left, right words, the authors of [30] used a ‗bag of words' from the sentence which were neighbours of a target word to identify or determine a binary vector.Also, the sense of the word can be detected if there are dependent words in the sentence near the target word.Many aspects of evaluating sense have been standardised through the efforts of SENSEVAL and SEMEVAL.This framework provides a shared task along with training and testing materials with sense inventories for all-words and lexical sample tasks in a variety of languages [31].A relatively small set of training examples (seed sets) are identified in the framework to represent sense.Sense clusters are then generated through the addition of most similar words to the seed set elements.The most similar sense cluster to the input text context are then considered as the sense of the target word [32].To address the limitation of the failed supervised scenario, studies have progressed on the kernel methods for automatic WSD using four target words: interest, line, hard and serve [6].The original algorithm based on glosses was found in traditional dictionaries such as the Oxford dictionary where the definition, or gloss, of each sense of a word in a phrase, is compared to the glosses of every other word in the phrase.A word is assigned the sense whose gloss shares the largest number of words in common with the glosses of the other words.The authors of [33] did not use examples of the word in the WordNet dictionary, but instead, used Lesk's basic approach to take advantage of the highly interconnected set of relations among synonyms that WordNet offers by providing a target word.Besides the confusions in WSD, there many difficulties in handling these using the supervised and unsupervised methods.Work in [34] determined that supervised methods are the optimal predictors of WSD difficulties, but are limited by their dependence on labelled training data in different domain types such as bionadical [35], [36].The unsupervised method performed well in some situations and can be applied more broadly [37], [38].The accuracy of the unsupervised WSD algorithm is lower than its alternative supervised algorithm [39].Word sense can also be detected from different sentences using latent semantic indexing by providing a query as the target word [40].WSD is not only used in document clustering [19] but is also used in many applications that are based on artificial intelligence of a natural language.Work on WSD using the English language is progressing and is also being used in other languages such as Hindi, Hebrew, Russian and Tatar [41], [42], [43].The application of WSD has not only been applied to text but also to images for determining the correct sense from a picture [44].

III. PROPOSED METHODOLOGY
In this work, there are two major tasks performed.First, the target word will be detected from within the sentence, and secondly, the sense of that word will be generated.
2) Query words: Query words are words having multiple meanings and can be obtained by comparing each filtered token with the manually created multiple meaning words list (MMWL).In Figure 1, suppose T2 and T8 are present in the MMWL, using Equation 5, we can find gloss and examples of each query word: Where x = 1, 2, 3…n, ( ) means ith token of the x th sentence considered as (query words), if it belongs to a multiple meaning words List (MMWL).
3) Query strings: All query words have been created; so now we can easily locate the gloss/definitions of query words using the WordNet dictionary.As shown in the WordNet dictionary, a word can have multiple definitions, with each definition having multiple examples [45], [46], [47], [48].By concatenating all definitions and examples (from each definition) this can be considered as a query string.In Figure 1, string-1, string-2 are query strings of T2 and T8 because there are two query words.All query strings from all query words can be created using Equation 6.
where x = 1, 2, 3…n, will determine a complete string of each query word ( ) containing all glossaries of query words ( ( ) ) and all examples ( ( ) ) of each gloss from the x th sentence using synsets (sets of synonyms) found in the WordNet dictionary.4) Frequency of buzz words from the query string: Next, the occurrence (frequency) of each buzz word from all query strings will be determined and summed.In Figure 1, T5 and T9 are those buzz words which do not belong to any query words and F1 and F2, are the frequencies of T5 and T9 in string1 respectively.Sum1 is the sum of F1 and F2, F3 and F4 are the frequencies of T5 and T9 in string2 respectively, and Sum2 is the sum of F3 and F4.These sums can be found by applying Equation 7: www.ijacsa.thesai.orgWhere x = 1, 2, 3…n, will determine a concept of the target word in the x th sentence.This concept consists of those glossary examples of the query word ( ) containing buzz words ( ) .
IV. SAMPLES BASED ON PROPOSED WORK In Table 2, the sentence -The researchers said the worms spend part of their life cycle in such fish as Pacific salmon and striped bass and Pacific rockfish or snapper‖.
By progressing through the following steps to identify Tags; Buzz Words; Query Words; strings of Query Words; the Frequency of buzz words from the strings; the sum of the Frequency of buzz words with respect to strings; and the largest Query Word as a Target Word, the target word -bass‖ is determined (based on the initial methodology).
Similarly, in Table 3, there is a further example -Sweet date can be used as the last course of a meal‖, where the target word -date‖ has been identified by the initial part of the proposed methodology.Table 1 contains the analysis of example-1, where the frequency of each buzz word has been determined from the strings of query words [-part‖, -bass‖].The sum from string2 is 3, i.e. the largest, therefore, the target word will be -bass‖.
The sense/concept will be generated from those definitions and examples of the target word belonging to helping words (i.e.all buzz words without a target word).From Table 2, a target word was -bass‖, and in Table 4, the concept of -bass‖ was generated related to the sentence.From Table 3, a target word was -date‖, and in Table 5, the concept of -date‖ was generated related to the sentence.www.ijacsa.thesai.org

Sentence
The researchers said the worms spend part of their life cycle in such fish as Pacific salmon and striped bass and Pacific rockfish or snapper

Concept of bass
[u'part : Gloss:the lowest part of the musical range', u'part : Gloss:the lowest part in polyphonic music', u'fish : Gloss:the lean flesh of a saltwater fish of the family Serranidae', u'fish : Gloss:any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)', u'fish : Gloss:nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes'] www.ijacsa.thesai.orgV. RESULTS AND DISCUSSIONS The evaluation strategy of WSD is based on the correctness of sense selection of an ambiguous word invoked in a context according to human judgment.

A. Dataset Preparation
In preparing the dataset, the consisting of 300 hotel reviews and MMWL (as previously defined).Approximately 66 % of the contexts were selected from hotel reviews containing ambiguous words from the MMWL for this purpose.The sample listing of the said datasets is presented in Table 6, where S1 has multiple meanings of the word -chair‖, S2 has -class‖, and S3 has -brand‖.These words are also listed in the MMWL.

S1
"Last year the chair of the food Department is retired"

S2
-The stay in the hotel was awsome.As a flight attendant, I see a lot of high class hotels and also know their service‖

S3
-Diazepam is an example of the chemical (generic) name of a sedative.It is marketed by some companies under its generic name and by other companies under brand names such as Valium or Vazepam.‖ The proposed framework develops a sense of detected target words using the number of frequencies (occurrences) of buzz words from the query words.The work in this study identified 100 % of the target words (ambiguous words) from 66 % (containing vague words) of the context relating to hotel reviews, and 84 % concept/sense was generated from the dataset as shown in Table -  This study was based on WSD-1, where only the WordNet dictionary was used.A sense detection of 84 %, was achieved by combining several other dictionaries, i.e., the Oxford dictionary, where the accuracy increased.Because the string of query words has been generated from its definition and examples, occasionally, the definition of a word could not be found in WordNet.The MMWL consisted of 266 obscure words.Updating the list would be useful for the remaining context given that buzz words can be generated from sentences, but if the buzz word is not in the MMWL then the number of query words is zero.Target words are reliant upon query words, and this is the reason why the sense cannot be generated.

VI. SUGGESTIONS FOR FUTURE WORK
In consideration of future work, if there is only a single query word, then there is no need for further processing as this query word can be considered as the target word.In proposed work, if there is a list of query words, then further processing will be carried out to detect a target word and the largest sum of frequencies, of buzz words from strings (a separate string for each query word) will identify the target word.If there is more than one sum of frequencies with the same score, then, this would be a viable case to perform future work to thereby calculate the distance of each buzz word from all query words to detect the target word.Also, additional work to detect the polarity sense of a word based on an opinion within in a sentence as defined in WSD-2 would be useful for future investigation.

Fig. 1 .
Fig. 1.Diagram of Proposed Framework for Detection of Target Word.

Fig. 2 .
Fig. 2. Diagram of Proposed Framework for Detection of Sense of Target Word.

TABLE II .
SOLVED EXAMPLE-1 TO DETECT THE TARGET WORD

TABLE III .
] SOLVED EXAMPLE-2 TO DETECT THE TARGET WORDSentencesweet date can be used as last course of meal.

TABLE IV .
CONCEPT OF DETECTED TARGET WORD IN EXAMPLE-1

TABLE V .
CONCEPT OF DETECTED TARGET WORD IN EXAMPLE-2 7.

TABLE VII .
DETECTED TARGET WORDS AND THEIR SENSES