Word Sense Disambiguation Approach for Arabic Text

Word Sense Disambiguation (WSD) consists of identifying the correct sense of an ambiguous word occurring in a given context. Most of Arabic WSD systems are based generally on the information extracted from the local context of the word to be disambiguated. This information is not usually sufficient for a best disambiguation. To overcome this limit, we propose an approach that takes into consideration, in addition to the local context, the global context too extracted from the full text. More particularly, the sense attributed to an ambiguous word is the one of which semantic proximity is more close both to its local and global context. The experiments show that the proposed system achieved an accuracy of 74%. Keywords—Word Sense Disambiguation; Arabic Text; local context; global context; Arabic WordNet; Semantic Similarity


INTRODUCTION
WSD is a natural language processing (NLP) field.It aims at determining the appropriate sense of an ambiguous word occurring in a given context [1] [2].It is a task which allows a better understanding, and consequently a better exploitation of the processed linguistic material.It is therefore very essential task for NLP applications, such as Machine Translation (MT), Information Retrieval (IR), Text classification…etc.
The oldest WSD approach proved that two words before and after the ambiguous word are sufficient for its disambiguation [3].For the Arabic language, the information extracted from this local context is not always sufficient.
To solve this problem, an Arabic WSD system was proposed in this paper that is not only based on the local context, but also on the global context extracted from the full text.The objective is to combine the local contextual information with the global one for a better disambiguation.
More particularly, the proposed system uses the resource Arabic WordNet (AWN) to select word senses.The sense attributed then to an ambiguous word is the one that possesses the closest semantic proximity to the local context, as well as to the global one.This proximity is measured based on the semantic hierarchy offered by WordNet.
The rest of the paper is organized as follows: Section II presents the architecture of WSD systems.Section III exposes some Arabic WSD systems.Section IV displays the description of the proposed system.Section V contains experiments and the obtained results.The last section gives conclusion and some perspectives.

II. WSD SYSTEMS ARCHITECTURE
In 1949, Weaver [4] discussed the necessity of WSD for MT, and he explained that to realize this process, the ambiguous word must be taken from the context where it occurred.In 1950, Kaplan [3] made experiences to determine in which size the context should be, in order to disambiguate a word.It proved that two words at the right and at the left (size =2) of the ambiguous word are sufficient for its disambiguation; Masterman [5] confirmed this result for the Russian language, while Choueka and Lusignan [6] confirmed it for the French.
Over the years, WSD systems were developed according to different approaches.Actually, these systems have generally an architecture structured around three main steps:  Sense inventory: consists on selecting the senses of the words.
 Context representation: represents senses and contexts in a formal manner.
 Disambiguation Process: attributes for every ambiguous word its correct sense according to its context.
The sense inventory step is the one that makes the difference from one system to another depending on the adopted approach.Generally, two approaches exist: The first one, called Knowledge-based approach, is based on the use of external lexical resources.These resources are containing all the words of a language with their senses.These resources can be dictionaries [7], thesaurus [8], or ontologies [9] [10].
Unlike the first approach, the second one doesn't use external lexical resources, but it acquires the necessary information to define words' senses from a corpus; it's called a Corpus-based approach.This information is obtained by the application of statistical language models on this corpus.Three approaches are distinguished in this category, supervised approaches that require annotated corpus [11] [12] [13], unsupervised approaches [14] [10] that require unannotated corpus, and a semi-supervised approaches that require both of the annotated and the unannotated corpus [15].www.ijacsa.thesai.org

A. Challenges
Arabic presents several challenges for WSD, due essentially to the particularities of this language and also to the lack of resources necessary to the disambiguation process.Diacritics' missing in Arabic texts is the most challenging characteristic for WSD; because it increases the number of a word's possible senses and consequently makes the disambiguation task more difficult.For example, the word without diacritics (Swt ‫)صوت‬ have 11 senses according to the AWN, while the use of diacritics for the same word (Saw~ata ‫َّتََ‬ ‫َو‬ ‫,)ص‬ cuts down the number of senses to two.
On the other hand, the Arabic languageَ is very rich morphologically.This causes an ambiguity during the lexical segmentation, and influences consequently the detection of the words' correct sense during disambiguation process.For example, the word ‫وجد(‬ Wjd) have two possible segmentations; the first one considers that the letter ‫و(‬ W) is a prefix of ‫جد(‬ Jad), while the second considers it as a letter in the word, which gives two totally different words.

B. State of the Art
The first WSD systems were mostly concerned by Latin languages like English and French since several decades ago.The Arabic language, as for it, didn't get the attention until the last decade.
The first Arabic WSD system was proposed by Mona Diab in 2002 [10].The author introduced in this work an unsupervised method to annotate Arabic words by their sense using English WordNet and an English-Arabic parallel corpus.
Another contribution was proposed by Elmougy [13] where a Naïve Bayes Classifier was used to disambiguate Arabic words without diacritics.
Merhbene [16] was based on the semantic trees and a measure of collocation to choose the most appropriate sense to an ambiguous word.
Zouaghi [17] have proposed a system of WSD by combining the information retrieval measures with the Lesk algorithm to estimate the most appropriate sense of the ambiguous word.
The most recent work was proposed by Menai [18], in which the author was based on the genetic algorithms.His objective is to exploit the power of these algorithms in the Arabic WSD.
All of the previously mentioned works used only one contextual information to disambiguate.The proposed system, as for it, leans on two contextual informations.The first one is extracted from the local context of the ambiguous word and the second from its global context.
IV. THE PROPOSED SYSTEM Before describing the system process, the structure of Arabic WordNet is firstly given.

A. Arabic WordNet
The Arabic WordNet (AWN) [19] [20] [21] is a lexical resource for modern standard Arabic.It was constructed according to the Princeton WordNet content.It's organized around elements called Synsets, which are a set of synonyms and pointers connecting it with other synsets.So, the AWN is a lexical network in which synsets represent its nodes and the connections between synsets represent its edges.
This resource counts at present 23,481 words organized into 11,269 synsets.A word can belong to one or more synsets.
In this work, the senses of a word are defined by the Synsets to which it belongs in the AWN.Below, some words synsets (i.e.senses) extracted from AWN are presented:

B. Description of the proposed system 1) Sense inventory
In this step, a preprocessing phase is applied; it contains a text segmentation process, a stop words removal process, and finally a stemming process to remove words' affixes (prefixes and suffixes).
Afterwards, the obtained words are classified, according to the AWN, into two categories:  Non ambiguous words: belonging to one Synset, i.e. possessing one sense.
 Ambiguous words: belonging to several Synsets, i.e. possessing several senses.www.ijacsa.thesai.org ) is firstly considered, afterwards, the vector space, spanned by the standard basis , where ( ) ( ) ( ) are respectively the unit vector of the sense , is built.
Using this space, words' senses will be represented by the vector ∑ where is the i th coordinate representing the semantic distance between the word sense and the sense in AWN.To calculate this distance, the Wu and Palmer (wu-p) measure is used [2].
The global context will be afterwards defined by the sense vectors set of non ambiguous words present in the full text: , while the local context will be defined by the sense vectors set of non ambiguous words present only locally: Finally, an ambiguous word that has m senses will be represented by the set of its sense vectors: .This last step consists of attributing for each ambiguous word its appropriate sense.This is done by choosing the sense with the closest semantic proximity to its local and global context.
Sense semantic proximity with a context is defined by the percentage of vectors in this context that are similar to the vector of this sense.

Similarity measurement between two vectors V=(
) and ( ) can be calculated by three distances which are; dot product, cosines, and Jaccard defined respectively as follows: According to the previous definitions, the local and global semantic proximity are measured for each ambiguous word sense; as a result, a pair of percentages representing respectively each of the semantic proximity is obtained.The sense with the better average of its two percentages will be assigned finally to the ambiguous word.

V. EXPERIMENTATIONS AND RESULTS
To evaluate the proposed system, a test corpus is constructed by collecting texts from various fields (news, sport, medicine, religions, etc.); afterwards, each word is annotated manually by its correct sense according to the AWN.
The Java language was used to implement the system, and to access the XML AWN database the 'Java API for Arabic WordNet" 1 was used.Finally, the application of the stemming process is based on SAFAR platform 2 .For measuring the system's efficiency, the precision measurement was used; it consists of the number of words 1 https://sourceforge.net/projects/javasourcecodeapiarabicwordnet/ 2 http://sibawayh.emi.ac.ma/safar/publications.php correctly disambiguated divided by the number of all ambiguous words.Experiment results have shown a precision of 74%.
Another experiment have shown that the use of a stemming process during the sense inventory phase increases the system's efficiency.More particularly, results (Table II) show firstly that the use of this process increases efficiency by 30%, moreover, they have shown that the use of AlKhalil Analyzer [23] is better than Buckwalter [24] by 4%:  The last experiment results show that the proposed approach is better by 0.34% than the classical method (based on local context).This is due to some challenges described as follows:  The non-recognition of named entities (persons' names, locations, organizations…etc.).These last should not be separated during segmentation process.Experiments show that words like: ‫َظبي‬ ‫َأبو‬ ‫َهللا,‬ ‫عبد‬ have not been recognized as a named entity. Another similar challenge that decreases the system efficiency is the incapability of multiword expression recognition such as ‫قاع‬ ‫دةَبيانت,َاألممَالمتحدة‬ …etc.
 The absence of a component that allows disambiguating senses with the same average semantic proximity.
 The absence of a part-of-speech tagging that allows categorizing words in verbs and names allowing consequently studying names and verbs in a separate way.
 The last challenge is relying on the lexical resource used.The AWN doesn't cover all Arabic words, which has consequently an impact on the system efficiency.For example the word ‫منطق‬ doesn't belong to the AWN structure.

VI. CONCLUSION
In this paper, a WSD system for Arabic texts was presented.The proposed system, unlike other systems, takes into consideration two types of context during disambiguation process.The first one is the local context defined by the words in the neighborhood of the ambiguous word, and the second is the global context defined by the full text.
Experiments have shown an accuracy of 74% for the proposed system.
The incorporation of a named entities and a multiword expression component in the process will be necessary done in the future for a better results, as well as a raise of all the challenges previously mentioned.

Fig. 1 . 1 : 4 : 5 : 7 :
Fig. 1.Sense inventory process Sense inventory Algorithm Input: Arabic Text T Output: List of Ambiguous Words and Non Ambiguous Words .1: Segment the Text 2: Remove stop words 3: Apply Stemming process for all obtained words 4: For each word do: 5: If: is belonging to one synset in AWN 6: Then: Add to NAW list.7: Else if: is belonging to a several synsets in AWN 8: Then: Add to AW list.9: End 2) Context representation This step consists of representing words' senses as vectors.For this purpose, the set of all non ambiguous word senses() is firstly considered, afterwards, the vector space, spanned by the standard basis , where ( ) ( ) () are respectively the unit vector of the sense , is built.

Fig. 2 ., 1 : 2 :: End 4 : 5 : 7 :End 9 :
Fig. 2. Context representation Context representation Algorithm Input: list of and Output: , 1: For all words in do: 2: Extract associated senses ( ) 3: End 4: For each word do: 5: For each senses do: 6: For each do: 7: Calculate the wu-p semantic distance between word sense and the sense .8: End 9: Calculate word sense vector ∑ 10: End 11: End 12: Construct 13: Construct 3) Disambiguation process:This last step consists of attributing for each ambiguous word its appropriate sense.This is done by choosing the sense with the closest semantic proximity to its local and global context.
Set of non ambiguous word sense vectors present locally = www.ijacsa.thesai.org
the sense that has the highly average to the ambiguous word

TABLE II .
IMPORTANCE OF THE STEMMING PROCESSThe table below (TableIII) show some disambiguated words from this piece of text:

TABLE III .
EXAMPLE OF WORDS DISAMBIGUATED