Efficient Hybrid Semantic Text Similarity using Wordnet and a Corpus

Text similarity plays an important role in natural language processing tasks such as answering questions and summarizing text. At present, state-of-the-art text similarity algorithms rely on inefficient word pairings and/or knowledge derived from large corpora such as Wikipedia. This article evaluates previous word similarity measures on benchmark datasets and then uses a hybrid word similarity in a novel text similarity measure (TSM). The proposed TSM is based on information content and WordNet semantic relations. TSM includes exact word match, the length of both sentences in a pair, and the maximum similarity between one word and the compared text. Compared with other well-known measures, results of TSM are surpassing or comparable with the best algorithms in the literature. Keywords—text similarity; distributional similarity; information content; knowledge-based similarity; corpus-based similarity; WordNet


INTRODUCTION
Text similarity is a field of research whereby two terms or expressions are assigned a score based on the likeness of their meaning.Short text similarity measures have an important role in many applications such as word sense disambiguation [1], synonymy detection [2], spell checking [3], thesauri generation [4], machine translation [5], information retrieval [6]- [8], and question answering [9].
There are three predominant approaches to compute text similarity.They can be categorized as corpus-based/ distributional semantic models (DSMs), knowledge-based models, and hybrid methods.DSMs are based on the assumption that the meaning of a word can be inferred from its usage (i.e. its distribution in text).It is based on the following hypothesis: linguistic items with similar distributions have similar meanings [10].Consequently, these models derive vector-based representations of the meaning of a word co-occurrence in a corpus.The vector-based representation is most often built from large text collections [5].In this category, the latent Dirichlet allocation (LDA) assumes that each document is based on a mixture of topics, whereas a topic probabilistically generates various words [6], [11]- [13] .In the same category, the latent semantic analysis (LSA) is based on that the words that share similar meaning tend to occur in similar texts [6], [9], [14], [15].The knowledge-based methods usually employ taxonomic information (e.g.WordNet) to estimate semantic similarity [18] [19].Sentence knowledge-based methods use semantic dictionary information such word relationships [19]- [21], information content [22], [23], parts of speech [18], [24], word senses [25], [26], and gloss definitions from a corpus [27], [28] to get the overall semantic score.These methods suffer from the limited number of general dictionary words, which are commonly used in general English literatures and may not suit specific domains.
Hybrid methods integrate various knowledge-based and/or corpus-based methods.They generally perform better [29].In recent years, much of the work on lexical semantics has focused on distributional vector representation models [30], [31].
We have identified three cases where knowledge-based, corpus-based or traditional hybrid methods perform poorly.We illustrate these cases by examples.Table I shows two examples of two sentence-pairs taken from STS-65 benchmark dataset [16] that were compared using: LSA [15] (i.e.corpus-based method), [16] (i.e.knowledge-based method) and [17] (i.e.hybrid method).
The first case is as follows: methods that depend on a large corpus tend to overestimate relatively unrelated sentences or relatively related sentences (e.g., LSA).For the first sentencepair, we obtained a similarity score of 0.19 (relatively high) for LSA measure, whereas the reported human similarity score www.ijacsa.thesai.orgmean is 0.01.The LSA method depends on words' frequencies that tend to be relatively high in a large corpus (e.g., TASA).The second case is as follows: knowledgebased methods have the same drawback as the previously discussed method (LSA).The method of [16] depends on WordNet semantic relations (i.e.path and depth).This method can distinguish between general and specific concepts using WordNet but does not have information about words' distributions (or context).The third case is as follows: traditional hybrid methods that combine multiple measures over an average function generally perform poorly [17].From [17], we determined that each sub-similarity method diverges in score compared to the overall similarity score.Each of the eight different measures has its strengths and weakness and thus will not get an acceptable semantic score in all cases.In many cases, one measure will have high similarity (e.g., >0.5 for LSA) and low similarity (e.g., <0.1 for path measure) over STS-65 dataset.In the second sentence-pair the same finding could be deduced.We deduced that the LSA and [17] measures overestimate the similarity score of the compared sentence-pair.Therefore, a similarity measure that use minimum data resources and get acceptable score is looked for.
Our work presents a hybrid-based text similarity measure that utilizes WordNet [32] information and a corpus [33].The WordNet is a man-made ontology that shows promising results in the text similarity domain.The proposed method uses a small size word corpus, thereby eliminating the processing of large corpora.Using the weighted word similarity [34], a new text similarity measure is proposed.The proposed measure compares short text to long text and finds the maximum word similarity and the total exact matching words.The final similarity is calculated using the total similarity of the comparable words weighted by the text length in words.
First, the related works are summarized.Next, the proposed approach is presented and explained.Then, the proposed method is evaluated; finally, the article is concluded.

II. RELATED WORK
Sentence similarity methods (also called short text similarity) are used to measure word similarities in a sentence to reflect the overall semantic of the compared sentences.In general, sentence similarities can be categorized as corpusbased, knowledge-based, and hybrid methods.

A. Corpus-based Methods
Corpus models learn word co-occurrence from large corpora to predict the similarity of comparing text.Many models use information from internet sources such as: Wikipedia [35], Google Tri-grams [5], [36], and Search Engine documents [37].These models can be categorized as DSMs and distributed vector representation models.DSMs derive vector-based representations of the semantic meaning of patterns of word co-occurrence in corpora.In this category, LSA is based on that the frequency of words in certain contexts that could determine the semantic similarity of words to each other.That is, words that are similar tend to occur in similar texts [6], [9], [14], [15].In latent Dirichlet allocation (LDA) each document is based on a mixture of topics, whereas a topic probabilistically generates various words [6], [11]- [13].The idea of the vector space model (VSM) [38] is to represent each document in a collection as a point in a space (a vector in a vector space).Points that are close together in the space are semantically similar, whereas points that are far apart are semantically different.The construction of a suitable VSM for a particular task is highly parameterized, and there appears to be little consensus over which parameter settings to use [39].Moreover, many of these models are based on large corpora.The global vector model (GloVe) is an unsupervised learning model for word representation [40], which is trained on the non-zero elements in a global word-word co-occurrence matrix.The distributional model [41] combines visual features with textual ones, resulting in a performance increase.The explicit semantic analysis (ESA) represents the meaning of any text as a weighted vector of Wikipedia-based concepts [42].Furthermore, the distributional method of LSA [43] is enhanced with WordNet semantic relations.
Distributed vector representation of words can capture syntactic and semantic regularities in language and help learning algorithms to achieve better performance in natural language processing tasks by grouping similar words.The unified architecture of NLP [44] learns features relevant to the tasks at hand given very limited prior knowledge.This is achieved by training a deep neural network, building upon work by [30], [45] .Their models [44], [46] learn word representations in a binary classification task ( related word to its context or not) .They use the learned word representations to initialize the neural network models for other NLP tasks that also have word representation layers.One of the recent works on distributed representations is the work of [31] wherein they used probabilistic feed-forward neural network language model to estimate word representations in vector space.Align, disambiguate, and walk (ADW) model is a graph-based approach that has two steps; word transformation to the word senses (i.e. one of the meanings of a word) and disambiguation by taking context of compared words [47].Based on WordNet, [48] exploit semantic representations of sentences using extracted features from a logic prover.

B. Knowledge-based Methods
Sentence knowledge-based methods use semantic dictionary information such word relationships [19]- [21], information content [22], [23], word senses [25], [26], and gloss definitions from a corpus [27], [28] to get word semantics.Based on human comprehension of sentence meaning, [49] proposed to measure the sentence similarity from three aspects that people identify in a sentence.People obtain information from a sentence on three aspects, or some of them: objects the sentence describes, properties of these objects and behaviors of these objects.Consequently, they propose three similarities: objects-specified similarity, objectsproperty similarity, objects-behavior similarity, and overall similarity.Some similarity models [19] measurs the semantic relatedness between texts based on their implicit semantic links extracted from a thesaurus.Other models [25] measures sentence similarity based on word sense disambiguation and WordNet synonym expansion.They build word sense disambiguation by using gloss interactions and expand it by synonyms.Then, the sentence is similarly calculated using cosine vectors.The reference [50] proposed a sentence similarity that used weighted word noun and verb vectors along with the order of words in a text.
In general, the knowledge-based approach is limited to the use of human-crafted dictionaries.Because of this, not all words are available in the dictionary and even though some word exists, they do not have full semantics.

C. Hybrid-based Methods
Hybrid-based methods are combinations of the previously mentioned methods.The reference [16] proposed a sentence similarity based on a non-linear function of WordNet path and depth, associated with information content form Brown Corpus, and sentence word orders.The reference [7] proposed a weighted similarity vector based on shortest path and term frequency to replace [16] semantic vector.They applied the similarity measure on photographic description data.weighted textual matrix factorization (WTMF) model [11] is built on WordNet, Wiktionary, and Brown corpus.The reference [18] generated a semantic vector space using part of speech and WordNet.The reference [51] proposed a sentence similarity measure for paraphrase recognition and text entailment based on WordNet for existing words and an edit distance for proper nouns.The reference [24] proposed sentence similarity based on WordNet Information Content and part of speech tree kernels.
The reference [29] proposed a three-layer sentence measure: lexical layer, syntactic layer, and semantic layer.The overall sentence measure depends on the number of tokens, RDF triples that entail the semantic layer.In the same area, [52] combined the words meanings and phrase context in a sentence measure.The meaning words are implied by extracting words' lemma from a dictionary, whereas phrase context usage was extracted using a huge para-phrase alignment database [53].
Many hybrid methods are supervised models.They predict test sentence prevalence to training data.UNT model [54] uses regression machine learning based on hybrid text similarity methods of [17], [55], [56].UKP system, which performed the best in the Semantic Textual Similarity (STS) task at SemEval-2012, uses the log-linear regression model to combine multiple text similarity measures of varying complexity.The reference [57] proposed the yiGou model.They used the support vector machine model with literal similarity, shallow syntactic similarity, WordNet-based similarity, and latent semantic similarity to predict the semantic similarity score of two short texts.The Takelab model [58] uses support vector regression model with multiple features measuring word-overlap similarity and syntax similarity to predict human sentence similarity.Each sentence is represented as a vector in the LSA model based on word vectors.Hybrid approaches show promising results on benchmark datasets.

III. PROPOSED METHOD
We highlighted the imperfections of word similarity measures [34] that are either distance (knowledge)-based [16] or information content (IC)-based [22].Distance-based methods suffer from the problem of having the same similarity value for words that share the same path or depth in a taxonomy such as WordNet.In contrast, the problem with IC measures is its limitation of available words in a corpus or getting the same similarity when the compared words has the same LCS ratio.We borrow the word similarity of [34] as shown in (1).Furthermore, we modified the word similarity factor of [34] as shown in (2).
(2) where  ,   are compared words, ψ ∈ [0,1] is a weighting factor that combines the IC of the pairs, and   and   is the word similarity as in Li [16] and Lin [22] respectively.
This article proposes a novel text similarity measure (TSM) that facilitates word similarity in (1).The TSM finds the maximum word similarity and the total exact matching words between compared sentences.Then, the total similarities of compared words are summed up and weighted by sentences' length and a logarithmic function.
The proposed maximum similarity of a word  and a text  is shown in (3).
From [1], [33], [58], we inferred that compared text lengths and exact matches words have a direct effect on the final similarity score.The longer the compared text, the higher the chances of getting similar words.
The proposed TSM between two text fragments , R is shown in (4).
where,  represents the exact word match between compared sentences.The  function computes the maximum length between the compared sentences.The Sim function, as defined in (3), stands for the maximum similarity between a word and compared text fragment.
The application of ( 3) and ( 4) can be shown by the following sentence-pair taken from STS-65 dataset [16]: S1: A boy is a child who will grow up to be a man.

S2: A rooster is an adult male chicken.
When we compare the two sentences using (3), the maximum similar word-pairs from the sentence (S1) to the sentence (S2) are as follows: the word boy to the word male (0.282), the word child to the word male (0.153), and the word man to the word adult (0.786).The length of both sentences is 4 after stemming and removing stop words.Thus, applying (4) we got the similarity of 0.152.Compared to the reported human mean score (0.11), the proposed method got an acceptable similarity score.

IV. EVALUATION AND EXPERIMENTAL RESULTS
The evaluation of word and sentence measures are as follows.

A. Word Similarities
We evaluated the word similarity [34] on a relatiely small benchmark datasets [60], [61].Below, we extend the comparison to larger benchmark datasets: WordSim (WS)-353 [62], MEN dataset [63], and SimLex-999 [64].The WordSimilarity-353 test collection contains two sets of English word pairs along with human-assigned similarity judgements.All the subjects in both experiments possessed near-native command of English.Their instructions were to estimate the relatedness of the words in pairs on a scale from 0 (totally unrelated words) to 10 (very much related or identical words).The MEN test collection contains two sets of English word pairs (one for training and one for testing) together with human-assigned similarity judgments, obtained by crowdsourcing using Amazon Mechanical Turk via the CrowdFlower interface.The MEN data set consists of 3,000 word pairs, randomly selected on scales 1 (lowest) to 7 (highest) similarity.The SimLex-999 comprises 666 Noun-Noun pairs, 222 Verb-Verb pairs and 111 Adjective-Adjective pairs.SimLex-999 is challenging dataset for computational models to replicate.In order to perform well, they must learn to capture similarity independently of relatedness/association.The Spearman correlation between different methods is shown Table II.The LSA, [65], [44], and VSM correlation were taken from [64].We used Brown corpus and WordNet 3.0 for the JDIC measure, Li, and Lin measures.According to the results, both Li and Lin methods perform poorly which links to our initial hypothesis that a (corpus-based or knowledge-based) similarity method often does not perform well.In general, word similarity measures vary from one method to another depending on method features.Some methods use all word tags, while others use nouns only.Moreover, some methods support disambiguation or use additional domain information.The Spearman correlation of the JDIC method got the highest correlation for the SimLex-999 dataset.The JDIC approach looks for similar words and the SimLex-999 dataset is composed of similar words rather than related words.The WS-353 [62] list contains pairs that are associated but not similar in the semantic sense, for example: liquid -water.The list also contains many culturally biased pairs, for example: Arafat -terror [4].Nevertheless, on average the borrowed method (JDIC) method achieved acceptable results compared with results of the state-of-the-art methods as shown in figure I.However, without a real system the comparison remains questionable.

B. Text Similarities
Table III shows the Pearson correlation of a list of text similarity methods over the benchmark dataset of Sem-Eval 2012 [69].The dataset comprises pairs of sentences drawn from publicly available datasets that have been manually tagged with a number from 0 to 5:  Table III shows that the proposed method (TSM) was significant (p < 0.01) over all datasets except for a few.We implemented Li sentence measure while the Lin method was implemented based on the sentence measure proposed by [55].The Google Tri-grams method [36] does not perform as well as for text similarity compared with its performance for word similarity measure (Table II).This finding shows that the word similarity on its own does not always lead to a good text similarity measure.It also supports our hypothesis that measures that use large collection of data could overestimate Li [16] Huang [65] VSM [39] LSA [67] Collobert [44] Mikolov [68] Islam [36] JDIC [34] Pennington [40] Average Spearman Correlation Similarity Method unrelated sentences (showed in Table I).The low performance of Li measure was related to its inability to capture relatedness in compared sentences.Preliminary research showed that path and depth alone (Li measure) cannot give better semantic relatedness.Contrariwise, the Lin text similarity method shows an average Pearson correlation of 0.51; thus, the information content gained better similarity scores.Further comparisons on the STS-65 dataset can be found at the work of [70].We noted high performance of TSM (Pearson 0.66) on the dataset of OnWN because WordNet is one resource of TSM.The TSM performs better than methods (1-9) because each of them is considered to use one technique (knowledge-based or corpus-based) compared to TSM (hybrid).The application of TSM on the two sentence-pairs in Table I got the scores (0.002,0.80), thus our proposed TSM does not adhere to discussed drawbacks of knowledge-based and corpus-based measures.We found that the major performance of TSM was because of the proposed text similarity measure and the borrowed word similarity measure.However, our method has some limitations.Compared with methods (11)(12)(13), it has lower performance.The main reason is that the top scoring methods tend to use most of the available resources and tools.For example, the yiGou 2015 adds the LSA features along with WordNet Similarity features.The TakeLab method uses multiple features that include syntax similarity which is not part of TSM.The UKP method uses a combination of approximately 20 features.These features include n-grams, ESA vector comparisons, and word similarity based on lexical-semantic resources.Furthermore, the TSM could not disambiguate words in different contexts.Therefore, we deduce that our method performance is accepted as it utilizes limited data resources .On average (figure II) our proposed TSM method got an acceptable Pearson correlation.The proposed method may be used in applications that do not require high accuracy such as in search engines or on systems that has low resources such as mobile applications.

V. CONCLUSION
This article presented a new text similarity measure based on previously proposed joint distance and information content word similarity measure, and the information content of compared words.The proposed text similarity is weighted based on comparable text length and the total exact word matches.The similarity measure outperforms much of the compared similarity measures and is significant at the 0.05 level.The reason behind the high achievement of our method is due to the employment of additional information (corpus and information content) and the effectiveness of the borrowed word similarity measure.Although the proposed method has low performance compared to some compared models, it has less machinery and uses low information resources.In future, we plan to apply the proposed method on a real application of software quality.G. Tri-grams [36] Li [16] LDA ESA [42] Lin [22] LSA [43] ADW [47] UNT [54] WTMF [11] TSM yiGou [57] Takelab [58] UKP [71]

Pearson Correlation
Similarity Method

TABLE I
a. Using TASA Space

TABLE III .
PEARSON CORRELATIONS OF SEVERAL METHODS OVER THE SEM-EVAL 2012 DATA SETS