Semantic Similarity Calculation of Chinese Word

— This paper puts forward a two layers computing method to calculate semantic similarity of Chinese word. Firstly, using Latent Dirichlet Allocation (LDA) subject model to generate subject spatial domain. Then mapping word into topic space and forming topic distribution which is used to calculate semantic similarity of word(the first layer computing). Finally, using semantic dictionary"HowNet" to deeply excavate semantic similarity of word(the second layer computing). This method not only overcomes the problem that it's not specific enough merely using LDA to calculate semantic similarity of word, but also solves the problems such as new words(haven't been added in dictionary) and without considering specific context when calculating semantic similarity based on semantic dictionary "HowNet". By experimental comparison, this thesis proves feasibility,availability and advantages of the calculation method.


INTRODUCTION
The semantic similarity calculation methods of word have been widely used in question-answering system, information retrieval, machine translation, etc. Different application Background have different definition of semantic similarity.In question-answering system and information retrieval, semantic similarity of word mainly focuses on the approximate degree of synonymity or same-meaning.While in machine translation it focuses on the approximate degree of mutual substitution in different contexts.The application background of this paper is Chinese question-answering system.So the understanding of word semantic similarity is approximate degree of synonymity of two words without caring about contexts.Semantic similarity of two words is higher if they are more synonymity in different contexts, otherwise the similarity is lower.
There are mainly two semantic similarity computing methods of word [1].One is counting word information in documents, the other is constructing knowledge of "world".The first method, using statistical information of word to calculate word semantic similarity, is based on aggregation phenomenon of the analogue.The method is objective and specific, so it can reflect similarity and difference of word in syntactic, semantic, pragmatic, etc.However, the method is dependent on training corpus and counting algorithm.In addition, this method is easily interfered by data sparsity and noise.Sometimes there are some obvious errors.For example, using LDA(Latent Dirichlet Allocation) subject model [2] to generate distribution of subject-word and document-subject.Words are aggregated according to topics, so words in the same topic have semantic similarity.The second method, using knowledge of "world", is based on the fact that everything is interrelated.Generally it describes the characteristic of word and relation of word using special description-language and building a structure like dictionary.For example semantic dictionary "HowNet" describes the connections of word through relationship of "sememe" and reflects synonymity of word through the approximate degree of similarity of sememe [1].The method accurately reflects semantic similarities and differences of word, but the result obtained by this method is greatly influenced by subjective consciousness.From the perspective of development of things, construction dictionary can't be completed and can't keep pace with the times, thus it can not accurately reflect objective facts.Above all,The two kinds of semantic similarity computing methods both have advantages and disadvantages.The thesis puts forward a new semantic similarity computing method (two layers computing method)by combining the two methods and redefining similarity calculation method of word.Firstly, The method uses LDA subject model to excavate topic-word distribution.Using LDA topic model reflects the objective existence of word.Then thesis uses semantic dictionary "HowNet" to further excavate the semantic similarity of word which reflects the objective substantiality of word.The new method lays foundation for similarity judgment of question sentence in Chinese question-answering system.

A. Problem description
Sentence C1:What is the fastest search engine in search field?Sentence C2: In Chinese retrieval,Baidu is more efficient than Google.
We can see that there are no common words between C1 and C2, but they are still similar.The reason is that Google and Baidu are two specific examples of Search engine.In fact, we often encounter those problems such as correlation and similarity of word and sentence in Search engine algorithm and question-answering system.In traditional information retrieval field, there have been a lot of methods to measure sentence similarity, such as the classical VSM model.However, those methods are often based on a assumption that the more repetition of words between sentences, the more similar they are.Through the example above, we can see that it does not conform to the reality.Most of the time, the approximate degree of synonymity of sentences depends on semantic relations behind words rather than repetition of words, especially suitable for short texts and questions with few words.Therefore, we need to adopt LDA topic model to find subject distribution behind words and judge semantic similarity of word.www.ijacsa.thesai.org

B. Brief introduction of Latent Dirichlet Allocation subject model
LDA subject model, proposed by Blei and etc, is a three layers Bayesian generative model-text-topic-word [2].The essence of LDA is to find topic structure of text using feature of words co-occurrence in text.In generation process, each text is represented as mixture distribution of subjects, and each subject is a probability distribution over words.Based on pLSA [4], leading a hyper-parameter  into the model's document-topic probability distribution, thus the new model obeys Dirichlet distribution.Then Griffiths and etc apply Dirichlet prior distribution to another parameter  , which makes the LDA subject model come into being a completed model.The model is represented by Fig. 1, with the meanings of symbols shown in table 1.

TABLE I. SYMBOL IN LDA MODEL
According to Fig. 1,the Joint probability distribution of LDA is: We often set Hyper-parameter [4] for detailed information of choosing and  values.
we can estimate the parameters using: Where ) (t k n denotes to the number of times that word t has been observed with topic k, ) (k m n denotes to the number of times that topic k has been observed with a word of document m.If you want more detailed information,you can see the paper of Blei [4].

C. Semantic similarity Calculation method of word in subject spatial domain
Running LDA topic model and doing Gibbs sampling on the document corpus D, we get K topics hidden in the documents and topic-word probability distribution  .The ) , , , , ( The semantic similarity calculation of two words 1 w and 2 w is: The value of ( 4) is higher, the similarity of two words 1 w , 2 w is more approximate , vice versa.

A. Problem description
Sentence C1:What is the fastest search engine in search field?
Sentence C2:In Chinese retrieval, Baidu is more efficient than Google.
Sentence C3:The search result on Google is more accurate than on Baidu.
By constructing topic spacial, we find that Search engine, Google and Baidu have semantic similarity by calculating their subject distribution cosine (4).Concluding that C1 has similarity with C2 and C3.But after doing further analysis, we find that C1 describes search speed, C2 describes efficiency of retrieval, and C3 describes search accuracy.In other words, searching C1 on Search Engine, we expect that the feedback is more about performance information of search engine or not.So we need further judge synonymity of other words.As we all know, there have synonymity among speed, efficiency and accuracy, but the semantic similarity between speed and efficiency is higher than between speed and accuracy.Of course, we also see that the topic spatial domain created by LDA topic model can judge the correlation between words through calculating their topic distribution cosine (4), but for further specific semantic information of words can not be presented.In order to make up this shortcoming, we use the , mn z www.ijacsa.thesai.orgfollowing method based on semantic dictionary "HowNet" to analyze specific semantic similarity between words.

B. Brief introduction of "HowNet"
"HowNet" is a common sense knowledge bases, of which description objects are concepts and semantic items, and can describe Chinese and English word using description objects [1].Using the basic content of "HowNet" to compute the relationship of words or phrase.As the meaning of Chinese words are very complex, its semantic meanings are different in different contexts.So one word are described as the collection of several semantic items and concepts in "HowNet"."HowNet" use "sememe" to future describe semantic items.Special word "sememe" is the smallest unit of semantic meaning and does not vary with the contexts.
Sememes are the most basic unit of describing the meaning item and exiting complicated relations [1].In "HowNet", there are eight relations of sememe: hyponymy, synonymy, relative, antonymy, part-whole, attribute-host, event-role, materialsproduction.Hyponymy is the most important sememe relation.It is a kind of hierarchy system, which is described through tree structure which is easy to operate by computer.The top describe abstract concepts and the bottom describe specific concepts.As follows, we will use the hyponymy relation of sememe to compute semantic similarity of words.If you want more concreteness calculation, you can take other relations of sememe into account .

C. Similarity computing method of word based on"HowNet"
There are two Chinese words: 1 Thus, the similarity between two words is transformed into the similarity of two semantic items.Of course, the specific context of two words is not considered here.Actually it is best to use sentence context to disambiguate words first.In other words,designating the word for a particular semantic item.thencomputing similarity of corresponding semantic items, which is more accurate and will be further researched in future.
By observing semantic dictionary "HowNet", Finding that semantic items are divided into function semantic items and notional semantic items.So the description of semantic items is different with different classes in "HowNet".Function semantic item is described in {relation sememe} or {syntactic sememe}.So, function semantic item only needs to compute the similarity of corresponding relation sememe or syntactic sememe.However, descriptions of notional semantic item are more complex and are divided into four parts: 1) The first independent sememe Description: The first sememe of independent sememes (without special symbols or relation symbol in front of sememe).
2) Other independent sememes Description: Specific words and Independent sememes except the first sememe.
3) The relation sememe Description: Sememe Described in relation symbol.The first part present the main semantic of word, so it have the highest weight.In order to lower the weight of other parts.The calculation formula is as follows : .reflecting the latter parts have lower significance to the overall similarity.You can adjustable the parameters i  .
In computing similarity between function word and notional word, we know that the possibility of same semantic they both express is very small in actual application.So we think the similarity of function word and notional word is always zero in the thesis.
Finally, all of similarity calculation of semantic items are ultimately attributed to similarity calculation of sememe.We use the hyponymy relation of sememe to compute semantic similarity of sememe.Obtained by experimental analysis: p in hierarchy tree. is a parameter can be adjusted according to the practical application.know more information about "HowNet" [1].

IV. THE TWO LAYERS SEMANTIC SIMILARITY CALCULATION METHOD
The similarity calculation method of word based on LDA subject model 1 Sim embodies characteristic of words co- occurrence.The similarity calculation method of word based on semantic dictionary "HowNet" 2 Sim reflects the semantic connection of words.We combine the two algorithms to acquire a two layers semantic similarity calculation method Sim .If the words have similar subject distribution and semantic connection, the similarity of words should be high, Vice versa.
Computing similarity of words 1 w and 2 w use: The 1

 and 2
 can be adjusted according to actual application.www.ijacsa.thesai.org

A. Preparations of Latent Dirichlet Allocation subject model  Experimental data
document number M Using the complete version of Chinese text classified corpus of Sougou laboratory(107M) , The text sets have 10 categories, including automobile, finance, IT, health, sports, tourism, education, employment, culture , military(Each category has 8000 pieces, 80000 pieces of document in total).You can get this data sets from [8].

Preprocess
Do preprocessing, word segmentation, erasing stop-word to original documents.Algorithm of Chinese word segmentation adopts ICTCLAS segmentation system of Chinese Academy of Sciences.Algorithm of delete stop-word adopts conventional removal method at the beginning and then repeatedly observing generating data, writing regular expressions to remove some words(for example name entities and no specific meaning words such as time, place) again.Erasing stop-word can lower the spatial dimension of word which is useful for computing semantic similarity of words.The final word dimension is 207499(N word number).As we know, Chinese word is a combination structure with single characters and the combination method is very complex.It leads to very high word dimension.Reducing word dimension should be further study features of Chinese words formation.

Topic number K
Abstract 20000 documents from M (each categories have 2000 documents) to acquire the most suitable topic number.By observing perplexity-index to determine number of topic .The perplexity-index represents uncertainty when forecasting data.The lower value, the better performance.The calculation formula is as follows[10]: In (8) Three experiments are made to set subject number.Each experiment as 10-100 (interval 10 add).The Fig. 2 shows that topic number and perplexity-index present inverse relation.When topic number is about 97, Decline trend of perplexityindex is not obvious.Bigger topic number is, Calculation of LDA subject model's parameters estimating is more complicated, so setting K=100.

D. Experimental
Table2 shows result of three semantic similarity computing methods of Chinese word.We choose seven groups word, detail information seeing experimental result.From group one, we find that those new specific words(Search engine, Baidu and Google) are not included into the semantic dictionary "HowNet".So we cannot use the semantic dictionary "HowNet" to calculate their semantic similarity.The result from group one embodies the "limitations" of application scope of "HowNet".However, LDA topic model uses statistical approach (training from large scale corpus, then generating potential theme and assembling words according to their subject distribution) property to break through the limitations of new word.Therefore, as long as training corpus is wide enough and updated, the application field of LDA subject model can be extended without limit.The extensibility of LDA subject model can make up the limitation of "HowNet " very well.
In group two, our purpose is to find the most similar word to "speed" from "efficiency" and "accuracy".We know that speed reveal the degree of fast or slow, and accuracy refers to the degree of precision or recall rate in search field, while efficiency is a comprehensive noun which can express both speed and accuracy.Through experimental data, we can see that if we only use the semantic dictionary "HowNet", speed, efficiency, accuracy are consistent in similarity, without any differences.However, the calculation method of semantic similarity on two layers can reflect the differences between words very well.
Analysis the third group of phrases,we want to find the highest similarity with "patient" from "sick person", "doctors", "diseases".From experiment result, Only using LDA to compute phrases similarity, we will find that doctors,sick person and disease both have high similarity with patient.This method has some certain distinction, but can not reach the aim of our application.Because Our application is ultimately used in Chinese question answering system, and the feature of our phrase similarity is synonymity.LDA topic model guarantee phrases similarity difference by sampling and complicated calculation, but is also a probabilistic model which reflect word co-occurrence.As we know, word patient frequently appear in a document,which is very likely to be have sick person, doctor, diseases and etc.That is the reason sick person, doctor, diseases have high similarity with patient when only using LDA to compute similarity.Therefore, in order to further distinguish difference of phrases semantic, we use the semantic dictionary to further mining phrase semantic.As the table shows that the two layer of the semantic similarity calculation method can reflect the greatest similarity(patient and sick person).At the same time, distinguishing doctor and disease with patient.Introducing the semantic dictionary to refine similarity of phrases.
In group four, the keywords are about colors.In semantic dictionary "HowNet", semantic items do not reflect approximation degree of color attribute value (red and pink are similar, while red and white look much different).It's very difficult to describe the approximation degree of colors using objective language, but it is able to tell differences to some extent by the method in this thesis.But the effect will be not consistency using different training corpus.
Words in group five are about affection tendency.Words in group six are about the degree of action.Words in group seven are computer vocabulary.Through analysis of the experimental data, the method in this thesis is also able to distinguish similarity between phrases to some extent, showing the most intuitive feeling and proving the feasibility of the method.

VI. CONCLUSION
The paper presents a two layers semantic similarity calculation method to excavate semantic similarity of Chinese words.Through lots of experiments, this method is feasible and applicable.

2 w
has m semantic items.And the similarity of 1 w and 2 w is the biggest similarity of their semantic items.

and 2 
The relation of topic number and perplexity-index Fig.of semantic dictionary "HowNet" Data sets Quoting two data sets sorted out by Liuqun (gloss.dat、 whole.dat).Gloss.datstores description of semantic item of words(66142 records in total).Whole.datstores hierarchy relation of sememe(1618 records in total).As gloss.datdata is massive and access frequency is high, gloss.dat is stored into mysql database.It makes search more faster.as 0.5, you can change the value according application.

Method 1 :
Based on LDA subject model described in the thesis Method 2:Word's semantic similarity computation Based on the HowNet by Qun.Liu[1].Method 3:The two layers Semantic similarity calculation method , m N denotes the length of document m , M denotes the documents sets.