Query Expansion based on Word Embeddings and Ontologies for Efficient Information Retrieval

Information retrieval has been an ever-going process for end users to fetch relevant data at one go. The problem intensifies more with unstructured data in a semantic web environment. It is also a promising area for researchers to dive in and refine it from time to time. Expanding the user query and reformulating it is one probable solution to increase the efficiency of the information retrieval system. In this paper we propose “WeOnto”, a novel two-level query expansion algorithm that utilizes the combination of web ontologies and word embeddings for similarity calculation. In the first level, the Real estate Ontology (REO) is created using Protégé and Sparql queries are passed to retrieve probable semantic words from the given ontology for each inputted user query. The first level gave significant results and improved the information retrieval by 18%. The second level of algorithm uses word embedding enhanced with the domain knowledge that helps to retrieve similar meaningful words based on cosine similarity for the same user query. Word embeddings are implemented using Word2Vec method that follows two architectures namely CBOW or Skip Gram. Most similar semantic words are retrieved using the CBOW word embeddings method in the proposed algorithm and concatenated with the semantic keywords generated from the real estate ontology to form a powerful reformulated query that gives promising relevant results. Finally, two topmost words as per their similarity index are taken to reformulate the original user query. Experimental results depict that proposed algorithm has given distinct results and has showcased significant improvement of 93% over the initial user query. Keywords—CBOW; Information retrieval; ontology; query reformulation; semantic web; skip gram; word embeddings;


I. INTRODUCTION
Internet is a deep ocean of information and efficient information retrieval has been a constant desire of users. Researchers have been continually working towards achieving this goal with various methodologies and algorithms being designed to give easy and quick access of information to the intended users. The main aim has been to work at the basic level and frame the user query such that the expanded query gives better results with increased precision.
Traditional IR methods were based on TF-IDF, Boolean, vector space models (VSM) or BM25 methods based on document frequency to solve the problem. But they all suffered from word mismatch issues called lexical gap problem [1] while at times the queries were not formulated correctly or were having ambiguous words that led to poor retrieval. This leads to following problems that needs to be catered eventually:  With ever-growing data over Internet, improving the efficiency of information retrieval system has always been an issue.
 How good a user query be formulated such that it increases the efficiency of retrieved web results.
 How to measure the effectiveness of the user query formulated that retrieves required relevant results.
Thus, the research objective is to increase the efficiency of information retrieval systems and focusing on query expansion such that the performance evaluation of reformulated user queries gives effective desired web results.
Finally, our research work focusses on the problem of information retrieval and low web results and proposes WeOnto algorithm, a novel algorithm that works on increasing the efficiency of information retrieval by incorporating the latest concept of word embeddings combined with web ontologies in a semantic web environment. The algorithm suggests a solution of reformulating the user query by expanding the user query with most similar new words, thereby giving better retrieved results.
Such expanded queries include the original query and some additional keywords that are found relevant to the given query keywords. These additional keywords are derived using the concept of Word embeddings amalgamated with ontologies both help to extract the semantics of the given query words.
Web ontologies are stored in the form of triples, i.e., subject, object and predicate having the entire meaning or relation between the subject and object explained within the triple. The word embeddings on the other hand take the word with respect to its meaning from the surrounding context and give us most similar words based on embeddings that store the relations in the form of vectors calculate cosine similarity to draw the appropriate results.
Word embeddings is method from natural language processing that has gradually found its application in information retrieval also [2]. Pre trained word embeddings are applied on the user corpus to retrieve most similar word for the query words such that the given user query be expanded to give efficient information retrieval. Word2Vec, In this paper, Section II explains the background related work while Section III talks about method and material used and comprises of the description of the proposed algorithm, -WeOnto‖ used for query expansion to increase the efficiency of information retrieval. Section IV describes the result and discussion along with the analysis and result of the experiment done on user defined corpus. Section V is the conclusion.
II. RELATED WORK Information retrieval has always been a topic of concern for researchers worldwide and many experiments and methodologies have been devised to increase its efficiency from time to time. We even have traditional IR models like Boolean model, vector space (VSM) model, probability-based models, and fuzzy set models [3]. These models enhanced the workability of IR systems but still with ever growing information over the Internet, the need to improvise the efficiency continues. The keyword matching approach could not do better around problems like polysemy where semantics of the words was required instead of syntactical approach.
So, recent researchers evolved methods that focus more on semantics and the meaning of words based on the context used. For such purposes, a natural language processing feature called Word embeddings [4] for the purpose of information retrieval has come as a probable solution. Word2vec is a deep learning method under NLP that takes word embeddings with respect to the context learned from the given corpus and gives most similar words as output. Siriguleng [5] in the paper also used word2vec and LDA topic model to expand Mongolian query and improve retrieval. Even B. Wang, et.al. in their work had discussed about experimental results [6] they had in using six embedding models. They compared these models but could not find one universal method that would cater all possibilities. On the other hand, B. Mansurov and A. Mansurov [7] depicts the use of word embeddings on Uzbek language and used it to get semantic similar words. Farhan et.al also talks about taking top relevant results and calculating the average vector values using word embeddings in a deep neural network and improvise the IR system to an extent [8]. Various researchers have recently understood the power of using ontologies with word embeddings and have showcased their effectiveness in their works; some of them have been put here. WE-based Arabic IR models also use wordnet and embeddings and depict comparisons of working after incorporating embeddings as in [9]. QSST, a Quranic searching tool based on word embeddings gave a high performance with an average precision of 91.95% [10]. Jin Ren et. al. in his paper [11] also explained about the effective results obtained on the use of predicate expression related to ontology and combining it with word embeddings. The work of Jayawardana, et. al. also describes the use of word embeddings on semi-supervised ontology population [12]. Lastra-Díaz et. al. in their work [13] stated that taking an average of two models i.e., Word embedding models and ontology measures in an experimental survey gave better results.

A. Word Embeddings
Word embeddings are unsupervised learning applications that also talk about transfer learning as it is incorporated in the given user corpus. Embeddings can be character level or word level [14]. The word level embeddings use word2vec method where the basic construct of embeddings is converting words into vectors and then mathematically apply relations on them based on the corpus being used. The vectors having similarity are closer to each other and have similar values. Their threshold value is mostly greater than 0.6. The closer it is to 1, the higher the similarity index is considered and thus two vectors or words are considered most similar.
Word2vec is a deep learning method under NLP that takes word embeddings with respect to the context learned from the given corpus and gives most similar words as output. The similarity is calculated using Cosine similarity [15] such that: The similarity value of the vectors ranges from -1 to +1. The Gensim library in Python language gives all capabilities of running this model and check the output. This model talks about different vector dimension and window size.
Word2Vec model is further divided into two architectures: Continuous Bag of Words (CBOW) and Skip-Gram (see Fig. 1) used to calculate vectors in their own way and giving different results but closely like each other.
CBOW architecture projects ‗Current Word' based on inputted context words whereas Skip-Gram works vice versa, i.e., it takes current word as input and gives contextual word before and after the current word [15].
The window size plays an important role in capturing the context of the corpus and giving similar words as output. In Fig. 1, window size =2 where W(t) is the target word while t-2, t-1, t+1, and t+2 are the neighboring words that form the contextual window because of which the meaning is understood. The more-closer words, the better they are related to each other.
All the researchers have ultimately tried to incorporate various ways of implementing ontologies or word embeddings method to achieve efficient information retrieval. Few of them have achieved the precision of 91% using their own methods. Our proposed algorithm -WeOnto‖ depicts the power of ontologies along with word embeddings that doubly work on semantics of keywords rather than pattern matching alone and show an effective information retrieval system having a 93% precision.
The user query in the proposed algorithm goes through series of steps that works on the semantics by fetching most similar words and reformulating the user query such that it improves the efficiency of information retrieved during the web search.
The next section gives a detailed description of WeOnto algorithm: a novel query optimization approach that provides users with more meaningful similar words that upon implementation improves retrieval results.

III. METHOD AND MATERIAL
With information explosion over the Internet, better information retrieval systems are always in demand. To increase its efficiency, query reformulation is one of the probable solutions.
For this purpose, we need to expand our query by preprocessing it first such that the stop words are removed and then we gather similar terms related to the major keywords left. The query after expansion will now have two set of words: i) keywords from query, ii) addition of new words.
The question arises how to get these new terms.
-WeOnto‖ is the proposed algorithm that finds an answer in the form of applying a combination of Ontologies and word embeddings on the user query and reformulates it into a new query which will be more suitable in context to the domain and aims to give more relevant results with increased precision.
As per the WeOnto algorithm, there is an input user query Q in step 1 (see Fig. 2) that will be reformulated such that the expanded query Q' at the end of algorithm is suitable enough to retrieve more relevant web documents and has better precision.
To expand the given user query Q, the query is sent for pre-processing which includes processes like removal of stop words, lower their case and tokenization and q[] = {t1, t2, t3…………., t n } is obtained. Here, q[] is the query after preprocessing having list of tokens {t 1 ,t 2 ,t 3 …t n }.  These tokens, t i , i = 1…n are sent for a Two-level process of query expansion which can be seen in the algorithm. In the first level as shown in step 2, tokens t i are passed to real estate ontology (REO) that was created to store the legal glossary terms used in case of real estate documentation during buying and selling of real estate properties [18]. The ontology was created using the WordNet vocabulary to capture all the syntactical and semantics of the English language as well as Legal terminology used for query reformulation. Sparql queries are issued in background that fetch semantically enriched keywords or synsets for each token t i . For every token, t i , if its synset exists, then it gets added to the semantic words list SW[], else the token itself gets added to SW[] giving us the list of semantic words fetched from REO as seen in Fig. 3. www.ijacsa.thesai.org In the second level as depicted in step 3, a shallow learning NLP technique, Word2Vec model M that is using Continuous bag of words (CBOW) method to learn from the given corpus C. The corpus is made by scraping 2000 web documents related to real estate legal documentation domain. This word embedding model M has converted 1.05 million words into 12.5 thousand vectors using which most similar words would be derived for the same set of tokens, t i .

These similar words are retrieved by calculating Cosine similarity for each pair and stored in sim_list[] list.
A threshold value is calculated as per step 3B from Fig. 2 to fetch the top k most similar words by taking average of the similarity index obtained for every token. If the similarity index is higher than this calculated threshold value, then such words are put into consideration and transferred to the actual similar words list, act_sim_words[].
Step 4 of WeOnto algorithm shows a union of above two lists obtained after Ontology at first level is first incorporated to find the semantically enriched words for the tokens. Then word embeddings are also applied at second level to understand the context of real estate related queries with the help of the knowledge model that has learnt from the user defined corpus.
Step 5 of proposed algorithm shows the final step of union of original tokens from user query to most suitable semantically enriched similar words as in Eq. Here, we need to remember that Q' will hold unique words only.
Hence, Q' becomes the final reformulated query that is deduced as the user query was expanded after applying the proposed algorithm where a list of semantic words retrieved from an ontology is concatenated to the list of words obtained from the word embeddings-based NLP model showing most similar words based on cosine similarity values.
The increase in the performance of information retrieval systems is calculated as defined in [19]: (3) This proposed novel algorithm is again tested, and it gives promising results showing a remarkable increase in the efficiency of IR system by incorporating a methodology that uses both ontology and word embeddings from NLP.

A. Experiment Setup
The proposed algorithm, -WeOnto‖ has a two-level procedure where the first level deals with the use of real estate ontology (REO) as defined in [20]. Real estate ontology has been created for a domain of real estate related legal documentation and has a glossary of legal terminology created using Wordnet Dictionary as seen in Fig. 4. The first level uses REO to retrieve semantically enriched keywords for the given user query and improved the reformulated query by 18%. However, the second level of algorithm is designed to further improve the information retrieval system and get more relevant results.
Hence, Step 2 of the WeOnto algorithm talks about the second level of the algorithm with the generation of similar words using word embeddings of natural language processing.
Word2Vec model of word embeddings is used with the aim that it will first train the model on real estate related dataset having data from 2000 web documents that were either government based or related to legal or real estate buying and selling. The implementation is done in Python language where its Gensim library was used to train the model that contains word vectors for a vocabulary of 12,462 words trained on around 1.05 million words from the user corpus and then apply various methods from it to derive similarity values.
Various parameters were set while training the model using Gensim library in python. Some of them like vector size = 100, initial learning rate, alpha = 0.025, window size = 5 which means two context words taken before and after the target word. Also, min_count = 1 which means that words having frequency <1 were avoided and lastly sg = 0 for CBOW and 1 for skip-gram method to be used. Fig. 4. Glossary Entity of Real Estate Ontology [19]. www.ijacsa.thesai.org After the model has learnt and created vectors or embeddings from the corpus, the model is loaded to fetch similar words using cosine similarity (see eq. 1) for a given user query. The high cosine similarity between two words shows that the words are semantically similar and accordingly converted to vectors and are geometrically similar in the Euclidean space as well.
The ‗most_similar' method returns the word vectors based on similarity for every token in the user query. To find the similarity between specific two words, i.e., to find similarity between user query and synsets retrieved from REO, ‗model.similarity()' method from genism library in python is used and all values are stored in final_list.
Once the training is done, the test set includes 50 random user queries on which the entire algorithm is applied step wise. The result generated at each step is stored in Fig. 5 where every column defines a sub step of the algorithm.
Column No. 1 depicts the initial user query, Q. The query is pre-processed and converted into set of tokens, Q1 as shown in column 2 in Fig. 5 above. These tokens are passed to the real estate ontology (REO) and Synsets (named as set A) are retrieved for each token as in column 3.
The second step of algorithm talks about tokens being sent to Word2Vec model that produces most similar words (named as set B) as stored in column 4. Column 5 depicts the union of set A and B along with cosine similarity calculated for all paired vectors. Column 6 holds the topmost two best words that are deduced using threshold value. Threshold value is first calculated taking the average of N vectors retrieved in column 4. If the similarity is greater than the average threshold value which keeps on changing with respect to every pair of word vector, then such words are counted as the best and stored in column 5.
Hence column 5 has the topmost words that has the highest cosine similarity. Column 6 shows the best two words derived for each token that will be added to the final expanded query.
Column 7 depicts the final expanded query, Q3 that holds the tokens after pre-processing and topmost two similar contextual words retrieved after implementation of the algorithm. This Q3 query is finally tested on test bed, www.google.com to retrieve the most relevant web documents against the initial user query. Table I shows the number of relevant documents retrieved at three levels i.e., at the initial query Q1, then Q2 is the query transformed by applying real estate ontology (REO) only and final expanded query, Q3 that shows implementation of combination of REO and word2vec method used in word embeddings.   Table I also depicts the average precision of each query calculated at baseline, ontology, and embeddings level.

B. Result Discussion
The graph in Fig. 6 is showing the average precision of getting relevant documents for given 50 queries.
Average precision after implementing the complete algorithm gives a substantially higher precision values for post word embeddings queries, Q3 as compared to its baseline, Q1 queries (see Fig. 6).
Another metric, Precision at 10 (P@10) is also used for performance evaluation of WeOnto algorithm and information retrieval at large. Table II depicts a sample of P@10 computed for all 50 queries. P@10 gives the number of relevant documents from the top 10 retrieved documents. Fig. 7 shows the precision at 10 (P@10) metric of 50 queries together. This metric is used for performance evaluation of information retrieval systems. Here, values of P@10 have increased considerably for every query after implementation of word embeddings as compared to the baseline queries.
The results show a major increase in the number of relevant documents retrieved and hence depicts a higher mean average precision upon implementing the proposed algorithm. Table III displays mean average precision of 0.44 for base line queries that increased to 0.85 after implementation of the second stage of WeOnto algorithm showing remarkable improvement of 93% as compared to an improvement of 18% at first level of the algorithm as per Eq. 3 in the efficiency of information retrieval system. Even precision at 10 also depicts a clear increase and states that top 10 documents retrieved are 75% more relevant as compared to initial baseline queries.
The graph in Fig. 8 depicts a significant upgrade in the values of the metrics required for performance evaluation of REIR model calculated at each level as described in the paper. It clearly shows an increase in efficiency of information retrieval using the semantically enriched ontology and word embeddings model of NLP for quick retrieval of real estate related legal documents.
A trend showing usage of semantic ontology [21] for query expansion was already there. Its aggregation with word embeddings has proved to give better information retrieval results.
It is evident that WeOnto algorithm proposed in the paper includes the usage of the combination of web ontology and word embeddings as also mentioned in [22] for the purpose of query expansion has given significant results with respect to information retrieval of web documents as compared to the baseline user queries.

V. CONCLUSION
Improving the process of information retrieval for efficient retrieval of web documents with high precision has been an ever-going process. Numerous methods have been developed from time to time be it traditional Boolean models or vector state models or even probabilistic models. Each of them was more concerned with queries having keyword matching and had very little understanding of the semantics or context of the query formed.
Query expansion that includes the reformulation of the user query showing better IR results has been a promising solution. The proposed algorithm, WeOnto works on same query expansion and suggests using a two-step procedure that uses ontologies and word embeddings. The ontology gives semantically enriched keywords for the user-query tokens whereas Word2Vec model learns from the given corpus and give most similar words for the said tokens. The best keyword from the entire set is extracted to form the final reformulated query that gave remarkable results and increased precision of the web documents retrieved. In future, instead of word embeddings, sentence-based embeddings can be devised. Also, as the embeddings are shallow unsupervised NLP techniques, the learning of the model can be improved by growing the size of the corpus.