Hybrid Spelling Correction and Query Expansion for Relevance Document Searching

A digital library is a type of information retrieval (IR) system. The existing IR methodologies generally have problems on keyword searching. Some of search engine has not been able to provide search results with partial matching and typographical error. Therefore, it is required to be able to provide search results that are relevant to keywords provided by the user. We proposed a model to solve the problem by combining the spell correction and query expansion. Searching is starting with indexing the title of the document by preprocessing the title of all incoming document data and then weighting the Term Frequency – Inverse Document Frequency (TF-IDF) against all terms of the whole document. Levenshtein Distance algorithm is used in the search process to correct typo-indicated keywords. Before calculating the relevance between the keywords and the documents using Cosine Similarity, the keywords are expanded using Query Expansion to increase number of documents retrieved. Calculation results using Cosine Similarity are then added to Query Expansion weight calculation to get final ranking result. Results show improvements over IR system compared with system without spell check and query expansion. The results of the study in the form of web-based application conducted testing for 50 times with number of data of 2,045. The system was able to correct typo-indicated keywords and search documents with average recall value of 95.91%, average precision value of 63.82% and average Non Interpolated Average Precision (NIAP) value of 86.29%. Keywords—Cosine similarity; information retrieval; Levenshtein distance; TF-IDF; typographical error; query expansion


I. INTRODUCTION
Digital library is one of information service providers in the forms of digital documents that can be accessed online. It is very helpful for students in searching information to complete assignments as well as searching supporting documents for research they are conducting. The large number of digital documents at digilib makes the scope of information search even greater so that information and the needs of relevant information are increasing [1]. Based on observation, some digilib has not been able to provide search results with partial matching and typo-indicated keywords (typing error).
In searching information, the model used and the choice of keywords can influence the level of document relevance towards user's keywords. One of them is VSM (Vector Space Model) where this model represents documents into the forms of vector space. VSM enables to determine relevant documents with keywords depending on the similarity measurement [2]. One of popular VSM measurement models is Cosine Similarity that calculates the cosine angle between two vectors. In addition, Cosine Similarity can be implemented on document matching and partial matching [3]. With the ever increasing size of the web, relevant information extraction on the Internet with a query formed by a few keywords has become a big challenge. Query Expansion (QE) plays a crucial role in improving searches on the Internet [4]. Query expansion plays a major role in reformulating a user's initial query to a one more pertinent to the user's intended meaning. It is to retrieve the most relevant expansion words for expanding the initial query of the user in order to enhance the outcomes of web search results. The reformulated query is then used to obtain more appropriate outcomes from a large amount of information on the web [5].
Spelling errors are words that spell-checker could not find in its lexicon [6]. Typo on keywords is one of reasons why the search result is not relevant since keywords entered are not in the database, so the search engine cannot find relevant documents with the typo-indicated keywords. Several research on search engine concluded that the keyword spelling errors by users are relatively high [7]. The causes for errors are usually related to writing ignorance, positions of keyboard buttons, and finger's movement [8]. Therefore, it needs spelling correction. Levenshtein Distance, also called edit-distance, is used to find word candidates suggested based on number of minimum characters that need to be substituted, inserted, or deleted to change words from string A to be string B [9]. Levenshtein Distance provides good results in solving problems of matching string data to provide text suggestion, for instance in handwriting recognition, search words and misspelled words, so the input effectiveness increases, misspelling can be avoided and auto-complete accelerates human computer interaction [10] [11].
Based on problem explanation, the researchers conducted research to improve the relevance of document search by using Query Expansion, Levenshtein Distance algorithm and Cosine Similarity calculation.

A. Text Preprocessing
Text Preprocessing is a process to bring unstructured data form into structured one as needed for further processing in text mining. In this research, text preprocessing is for the titles of research documents and query from users [12]. It uses several general processes such as: 1) Case folding: a process of changing letters in a document into upper-case or lower-case. In this research, lower-case is used.
2) Tokenizing: a process of breaking down string into some smaller units called term. Token can be in the forms of a word / number, sentence or paragraph. In this research, term from the tokenizing result is in the form of a word [13].
3) Filtering: a process of removing symbols from string. In this research, all symbols except for alphanumeric are removed. 4) Stopword removal: a process of removing unessential words in the description by checking words of description parsing result whether they are in the unessential word list (stoplist) or not for instances are conjunction "adalah", "dan", "dari", "yang", "di" and "ke" [14]. 5) Stemming: a process of removing affixes including prefixes, infixes, and or suffixes on the word group to process.
This research adds one more process to change acronym into its original form in order that term table for TF-IDF weighting becomes more structured.

B. Synonym Table Formation
Every word included in the wordlist table is then processed to find its synonym as Query Expansion reference. The search of synonym uses scraping technique towards webpage of kamuslengkap.com. The results of synonym are then stored in the wordlist table as shown in Fig. 1.

C. Query Expansion
Query Expansion reformulates user's original query to improve the effectiveness of information retrieval [15]. The use of query expansion in this research aims at increasing recall value by taking documents that have similar meaning or synonym with terms from query. The process of query expansion is done to term that has not experienced stemming [16]. Table I used for searching synonym of keywords is wordlist table containing word group from documents and the synonyms that have been found through prior scraping. For instance, if we need to search with keywords "Pencarian dokumen" (document search), each term will be expanded based on the synonym of each term in the wordlist table.  Therefore, documents to search are not only those with terms "pencarian" and "dokumen" but also "penelusuran", "pelacakan", "arsip" and "naskah".

D. Term Frequency-Inverse Document Frequency (TF-IDF)
TF-IDF is a process of weighting a term of a document towards the whole document. TF-IDF calculation is a technique that weighs relevance of term towards document by combining two concepts in weighting, namely the frequency of a term occurs in a document (term frequency) and inverse document frequency containing the term [17].
Term Frequency (TF) that frequently occurs in documents becomes more critical since it can indicate the topic of documents. There are some formulas used to calculate term frequency in documents, but in this research TF calculation uses binary TF that focuses on a term in documents. If it is found, it scores one (1) regardless of the occurrence frequency in documents. If it is not found, it scores zero (0). Inverse Document Frequency (IDF) is used to indicate discriminative power of term i. Generally, terms that occur in a variety of documents less indicates of a particular topic. Formula of inverse document frequency is defined as followings: Where is document frequency of term i or number of documents containing term i and n is the number of all documents.
Weight W_ij is the multiplication result of term frequency matrix and IDF value of each term that can be defined as followings: Where is document frequency of term i or number of documents containing term i and n is the number of all documents.

E. Spell Checker
Spellcheck is a technique that identifies incorrect words or misspelled words then changes them into correct word combinations properly. There are two main methods used to develop spelling checker application, namely identification (error detection) and correction (error correction). Besides, spelling checker is divided into two types, namely non-word error spell checker and real-word spell checker. Non-word error spell checker manages misspelled words due to typing errors, while real-word error spell checker manages substitute words to replace errors in the sentence [20] [21].

F. Levenshtein Distance
Levenshtein Distance is an algorithm that measures distance between two strings by calculating number of minimum operations needed to change one string to another. The operations are deletion, insertion, and substitution. Mathematically, Levenshtein Distance between two strings can be formulated as followings: In this research, Levenshtein Distance algorithm is used to process user's query to find out whether query typed by users is indicated typo or not. The typo-indicated words are those that are not in the wordlist table [22]. This process calculates the distance of words in query with word groups in the wordlist table. The words chosen as the correction result from the typoindicated query words are those with the closest distance based on Levenshtein Distance calculation.

G. Relevance Calculation with Cosine Similarity
In this stage, calculation of relevance using Cosine Similarity between query and document from TF-IDF weighting previously obtained is done. The results are in the forms of similarity value between a document and the query, the higher the similarity value of a document with the query, the more relevant the document to the query. Calculating similarity between documents and query is done by dividing dot product of document vector and query vector with multiplication of Euclidean value of document vector and Euclidean value of query vector. Euclidean value is calculated by finding out the square root of the sum of the squared term weight in documents. The calculation is as the followings (4).

H. Calculation of Term Weight of Expansion Result
This stage adds the calculation by using IDF value of each document term of search result since the term from expansion result of query has lower degree of importance compared to that from original query. Documents containing term from original query can have higher similarity compared to those with term from expansion result. This research used calculation based on the reference as shown in (5) [23] [24].
Calculation is done by adding the calculation result of Cosine Similarity (Cosim) with value determined in (5). If number of term in a document is n original query term (QA), value of one (1) is added as many as n. If there is term from expansion result (QE) and df of QE > 10, so it is added 1 -(1/(log(dfi)). If df of QE ≤ 10, so it is added 0. The example of the calculation is as followings [25]: Query: Sistem pakar penyakit tulang Table II consists of expansion result from the tokenization of the query. The next is searching terms from the title of each document that intersect with original and expansion keywords in Table IV.
The intersected term results are then calculated by using equation (5) can be seen at Table V.   TABLE II. EXPANSION RESULT

I. Recall, Precision and Non-Interpolated Average Precision (NIAP) Testing
Effectiveness of information retrieval system is the ability measurement of system to retrieve a variety of documents from database in accordance with the user's request. There are two critical things usually used in measuring ability of information retrieval system, namely ratio or comparison of recall and precision. Recall and precision calculation are calculation done to a group of documents from search result (set based measure) in overall, so it cannot describe the performance of information retrieval system in terms of relevant document ranking [2]. NIAP is used to check the search success of software developed [26].
Testing is done for 50 times for each of those with spelling correction and query expansion and those without them for testing the relevance level of document search results retrieved by system. Formula for recall, precision and NIAP are shown in (6), (7) and (8). The storage process stores data of documents consisted of id, title, year of publication, and author into document table. Then, the title of document is extracted. Term of the document title is taken for TF-IDF weighting and a group of words from documents title is taken for synonym table formation. Fig. 3 shows flow of storage process. Document search processes consist of correction process of keywords from user's query by using Levenshtein Distance, query expansion based on synonym table formed, relevance calculation by using Cosine Similarity, and expansion term weighting process to add to relevance value of Cosine Similarity calculation result. Fig. 4 shows flow of document search process.

A. Data Collection
Dataset used in this research is in the forms of titles in excel file format containing research document titles. Data of the scraping results consist of 2,083 records with four attributes, namely ID, title, author, and year.

IV. RESULT AND DISCUSSION
Testing is done for 50 times by using data of 2,045 document titles in database with different keywords in every testing both with the use of spelling correction and query expansion on query keywords or not.

A. Testing without Spelling Correction and Query Expansion
This testing only uses similarity calculation with Cosine Similarity without spelling correction and query expansion on keywords to compare recall and precision value with those with spelling correction and query expansion. In Table VII, it can be seen that several rows have errors in keyword writing which finally the Rt value = 0 and also follows the recall value, precision and NIAP = 0. Writing errors can be seen in the blocked rows, such as in the 3, 4, 5, 8, 11, 17 and so on. Various writing errors are such as missing letters, excessive letters and letter placement errors. This testing obtains average recall of 69.11%, precision of 69.28%, and NIAP of 60.72%.

B. Testing with Spelling Correction and Query Expansion
This testing uses similarity calculation with Cosine Similarity and spelling correction on keywords with Levenshtein Distance algorithm followed with query expansion. This testing obtains average recall of 95.91%, precision of 63.82% and NIAP of 86.29%. In Table VIII it can be seen an increase in recall value, this is due to the improvement of keywords in several tests that have writing errors. The RT value was previously 0 in Table VII changes depending on the data in the database. In the correction keyword column, the rows that are blocked are spell-checked words and produce the appropriate words so that they can be found in the database. For example: "levemstein" becomes "levenshtein", "roshio" becomes "rocchio", "steganogarfi" becomes "steganography" and so on. Search results with keyword correction have higher average recall and precision value than those without keyword correction even though each trial has same keywords. The reason is that the use of Levenshtein Distance algorithm can correct some typo-indicated keywords correctly, so that system can find documents that have the keywords. Fig. 5, Fig. 6, and Fig. 7 show comparison of recall, precision and NIAP value based on the number of term. Fig. 5 shows that recall value of testing with spelling correction and query expansion is higher than that without them. The reason is that the use of spelling correction can display documents with typo-indicated keywords while the use of query expansion can find documents that have similar meaning with keywords from user's query. Fig. 6 shows that precision value of testing with spelling correction and query expansion tends to be lower than that without them. The reason is that query expansions on keywords are too general; as a result, some documents that are less relevant are also displayed.
NIAP value in Fig. 7 shows the excellence of testing with spelling correction and query expansion compared to that without them. The reason is that the use of query expansion can display relevant documents that have similar meaning with keywords from the query and the ranking result is above the documents that are less relevant based on Cosine Similarity and term weighting of query expansion.   Hybrid spelling correction and query expansion on keywords are able to improve relevance of document searching. The reason is that the use of spelling correction can display documents with typo-indicated keywords while the use of query expansion can find documents that have similar meaning with keywords from user's query. The proposed methods are able to improve relevance of document searching with average recall from 68.11% to 95.00%, but the precision decreases from 69.28% to 63.32%. However, the decrease does not influence the ranking of relevant documents retrieved by system because there is an increase of NIAP value from 60.72% to 86.29%, so the low precision is tolerable.

VI. FUTURE WORK
This research can be further developed by optimizing algorithm to correct typo-indicated keywords, so it can correct typo-indicated keywords based on the linkages of the keywords in query and not only based on their Levenshtein Distance. Besides, query expansion can be optimized in finding synonym of keywords, so term of expansion results are not too wide and to add feature to find expansion results of phrases, so it is not limited to only term from user's query. Therefore, the recall value can increase without lowering the precision value.