Introduction of the weight edition errors in the Levenshtein distance

In this paper, we present a new approach dedicated to correcting the spelling errors of the Arabic language. This approach corrects typographical errors like inserting, deleting, and permutation. Our method is inspired from the Levenshtein algorithm, and allows a finer and better scheduling than Levenshtein. The results obtained are very satisfactory and encouraging, which shows the interest of our new approach.


INTRODUCTION
Automatic correction of spelling errors is one of the most important areas of natural language processing. Research in this area started in the 60s [1]. Spell checking is to find the word closest to the erroneous word and words in the lexicon. This approach is based on the similarity and the distance between words. In the areas we are interested in the treatment of misspelling out of context; several studies have been achieved to present methods of automatic corrections. Among these works, we cite:  The first studies have been devoted to determining the different type of elementary spelling error, called editing operations [2] which are:  Insertion: add a character.  Deletion: omission of a character.  Permutation: change of position between characters.  Replacement: replace a character with another.  Based on the work of Damerau, Levenshtein [3] considered only three editing operations (insertion, deletion, permutation) and defined his method as edit distance. This distance compares two words by calculating the number of editing operations that transforms the wrong word to the correct word. This distance is called Damerau-Levenshtein distance.  Oflazer [4] proposed a new approach called "Error tolerant Recognition", based on the use of a dictionary represented as finite state automata. According to this approach, the correction of an erroneous word is to browse an automatadictionary for each transition by calculating a distance called cut-off edit distance, and stack all the transitions not exceeding a maximum threshold of errors. Savary [5] proposed a variant of this method by excluding the use of cut-off edit distance.  Pollock and Zamora [6] have defined another way to represent a spelling error by calculating the so called alphacode (skeleton Key), hence the need for two dictionaries: a dictionary of words and other for alpha-codes. Therefore, to correct an erroneous word, we extract its alpha-code and comparing it to the alpha codes closest. This method is effective in the case of permutation errors. Ndiaye and Faltin [7] proposed an alternative method of alpha code, who defined a system of suitable spelling correction for learning the French language, based on the method of alpha code modified by combining other techniques such as phonetic reinterpretation, in case where the first method does not find solutions.  A critical analysis of existing systems for spell checker, realized by Souque [8] and Mitton [9], confirms that these systems have limitations in the proposed solutions to some type of erroneous word.
In the work presented in this paper, we propose a new metric approach inspired from the Levenshtein algorithm. This approach associates for each comparison between two words a weight, which is a decimal number and not an integer. This weight allows the better and perfect scheduling solutions proposed by the correcting system of the spelling errors.

II. LEVENSHTEIN ALGORITHM
The metric method developed by Levenshtein [3], measures the minimal number of elementary editing operations to transform one word to another. The minimum term was defined by Wagner The calculation procedure of the Levenshtein distance between two strings = 1 2 … m of length m and = 1 2 … n of lengthn, consists in calculating recursively the edit distance between different substrings of and .
The edit distance between the substrings X 1 i = x 1 x 2 …x i andY 1 j =y 1 y 2 ...y j is given by the following recursive relationship:  The matrix shows the recursive calculation of the Levenshtein distance between the erroneous word ‫"ﻣﺪﺳﺮة"‬ and the dictionary word ‫,"ﻣﺪرﺳﺔ"‬ the distance is 2.
The limitation of such a spelling correction system using the edit distance is not to allow a correct order of suggested solutions to a set of candidates having the same edit distance. For example, we have the dictionary word ‫"اﻟﺴﯿﻒ"‬ and the erroneous word ‫,"اﻟﺴﯿﻖ"‬ the Levenshtein method returns the same edit distance for the following set of words.
In order to remedy to this limitation, we propose an adaptation of the Levenshtein distance. This adaptation gives a better scheduling of the solutions having the same edit distance.

III. LEVENSHTEIN METHOD ADJUSTED
To remedy to the scheduling problem, we introduced the frequency of the three type errors of the editing operations. We carried a test with four experienced users: they have typed a set of Arabic documents in order to calculate the frequency error of the editing operations. For this, we define the following three matrices:  Matrix frequency of insertion error.  Matrix frequency of deletion error.  Matrix frequency of permutation error.
In this context, we modified the Levenshtein distance between two words by taking into account these three matrices. More formally, for two strings X = x 1 x 2 x 3.. x m of length m and Y = y 1 y 2 y 3 ..y n of lengthn, the calculation procedure of the measurement between X and Y is done in the same manner as that of Levenshtein algorithm, but introducing the matrices frequency of the editing errors. This measure ℳ(i,j) is given by the recursive relationship :  ℱ aj (x i ) = the error frequency of adding the character 'x i ' in a word.  ℱ sup (y j ) = the error frequency of deleting the character 'y j ' in a word.  ℱ permut (x i / y j ) = the error frequency of the permutation 'x i ' with the character 'y j '.

IV. TESTS AND RESULTS
The statistical study that we have done is to determine the frequency of errors editing operations (insertion, deletion, permutation). For this, we launched a typing test of Arabic documents for a set of users. Our training corpus is a set of Arabic documents typed by four expert users. From this corpus, we calculated the three matrices of error previously defined.  For the remaining 10 erroneous words, our method has positioned in the fourth positions whereas the edit distance has proposed the following: 3 in 4th, 3 in 5th, 2 in 8th and 2 in 10th with a rate of 5.26% for our method and 1.57% for the Levenshtein. The table below summarizes the results obtained. V. CONCLUSION In conclusion, we note the interest of our method in scheduling of correct words for the first and second positions while for the third and fourth positions can justify this by the unavailability of frequencies (zero frequency) for some Arabic alphabetic character during execution of our test.