Unsupervised Morphological Relatedness

Assessment of the similarities between texts has been studied for decades from different perspectives and for several purposes. One interesting perspective is the morphology. This article reports the results on a study on the assessment of the morphological relatedness between natural language words. The main idea is to adapt a formal string alignment algorithm namely Needleman-Wunsch’s to accommodate the statistical characteristics of the words in order to approximate how similar are the linguistic morphologies of the two words. The approach is unsupervised from end to end and the experiments show an nDCG reaching 87% and an r-precision reaching 81%. Keywords—Arabic Language; Computational Linguistics; Morphological Relatedness; Semitic Morphology; Unsupervised Learning


I. INTRODUCTION
Expanding a query word to its variants is one of the challenges facing an Information Retrieval (IR) system in order to achieve a decent recall.Take the Arabic word " " ([kitaAb]: book) 1 .An IR system seeking information related to this word in a collection of documents should pick all the documents in which occur the word itself or any of its variants such as " " ([kaAtib]: writer), " " ([kutub]: books), " " ([kutayib]: small book),...etc.
The purpose is to capture the documents in which also occur words with meanings close to the query words.One way to do this is to exploit the fact that two words derived from the same morphological origin are likely to share the same broad meaning.
Experience shows that such technique depends on the type of the language morphology [2], [3].In languages like English, a word is generally a concatenation of prefixes, stem and suffixes.For instance, the word "unbreakable" is composed of "un", "break" and "able".While the principle of decomposition can be applied to Arabic, words in Semitic languages, such as Arabic, are actually derived by combining two entities; each might be regarded as an origin: root and pattern [4].For instance, the word " " ([kaAtib]: writer) is coined by combining the root " " ([k t b]) and the pattern " ".
This means the normal form an IR system may reduce the Arabic query word to one of three different types, each expressing a different level of similarity.
• Stem: The extraction of the stem is simply the elimination of the prefixes and suffixes.
• Root: Specific to the Semitic languages, its extraction is more complicated than the extraction of the stem as it tries to identify the three, four or five core letters among all letters of the words [5].The words derived from the same root have a common meaning broader than the one shared by words having a common stem.
• Pattern: The extraction of the pattern is the identification of the non-core letters and their positions among the core ones.The authors are not aware of any IR system that makes use of pattern as the normal form of words.
For instance the word " " ([alkaAtib]: The writer) might be reduced to its stem " ", to its root " " ([k t b]) or to the pattern " ".
While the trend is to reduce the query words to the normal form then to match them against the stored normal forms, another approach [6] is to redesign the matching itself in such a way that it identifies words morphologically close to the query word by measuring the Morphological Relatedness (MR).
The present work is an attempt to enhance this approach [6] in order to improve the effectiveness of morphologyaware matching.Three major changes are introduced to the computation of the MR: 1.The words are first processed by an unsupervised morpho-segmenter which tries to remove the prefixes and suffixes.2. The frequency is involved earlier in the computation of the Longest Common Subsequence (LCS).An alignment algorithm is adapted to take into account the frequencies of the letters in the computation of the cost.3. The comparison is extended to n-grams.
Section II reviews the principle of string alignment that will be used to calculate the MR [7].Section III overview works related to the idea of computing the similarity among natural words.Section IV introduces the proposed approach.Section V details the test and discusses the results.

II. SEQUENCES ALIGNMENT
Sequence Alignment (SA) is the process of identifying the minimal number of edit operations required to transform one string of characters into another [8] [9].In an edit operation, a character may undergo one of the following changes: • Indel: The character is simply deleted or a new character is inserted.
• Substitution: The character is substituted by another.
Other operations might be defined on the basis of these.This study will limit this section to the simplest definitions of the underlying concepts.Each edit operation is assigned a cost.
Having two strings (words) at hand, the objective is first to calculate the minimal cost of edit operations required to transform the first string into the second one.Then, to identify what and where are those edit operations.The algorithm which will be focused on in this paper is due to Needleman, Saul and Wunsch, Christian [7].Algorithm 1 depicts the steps to align two words A and B, where cost s denotes the cost of a substitution, cost g d and cost gi are a gap penalty pointing out aligned with a null.First, a two-dimensional matrix is built where n is the size of A and m is the size of B, and the rows are labeled with letters of A and the columns are labeled with letters of B. The extra row and column at index zero have been added to deal with the empty string.Second, all cells are filled with the similarity values starting from the top row and going to the bottom-right cell from left to right.Each cell in this matrix holds the similarity between two substrings of the two strings whose ends intersect at a given cell; that is, the cell M [i, j] holds the similarity between substrings A = a 1 . . .
Example: Table 1 shows an example of applying Algorithm 1 to find the similarity between two words, A= "winter" and B= "write".cost s was supposed to be equal to equal to 1 when the two letters match, and the other costs are equal to -1.For instance, the value in the cell M [4,4] indicates that the similarity between "wint" and "writ" is 1.
So the value in the last cell M [6,5] means that the similarity between the words "write" and "winter" is 1.

III.RELATED WORKS
Beside [6], the authors are not aware of any published work on the concept of the MR.This section overviews a number of approaches that make use of the concepts of edit distance in the context of Natural Language Processing (NLP).
Ghafour et al. [10] suggest to adapt the Levenshtein's distance [11] in the comparison of compare Arabic words.The cost of the operation captures three levels of similarity: phonetic, character form and keyboard wise similarities.Gomaa & Fahmy [12] proposed a system to automatically grade answers to an essay question.They tested different similarity measures, trying to achieve a maximum correlation value between the proposed system and human experts grades.The Needleman-Wunsch similarity [7] is one of the measures they tested.It achieves 26.5% of the correlation score.
Mustafa & Al-Radaideh [13] who investigated for a ngrams based comparison claim that the bigram based comparison is more effective than the trigram based comparison, and that the use of pure n-grams technique alone does not perform with Arabic words as well as it does with English words.In [14] Mustafa suggests to extend the comparison to non contiguous letters.For instance, the n-grams in the word W = w 1 w 2 . . .w n might be {w 1 .w 2 , w 1 .w3 , . . .w n−2 .wn , w n−1 .wn }.Tested on 160,000 words, the author claims that this approach outperforms the classical one when using a rule-based stemming.The approach meets the balancing point of recall and precision at around 40%.
Reference [15] opted for the rule based approach to match Arabic words.It identifies the common letters between two words, compare their order and checks whether the uncommon letters are valid affixes or not.If these two conditions apply, a match is raised.This approach uses a predefined list of affixes.The authors tested 1,500 distinct words and claim they have achieved a 15% error rate at a 13% missing rate.The error rate measures how many erroneous hits are found among all the relevant variants.While the missing rate measures how many relevant variants are missing among all actual relevant variants in the dataset.

IV. A TWO STEPS MORPHOLOGICAL RELATEDNESS
To the best of the authors' knowledge, the concept of MR was introduced by Ahmed Khorsi [6] where he tried to substitute the classical normalize-then-match approach in the matching process used in the IR systems with a straight comparison that takes into account the core letters intended to carry the core meaning of the word and the non-core ones which are meant to carry the variation in the meaning.Basically, two challenges had to be addressed: 1.How to distinguish the core letters from the non-core ones 2. How to model the matching and mismatching, either within the core letters or the non-core letters.
As the core letters in a word might not be contiguous, the matching made use of the computation of the Longest Common Subsequence (LCS) [16].As its name suggests, it extracts the longest sequence of letters, either contiguous or not, but in same order shared by the two words.As the LCS does not guarantee that the common letters are all core ones, the formula to calculate the MR tries to exploit the fact that the non-core letters are usually more frequently used than the core ones.The words in a collection of documents found to have the highest MRs with the word at hand were considered the most morphologically related and the most probable to carry a meaning very close to the meaning carried by the word at hand as shown in Algorithm 2. The MR measure is: Where w 1 and w 2 are two strings whose will be tested, |w 1 | is the length of w 1 and w 1 [i] is the i th letter in w 1 .LCS (w1,w2) is the LCS between w 1 and w 2 , and LCS (w1,w2) is LCS's complement (i.e., it contains all letters that are not included in LCS (w1,w2) ).f req(a) is the frequency (count) of the letter a in a corpus.
Tested on more than 200,000 words, such simple approach could achieve 82% nDCG and 78% R-precision when the five (05) highest MRs are picked.
Based on the analysis of results and the lacks of identification in the original work [6], the present work introduces three major changes to the concept of MR computation: 1. Stemming: To avoid the interference of (pre/suf)fix letters in the computation of the MR, an unsupervised morphological segmentation is applied beforehand to extract the stems on which the actual computation of the MR is applied.2. Alignment cost: The LCS extraction in the original approach does not make any distinction between letters.In an attempt to involve the frequency factor early in the process, the cost model of an alignment is adapted to accommodate the frequencies.3. N-grams: The investigation of this study is extended to the effect of making the comparison unit n-grams of letters rather than single letters.
The first step relies on a morphological segmentation of words, which is also suggested in a separate work.The following paragraphs summarize its main traits.

A. Morphological Segmentation
The objective of this section is to take a quick look at the step introduced before the actual computation of the MR.This step aims at identifying the prefix, stem and suffix of a word, as experience has shown that an unsupervised morphological segmentation is feasible and could reach acceptable performance [17]- [28].The following paragraphs describe how the unsupervised learning and segmentation of natural words are approached.
Let the word be "unbreakable", which is formed by concatenating the prefix "un", the stem "break" and the suffix "able".The vocabulary suggests such segmentation should contain other words with different combinations of prefixes, stems and suffixes (e.g."rebreakable", "unbreaking"...etc.), which makes the occurrence of "un" have a weak dependence on the occurrence of "break", whose occurrence is also relatively independent from the occurrence of the suffix "able".On the other hand, it is obvious, but worth mentioning, that each morpheme is not separable, either from its first letter or its last one.The proposed approach is all about exploiting this fact: a morpheme depends on only the letters of which it is formed.To address the challenge of how to assess the dependence among the letters of a word, probabilistic dependence was employed [29].
1) Segmentation Algorithm Algorithm 3 iterates over the word, letter by letter, and, for every position, computes two dependencies: 1. the dependency of the prefix on its last letter; 2. the dependency of the suffix on its first letter, where the prefix starts (inclusive) at the current letter and the suffix ends (inclusive) at it.The difference between the two values then points to which of the two fragments (i.e. the prefix or the suffix) is more attached to the current letter.The algorithm keeps going until it encounters a change of the direction of the highest dependency.If, in the immediately preceding position, the prefix depends on the letter more than the suffix does and, in the current position, the suffix depends on the letter more than the prefix does, a cutting point is marked between the previous and the current letter.
2) Computation of the Dependence By definition, the concept of dependence that used is symmetric [29].In this context: "the string depends on the letter" means "the letter depends on the string" and vice versa.
The dependency of a letter a on the prefix α: will be called the forward dependency, and it is denoted fdep(u) where u=αa is the prefix appended with the letter under consideration a.The dependency of the letter a on the suffix β: will be called the backward dependency, and it is denoted bdep(v) where v=aβ is the suffix headed by the letter under processing a.The beginning and the end of a word are marked by respectively # and $.The forward dependency is then: and Where P (α → a) is the conditional probability: and P (a → β) is the conditional probability: then and Where the probability of a letter a: P(a) is approximated by its normalized frequency in a corpus.
Where A is the alphabet.Count(α) expresses how often an n-gram α occurs in the corpus.2 is a simulation of Algorithm 3 on the word "unbreakable".The second letter "n" depends on the prefix "#u" more than it does on the suffix "breakable$", where the third letter "b" depends on the suffix "reakable$" more than it does on the prefix "#un".This change of the dependence direction makes the point "un|breakable" a cutting point.The same logic applies to the seventh letter "k" and the eighth letter "a".

B. Morphological Relatedness
An MR assessment is expected to fulfil two assumptions.
• The longer the shared sequences are, the higher the relatedness should be.
• Words sharing core letters are much more related to each other than are words sharing only non-core letters.
The following shows that an adaptation of a string alignment algorithm might be the answer.
1) Relatedness Algorithm Algorithm 4 is an adaptation of the Algorithm 1, where a word A contains α n-grams and a word B contains β n-grams.Instead of talking about a cost, the term gain will be used which fits well with the aforementioned assumptions.
2) Computation of the Relatedness The following describes how is the gain computed: where .denotes any letter of the alphabet.Example: The MR between two words A= " " ([ki-taAb]: book) and B= " " ([kaAtib]: writer) is calculated as shown in Table 3, where the frequencies used in this example are shown in Table 4.To fill each cell in the matrix, the maximum value among adjacent cells plus the gain of the underlying operation is picked.The three adjacent cells are those on upper left corner side (M The resulting value at the last cell M [4] [4] indicates the MR between the two words " " and " ". The value in the cell M [2] [2] is the relatedness between the two substrings " " and " ".It corresponds to a substitution of the letter " " ([A]: 1 st Arabic letter) with " " ([t]: 3 rd Arabic letter).The gain gain subs (" "," ") used to evaluate the cell M [2] [2] value is the inverse of the frequency of the letter " ", which is higher than the inverse of the frequency of the letter " ".The gain gain del (" ") uses the frequency of the deleted letter " ".The last gain gain ins (" ") is the inverse of the frequency of the inserted letter " ".The value of the cell M [2] [2] is calculated as follows: + gain del (" "), M [2, 1] + gain ins (" ")) = max(38.45− 15.27, 30.14 − 15.27, 23.17 − 8.31) = 23.17 As the cell corresponds to a match the gain is a positive used to evaluate the value of M [2] [3] is the inverse of the frequency of the matched letter " " ([t]: 3 rd Arabic letter) for the three operations: matched, deletion and insertion.Because the cell M [2][3] corresponds to the letter " " at both M [2] and M [3].M [2][3] is calculated as follows: + gain del (" "), M [2, 2] + gain ins (" ")) = max(30.14+ 15.27, 23.17

V. TESTS AND RESULTS
This section firsts describes the test settings then discusses the results.

A. Test Dataset
The morphological segmentation is fed with a corpus of plain classical Arabic texts2 .It contains around 1M distinct (122M in total) words with an average size of 6.22 letters per word.The morphological segmentation step produced 596,356 distinct (1,116,919 in total) stems with an average size of 5.94 letters per stem.The resulting stems were then used to feed the computation morphological relatedness.

B. Performance Metrics
1) Morphological Segmentation Three samples of 100 words each are randomly picked out of the whole segmented corpus.The results of these approaches are evaluated manually by using three metrics: recall, precision and F-measure of the cutting points: • Recall measures how many correct points are found among all the existing correct points.
Recall = correct points in the result all correct points in the dataset • Precision measures how many points are actually correct among all the points the algorithm found.
P recision = correct points in the result all found points in the result • F-measure measures the average of the recall and the precision.
2) MR Computation Evaluation The MR computation is ran over the whole set of stems resulting from the previous phase.For every stem, the ten others stems with the highest MRs are picked.Then two metrics are used to assess the performance of the MR computation.
• Normalized Discounted Cumulative Gain (nDCG) measures the number of related stems returned in the results and their order among all existing related stems.The higher is the nDCG, the more related and the better ordered are the results.Related stem means a stem which is derived from the same root.For this purpose, the binary value rel(s, s i ) is defined, which is set to 1 when s and s i are derived from the same root (relevant to each other) [6] and set to 0 otherwise. where: • R-precision measures the number of related stems among all stems that appear at the k th position of the returned stems.The higher is the R-precision, the better the performance is.This study supposes k is equal to 10.

C. N-grams vs Letters
The generalization of the approach to make the unit of comparison n-grams rather than letters is investigated.Three cases are tested: unigram (one letter as the original version), bigram (The unit of comparison is two letters) and trigram (three letter).

D. Noise
Typos are known to be a source of errors and a single kind of typo may influence the performance of the whole system.The study seeks to go deeper and investigate the extent of the influence of the misspellings on the performance of the MR computation.The study first runs the computation on a test set without any filtration (normal), then runs it on a test set with simple normalization rules that eliminate the effect of a number of common mistakes (Typos-free): 1. " " ([´]) confused with " " ([A]).
• Nonempty affix: Samples are picked randomly only among words for which morphological segmentation has carried out at least one cutting point.For a number of words, the segmentation simply did not identify any cutting point.For most of these cases, it was because of the scarcity of the stem or stem.affixremaining combination.The study then tries to assess the impact of such cases on the performance of the segmentation and the accuracy of the identified cutting.
• Thresholded: An additional filter is applied to the nonempty affix sample, where a segmentation is picked only if the affix reappears in more than 1,000 other segmentations.
The results of the raw sample are the lowest on Table 5.The two obvious causes might be the irregularities in Arabic morphology and the typos in the test set.The latter is confirmed by results obtained when the sample is restricted to the words with a relatively high frequency (thresholded) as shown on Fig. 1.
Another reason which affects the performance is the lack of a dataset.The proposed approach works wholly with the unsupervised method and depends only on the count of words in the corpus.However, it is difficult to include all possible derivatives in the corpus.For example, the derivative " " ([Al•<ij•HaAfaAt])the prejudices appears one time in the dataset; that is, there is no other derivative that appears in the dataset without the prefix " ", for instance.Indeed, the word " " would be more dependent on the prefix " " and the proposed algorithm does not learn that the prefix " " can be cut off from the word " ".This problem will be removed if the word " " is added into the dataset.The approach then finds the segmentation position " "|" ".
Running the morphological segmentation over the whole corpus resulted in around 1,805,231 morphemes and 596,358 distinct morphemes with an average size of 5.94 letters per morpheme.
It is worth noting that the objective of the study is to address the classical Arabic (CA) words.To the best of the authors' knowledge, there is no suitable gold standard for CA.The study should build itself a set of words and then proceed to the manual segmentation of three different randomly picked samples for each setting.2) Morphological Relatedness Obviously, checking the whole test set manually is impractical.Thus, the study opts for a sampling approach to approximate the performance.Three samples of 100 words each are randomly picked out of the whole corpus.Along with the nDGCG and r-precision computed for every sample, the average and the standard deviation are recorded.This process is repeated for the different settings explained earlier.
Values in Table 6 and Fig. 2 clearly indicate a high performance of around 80% in all cases.The highest value of nDCG occurs with bigrams in the typos-free case.This confirms the utility of such simple typos handling.This also suggests that the performance might be enhanced further if a heavier typos filtration is applied.
The combination of letters in bigrams and trigrams shows positive and negative effects.The positive effect is the lowering of the frequencies, which increases the influence of the core (root) letters and widens the differences between the frequencies of the n-grams formed of core letters and the frequencies of the n-grams formed of non-core letters n-grams.A good distinction is then made between the two classes of letters.The negative effect occurs when the combination becomes less common and fails to capture the distinct classes.Instead, it may mix up letters from both core and non-core letters.The results show that the bigram is a good compromise.The standard deviation is a good indication that the results are reliable.
The changes introduced to the original version [6] were fruitful and the performance increased considerably.It is worth mentioning that the evaluation reported in [6] was on the five top results.The proposed approach is in the top ten and the values are still higher.Of course, one of the factors that boosted the performance is the stemming.However, given that the stemming was also totally unsupervised, this is another proof that an end to end unsupervised method can handle a complex morphology language such as Arabic.

VI.CONCLUSION AND FUTURE WORK
The concept of the Morphological Relatedness seems promising in the area of the unsupervised processing of languages.Even more, it shows a good handling of a complex language, such as classical Arabic.The purpose of the work reported in this article was to enhance the computation of the MR originally suggested in [6] without falling into the trap of human supervision.The study is able to overcome the problem of the long prefixes and suffixes by introducing an unsupervised morpho-segmentation.The study also handles the unclear boundaries between the frequencies by extending the comparison to bigrams.
The study commits itself to keep any processing human independent and as generic as possible.The open issues are diverse, and the explorable are numerous.A few of them are listed: • Does the morphological relatedness perform well when generalized to upper levels such as the morphology of partially structured texts?
• Can the computation of the MR be sped up by using an index?

Algorithm 1 :
a i and B = b 1 . . .b j .Then the last cell M [n, m] holds the similarity between strings A and B. Needleman-Wunsch similarity Input : Two strings A = a 1 a 2 . . .a n and B = b 1 b 2 . . .b m Output: The Needleman-Wunsch similarity between two strings A and B 1

Algorithm 2 :
Top five morphological relatedness input : A word w output: Five words have the highest MRs with w 1 foreach word w i in a corpus do 2 mr i ← MR (w,wi)

1 .
gain match (a) = + 1 f req(a) : When two letters match.2. gain subs (a, b) = -1 f req(a) : In case of substitution, the letter a is the one with the lowest frequency.3. gain del (a) = -1 f req(a) : In case of deletion, the letter a is the deleted letter.4. gain ins (a) = -1 f req(a) : In case of insertion, the letter a is the inserted letter.where f req(a) is the normalized frequency of the letter a: f req(a) = Count(a) Count(.) − 15.27, 14.86 − 15.27) = 45.41

Figure 1 :
Figure 1: Results of morphological segmentation

Table 2 :
Example of morphological segmentation

Table 3 :
Example of the computation of morphological relatedness

Table 4 :
A sample of frequencies of Arabic letters

Table 5 :
Morphological segmentation results