Hindustani or Hindi vs. Urdu: A Computational Approach for the Exploration of Similarities Under Phonetic Aspects

The semantic coexistence is the reason to adopt the language spoken by other people. In such human habitats, different languages share words typically known as loan words which appears not only as of the principal medium of enriching language vocabulary but also for creating influence upon each other for building stronger relationships and forming multilingualism. In this context, the spoken words are usually common but their writing scripts vary or the language may have become a digraphia. In this paper, we presented the similarities and relatedness between Hindi and Urdu (that are mutually intelligible and major languages of Indian sub-continent). In general, the method modifies edit-distance; and works in the fashion that instead of using alphabets from the words it uses articulatory features from the International Phonetic Alphabets (IPA) to get the phonetic edit distance. This paper also shows the results for the languages consonant under the method which quantifies the evidence that the Urdu and Hindi languages are 67.8% similar on average despite the script differences. Keywords—Lexical Similarity; Urdu; Hindi; Edit Distance; Pho-netics; Natural Language Processing; Computational Linguistics


I. INTRODUCTION
In the Indian sub-continent hundreds of different languages are spoken throughout the area it spans, most of which belong to the Dravidian and Aryan families. It is the accepted fact by linguists that the Aryan family of languages evolved from Sanskrit [1]. Historically, during the medieval period of India, Sanskrit was the language of rulers and of the people from the upper-class, this period also shows the witness for Prakrits and the other languages derived from the Sanskrit [2]. Followed by time, we see the rule of Persian language in Indian courts; and at the near end of Mughal era, Urdu eventually became the official language of the court [2], [3]. Many of the researchers argue that the languages Urdu and Hindi are same because they share the same grammar and a large number of words in their common vocabulary; while in the same context, many other researchers express their findings in refutal. The debate engages the development and origin of the Urdu. A common understanding behind the development of Urdu language shows that it is a creole language which came into being through the mixing of local Indian people and the foreign invaders from the different background and ethnicities [4]; and hence, often referred with the 'camp language'. In contrast, veteran Urdu lexicographer Parekh [5] rejected the theories describing Urdu as a creole language; and maintained that it is the 'Khariboli' of the central zone of India with the exception that if its vocabulary draws words from the Persian and Arabic it would become Urdu, and similarly, if it uses words from Sanskrit it would be Hindi. The Urdu and Hindi are the mutually intelligible languages; however, are the victim of language-split which resulted in the usage of modified Perso-Arabic script called 'Nastaliq' and Devanagari script, respectively, for writing. Amongst many other characteristics of these scripts the two which appear salient are: Nastaliq is a cursive scripted and supposed to be written in right-to-left direction; whereas, Hindi is block-letter and follows left-toright direction. Figure 1 depicts the languages of the world and their respective sizes (in terms of size of leaves spread), where specifically Urdu and Hindi appear on the top-left, in the Indo-European→Indo-Iranian→Indo-Aryan branch. In the same context, the two languages in the world of today can be combined under the common term 'Hindustani' and also recognized as separate Persianised and Sanskritised registers of the Hindustani language.
Before we proceed further, another behaviour is noteworthy for which we see that since the division of India happened in 1947, a special focus has been made on official grounds for inducting Persian and Arabic words in Urdu and Sanskrit in Hindi by the right governments and print and electronic media associates of the two countries (i.e., Pakistan and India). Thus, where these languages share a vast vocabulary and morphological structure, their speakers attempt to distinguish them through the word borrowing or with loan words from the source languages as mentioned earlier. Hence, it is observable that it may be very difficult for the youth of contemporary time to comprehend the news bulletin announced officially in the pure Urdu and Hindi. However, movies and other channels of entertainment can be accounted for as the principal medium of vocabulary enhancement.
With more than 329.1 million native speakers all around the world, [2] and being a victim of digraphia; the main challenges and point of research investigation-w.r.t the similarities between Hindi and Urdu languages-taken into the consideration for this paper are discussed in the subsequent paragraphs.
The difference in writing script leads the matter not only towards the inabilities of reading but also to the pronunciation. We see that the pronunciation of certain Perso-Arabic alphabets are improper w.r.t the core Hindi speakers such that they are not able to differentiate ‫,'ج'‬ ‫,'ذ'‬ ‫,'ز'‬ ‫,'ض'‬ and ‫;'ظ'‬ as they tend to pronounce 'ज' for all of them. It is seen very often to associate a diacritic symbol, namely bindi, in the transformation of 'ज'→'ज़' for differentiating ‫'ض'‬ and ‫'ظ'‬ from the rest of aforementioned Urdu alphabets. Similarly, for Urdu alphabets ‫'گ'‬ and ‫'غ'‬ they use a single Hindi alphabet'ग' and add bindi in it ('ग़') to substitute it for ‫;'غ'‬ in a very similar fashion, it uses 'क' and 'क़' for ‫'ک'‬ and ‫'ق'‬ respectively. In a contrasting manner, for the two Urdu alphabets such as ‫'ا'‬ and ‫'ع'‬ Hindi corresponds with the three alphabets 'अ', इ', and उ'; similarly, for ‫'ح'‬ and ‫,'ھ'‬ and ‫'ہ'‬ Hindi has only one alphabet 'ह'; and lastly, for ‫,'ث'‬ ‫,'س'‬ and ‫'ص'‬ Hindi has got only one alphabet i.e. 'स'; so with the Urdu alphabets ‫'ے'‬ and ‫'ی'‬ it has got a single alphabet i.e. 'य'. Collectively all of the alphabet marking are mapped in the figure 2. Thus, with these many-to-many mappings among the alphabets of two scripts, we can easily anticipate the production of very severe semantic mistakes. For example, take the Urdu word ' ‫'ذ‬ [Da.li:l] (humiliated) for which the Hindi may have the chance to pronounce as 'जलील' [dZa"li:l] (exalted, magnificent). Similarly, for the multi-words, the Urdu language has to give an additional space hence a single word would consist of multiple tokens; for example, ‫ڑھ'‬ ‫ب‬ ‫'ان‬ [@n ⌣ p@ó H ] (illiterate) which is ‫'ان'‬ + ‫ڑھ'‬ ‫,'ب‬ however; Hindi language has no compulsion of giving a whitespace in between tokens, so for the given Urdu example ‫ان'‬ ‫ڑھ‬ ‫'ب‬ , it will render 'अनपढ़'(pronounced as per same IPA and meant into the same thing).
Thus, in order to find the similarity between the two languages, we are required to transform every cognate as per a similar scheme. For such scheme, Romanized transliteration is the popular way but it undergoes with the same issue i.e. manyto-many alphabet mapping; for example ‫,'ث'‬ ‫,'س'‬ and ‫'ص'‬ will have only substitute in the Latin script i.e. 'S' et cetera [6]. The alternate approach, as used in this paper, is taking the IPA into account for the transformation. In addition to it, this paper presents a modified version of conventional edit distance, namely 'Phonetic Edit Distance' (PED), where the articulatory features of the IPA are employed. This will also help us to see the relatedness of the same word, spoken/pronounced by the people of core Urdu and Hindi backgrounds, at a slight/negligible distance; instead of getting a hard distance through standard edit distance metric (yielded on romanized transliteration). Likewise, Nizami et al. [7], we considered to find the similarities w.r.t the Parts-of-Speech (PoS); such that it would be more interesting to find the right cognate of book as a noun in the list of nouns of the other language, rather the make a generalized look-up on all possible words.
The rest of the paper organization is as The literature review about the lexical similarity of languages and earlier techniques is in Section II, the methodology is described in Section III, the detailed results are shared in Section IV, followed by the conclusion and future works in Section V, and bibliographical references in the end.

II. LITERATURE REVIEW
In this section, the literature review, the existing techniques, and approaches for lexical similarity concerning script and sound are described.
The two most frequent method for resolving the problem of this kind are string matching algorithms and employment of the Soundex algorithm. The edit distance algorithm has different variants for different types of tasks like string alignment and (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 spells correction in language processing [8]. The problem is that it takes the characters or letters as distinct units, in such cases if the characters are completely similar then no operation needed. It depends on the user that it may use different weights for the operations. There is another algorithm known as Soundex which works on sound-based matching instead of letter or spell matching [9].
Jinugu [10] presented which is the variant of the Tarhio-Ukkonen algorithm [11] for maximizing the matching of a string by finding the longest patterns in the string and ignores the mismatches of characters. This algorithm works in multiple shifts on the variant lengths of strings for matching purpose, the shift distance and number of characters involved in matching also matters for its performance.
The work [12] shows usage of the Soundex algorithm for retrieving noun words from the database consisting of vowels and consonants for the Hindi language. Likewise, the Soundex algorithm provides classes for letters as their agreement classes which are six in number, where the vowels are eliminated and only consonants are changed into their relevant phonetic class [13]. Other similar work is by [14] and [15] which is phonetic matching using rule-based algorithms and utilizing encoding scheme for homophone words matching scripted in different languages.
The Soundex algorithm considers many letters due to the articulation of sound. The IPA chart is present at the website 1 . The IPA's are ordered according to the manner and place of articulation. Few letters have same articulation i.e. plosive, bilabial [16]. Similarly few letters are voiced and unvoiced consonant [17], aspirated and pharyngeal etc. The set of features can represent IPA symbols. In Germanic languages, according to [18] and [19], there are some voiced-stops that can change into voiceless stops and vice versa.
There are different studies found on the lexical similarity like [20] worked on the dialectal differences among the pair of texts using cosine similarity, Hamming distance, and Levenshtein distance and [21] worked on cognates identification among different languages based on inter-related vocabulary. It shows that the lexical similarity can be computed by using the phonetic level features of words rather than orthographic features.
Another work is done for the similarity of words on a limited PoS by using synsets in WordNet, to extend this lexical similarity on phrase level and sentence level computed with the help of word-level similarity [22]. Similarly, the lexical similarity computed for the source code by using string level matching [23]. Some other researchers find multiple dimensions for lexical similarity like knowledge-based, stringbased, and corpus-based [24]. An experiment conducted by [25] on cross-language similarity for the cognates of Dutch and English, on similar grounds [26] lexical similarity was computed by using phonetics based cognates with high frequencies.
As the researchers in [25] and [26] showed that in crosslanguage cognates and loan words similarity matters phonologically. Similarly, the historic background and origin of these languages are analyzed like [27] did for Urdu and 1 https://www.internationalphoneticassociation.org/content/full-ipa-chart French words. Another work in support of phonological level similarity of languages was done on English word structures as a network of language where links were made between words phonologically [28].

III. METHODOLOGY
In this section, the main components are described as: The detailed discussion about the languages chosen for the experiment, the source of the dataset, and the specific PoS word lists for lexical similarity is done in section III-A. Discussion about the standard edit distance and the proposed phonetic modifications based on articulatory features is described in section III-B. The proposed method modified phonetic edit distance is explained in section III-C. The detailed discussion for the computation of lexical similarity for the chosen languages is given in section III-D.
The basic task of lexical similarity calculated on similar words ratio or count in between two languages. But, there are few reservations like: 1) How it will be inferred that two words are similar? 2) To which extent comparison should be done and which criteria should follow to choose the words?
The answer to the raised questions is that there should be a dynamic method to decide whether two words are cognates or not, the edit distance should be employed as a measure of similarity, the comparison should be on the equal words or some acceptable count of words and lastly, the criteria is also regarding the origin and source of words picked for the lexical comparison. To explain the first part of the answer we need to propose a modified edit distance method for computing lexical similarity which is explained in the coming part as phonetic edit distance. The next parts of the answers are related to word lists and their specific feature or aspect for selecting comparison. For this, we have chosen different parts-of-speech (PoS). This decision is made due to the importance of PoS as these are the rich and main content of any language, also some previous similar work of lexical similarity was done by using PoS tag set [7]. For the complete pictorial view of the proposed phonetic edit distance method, the system diagram is given in figure 3.

A. Dataset
We used the Universal Dependency 2 (UD) corpora for extracting PoS word lists. UD has some standard data about all languages in a standard format. Another reason to choose the only PoS for similarity is that in the textual corpora each language can be divided into PoS tag set. In this experiment, we have chosen Urdu and Hindi with majorly two scripts Devanagari and Perso-Arabic. Also, the conversion system for these scripts to IPA is developed. The part-of-speech (PoS) used for the similarity purpose are verbs, proper nouns, nouns, particles, auxiliary, pronoun, coordinating and subordinating conjunctions, and adposition. The length of each PoS tag word list is shown in Figure 4 comparing the size of both Hindi and Urdu languages.

B. Understanding the Phonetic Edit Distance
The standard edit distance [29] takes two words or strings and returns the distance between them. In this process, the internal mechanism of the edit distance method is based on the insertion, deletion, and update operations which compute the cost for two strings as a distance. Each operation is given a unit cost which aggregates during the comparison of two strings. This is simple string matching which doesn't provide any information about the sound-related features like phonetic articulatory features. In our proposed method the edit distance is modified based on these articulatory features and called here as the future of edit distance as phonetic edit distance.
The proposed Phonetic Edit Distance (PED) works the same as standard edit distance but its internal mechanism is based on phonetic features, it is explained in section III-C. It takes IPA encoded two words and then returns the phonetic distance between them. If the sound of both words is the same then the phonetic distance will be zero. But, if the words are not similar then the insertion and deletion operation computes cost as well as the phonetic cost of the words also aggregates to total in case of mismatch of sound.
If we take the standard edit distance, the distance of two IPA strings /baed/ and /paed/ is Φ(/baed/, /paed/) = 1, these IPA strings represent English words bad and pad respectively; and Φ is the edit distance. There is only one replace operation of 'b' with 'p' to make the similar string but, with our proposed PED method, by using the same pair of IPA encoded strings is 0.2, in this case, the cost of replacement operation of 'b' with 'p' is 0.2 as the sound of both letters make less difference and take place near in the articulatory feature-based IPA chart, and thus phonetic similarity of the 'bad' and 'pad' is much lesser than the standard ED.
The lexical similarity in true sense is by calculating for both types of words. But vowels in different languages contain addon features like in Urdu vowels contain short vowels as their composing part [30]. In our paper, we are skipping the vowels as future work due to their complex nature of features for phonetics. Thus, for consonants, we have proposed the following features (and their respective values) voiced (binary), airflow (discrete), place (continuous), aspirated (binary), pharyngeal (binary) and manner (discrete).
The features are picked from their positions, the place is the articulation of place and inside the human mouth, these places are present. we have assigned the value as per their feature positions. Like lips (bilabial) position have 0.05, teeth position have 0.15, and the throat (glottal) position has 0.95. The other two important features are type and label, the label is the IPA of sound and type ensures that sound is a vowel or a consonant.

C. Algorithm for Phonetic Edit Distance
As the sound based articulatory features presented in section III-B and represented with their corresponding values. In this work, we didn't compare the vowels with consonants but only consonants with consonants.
For the comparison of consonants, as shown in algorithm: 1, we have given 2 /3 value to the place and manner features; and 1 /3 value is assigned to the remaining features. The voiced feature has 1 /5 value; and the other features have the remaining weight/value shown as 1 − 2 /3 − 1 /5. At present, in the proposed system, the remaining features are airflow and aspirated. Although, we can increase the features without decreasing the weight of major features.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 Further, manner and place features are more substantial. So, we have considered the distance only when manner and place represented as (δ m+p ) is equal or lesser than the threshold which is 1 /2. If the joint distance is above the threshold level then we don't add the distance of other features and return only this distance. Also, rule-based distance for feature manner is done by using dictionary look-up when the key of ⟨ manner , manner ⟩ is given. Finally, , , shows the Manhattan Distance for the remaining all features as presented in-line 10. ← length of X.

3:
← be the length of Y. return min(ins_cost, del_cost, rep_cost) In Algorithm 1, PDC is the phonetic difference of consonants, Θ is computing unit for two features, MPED is the modified phonetic edit distance and ED is the edit distance. The pseudo-code of overall lexical similarity for experimental languages is described in Algorithm 3, in which on finding similarity the result is in the range of (0,1) if the sound is the same then 0 otherwise 1. This way all PoS words from the Urdu language compared with Hindi language using this modified phonetic edit distance to find the lexical similarity and the ratio of loan words or cognates between the languages.    the IPA letter /sA/ and /s "/ we can substitute/suppose Urdu alphabets ‫ص‬ and ‫س‬ respectively; and for the analogy of the romanized variant, both of them would be producing sound for s but former one has low whistle sound in comparison to the later one, where whistle sound is bit higher due to dental place. Thus, with the romanized equivalents of these (Hindi) words we can get a higher distance through the standard edit distances (see

D. Computing Lexical Similarity
The flow of finding lexical similarity is described in this section as; The word lists created from UD corpora, then these word lists converted into respective IPA codes, after this IPA codes enriched with phonetic articulatory features and in last the lexical similarity computed based on these phonetic features between the languages using the proposed method and the pseudo-code is presented in Algorithm 3.
Universal Dependencies website 5 provide the corpora for all languages in CoNLL-U format with tagged PoS. The tagged (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 dependency structure includes word lemma and the Universal Parts-of-Speech (UPoS). We used lemma in computation rather than words, for which there exists an inflectional nature.
After extraction of word-lists, we converted word-lists into IPA strings. There are many online platforms and dictionaries which converts words into respective IPA strings. Keeping in mind that the chosen languages hold short-vowels, izafatletters, and diacritics causing issues of conversion for such platforms [2]. Underlying this, we have created our mapper of words-to-IPA (in fact script-to-IPA) for both languages.
Further, the articulatory features from the IPA phonetic chart used for a one-to-one mapping of words. The IPA chart is the standard chart for phonetic level weight-age of letters in any word. Based on these features, each word is compared with other words. These articulatory features are described in section III-B and III-C. for every word in Lang 2 do 5: x ← PED( , ) · 1 max(∥ ∥, ∥ ∥) 6: [Tot] ← least value as key. In Algorithm 3 for the comparison of languages Lang 1 and Lang 2 , supposing words of Lang 1 are compared with all words of Lang 2 in the step-3 and step-4. Also, we have normalized the PED result value with a maximum length of the word for the comparative words in step-5. Otherwise, the smaller words will be getting less value of phonetic distance.
Here in algorithm 3, every word 'a' compared with 'b' and the minimum value recorded in edit distance, this aggregates the overall edit distance; and is the average distance per letter for both lists. if the value of is equal to zero '0' then both lists (languages) are similar as identical but if the value is more closer to '1' then words are different in sounds and vice versa.

IV. RESULTS
We have computed the lexical similarity for Urdu and Hindi with the proposed method on articulatory features PoSwise. The results are shown in Figure 5. It is identified that these languages have cognates and genetic affinity. Urdu and Hindi are quite similar in spoken level but Hindi is written in Devanagari script which is entirely different from Urdu script. The similarity index shows that the Hindi and Urdu are top similar in auxiliaries, determiners, articles, coordinating conjunctions, and pronouns PoS. It is analyzed from our experiment that Urdu and Hindi on average 67.8% similar languages despite different scripts. This average similarity is computed from the similarity of each PoS employed in this study. In all PoS-wise comparisons, most of the results are comprehensible the only determiner is less similar as shown in Figure 5. Similarly, few PoS have shown high similarity that is due to the unbalanced size of words in those PoS like coordinating conjunction, subordinating conjunction, and verbs. The point which is important to discuss here is the low similarity in adpositions and determiners; this is due to the erroneous tagging of Hindi and Urdu PoS tags in UD dataset. Thus, if specifically the tagging for adposition and determiner is done properly, then, the results would support the fact with more emphasize that the Urdu and Hindi are the same languages though they are the victim of language split.

V. CONCLUSION
In this paper, we have introduced a modified algorithm for the lexical similarity of Urdu and Hindi languages based on articulatory features. This algorithm has identified the intelligibility, cognates, and borrowed words despite the spelling, script, and phonetic difference. In the conducted experiment with the proposed algorithm, the majority of similarity pairs of PoS are in agreement as per their genetic affinity. The proposed algorithm has given better and understandable results which are far better than the simple string matching with standard edit distance on such a phonetic level parameter. The proposed method is also found effective under the situation where a speaker does not qualify or unable to pronounce a certain alphabet of other languages (for example Arabs cannot pronounce the sound 'p' and 'ch'); so for these situations, they have to look into the similar or near-by sounds for substitution. In such scenarios, PED will give minute results edit distance in comparison to standard edit distance.
The ≈ 67.8% similarity is fair enough to stay positive on the question that whether Urdu and Hindi languages are mutually intelligible or not? Since the similarity under the phonetic aspect is high, therefore, we maintain that, within the context of the speech, it is very rightful to term both of these languages as 'Hindustani'; however, the difference of script may produce a very trivial excuse to differentiate either one of them as 'Hindi' or 'Urdu.' (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 Though there are some limitations in the used UD corpora (variation in the size of languages, format errors, and basic processing) and the issues itself in the languages like silent letters, diacritics, and short-long vowels. It could be improved by using digital lexicographic resources and dictionaries rather than letters to the IPA scheme. In the future, a comprehensive work could be done for the lexical similarity of the whole family of Indo-Aryan languages by extending and enriching the proposed algorithm with vowels along with consonants.