Urdu to Punjabi Machine Translation : An Incremental Training Approach

The statistical machine translation approach is highly popular in automatic translation research area and promising approach to yield good accuracy. Efforts have been made to develop Urdu to Punjabi statistical machine translation system. The system is based on an incremental training approach to train the statistical model. In place of the parallel sentences corpus has manually mapped phrases which were used to train the model. In preprocessing phase, various rules were used for tokenization and segmentation processes. Along with these rules, text classification system was implemented to classify input text to predefined classes and decoder translates given text according to selected domain by the text classifier. The system used Hidden Markov Model(HMM) for the learning process and Viterbi algorithm has been used for decoding. Experiment and evaluation have shown that simple statistical model like HMM yields good accuracy for a closely related language pair like Urdu-Punjabi. The system has achieved 0.86 BLEU score and in manual testing and got more than 85% accuracy. Keywords—Machine Translation; Urdu to Punjabi Machine Translation; NLP; Urdu; Punjabi; Indo-Aryan Languages


INTRODUCTION
The machine translation is a burning topic in the area of artificial intelligence.In this digital era where across the world different communities are connected to each other and sharing a vast amount of resources.In this kind of digital environment, different natural languages are the main obstacle to communicate.To remove this barrier researcher from different countries and big companies are putting efforts to develop machine transition system to resolve this barrier.Various kinds of approaches have been developed to decode natural languages like Rule based, Example-based, Statistical and various hybrid approaches.Among all these approaches, statistical based approach is a quite dominant and popular in the machine translation research community.The statistical systems yield good accuracy as compared to other approaches but statistical models need a huge amount of training data.In comparison to European languages Asian languages are resources poor languages therefore it is challenging task to collect parallel corpus for training these statistical model.There are many machine translation systems which have been developed for Indo-Aryan languages [Garje G V, 2013].Most of the work have been done using rule-based or hybrid approaches because the non-availability of resources.The proposed system based on an incremental training process for training the machine learning algorithm.Efforts have been made to develop parallel phrase corpus in place of parallel sentence corpus.Collecting parallel phrases were more convenient as compared to the parallel sentences.

II. URDU AND PUNJABI: A CLOSELY RELATED LANGUAGE PAIR
Urdu 2 is the national language of Pakistan and has official language status in few states of India like New Delhi, Uttar Pradesh, Bihar, Telangana, Jammu and Kashmir where it is widely spoken and well understood throughout in the other states of India like Punjab, Rajasthan, Maharashtra, Jharkhand, Madhya Pradesh and many other 1 .The majority of Urdu speakers belong to India and Pakistan, 70 million native Urdu speakers are in India and around 10 million speakers in Pakistan 2 and thousands of Urdu speakers living in US, UK, Canada, Saudi Arabia and Bangladesh.The word Urdu is derived from Turkic word ordu which means army camp 2 .The Urdu language was developed in 6th to 13th century.Urdu vocabulary mainly derived from Arabic, Persian, and Sanskrit and it is very closely related to modern Hindi language.Urdu is written in Nastaliq style and script is written from right to left using heavily derided alphabets from Persian which is an extension of Arabic alphabets. 3Punjabi is an Indo-Aryan language and 10th most widely spoken language in the world there are around 102 million native speakers of Punjabi language across worldwide 4 .Punjabi speaking people mainly lived in India"s Punjab state and in Pakistan"s Punjab.Punjabi is the official language of Indian states like Punjab, Haryana, and Delhi and well understood by many other northern Indian regions.Punjabi is also a popular language in Pakistani Punjab region but still did not get official language status.In India, Punjabi is written in Gurmukhi script means from Guru"s mouth and in Pakistan Shahmukhi is used means from the king"s mouth.Despite from the different scripts use to write Punjabi, both languages share all other linguistics features from grammar to vocabulary in common.
Urdu and Punjabi are closely related languages and both belong to same family tree and share many linguistic features like grammatical structure and vast amount of vocabulary etc. for example: Urdu: ‫وٍ‬ ‫پٌجبثی‬ ‫کب‬ ‫یوًیورسٹی‬ ‫۔‬ ‫ہے‬ ‫علن‬ ‫طبلت‬ Punjabi: ਉਸ ੰ ਜਾਫੀ ਮੂ ਨੀਵਯਸਟੀ ਦਾ ਸਵਸਦਆਯਥੀ ਸੈ । English: He is a student of Punjabi University.www.ijacsa.thesai.orgDespite from script and writing order where Urdu is written in right to left using Arabic script and Punjabi from left to right using Gurumukhi script, every other linguistic feature is the same in both sentences.Both sentences shares same grammatical order and most of the vocabulary, this is also true in care of more complex sentences.By analysis of both languages, we found that both languages share many similarities and are used by a vast community of India and Pakistan.Therefore, we need a natural language processing system which can help these people to share and understand text and knowledge.The efforts have been made to develop a machine translation system for Urdu to Punjabi text to overcome this language barrier between both the communities.With the help of this machine translation system, native Punjabi reader can understand Urdu text by translating into Punjabi text.

III. CHALLENGES TO DEVELOP URDU TO PUNJABI MT SYSTEM
A. Resource poor languages: Urdu and Punjabi languages are new in natural language processing area like any other Indo-Aryan language.Both languages are resource-poor language, very small or no annotated corpus is available for development of a full-fledged system.
To develop a machine translation system based on the statistical model, one should need a huge parallel corpus to training the model.For rule-based approach or hybrid machine translation system, one should need a good part of speech tagger or stemmer and large parallel dictionaries.To best of our knowledge, Urdu-Punjabi language pair does not have these resources in a vast amount to train or develop the system.Therefore, development of resources is one of the key challenges to work on this language pair.Above example shows that same sentence can be written in various ways due to free word order and all sentences give exactly the same meaning.Therefore, it is always difficult to form every possible rule to interpreter"s source language text to do machine translation.In naturally written text diacritical marks are used very rarely.Due to missing of diacritical marks, an Urdu word can be mapped to many different target language translations, for example, word dil/‫دل‬ often used without diacritical marks and can be interpreted as "Heart" and "DELL" without knowing the context of this word.Missing of diacritical marks is a key challenge to choose a proper translation in the target language and the system always needs to disambiguate these words.Along with this, the missing diacritical marks create various variations of the same word, for example, word "Urdu" can be written in three ways(‫ُو‬ ‫رد‬ ُ ‫ا‬ ‫ردو)(‬ ُ ‫ا‬ ‫.)اردو)(‬ Therefore, one should need to include all of these variations in the training examples.www.ijacsa.thesai.org

IV. METHODOLOGY
An Incremental machine learning process has been used, in place of manually developed parallel sentences corpus of source and target languages.Urdu and Punjabi languages are resource-poor language; the non-availability of the parallel corpus is a primary challenge to develop a statistical machine translation system.Efforts have been made to develop a corpus of manually mapped parallel phrases.Figure 1 shows the overall learning process of machine translation systems.The system takes Urdu text document as input and translates using initial uniformed distributed data.Initially, the system has phrase tables for most frequent 5000 Urdu words mapped with Punjabi translations.Due to insufficient data in phrase tables, many Urdu words returned without translation in parallel phrase file generated by decoding module shown in Appendix 1.
Then generated file manually corrected and updated with new translations by linguists.This updated file again submitted to the system to generate language model and translation model.The system learns new parameters from all the updated all files present in the repository of generated training files.Then system updates language model and phrase tables with a new vocabulary and update probabilities.With this incremental learning process, the system gets trained by each document it processes, learn and update language and translation model.The complete system is divided into five different processes or modules, Tokenization and segmentation, Text classification, Translation model learning, language model learning and decoding process.A. Tokenization and segmentation process: Tokenization process is the primary and most significant task of any machine translation system.In preprocessing phase, the input text is divided into isolated tokens or words by tokenization process based on whitespace.Tokenization process is also a challenging task to identify valid tokens, when the system has noisy input data.Where tokens are often attached to neighboring tokens without any whitespace in-between them.This kind of writing trend is quite common in Urdu, where whitespace is an optional thing.The proposed tokenization process works on two levels, (1) isolates sentence boundary identification and (2) isolate word boundary identification.

1) Tokenization into Sentences:
In sentence tokenization process, the system identifies sentence boundary based on few symbols used in Urdu to complete the sentence.For example, Urdu sentences often end with,{ ‫؟‬ ‫۔,‬ }, but symbol { ‫۔‬ } is an ambiguous one and not always used to identify the sentence boundary.This symbol { ‫۔‬ } also used as a separator in abbreviations.For example, ‫سی۔‬ ‫سی۔‬ ‫آئی۔‬ , therefore, to tokenize text into sentences few rules were formed to check boundary conditions based on abbreviation.For example, the system always checks surrounding words of sentence termination symbols in abbreviation list.
2) Tokenization into words: The word tokenization process identifies individual tokens or words in the input text.To identify all the individual tokens first, one should need to separate all the words from symbols which are attached to words.For example, the system inserts whitespace in-between symbols and words and change them from ‫ہیں۔‬ ‫آئے‬ to ‫۔‬ ‫ہیں‬ ‫آئے‬ .3) Segmentation process: The segmentation issue is a key challenge in Urdu text processing NLP applications.Segmentation issue can be handled on two levels, space insertion and space omission as discussed in MT challenges.In tokenization process, the system has handled only space insertion issue.Space omission problem is negligible in Unicode Urdu text but space insertion is quite frequent.To resolving the word segmentation problem in Urdu is quite a challenging task and need a full-fledged algorithm for this.Rather than handling all segmentation issues, the system has handled most frequent cases of segmentation.For example, in Urdu text, most of the time word attached with these prefixes ‫سے{‬ ‫اور,‬ ‫}کے,‬ which are ends with non-connecters and easily understood by Urdu reader but difficult for a computer algorithm to process.Few examples of segmentation words start with these prefixes are { , ‫اورنام‬ , ‫اورترک‬ ‫کےبعد,‬ ‫کےلیے,‬ ‫سےکہیں‬ , ‫سےپہلے‬ }.The analysis shows that these three words were 65% of all segmentation cases found in Urdu text and 5% cases of segmentation were related to alphanumeric words.Alphanumeric segmentation issue is also quite common in Urdu Naïve Bayes model has been used to classify the input text, Naïve Bayes model considers document as bag of word where word positions are not important for classification, The Naïve Bayes approach based on Bayes rule defined as: Rewriting by dropping the denominator because of constant factor: To representing features of the documents for a class, equation can be written as: ) ( ) ( ) Joint probability of whole set of independent features defined as: Simplified as: To calculate maximum likelihood estimate and prior defined as: To handle the unknown words, classifier has used Laplace smoothing defined as: Rewritten as: Where is size of vocabulary and is constant value to add in frequency count of word in a document.HMM is a generative model defined as: Where are source language phrases and target language phrases.By inputting the , we take the highest probability phrase sequence as output of target language.One should define bigram HMM model as below: Where IBM models provide an elegant solution to automatically mapped source and target language phrases, but for that, one should really need a large parallel corpus to train the model.Urdu and Punjabi are resource poor languages as we discussed in challenges.Therefore, the efforts have been made to find out a simple and effective solution for the training process.
The system takes manually mapped phrases as a training file and calculates translation probabilities.Sample of a training file is shown in appendix 1.
For example: word ‫اتفبق‬ can translate into four different ways.If training algorithm knows mapping in advance then it is quite straightforward to calculate translation probabilities from their occurrence in training data.In proposed method, the training algorithm already has alignments of all phrases, therefore; it can calculate parameters for the generative model.
Appendix 1 shows one-to-one, one-to-many, many-to-one, many-to-many word mapped phrases.In training data, we try to combine multiple words into a phrase which are frequent or combined words yield valid translation in target language.To compare with IBM models, we have used 50000 thousand parallel Urdu-Punjabi sentences to train the model using Moses toolkit which used Giza++ for phrase alignment.For 50000 sentences Moses generated over 3168873 phrases of size 503 MB.By examined generated phrase table manually and found many miss alignments and unnecessary long phrases those were increasing the size of phrase table and adding complexity to search space for decoding algorithm.As compared to an automatically generated phrase table, our manually mapped phrase table for the same set of sentences contains 56023 thousand phrases which are sufficient to translate given sentences accurately of that domain as shown in experiment section.In our phrase table, a maximum length of any phrase was four-gram and total four-gram phrases was 1093 compared to automatically generated phrase table contain several thousands of four-gram phrases.
Automatically find the alignment of words and phrases using parallel corpus is a graceful solution but when we deal with resource-poor languages we need to find out alternative ways.Development of machine learning resources like sentence-aligned parallel corpus is a time-consuming job.To train any machine translation model; one should require millions of parallel sentences.Therefore, if one do not have parallel corpus it is better idea to map phrases rather than writing parallel sentences.Mapping and checking phrases incrementally makes the job easier.Mapping the phrases gave you three advantages first you just need to write a short phrase in place of the whole sentence in the target language.During training processes system generate partial translation or nearly complete translation of an input document.We just need to check or mapping new words in generated files.Second is your phrase table size will be very small compared to automatically generated phrase table it will make a decoding process more efficient.Third, a linguistic person needs less time to generate parallel phrases then parallel sentences.
2) Language model: The language model is responsible for generating natural language.The system has been used Kneser-Ney smoothing algorithm to generate language model (Chen and Goodman 1998).Kneser-Ney is an extension of Absolute Discounting and provides state of the art solution for predicting next word.Absolute Discounting method is defined as: Kneser-Ney is a refined version of Absolute Discounting and gave a better prediction on lower order models when higher order modes have no count present.Following equation shows the second order Kneser-Ney model.
Where is normalized constant, defined as: Where * ( ) + is number of word types that can fallow, .
( ) used as a replacement of maximum likelihood of unigram probabilities with continuation probability that estimate how likely the unigram is to continue in a new context.Continuation probability distribution defined as: Where numerator equation is a count of different word types before the word w. ∑ * ( ) + : Denominator equation is a normalized factor, total count of different words preceding the all words.Recursive formation of kneser-Ney for higher order model defined as: To form the language model we have used a mixture of phrase and word-based language model.Generally, machine translation systems and other NLP applications used wordbased language model.We have tried to develop phrase-based model along with word-based model which gives accurate predictions to choose correct phrases or word to generate target language.The system generates phrase separator training data files to generate phrase and word-based language model file shown in Appendix 2. Changes have been made in language model training data to reduce vocabulary size.For example, we have changed all numeric tokens with a unique token like 22.201 and 545.1 numeric values with 11.111 and 111.1 respectively.Changing the numeric token with unique tokens helped smoothing algorithm to efficiently predict phrase sequence with the same pattern with different numeric tokens for example.Along with numeric patterns, we changed patterns like an email address to unique token [e@e] which helped us to decrease the size of a language model.D. Decoding: Decoding problem find the most likely state sequence from given observation , to decoding the Hidden Markov Model and find the state sequence with the maximum likelihood the system had used Viterbi algorithm.The sequence of states is backtracked after decoding the whole sequences.EXPERIMENT AND EVALUATION The system has been evaluated using BLEU score which is automatic evaluation metric (Papineni et.Al 2002) and evaluated by human evaluators which were a monolingual non-expert translators have knowledge of only target language.Where BLEU score range between 0 > 1 and for manually checking we have set four parameters as shown below.For BLEU score based evaluation, one target translation reference has been used to calculate a score which was prepared by same linguistic experts those who prepared training data.For incremental training, all training data was collected from BBC Urdu website.The system has been evaluated after every 100 training documents.BLEU scores for per domain shown in chart 1 to chart 5. Manual testing was performed at the end of the training section.Test set contained 10 documents from each domain combined 1123 sentences.In manual testing 85% sentences got score 3 and 2 and 10% sentences got score 1 and remaining got score 0 which are new to the system and overall BLEU score was 0.86 for the same set of sentences.The text classifier before translation showed an increase in overall accuracy.The text classifier helped translation algorithm to pick correct translations phrases according to the domain of input text.The text classifier was evaluated using standard metrics as shown below.
Manual testing was performed at the end of the training section.Test set contained 10 documents from each domain combined 1123 sentences.In manual testing 85% sentences got score 3 and 2 and 10% sentences got score 1 and remaining got score 0 which are new to the system and overall BLEU score was 0.86 for the same set of sentences.The text classifier before translation showed an increase in overall accuracy.The text classifier helped translation algorithm to pick correct translations phrases according to the domain of input text.The text classifier was evaluated using standard metrics as shown below.The text classifier able to classify any given text document with overall accuracy 0.961.The text classifier was failed when document did not contain sufficient text to classify or text was very ambiguous for classifier like a political document which contains more sports related text than politics.

∑
Our experiment shows that simple statistical model like HMM also yields good results for the closely related language pair.HMM based model quite popular in the field of part of speech (POS) tagging and Named Entity (NE) tagging and researcher showed really good results for sequence tagging NLP applications.Various researchers [Thorsten Brants, 200]  had been shown that with a good amount of training tokens even simple statistical model also perform well compared to MaxEnt etc. Appendix 3 shows that sample output and comparison of Google translator and our machine translation system.The proposed system generates nearly perfect or perfect translation of given text compared to Google translator which generates grammatical incorrect, meaningless and partial output in all cases.The system"s output was compared with all five domains.Urdu inputs examples were quite simple without any ambiguous words.
The comparison is difficult between both systems because both systems used different training data sets, but we had checked the entire words list manually on Google translator and nearly all words were in its translation database, but decoder was not able to translate the input text by using its knowledge base.Google translator has very rich phrase translation database but the translation is still quite poor for Urdu-Punjabi language pair.

VI. CONCLUSION
The Paper has presented incremental learning based Urdu to Punjabi machine translation system.In place of parallel corpus, where system learns parameters from parallel sentences of source and target language.The proposed system used manually mapped parallel phrases training data and learned the parameters for translation model and language model rather than using parallel sentences corpus.In preprocessing phase, the system has used rules for segmentation, tokenization and text classification system to translate given text according to a preferred domain which also helped translation system to improve overall accuracy.The system has been trained and tested for Urdu Punjabi language pair which is closely related languages and share grammatical structure and vocabulary.Urdu and Punjabi languages are resources-poor languages and one should need a huge amount of parallel corpus to train statistical machine translation model to get decent accuracy.In our learning method, the system has able to achieve 0.86 BLEU score which is relatively good compared to other statistical translation systems.Like Urdu and Punjabi, many other Asian languages are resource poor languages and this approach can be applied straight away for other closely related language pairs.www.ijacsa.thesai.org

Fig. 1 .
Fig. 1.Incremental MT training and decoding system www.ijacsa.thesai.org The system has used a list of 100 stop words to remove uninformative words which are common in training examples.Urdu is a morphologically rich language and one word can appear in the corpus with different suffixes, therefore, to transform all inflected words to root form in the training examples Urdu stemming rules has been used [Rohit Kansal et.al 2012].C. Translation and Language model Training: The machine translation system"s training process is divided into two main parts, Translation model, and Language model learning.The system used Hidden Markov Model (HMM) as learning process and Viterbi algorithm as a decoder.

) 1 )
Translation model: Urdu and Punjabi languages are closely related languages.Both languages share identical grammatical structure as well as same word order [Durrani, Nadir et.al 2010].To learn the translation model we have manually mapped the phrases of source and target languages.
He paid $50 to shopkeeper.He paid $30 to shopkeeper.Both these sentences changed to:He paid $11 to shopkeeper.
Current word length > 3 and start with ‫کے{‬ , ‫اور‬ , ‫}سے‬ and word not present in DB Apply rules to split prefix and suffix parts IF: suffix part is present in Phrase Table Add prefix and suffix words in FinalList[].
B. Text Classification: Most of the statistical machine translation system use single phrase table for translation.Instead of single phrase table for translation, the proposed system has used five different phrase tables for each domain.The system has trained on political, health, entertainment, tourism and sports domains.After tokenization process, text classifier needs to classify input text into most probable class, then translation module uses specific domain phrase table to translate input text.The text classifier returns a list of all domains with the higher probable domain on top followed by less probable domains.Other domains are used as a backoff model when the system did not find an Urdu phrase in the top domain then it searches in next less probable domain and so on.( ) { Fig. 2. Text classification system ALGORITHM 1. Tokenization and Segmentation Process Read Input Text in InputText FinalList[] Sentences[][] IF: Current token is not a sentence separator Sentence += token+" " ELSE IF: Current token is a sentence separator AND previous and next are not abbreviation tokens Add Sentence in Sentences[][] END LOOP Training Data www.ijacsa.thesai.org

TABLE III .
MANUALLY EVALUATION SCORES

TABLE IV .
CONFUSION MATRIX OF TEXT CLASSIFIER