Transliterating Nôm Scripts into Vietnamese National Scripts using Statistical Machine Translation

— Nôm scripts were used as the Vietnamese writing system from the 10th century to the early 20th century. During this period, Nôm scripts were the means to record a broad range of historical events, literary works, medical knowledge, as well as wisdom of many other domains. Unfortunately, since hardly any native Vietnamese speaker can read Nôm scripts nowadays, these valuable documents have not been fully harnessed. To address this gap, it is necessary to build an automatic transliteration system that can support us in decoding the ancient scripts and gaining knowledge of our Vietnamese ancestors. This study focuses on categorizing and reviewing the current progress on the Statistical Machine Translation (SMT) approaches to transliterate Nôm scripts into Vietnamese national scripts. In this paper, we discuss the differences between Nôm scripts and Vietnamese national scripts, systematically compare SMT models in transliterating Nôm scripts into Vietnamese national scripts, as well as having a thorough outlook on several promising research directions.


I. INTRODUCTION
Transliteration is a type of conversion of a text from one script to another, in the same language. For instance, the Cyrillic scripts of the Russian language, "Ïóòèí", is transliterated into the Latin scripts as "Putin". This transliteration is relatively straightforward, because there is only one correspondence in the Latin scripts for most of the letters in the Cyrillic scripts. Since both scripts are based on alphabets that contain a limited number of graphemes (strokes) to represent speech, transliteration can be done by looking up the mapping table. Table I   On the contrary, transliteration from Nôm scripts to Vietnamese national scripts is challenging because they do not belong to the same writing system. While Nôm scripts belong to the logographic writing system, Vietnamese national scripts belong to the alphabetic writing system. In other words, Nôm -Vietnamese national scripts is the one-to-many relationship.
For instance, the Nôm character can be transliterated into nghĩ or nghỉ. Due to differences between the two writing systems, the mapping table method presented in the aforementioned Russian language example is not applicable when transliterating from Nôm scripts into Vietnamese national scripts.
The one-to-many mapping from Nôm scripts to the Vietnamese national scripts causes difficulties in transliterating process because people have to simultaneously read Nôm text and guess the appropriate meaning. Successful Nôm-transliteration also requires extra-linguistic knowledge about the culture, history, geography, dialects, specialized terminologies of ancient Vietnam. In recent years, rich-resource languages have gained success in applying machine translation. Chinese [1] and many European languages, including German [2], Greek [3], English [4], Spanish [5], French [6], Finnish [7], Italian [8], Dutch [9], and Portuguese [10] are some of those rich-resource languages. Besides, research in low-resource Southeast Asian languages such as Indonesian [11], Khmer [12], Lao [13], Malay [14], Myanmar [15], Philippines [16], and Thai [17], also yields significant results, which motivates us to apply machine translation in transliterating Nôm scripts into Vietnamese national scripts. Two state-of-the-art approaches in machine translation are Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). However, NMT requires a large amount of data [18], which is impractical for the low-resource language pair Nôm -Vietnamese national scripts. Therefore, we apply SMT for the transliterating task in this study.
Given the mechanism, the larger the manually transliterated training data are given to the computer, the more accurate the transliteration the computers generate. Besides, the machine can also improve the transliteration accuracy if humans supervise and manually revise the incorrect results that the computers previously produce. The more times we repeat the supervising and revising loop, the better the transliteration results become.
In this paper, we present the automatic transliteration from Nôm scripts to Vietnamese national scripts using Statistical Machine Translation. Our research steps are as the following: (1) collect and clean (i) the Nôm-Vietnamese national parallel corpus as the training data for the translation model and (ii) the monolingual Vietnamese national scripts as the training data for the language model, (2) classify corpora according to literary forms and domains, (3) experiment, and (4)  The remaining of the paper is organized as follows: in Sections II and III, we provide an overview of Nôm scripts and of related studies, respectively. Then, we present our proposed model in Section IV and discuss the experimental results in Section V. Section VI concludes the study.

II. OVERVIEW OF NÔM SCRIPTS
Nôm scripts were created based on Chinese characters, which results in various similarities between Nôm scripts and Chinese scripts. Different from all phonological recording systems, Chinese scripts are the only logographical writing system currently used in the world [19]. Regarding the phonological writing system, there are symbols that record phonemes of a language. Meanwhile, Chinese characters are used to mark morphemes, ideas, basic concepts such as the sun ( ), moon ( ), tree ( ), human ( ), water ( ), and heart ( ). These basic elements are called radicals ( ). Radicals are the building blocks from which Chinese characters (Hanzi -) are built. According to the Han dictionary Shuowen ( ), there are six methods ( ) of constructing Chinese characters, including: • Pictograms ( ): (sun), (moon), (tree), etc.
Among these six types of characters, 90 percent belong to the ideogram-plus-phonetic category [19], i.e., each character is a morpheme-syllable compound. Meanwhile, Vietnamese language is constituted by morpho-syllables, which means units that constitute the two writing systems are equivalent. In the Chinese language, morphemes are radicals because a radical is the smallest meaningful unit of the Chinese writing system. Radicals are also the basis to arrange entries in Chinese dictionaries. For instance, to look up for the Chinese character (mother), we first search for the radical (woman), since the character contains the radical . Then, we look up the remaining component, , by the number of strokes, which is three.
According to [20], while there are about 10,000 distinct pure morpho-syllables (not including transliterated morphosyllables of loan words or scripts of ethnic languages) in Vietnamese, there are approximately 13,000 distinct Chinese characters (not including ancient characters, characters used for translitering loan words) in Chinese. Also from [20], each Chinese character has its own Unicode; the Chinese Unicode Charset is constructed based on various Chinese encoding charsets such as Big5 and GB; these encoding systems are gathered and aggregated into Unicode CJK charset; the first version of CJK was released in 1980 with roughly 13,000 Chinese characters; the number of encoded Chinese characters has grown over the years and reached 80,000 in 2018.
Most of the Nôm scripts were also created in the form of semantic (meaning)-phonetic (sound) compounds. The ancient Vietnamese usually borrowed two elements -one element for meaning and the other for the sound -from Chinese character collection to construct a Nôm character. For instance, the Nôm character means number three. In the Nôm character , the Chinese character , which has pinyin /bā/, denotes the sound, while the Chinese character indicates the meaning. Similarly, in the Nôm character , which means father, the Chinese character signifies the sound, while the Chinese character expresses the meaning. Apart from the aforementioned semantic-phonetic compounds, there are a number of Nôm characters created by other methods, such as rebus, repetition, transfer, and diacritics adding. These methods signify the phonological difference between Nôm and Chinese characters [21].
Because Nôm scripts are mainly built on the semanticphonetic compound method, there are cases in which one Nôm character is mapped to two or more Vietnamese national scripts. This typically happens when the national scripts have similar pronunciation and indicate synonymous meanings. This phenomenon can be explained by linguistic characteristics. While Vietnamese and Chinese languages are both tonal languages, they do not have the same number of tones. In particular, while there are six tones corresponding to six diacritics in Vietnamese, there are only four tones in Chinese. Moreover, different script creation methods, regional dialects, and Sino-Vietnamese variants due to different times of adoption also account for the one-to-many mapping between Nôm scripts and Vietnamese national scripts. For instance, the Nôm character has two corresponding national scripts. The first one corresponds to mùi (smell), as it was adopted before the Tang Dynasty. Meanwhile, the second one corresponds to vị (flavor) as it was adopted from the Tang Dynasty onwards [22]. In the Nôm-Vietnamese national scripts dictionary 1 , a considerable number of Nôm characters are polyphonic (a polyphonic Nôm character has more than one corresponding Vietnamese national script). For example, character (Unicode code 6298h) has 19 corresponding national scripts (chệch, chét, chết, chẹt, chiết, chít, chịt, díp, gãy, gẩy, giẹp, giết, giỡn, nhét, nhít, siết, trét, triếp, xiết). This is also the one with the highest number of meanings in the Nôm-Vietnamese national scripts dictionary. In contrast, each monophonic Nôm character has only one corresponding Vietnamese national script.  Choosing the suitable Vietnamese national script for a given Nôm character is a difficult problem not only for the machine but also for the transliterators. Consider the Nôm character (Unicode code 2025Dh), which appears in the 12th sentence of Tale of Kieu in Fig. 2. might be transliterated into two national scripts as nghĩ (to think) or nghỉ (a pronoun used to indicate an old man in ancient Vietnamese) [23]. Scholars have been debating for over 50 years on which national script is correct in the given situation. Both sides provide  Table III).
various arguments, historical evidence, and literary evidence, etc. to demonstrate why one out of the two national scripts would be more suitable than the other. Therefore, requiring a computer to generate a 100-percent accurate transliteration output is impracticable, at least at present time and in near future.

III. RELATED WORKS
The digitization of Nôm scripts has been proposed and implemented since the 1990s by Ngo Thanh Nhan, Nguyen Quang Hong, among other scholars 2 . Thanks to these contributors, most of the common Nôm characters have become a part of the Unicode encoding system. This significant work is a solid foundation for lateral digitizing steps, such as storage, lookup, processing, automatic transliteration, etc.
Moreover, Việt Hán Nôm 2002, a software developed by Phan Anh Dung 3 , allows us to type and look up both Chinese and Nôm characters. Another software, Hanosoft, developed by Tong Phuoc Khai 4 , also includes several utilities for looking up and transliterating from Chinese characters into Nôm characters. In the aforementioned software, the authors have developed a tool to automatically transcribe Chinese characters into Sino-Vietnamese, Chinese characters into pinyin, and Nôm characters into national scripts. However, the central issue of the problem, which is choosing the proper National script for a given polyphonic Nôm character, has not yet been addressed. The software just randomly selects a Sino-Vietnamese phonetic transcript or a phonetic transcript among all possibilities. Besides, the website of the Vietnamese Nôm Preservation Foundation 5 includes a Chinese character-Nôm lookup tool and a digital library of Nôm documents, most of (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 2, 2021 which are images of hand-written Nôm. Some literary works have also been digitized.
The work that is most closely related to our study is the Nôm converter 6 , which is a toolkit used to automatically transliterate Nôm scripts to national scripts and vice versa. The system applies Statistical Machine Translation (SMT) approach and is based on Moses [24]. The data sets used to train Moses are parallel corpora. These corpora are 22 manually transliterated texts corresponding to 3,234 lines in total. The tool works fine, except for some cases in which input contains strange untrained Nôm scripts. For those cases, Nôm converter just ignores the strange untrained scripts and transliterates the rest of the input scripts as normal. Nôm converter has a rather high rate of choosing the correct national script when compared with the referenced transliteration version carried out by humans. To the best of our knowledge, it may be considered as the first automatic Nôm-Vietnamese national script transliteration tool that utilizes machine learning technology. Our approach is similar to Nôm converter, but with new modifications and improvements to address the limitations of the existing system.

IV. PROPOSED MODEL
In our proposed model, we customized a Statistical Machine Translation model (SMT) and improved the transliteration accuracy based on our work in automatic translation from English into Vietnamese [25]. Instead of following the Nôm converter system's approach in transliterating both directions (from Nôm scripts into Vietnamese national scripts and vice versa), we only focused on one-way transliteration from Nôm scripts into Vietnamese national scripts. Our core aim is to harness the Vietnamese ancient Nôm text, and the transliteration from national scripts to Nôm scripts does not imply as much practical significance. Besides, focusing on a one-way transliteration from Nôm scripts into national scripts allows us to invest more in improving national script output through various language models.
To overcome the shortage of parallel corpora for training as in the Nôm converter system, we added a Sino-Vietnamese dictionary into the phrase table of the Moses system. To improve the accuracy, we also added more manually transliterated literary works that Nôm converter has not yet included. Our major contribution is categorizing the Nôm script input data and providing language models for the Vietnamese national script output. The most challenging issue that we observed in transliterating Nôm script into Vietnamese national script was choosing the correct national script among all possibilities. This selection depends on context, form, domain, and even on the chronology of the input data. Nôm converter merely selects the national scripts according to the context in the training dataset, which is mixed in terms of form, domain, and time. Therefore, we classified the training dataset and language models by form and domain in our proposed model.
Because each form has its own rules for choosing the national script output, we classified the form into two categories: verse (such as Tale of Kieu, Tran Te Xuong's poems, etc.) and prose (The legend of Quynh, Biography of Phan Boi Chau, etc.). That is, these two forms required different language models. Besides, we also built corpora for three different domains, which were literature, history, and religion. New domains will be added into the current list of domains if we constructed and developed more corpora. Since each domain has its own terminologies, determining the domain to which the input scripts belong helped us narrow down the domain of possible national script output to improve the possibility of selecting the correct national scripts, especially for the cases in which the input is polyphonic Nôm scripts.
The final step was to build language models in the target language which was Vietnamese national scripts. The principle of machine learning is that the more training data we feed into the model, the better the transliterating accuracy will become. Due to this reason, not only did we utilize the national script dataset available in the parallel corpora that were used to train Moses in the previous step, but we also provided additional national script data that were already categorized by form and domain. This step improved the accuracy of the proposed model significantly since we included hundreds of thousands of sentences to build language models, compared to only thousands of sentences in the parallel corpora. A larger dataset of N-gram language models also allows the machine to generate the most linguistically natural transliteration output.
Later, when our proposed model is put into use, users will be able to select the form and domain of input data they want to transliterate from a menu. According to users' selection, computers will use the corresponding knowledge they have been trained to fit the form and domain of input data.
Let n be the source language sentence (Nôm scripts) and q be the target language sentence (Vietnamese national scripts), we have the following equation of the SMT model: Through equation (1), the SMT model's working process is as follows: i) estimate probability of seeing target string q language models P (q); ii) estimate probability that the source string n is the translation of the target string q given the translation model P (n|q); iii) choose the sentence q so that the value of product P (q)P (n|q) is the maximum. Fig. 3 shows a complete statistical translation system.

A. Experiments and Results
In this sub-section, we describe the training data and experimental results of the proposed model. 1) Training, testing, and tuning datasets: We used singlecharacter dictionaries listed in Table IV and compoundcharacter dictionaries listed in Table V. In single-character dictionaries, each entry is one morpho-syllable such as một (one), là (to be). Meanwhile, entries in compoundcharacter dictionaries have at least two morpho-syllables such as một vài (several), chúc mừng năm mới (happy new year).  The Tale of Kieu (1902) dictionary in Table IV is not a publicly available dictionary. We observed there are Nôm characters in Tale of Kieu that have not been included in the other six single-character dictionaries. Therefore, we manually created the Tale of Kieu (1902) dictionary by listing all distinct pairs of Nôm and Vietnamese national scripts. We then utilized a computer program to aggregate all dictionaries listed in Table  IV and Table V   Total 7920 Currently, we have not utilized the domain information yet because the majority of the dataset belongs to the Literature domain. Classification of data is only useful if we have a considerably large corpus of various domains. Although we are not using categorizing information at the moment, we still include it into the program as a foundation for future work. We also collected corpora written in Vietnamese national scripts on the websites Gác Sách 11 , Sách Phật giáo 12 , and Ô Cửa Sổ 13 . Domain and form of monolingual corpora are listed in Table VII. The training, testing, and tuning datasets are splitted by the ratio 1:1:8 as follows: for each text in Table VI, roughly 1/10 is distributed to the testing set, 1/10 is distributed in the tuning set, the remaining 8/10 is for the training set.
2) Experimental results: Using Moses SMT system [24], we conducted experiments on the corpora previously discussed and yielded the results in the In Exp2, the data was the same as in Exp1, but the model has been tuned for better transliteration quality. The resulting BLEU score was 14.56, which was slightly better than the previous experiment's result.
In Exp3, we measured the impact of the language model by adding 370,817 lines of Vietnamese national script to train the model instead of using the default national scripts extracted from the parallel corpus. This made a significant difference, as the BLEU score increased from 13.32 to 63.89. This was because the language model supported phrase-based translation and provided context for the transliteration model to choose the most likely national script for a given Nôm script.
In Exp4, we tuned the model from Exp3, and the BLEU score increased from 63.89 to 65.94.
In Exp5, we added 6205 entries of compound dictionaries, growing the parallel corpus compared to that of Exp1. Consequently, the BLEU score increased from 13.32 to 36.82.
In Exp6, we tuned the model from Exp5, and the BLEU score increased from 36.82 to 44.24.
In Exp7, we added 370,817 lines of Vietnamese national script to train language model. Compared to the results in Exp5, the BLEU score in this experiment increased from 36.82 to 67. 19.
In Exp8, we tuned the model from Exp7, and the BLEU score increased from 67.19 to 69. 16.
In Exp9, we added 6,348 pairs to the parallel corpus. Compared to Exp1 and Exp5, the BLEU score increases from 13.32 to 36.82 to 80.50.
In Exp10, we tuned the model from Exp9, and the BLEU score increased from 80.50 to 80.83, which was not a considerable difference.
In Exp11, we added 370,817 lines of Vietnamese national script to train language models. Compared to Exp9, the BLEU score increased from 80.50 to 82.30. In this case, since we already had parallel corpus with long sentences of national script, adding a language model corpus did not yield a significant difference as in Exp3 and Exp7.
In Exp12, we tuned the model from Exp11, and the BLEU score increased from 82.30 to 85.38.
In the last four experiments, from Exp13 to Exp16, we used the parallel corpus without dictionaries to train the model and got acceptable results. However, there was similarity between the training corpus and the testing corpus, so the BLEU score was quite high, ranging from 75.71 to 79.40.
We hypothesized that if the testing data contain Nôm scripts from other domains such as medicine or agriculture, which have not been in the training data set yet, then the models with dictionaries will work better. This was based on the assumption that even though lacking the context, dictionaries cover a broader scope of vocabularies. However, that missing context could be made up by the additional language model as in Exp3 and Exp4, where the BLEU scores were 63.89 and 65.94 respectively. Those results were acceptable given that we trained the model only with dictionaries and additional language model data, without any parallel sentences. At the moment, we did not have data to verify our hypothesis. Verifying this hypothesis will be put in our future work, when we collect data from various other domains.
The corpora we used to train and test the transliteration system included single-character dictionaries, compoundcharacter dictionaries, and parallel pairs of Nôm-Vietnamese national script sentences, whose model was trained by dictionaries. Parallel sentences were separated with the ratio of train:tune:test as 8:1:1 (tune here refers to the data used to tune the model, that is, to find the optimal parameters for the transliteration model).
In the third column of Table VIII, "Default" refers to the monolingual national scripts extracted from the parallel corpus in the second column. In experiments with "Default" monolingual corpus, we did not use additional language model corpora in Table VII. In the fourth column of Table VIII, BLEU stands for Bi-Lingual Evaluation Understudy, a metric used to measure quality of machine translation output in comparison to human-generated output. Format of BLEU score is overall, uni-gram/2-gram/3-gram/4-gram. The fifth column signifies whether an experiment was tuned or not. As mentioned previously, the purpose of tuning is to find optimal parameters for the transliteration model, and thereby generating better transliteration output, compared to the untuned model. After training the model, we chose 10 percent of the sentences in the testing data to evaluate the proposed model. Exp12 yielded the highest BLEU score, which was 85.38. Therefore, we selected some sentences in the testing set of this experiment to compare to the corresponding output generated by Nôm converter. 12 sentences from Tale of Kieu (version 1902) were tested, and the results are presented in Table IX.
We use different typefaces to distinguish between correct and incorrect transliteration. The differences are explained as follows: • Compared transliteration   The BLEU score in Exp12.2.1 was the highest score. However, the way we separated data into training set and testing set previously might cause biased results because of the similarity between the training set and testing set. That is, it may not be practical to distribute each poem (text) in both training and testing sets with the ratio of 8:1, because in real world situations, users might want to transliterate an unseen Nôm text, completely different from the one used to train our model. Therefore, we applied k-fold cross validation to better evaluate the skills of our proposed model. There are 7,920 lines of parallel Nôm-Vietnamese national scripts in total. We shuffled the data and then distributed parallel text into 10 folds (parts). After equally distributing all sentence-pairs into 10 folds, each fold contained 792 pairs of sentences. Based on the experiment results presented in Table VIII, we observed that Exp12.2.1 set-up generated the highest transliteration quality. Consequently, we implemented k-fold cross validation using dictionaries and an additional language model for model training as in Exp12.2.1. Then 10 previously separated folds were distributed into training, tuning, and testing sets as follows: eight folds for training the model, one fold for tuning, and the one remaining for testing. This time, the data used for language model training was slightly different from that of Exp12.2.1. In addition to the data as in Exp12.2.1, we also extracted and used the national scripts from eight folds of the parallel corpus to feed more data into the model, and thereby improving the transliteration quality generated from the model. We conducted the experiment 10 times, with the corresponding BLEU evaluations presented in Table X Table VIII were relatively fair, as they were not biased due to the data distribution among the training set and the testing set.

B. Limitations and Development Directions
Based on the test results presented in S ection V-A, we observe that our proposed model still has limitations in choosing the correct national script for a given Nôm script input. While our goal to resolve this difficulty remains, it is unlikely to attain 100-percent accurate transliteration output since even humans argue over which national script should be used for a given Nôm character.
To overcome the aforementioned limitations, we will continue to collect and build a larger parallel corpus for the translation model as well as a monolingual corpus for language models. We will also categorize input data into domains to improve the transliteration quality. We will keep collecting corpora from some other domains such as medicine and ideology. In addition, we will conduct more experiments and train our proposed model with new machine learning models.

VI. CONCLUSION
In this paper, we have presented an automatic transliteration from Nôm scripts into Vietnamese national scripts using the SMT paradigm in computational linguistics. Our proposed model demonstrates significant improvements compared with the existing transliteration system, Nôm converter. Not only does the model recognize a broader range of Nôm scripts, it can also choose the national script for a given Nôm character with higher accuracy according to the context of the input Nôm scripts. Our finding of the distinct characteristic of the language pair Nôm -Vietnamese national scripts and our contribution in building a separate corpus for the language model beside the default language model extracted from the parallel corpus lead to a high result in the SMT approach.
In the future, we will build domain-specific language models and integrate linguistic knowledge to improve transliteration accuracy. Moreover, we can conduct manual postediting to introduce further improvement. Our proposed model, therefore, will be able to generate more accurate transliteration results. This automatic transliteration system will bridge the gap between our past and our present, stemming the differences in our two writing systems, the historical Nôm scripts and our current national scripts. Thanks to this system, the priceless treasure of our ancestors in history, literature, religion, geography, and traditional medicine will be explored and harnessed effectively. Scholars can now browse and understand the main ideas of a Nôm text without having to invest an immense amount of time to manually work on the ancient scripts.