Factored Phrase-Based Statistical Machine Pre-training with Extended Transformers

This paper presents the development of a cascaded hybrid multilingual automatic translation system, by allowing a tight coupling between the two underlying research approach in machine translation, namely, the neuronal (deterministic approach) and statistical (probabilistic approach), while fully taking advantage of each method in order to improve translation performance. This architecture addresses two major problems frequently occurring when dealing with morphologically richer languages in MT, that is, the significant number unknown tokens generated due to the presence of out of vocabulary (OOV) words, and size of the output vocabulary. Additionally, we incorporated factors (additional word-level linguistic information) in order to alleviate data sparseness problem or potentially reduce language ambiguity, the factors we considered are lemmatization and Part-of-Speech tags (taking into consideration its various compounds). We combined a fully-factored transformer and a factored PB-SMT, where, the training data is pre-translated using the trained fully-factored transformer, and afterwards employed to build an PB-SMT system, parallelly using the pretranslated development set to tune parameters. Finally, in order to produce the desired results, we operated the FPB-SMT system to re-decode the pre-translated test set in a post-processing step. Experiments performed on translations from Japanese to English and English to Japanese reveals that our proposed cascaded hybrid framework outperforms the strong HMT state-of-the-art by over 8.61% BLEU and 7.25% BLEU, respectively, for validation set, and over 8.70% BLEU and 7.70% BLEU, respectively, for test set. Keywords—Machine translation; transformer; statistical machine; morphologically rich; hybrid


I. INTRODUCTION
Machine translation has known an improvement in the state-of-the-art performance by the intervention of Transformers [1] which is a new paradigm in Neural Machine Translation (NMT) [2] [3] powered by frameworks of sequence to sequence learning, thus rivaling since then the factored statistical machine translation paradigm [4] which has achieved the state-of-the-art in SMT frameworks [5] [6]. However, the fundamental design of NMT models which imposes them to make reliable the input representation of a word by observing several instances of that word in multiple examples, and make them to eventually face coverage issues during the computational complexity control by limiting the input and output vocabulary sizes, greatly affects their translation performance when processing rare or OOV (out of vocabulary) words (which are those neither included in the vocabulary nor seen in the training data set, therefore mapped to an UNK token since being considered as unknown words) for languages that are morphologically rich and of low resources (such as Cameroon local languages and some national well known languages namely Arabic, Czech, German, Italian and Turkish). Though having fluent translations in most cases, NMT face challenges in modeling languages syntactic and semantic deeper aspects.
As such, for low-resource (or small corpus) and morphologically rich language conditions, the necessity to incorporate for the surface level words various linguistic annotations was found to resolve semantic ambiguities and data sparseness, thus leading to better translation of rare words or OOVs and greater generalization capacity as illustrated [4] when addressing this issue for the traditional SMT architecture [7] by proposing the factored translation model. This linguistic annotations or factors include features such as lemmas, stems, morphological classes, roots, data-driven clusters, data-driven clusters, part-of-speeches, constituency parsing and compounds. With the vision of alleviating data sparseness and reducing language ambiguity, such extra features may be of enormous benefits when added to both NMT and Phrase-based SMT frameworks.
However, the aim of improving translation performance has inspired much research works through the combination of NMT and SMT paradigms [8] [9] [10] [11] in order to fully take advantage of each system's strength, and therefore overcoming the deficiencies of meaningless translations (those with meanings totally different compared to source sentences) and limited vocabulary size usually faced by pure NMT models, although its strong language modeling capacities. By contrast, the hard word alignment technic of PBSMT models reflects the source sentences adequacy extremely well, thus helping to some extend to restore the meaning of source sentence whenever wrong translations are produced. The framework proposed by [12] is very close to our work in the global context and overall all architecture but as compared to theirs, ours integrates outperforming paradigms in both the NMT and PBSMT frameworks, that is, Factored Transformers and Factored SMT, respectively. Also, we used linguistic features taking into consideration compounds bot at the NMT (augmenting its embedding layer so as to learn various compositional input representations at different granularity levels) and SMT levels and finally, we proposed a novel UNK replacement algorithm. Our experimental findings reveals that our hybrid model provide consistently and significantly better www.ijacsa.thesai.org translation quality for morphologically rich and low resourced languages when coming across rare and unknown words than the state-of-the-art of hybrid translation models. This paper is organized as follows: A literature review is performed in Section 2. We discuss the factorization process with the integration of compounds in Section 3. In Section 4, we describe the transformer operation with the incorporation of linguistic factors in detail. Section 5 detail our proposed neural hybrid MT framework. In Section 6, the results of two sets of experiments on Japanese to English and English to Japanese tasks are reported measured by their BLEU score. Finally, in Section 7 we summarize our findings and outline future plans.

II. ANALOGOUS RESEARCHES
By using a combination of different modules, paradigms, resources and approaches, many researchers have explored Hybrid MT systems. In order to produce publishable quality translations, corrections of repetitive errors have to be implemented through the development of various automatic or semi-automatic post-processing techniques, human postedition usually still have to be operated on the overall resulting MT output [13] [14]. Although human post-editing (PE) is needed over MT outputs, MT output post-edition more often remains cheaper and faster as compared to performing human evaluation from scratch. The authors in [15] [16], and [17] revealed that in some cases productivity can be increased as well as the quality of human translations exceeded by the quality of MT plus PE. More to that, a further optimization of the PE process needs to be done aiming at a time saving and cost-effective use of MT [13].
The authors in [18] and [19] brought out the idea of exploiting machine translation systems combined linearly using different paradigms has been successfully operated over SMT and rule-based MT (RBMT). As such, the systematic errors produced by the RBMT system were corrected by this automatic PE (APE) system based on PB-SMT, hence leading to the reduction of post-editing effort. For translation into a morphologically rich language, a rule (20 hand-written rules)based approach for English-Czech MT outputs APE at the morphological level was applied by [20] and [21], based on the most frequent errors encountered in translation. Words morphosyntactic categories such as case, number, person, and gender as well as dependency labels are efficiently corrected by this approach. Intuitively, one useful way to improve the APE performance is by source-language information integration in APE. The author in [22] proposed a pipeline in order to overcome data sparsity issues, where through taskspecific dense features the best pruned phrase table and language model are selected. More to that, they found that consistent improvements in all language pairs can be obtained by including source language information into statistical APE. The author in [23] considered the potential links of individual alignments occurrences and used an arbitrary number of alignments generated by different models (including both a refine model and minimum Bayes risk based models) by constructing over the 1-best alignments from multiple alignments [24] [25] weighted alignment matrices, rather than performing the combination of exactly two bidirectional alignments as proposed [26] and [27]. The works presented by [28] were motivated based on the fact that word alignment quality is constraint by word alignment-based reordering of source words, with the principal objective of producing monotone source and target chunk alignments through the reordering of source chunks. We argue that the problem of long-range reordering can be reduced to only short-range, intra-chunk reordering by obtaining monotone chunk associations from monotone word alignments while some source language syntax is preserved. The assumption is founded on the reflection that translation is performed by human translators much preferably at chunk level rather than at the word level.
Also, translation outputs produced by an SMT were either re-ranked in a post-processing step using NMT [29] [30] [31] [32] [33], or used to produce an NMT system [10]. Another scenario involves re-ranking the translation outputs produced by an NMT in a post-processing step by using an SMT [12], or guiding translation in NMT by integrating an SMT into an NMT, as they revealed significant translation quality improvement over the Chinese-English translation tasks during experiments [9] [34]. In the works of [34], an NMT architecture is trained in an end-to-end manner where at each NMT decoding step, based on decoding information additional recommendations scored by an auxiliary classifier are offered by the SMT in order to generate words, and the SMT recommendations are combined with NMT generations exploiting a gating function while jointly taking part in the training process.
The several aforementioned attempts to improve MT system's performance did not still properly handled the issues faced by morphologically rich and low resourced languages, and long-term dependency modelling. We argue that, in order to limit the vocabulary size words could equally be split into sub-word units as proposed [35]. Also, lexical probabilities could be integrated into the NMT as successfully investigated [36]. Another latitude to achieve more monotone translation could be to exploit pre-reordering as experimented [37], and finally but not the least, the NMT translation of rare words could be improved in a post-processing step as suggested [38].

III. WORD COMPOUNDING AND FACTORIZATION
In order to reduce the rate Out-Of-Vocabulary (OOV) occurrences and the amount of bilingual data when processing morphologically rich languages, factored models are majorly used. Factorization consists of splitting and retrieving from a given word linguistic information/factors such as dependency information, syntactic information, part-of-speech tags and lemma, using Tree-Tagger [39] and integrating it as a vector into a translation system. Machine translation from one morphological rich language to another has been a tedious task especially when not having enough required morphological information on the source side, since to have an exact target language word-form, word compounding is pronounced useful and highly productive [27] [40] [41] since it leads to sparse data problems and increases the vocabulary size. As such, integrating word compounding in the preprocessing phase has proven to be useful to add extra www.ijacsa.thesai.org morphological information to the linguistic/morphological factors of the source and target languages.
Compounding is operated at the level of POS Tag, where minimized part-of-speech tag are produced by refining POStags from the Tree-Tagger using a dependency parser to add morphological information including gender, number, case, verbs, person for nouns, definiteness, pronouns, determiners and adjectives, provided that both tools agreed on the POS-tag. And in case of disagreement the Tree-Tagger POS-tags were chosen. Morphologically rich language compound are formed by joining words, inserting filler letters (example: -s, -en, -er,ien) or from the end of all but the last word remove letters (example: -en, -n) of the compound [42].

A. Compound Splitting
The morphologically rich data language model is POS tagged and employed to compute the adverbs, adjectives, negative particles, verbs and nouns frequencies. Then making use of the adjusted version of the corpus-based method proposed by [27], each adjective and noun splits into known words from the corpus also proper names are not split since it would give rise to errors if translated in parts, while permitting filler additions and truncations. Also, due to the fact that compound parts often contain the base form, lemmas are equally used to calculate word frequencies in addition to surface form. As hint, more splits are gotten when using the arithmetic mean of the frequencies of its parts rather than the geometric mean, where the highest arithmetic value is validated. Each compound parts length was limited to 4 characters and the number of parts for adjectives particularly was restricted to ≤ 2 with minimum words length to be split ≥ 7.
All compound parts but the last were marked with the symbol # so as to be handled as separate words.
Special POS-tag are assigned to split words parts based on the compounds last word's POS, with both the full word and the last part receiving the same POS. Finally, words containing hyphens are split based on this same algorithm, and different POS-tags are assigned to their parts, with hyphens left at the end of all but last part. Factorization with compound splitting is integrated in a pre-processing step for training and translation of both the Transformer framework and the Phrasebased statistical machine framework.

B. Compound Merging
For translation into the morphologically rich language, the split compounds are merged based on POS through a postprocessing step at the outputs of both the Transformer framework and the Phrase-based statistical machine framework. As such, if a compound-POS is possessed by a word and a matching POS possessed by the following word, they are merged. Alternatively, a hyphen is added to the word in case the next POS does not match, thus allowing for coordinated compounds.
We used the merging algorithm proposed by [41] based on [40], with this algorithm the advantage is that unseen compounds can be merged and coordinated compounds handled.

IV. INCORPORATING LINGUISTIC FACTORS INTO THE
TRANSFORMER Our principal innovation over the standard encoder decoder based Transformer architecture is that we express the encoder input and decoder output as a combination of features such as [43] [44] [45]. Our generalized model supports an arbitrary number of input features.
It is on a number of well-known linguistic features that we focused in this paper, having as empirical question of knowing to which extend does providing linguistic features to both encoder and decoder improves the translation quality more specially in morphologically richer languages when using the transformer paradigm.
In order to better integrate linguistic factors in our NMT framework, we extended the Transformer architecture propose by [1] which employs multiple stacked layers of an encoderdecoder structure. Two sub-layers constitute the encoder layer, which are a self-attention sub-layer succeeded by a positionwise feed-forward sub-layer. Similarly to the encoder, the decoder has an additional sub-layer which serves at preventing information about future output positions to be incorporated by a given output position during training through masking in its self-attention. For all positions in a sequence, the transformer model computes attention scores using as query each position's input representation. The input representations weighted average are computed then using the previously obtained attention scores. More generally, the attention is identified as query and key/value vector pairs mapping to an output. As such, our work is an extension of [1] by the integration of additional linguistic factors. Considering that we have layers of annotations for linguistic factors, and training parallel sentences from the training data where the -th sentence pair word sequence is denoted in layer zero as ( ,0) and its length denoted as | |, the annotations of its layers are denoted by { ( , ) } =1 , with the target sentence denoted as ( , ) . In other words, for each feature we look up separate embedding vectors, and concatenate them. The total embedding size is matched by the concatenated vectors length, and the internal structure of the transformer's encoder and decoder is maintained. According to this setting we extended our standard encoder-decoder based Transformer architecture, operating as follows: Given the input sequence = ( 1 , … , ) of elements where ∈ ℝ on which each attention head operates, and from which a new representation = ( 1 , … , ) of same length is computed where ∈ ℝ . The weighted sum of a linearly transformed input elements will be computed from each output representations as [46]: Equally, a softmax function is used to compute each weight coefficient, as: And compatibility function which compares two input elements is used computed from :


To enable efficient computation, a scaled dot product was chosen for compatibility function. Where we have as unique parameter matrices per layer , , ∈ ℝ × .
Input representations in multi-headed self-attention are linearly mapped to lower-dimensional spaces firstly, and one multi-headed self-attention layer's output is formed by the concatenation of several attention mechanisms output vectors (provided that each attention mechanism is identified as a head). Thus in the first self-attention layer the vector for position for a single attention head ℎ ⃗ is computed as:

A. Beam Search Integration with Factors
We extended our beam search procedure in order to find the best sequences by dealing with multiple word features (outputs), for simplicity reasons we have one beam responsible for generating lemmas and another beam responsible for generating the concatenation of the different factors. With the help of a toolkit such as MACAON [47] or even the more specialized KyTea [48], we performed the grammatical and morphological analysis. While taking into consideration the context, the lemma and factors for each word is output using the MACAON/KyTea POS-tagger [49]. In the various outputs, the generation of the lemmas and factors are made in a synchronous stream thus leading to sequences with different length sizes, ending each after the generation of the < > (end-of-sequence) symbol, and creating by such, multiple representations of the < > symbol in an output word. Due to the fact that lemmas carry most of the meaning and are closer to the final objective, we constricted the length size of the factors sequence to be equal to that of the lemma sequence. This implies that when the lemma sequence generation has ended we stop the generation of factors while ignoring their < > symbol, therefore avoiding both longer and shorter factors sequences.
In order to generate the next word in the sequence, the feedback (previous word) is employed taking into consideration its various features (outputs), in this case, in order to obtain full benefit of both feedback outputs we performed the tanh (non-linear) transformation of both embedding concatenation, thus having more information and learning better by their combination. Given as: Where, the previous output −1 feedback is computed by , are trained weights, with −1 and −1 the embedding of the lemma and factors generated at previous time step, respectively.
Finally, for each partial hypothesis we did the cross product of the output spaces of the best generated lemma and factors hypotheses, thus associating each factor hypothesis to each lemma hypothesis. Also, having as beam size, the − best combinations was kept for each sample. Equally, in order to get the word candidate when having the lemma and some factors, the MACAON toolkit was used. In situations where name entities are processed therefore having no factors found, the lemma was outputted by the system.

V. HYBRID MACHINE TRANSLATION SUCCESSION
Although the translations produced by NMT are more fluent than those of SMT, it still does not fully and explicitly exploit the source information as compared to SMT. Thus, sometimes generating translations that are quite different from the source sentence original meaning [50] and some other times may mistakenly ignore some words during source sentence translations causing other words to be repeatedly translated [51].
If we consider as "intermediate language (another language)" the translation produced by the output of the NMT, to some extend we may amend the duplicated and meaningless translations, by building a translation model and operating a word alignment using an SMT.
Therefore we propose a factored multi-engine hybrid MT system consisting of an NMT and SMT framework, illustrated in Fig. 1.
Firstly, a preprocessing phase in this pipeline is performed by the transformer, which consist of training the transformer system using the initial factored training data, translating the training data, development set and test set into factored pretranslations; secondly, a target-target SMT system is built using the factored pre-translated training data, with parameters tuned using the pre-translated development set; and finally, the desired output is produced by decoding the pre-translated test set using the tuned SMT system. When using the transformer to perform the pre-translations, if there is an occurrence of OOV in the source sentence, an 'UNK' token is generated by the transformer when translating the training data, development set and test set. We therefore propose a simpler and efficient technique to replace in the translation sentence, the "UNK" token by the corresponding source word. This method is known as the "labeled UNK replacement algorithm", which alleviates the weaknesses faced by the UNK replacement algorithm inspired from [52] proposed by [12]. The technic is presented in Algorithm 1. www.ijacsa.thesai.org As such, our algorithm will simply traverse the translation and replace the UNK token they encounter with their corresponding source word (the key at that position), if in the vocabulary there is no existence of the source word. Reference [12] proposed a naïve algorithm to do the UNK replacement, facing the weakness of eventually having between the source sentence and the target sentence different word order, thus creating wrong replacements.

Algorithm 1 Labelled UNK replacement by source words
Require: The translation 1 with UNK tokens from the Transformer.
With e as an array of key and value pairs where each key is a source word, and the value the corresponding translation or UNK token. e.g. for the French sentence: "un chat noir" to translate in English, we may have the corresponding e below.
Considering that we have the following sentence, the replacement will thus be done as: Finally, to post-process these unknown words, instead of using a back-off dictionary [53], we engage by considering more context a factored phrase based SMT system to perform the desired translation. In the factored PB-SMT, for decoding and training we applied the Moses toolkit [7], for sequence models we used SRILM [46] to train a 5-gram language model, and for word alignments creation we employed Giza++ [54], using for feature weights tuning the MERT (Minimum Error Rate Training) [26].

VI. EXPERIMENTS
In order to verify our proposed framework, we selected translations between Japanese and English languages, noting than Japanese is drastically different in terms of word order and has a far richer grammatical structure as compared to English language.
For fair comparison we re-implemented the hybrid frameworks proposed by [10], training our models using a machine with 8 NVIDIA P100 GPUs.

A. Datasets and Setup
We used as training data Part-1 of the JP-EN Scientific Paper Abstract Corpus (ASPEC-JE) for JP-EN translation task which contains 1M sentence pairs, with the 1,790 sentence pairs contained in the development/validation set, and the 1,812 sentence pairs contained in the test set [55], provided that for the validation and test sets each sentence at the source side has only one reference.
For factorization at both the Transformer and the PBSMT level, we used Lemma and POS tags with compounds (as explained in Section 2 above) as input and output features, which can be produced either by using the MACAON toolkit [47] or the more specialized KyTea [48] especially for the Japanese data.
Due to the fact that unknown words cannot be generated when using Byte-Pair Encoding (BPE) [56] since they are all encoded as BPE units, we thus keep words as translation units. Besides this, incorrect words are sometimes produced by BPE units generation during the final word level processing, thus does not lead to any noticeable improvement in terms of % U [57]. We used the PB-SMT system described in section 4 above. Also, we used as NMT system the transformer [1] default settings with some variants, setting mini batches of size 80, and having as 60 the maximum length of a sentence, with a size of 600 for word embeddings. Parallelly, we have as input and output vocabulary size set to 45K. We reshuffled the training corpus between epochs, and trained the models with the AMSGrad optimizer [58], while at every 5,000 mini batches on the validation set, we validated the model through BLEU (BiLingual Evaluation Understudy) scores, and at every 30,000 performed model safeguard.
We only utilize the baseline transformer system pretranslated training data and devset as input to the SMT engine for its training and tuning. For tuning, the optimized configuration file settings for our translation model is found using Batch MIRA (equally known as k-best MIRA) [59] [60], www.ijacsa.thesai.org which is a version of MIRA (a margin-based classification algorithm) working within a batch tuning framework when we have sparse features OR using Minimum error rate training (MERT), but the use of more than about 20-30 features cannot be supported. After which the pre-translated test set is redecoded utilizing the tuned SMT system.

B. Evaluation and Results
Through bootstrap re-sampling significance test we calculated the statistical significance [61], and also, caseinsensitive BLEU scores were used to report all results. Table I shows the BLEU score based translation results for ↔ with non-reordered data, considering as baseline systems a standard PB-SMT [62] for statistical based translations and a NMT proposed by [3] for neuronal based translations. Thus, we observe that:  The hybrid translation system where the SMT system is used to pre-translate data which serves as input to the NMT, performs significantly gets worse than both the baseline NMT systems and the FNMT system, when operating on → and → languages. The baseline SMT systems has been outperformed in % points by all the SMT ⇒ NMT systems on → and → , except for the → validation set which reports a decrease in result of − . points.
 The hybrid NMT ⇒SMT model results indicates that the translations produced by the baseline NMT system are re-decoded by the NMT ⇒SMT pipeline, leading to a significant improvement of + . points and + .
points on the → validation and test sets translation performance, respectively, and also, a significant improvement of + .
points and + .
points on the → validation and test sets translation performance, respectively, compared to the baseline NMT system. As compared to the factored NMT system, the hybrid Factored NMT ⇒ SMT model results indicates a slight but noticeable improvement of + .
points and + .
points on the → validation and test sets translation performance, respectively, and also, a significant improvement of + .
points and + .
points on the → validation and test sets translation performance, respectively.
 Finally, we observe that the hybrid model where translations produced by the factored transformer at both its input and output (fully-factored transformer), and which are further re-decoded by the factored SMT, outperforms the translations on the → validation set generated by the fully-factored transformer, and the transformer, by + . points and + .
points, respectively, and also, translations on the → test set generated by the fully-factored transformer, and the transformer, by + .
points and + . points, respectively. Similarly, both the translations on the → validation set generated by the fully-factored transformer, and the transformer, by + . points and + . points, respectively, and those on the → test set generated by the fully-factored transformer, and the transformer, by + . points and + .
points, respectively, are as such outperformed by our proposed hybrid system.

C. Discussion
From the above results with reference to the state of the art, we analyze that: As compared exceptionally to [10] framework consisting of an SMT ⇒NMT pipeline which has a higher computational complexity due to the integration of the source information into both the SMT and NMT (concatenating at this level the pre-translated and source sentences as input), and other state of the art hybrid frameworks particularly [12] consisting of an NMT ⇒SMT pipeline, our hybrid MT pipeline is more simpler, viable and efficient, by employing source-side information only during the transformer training and exceptionally during OOVs processing, thus favoring its faster computation. Analytical studies for rare/OOV word impact on the translation quality were operated over the Scientific Paper Abstract Corpus (ASPEC-JE) for Japanese-to-English, sorted by the words average inverse frequency and validation sentences were split into groups with comparable numbers of rare words independently evaluated. All target words which occur in the training data for each number of sentence occurrence less than N times were replaced by the UNK token, for all analyzed systems. Given ∈ {0 , 0.5 , 1 , 1.5 , 2 , 2.5 , 3 }.
Thus, a higher occurrence of rare words is obtained for large N, hence in the reference only the most frequent words are exploited. Meanwhile a lesser occurrence of rare words is obtained for lower N, using hence more words. We observed that our best performing model (FF-Transformer ⇒ FSMT) considerably outperforms the state of the art both stand-alone and hybrid MT systems on sentences with many OOV words, as a greater occurrence of OOV words implies an increased amount of data size. This boost in performance can be justified by the fact that attention mechanisms which makes up the Transformer operates better on lager data sizes.
We point out that, attention mechanisms are used by neural networks to encode each position while relating two distant words of both the inputs and outputs with respect to itself, by which the training can be accelerated through parallelization. An attention mechanism is a technique created for paying attention to specific words, which have proven to be useful to address the bottleneck issues that arise when handling long sentences with complicated dependencies between words, as it is harder for the context vector to capture all the information contained in the sentence due to the sequential order of word processing. More precisely, the Attention technique focuses on part of a subset of the information it is given, provided that for each input word one hidden state vector is produced. These vectors can then be concatenated, averaged or (even better!) weighted in order to give higher importance to words from the input sentence, most relevant to decode the next word of the output sentence. Also, due to the larger vocabulary of the test set by the integration of factors during the PB-SMT post-processing translation, we experienced in our proposed framework a significant decrease in rate of OOVs as compared to the NMT system, of 1.06% and 5.37%, respectively.
We emphasize that, the results on the ASPEC Japanese-to-English corpus should be interpreted with caution. It is the expectation that the attention based HMT when used on longer sentences will show their true potential. In order to investigate on the effect of translating long sentences, sentences of similar lengths having unknown words to the models included were grouped together and the BLEU score was computed per group. The results are delineated in Fig. 2, analyzed over the full validation set.
We observe on Fig. 2 that the buckets of longer sentences are more effectively handled by our Transformer based HMT (purple curve) due to its integrated Attention mechanism at both the encoder and decoder levels as compared to the winning entry recurrent based HMT (green curve) in which the Attention mechanism is integrated only at the level of the decoder, hence as sentences become longer the quality does not degrade. While at shorter sentence lengths, it is observed that our outperforming model performs worse, indicating that although the attention mechanism speeds up training, it is likely not very important and may potentially be redundant. More to that, higher perplexities are produced when operating Attention mechanisms over short sentences, as the model becomes less certain about its predictions than without it.
And we believe that, translations performance will be improved if phrases corrected and reordered are considered. We shall dive deeper by considering this fact in future work. 58 | P a g e www.ijacsa.thesai.org

VII. CONCLUSION
We have proposed a novel HMT framework cascaded as a Fully-Factored Transformer ⇒ Factored SMT pipeline consisting of integrated linguistic factors at both the source language and target language of the transformer model, and linguistic factors at source language (pre-translated language) of the SMT model. The considered linguistic factors where lemmatization, part-of-speech tagging (taking into consideration its various compounds). Our experimental results on ↔ language pairs clearly revealed that our proposed HMT framework with integrated linguistic factors outperforms the state-of-the-art HMT frameworks, in terms of both perplexity and BLEU points. More to that, we observed an OOV rate reduction, due to the generation of new word forms derived from the integrated additional linguistic resources.
As future work, we aim to explore whether the integration of a grammatical error detection and correction (GEC) process [34] will further help in reducing the rate of OOVs. Also, use compositional learned word representations from smaller orthographic symbols inside the words such as character ngrams, which can easily fit in the model vocabulary.

VIII. CONFLICTS OF INTEREST STATEMENT
The authors whose names are listed above certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers' bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript.