Transformer is a neural machine translation model which revolutionizes machine translation. Compared with traditional statistical machine translation models and other neural machine translation models, the recently proposed transformer model radically and fundamentally changes machine translation with its self-attention and cross-attention mechanisms. These mechanisms effectively model token alignments between source and target sentences. It has been reported that the transformer model provides accurate posterior alignments. In this work, we empirically prove the reverse effect, showing that prior alignments help transformer models produce better translations. Experiment results on Vietnamese-English news translation task show not only the positive effect of manually annotated alignments on transformer models but also the surprising outperformance of statistically constructed alignments reinforced with the flexibility of token-type selection over manual alignments in improving transformer models. Statistically constructed word-to-lemma alignments are used to train a word-to-word transformer model. The novel hybrid transformer model improves the baseline transformer model and transformer model trained with manual alignments by 2.53 and 0.79 BLEU, respectively. In addition to BLEU score, we make limited human judgment on translation results. Strong correlation between human and machine judgment confirms our findings.

1. Introduction

There was a long period of time when statistical machine translation (SMT) was a dominant translation paradigm. The most effective SMT model is phrase-based. Phrase-based SMT is interpretable, intuitive, and reminiscent of the human translation process. It consists of several separate steps of processing concatenating together in a sequence. For example, a famous phrase-based SMT system with the name Moses created by Koehn [1] contains 9 separate steps including token alignment, lexical translation table creation, and phrase-table creation. The explicitly modular architecture of phrase-based SMT has both advantages and disadvantages. It allows us to easily modify any module to improve the overall system, but it requires us to study multiple modules to create an effective phrase-based SMT system. State-of-the-art neural machine translation (NMT) based on deep learning, on the other hand, adopts an end-to-end approach different from traditional SMT. The whole NMT model is represented as a large neural network consisting of millions of trained parameters, taking as input a sequence of source tokens and returning a sequence of target tokens. NMT does not require us to study each stage of translation separately since it can function as a black box, i.e., if we enter a source sentence, then it will perform some complex numerical operations and return a predicted target sentence for us. Nevertheless, it has been reported that different parts of SMT actually improve NMT models. Han et al. [2] concatenated source token embeddings with their corresponding lexical translation embeddings as an additional input feature. Their experiments show the improvement in translation accuracy for the Chinese-English language pair. Song et al. [3] replaced source phrases with their corresponding one-to-one target phrases in a phrase table. Their experiments on Chinese-English and English-Russian language pairs demonstrate that hybrid source sentences consistently lead to better translations. Chen et al. [4] proposed the use of prior alignments to guide NMT models. Their experiments with recurrent NMT models in translating from German to English and from English to French reveal large gains in translation quality of recurrent NMT models trained with prior alignments. Garg et al. [5] proposed an adjustment to the state-of-the-art transformer NMT model [6, 7], making the model capable of learning statistical prior alignments. Their experiments for the three language pairs German-English, Romanian-English, and English-French exhibit that the adjusted transformer model consistently produces better posterior alignments, compared with the baseline transformer model. However, an improvement in translation quality does not materialize. There are two possible reasons that the improvement does not occur. First, their statistical prior alignments are perhaps not good enough. Second, the studied language pairs are rich resources; consequently, the state-of-the-art transformer NMT model successfully captures their properties without the help of prior alignments. Nonetheless, there are many machine translation tasks without the luxury of available rich resources. The problem of translating news articles from Vietnamese into English that we are interested in is one of those tasks. Vietnamese-English is a low-resource language pair, and fortunately, a Vietnamese-English bilingual dataset with manually annotated prior alignments is publicly available by Ngo and Winiwarter [8, 9]. Based on these conditions, in this work, we first verify whether manual prior alignments (MA) improve the translation quality for the Vietnamese-English transformer-based NMT model. Second, we experiment different Vietnamese-English transformer-based NMT models trained with statistical prior alignments (SAs), with the objective of approaching the quality of the model trained with manual prior alignments.

The rest of the paper is divided into six sections. The first section reviews related works. The second section introduces the proposed transformer-based neural machine translation models guided by prior alignments. The third section presents the raw material and the preprocessing steps applied on it to get datasets for our study. The fourth section describes the experiments and discussion on their results. The fifth section unveils a limitation of the proposed models and a future work on improvement. The final section gives conclusions from this work.

In this section, we briefly review works related to our study on improving transformer-based neural machine translation with prior alignments.

2.1. Token Alignments

Token alignments for a pair of sentences are a relation from the set of token positions in the source sentence to a set of token positions in the target sentence. An alignment can be intuitively represented in Pharaoh format [10] as a tuple , where the first element indicates -th source token and the second element indicates -th target token. Preparing token alignments is a crucial part of the traditional SMT models. The most popular token alignment tool is Giza++ [11], which is used by default in the famous SMT system Moses [1]. Giza++ implements the IBM Model 4 [12]. In addition to Giza++, there is another efficient token alignment tool fast_align by Dyer et al. [13], which effectively implements the IBM Model 2 [12]. Dyer et al. reported that the fast_align tool provides alignment as well as Giza++ does, while running significantly faster. Based on the efficiency and alignment quality, in this study, we prefer fast_align to Giza++ for statistically aligning source and target tokens.

2.2. Recurrent NMT Models Trained with Prior Alignments

While modern NMT models outperform SMT models in terms of translation quality, the task of token alignment is still dominant by traditional statistical tools [5]. Chen et al. [4] combined the advantages of two approaches by using statistical prior alignments to train recurrent NMT models. For German-English and English-French tasks, they experiment two recurrent NMT models trained with prior alignments which have been generated with Giza++ [11]. Their experiment results show that the proposed models significantly improve over baseline recurrent NMT models. Chen et al. also introduced alignment cost for the mismatch between prior alignments and computed single-head attention mechanism of the recurrent models. Further developments on using prior alignments to improve recurrent NMT models can be found in [1417]. Moreover, a recurrent neural network model trained with prior alignments has also been proved effective in speech synthesis task [18], which has sequence-to-sequence pattern similar to machine translation task.

2.3. Baseline Transformer Model

Recently, a novel deep neural network model, transformer [6], with an innovative multihead attention mechanism has been introduced. It has become the state-of-the-art model for many artificial intelligence tasks, including machine translation [1922]. In comparison with other NMT models, including recurrent ones, transformer not only provides better translation results but also can be trained in a shorter period of time [6]. In this work, we use the transformer model as the baseline translation system. The transformer model is composed of encoder and decoder modules. The output probability distribution of the decoder is then used to predict the next target token.

Given a reference target sentence containing tokens, the mathematical formulation of the optimization criterion for training the transformer model is presented in equation (1), revised from the one provided by Muller et al. [23]:

In equation (1), the symbol indicates whether -th token in the dictionary is the true value at the -th position in the target sentence.

2.4. Transformer Model Guided by Prior Alignments

Garg et al. [5] altered the state-of-the-art transformer NMT model [6, 7] for joint alignment and translation tasks, making use of prior alignments in training the model. The revised transformer model has the same architecture as the baseline transformer model with a slightly different training procedure. They replace the optimization criterion with a modified one including prior alignments. Specifically, for a pair of source and target sentences of length and , respectively, and a prior alignment set , they randomly take the output of just a head ( can be any number from 1 to 8) of the fifth decoder layer and then project it into a sequence of probability distributions over tokens of the corresponding source sentence for every target token. They compare the probability distributions with the reference probability generated from prior alignments via cross-entropy:

In equation (2), the symbol indicates the probability of whether the -th target token is correctly aligned with the -th source token.

Taken together, the optimization criterion for the transformer-M model is the sum of cross-entropy for tokens and a weighted cross-entropy for alignments between source and target sentences in the training dataset:

2.5. Proposed Transformer-Based Models Trained with Prior Alignments

In experiments for German-English, Romanian-English, and English-French translation tasks, Garg et al. used prior alignments created with Giza++ to train the revised transformer models. The models generate better posterior alignments but do not provide better translations. Motivated by the improvement in translation quality of recurrent NMT models trained with prior alignment [4], we experiment training transformer models with manually constructed alignments (transformer-M) for our Vietnamese-English translation task. The availability of manual token alignments allows us to assess the statement on whether prior alignments help us to build a better transformer model. Unfortunately, the approach is labor-consuming and does not provide us the freedom to make a choice of token type other than the one used in manual token alignments. Consequently, aside from the transformer-M model, we build other transformer models trained on statistically constructed prior alignments (transformer-S). Transformer-S models employ different token types and are trained on statistically constructed prior alignments instead of manually annotated prior alignments, while keeping the same architecture and training procedure as for the transformer-M model.

2.6. Syllable-to-Word Transformer Model

The first transformer-S model (transformer-S1) is guided by alignments constructed with the fast_align token aligner in the place of Giza++ as in the study by Garg et al. [5]. In addition to the change of aligner, we adapt their procedure for constructing statistical alignments to suit the Vietnamese-English translation task. The adapted procedure is presented as Algorithm 1.

(1)We tokenize both Vietnamese source sentences and English target sentences. We apply the types of tokens in the Transformer-S1 model as in the case of the Transformer-M model. A token in both source and target sentences is a sequence of characters delimited by spaces. Linguistically, Vietnamese-English Transformer-M and Transformer-S1 models are syllable-to-word models since spaces in Vietnamese delimit syllables and spaces in English delimit words.
(2)We construct many-to-one alignments from Vietnamese to English, using the fast_align token aligner.
(3)We repeat step 2 in the reverse direction from English to Vietnamese.
(4)We merge the bidirectional alignments generated in steps 2 and 3, following grow-diagonal heuristics proposed by Koehn et al. [24].
2.7. Word-to-Subword Transformer Model

Influenced by the work of Nguyen et al. [25] for Russian-Vietnamese NMT, we create the second transformer-S model (transformer-S2). While utilizing the same architecture, training procedure, and procedure to construct statistical alignments (Algorithm 1) of Transformer-S1 model, we tokenize the sentences differently in the transformer-S2 model. On the Vietnamese source side, we segment sentences into words, and on the English source side, we divide the sentences into subwords. We decide to adopt this mixed model due to the difference in linguistic morphology between Vietnamese and English. While Vietnamese is a noninflectional language, English is an inflectional language although not as morphologically rich as Russian. We use the VnCoreNLP tool developed by Vu et al. [26] and further improved by Nguyen et al. [27] to segment Vietnamese sentences into words. There is a popular phenomenon that, in Vietnamese, a syllable appears in many different words; therefore, these syllables are ambiguous to recognize by classifiers. We deploy segmentation of Vietnamese sentences into words to reduce ambiguity and, consequently, to enhance the quality of the transformer-S2 model. An example of a Vietnamese sentence and the result of its segmentation into words are presented in Table 1.

The VnCoreNLP tool employs character “_” to inform that neighboring syllables are concatenated into a word. In Table 1, two syllables “lãnh” and “thổ” are concatenated into a word “lãnh_thổ.”

On the English target side, we divide sentences into subwords with BPE tool proposed by Sennrich et al. [28]. An example of an English sentence and the result of its segmentation into subwords are presented in Table 2.

The BPE tool uses a pair of characters “@@” to inform that a containing token is a subword and should be concatenated with the next token to form a word in the inference phase of the transformer-S2 model. For some words, segmentation into subwords is interpretable, such as the word “personally” is divided into 2 subwords “person” and “ally” (Table 2). Subword “person” is the root part of many other words, such as “personal,” “personalize,” and “personality.” The segmentation actually has some grammatical meaning. A similar meaningful segmentation can be found for the word “ignorant” divided into “ignor” and “ant.” Meanwhile, there are other words where their segmentation is not understandable. In Table 2, the word “mob” is divided into two meaningless subwords “mo” and “b.”

Overall, the transformer-S2 model is a variant of the transformer-S1 model with different token representations on the source and target side.

Moreover, the imperfect segmentation of English sentences into subwords stimulates us to propose a novel transformer-S3 model without the use of English subwords, which puts more focus on the linguistic aspects of machine translation, such as the use of lemmas.

2.8. Hybrid Word-to-Word Transformer Model Trained with Statistical Word-to-Lemma Alignments

The transformer-S3 model can be seen as a hybrid of transformer-S1 and transformer-S2 models. Specifically, the transformer-S3 model is a word-to-word model. On the Vietnamese source side, we segment sentences into words, such as in the transformer-S2 model, while on the English target side, we choose to divide sentences into words, such as in the transformer-S1 model. Nevertheless, in preparing prior alignments , we revise the procedure to construct statistical alignments (Algorithm 1), replacing English words with their lemmas. Step-by-step procedure to construct alignments is presented as Algorithm 2.

(1)We tokenize both Vietnamese source sentences and English target sentences into words
(2)We replace English words with their lemmas
(3)We construct many-to-one alignments from Vietnamese words to English lemmas, using the fast_align token aligner
(4)We repeat step 2 in the reverse direction from English lemmas to Vietnamese words
(5)We merge the bidirectional alignments generated in steps 2 and 3, following grow-diagonal heuristics proposed by Koehn et al. [24]

In Algorithm 2, we replace English words with their lemmas, using Stanza tool created by Qi et al. [29]. A word is a surface form of a lemma according to its grammatical role in sentences. For example, words “life” and “lives” are inferred from the same lemma “life,” depending on the grammatical number. An example of an English sentence and the result of its lemmatization are shown in Table 3.

We adopt lemmatization of English words to reduce the size of vocabulary of the training dataset. The English side of the training dataset contains 36672 distinct tokens inflected from a smaller number of 28583 lemmas. We hope that a reduced vocabulary and an unchanged number of tokens will allow the fast_align aligner to produce better alignments and, consequently, lead to a better translation model trained on them. The relation between English words and their lemmas is one-to-one (see index sequences in Table 4); therefore, Vietnamese-word-to-English-lemma alignments can be employed in training the word-to-word transformer-S3 model.

We want to restate an important characteristic of the transformer-S3 model. The lemmatization of English target words is only applied in the construction of statistical alignments. We still use English words in the translation model.

3. Materials

In this work, we use English-Vietnamese Word Alignment Corpus (EVWACorpus) provided by Ngo et al. [9]. The dataset consists of 1000 news articles with 45,531 sentence pairs. These sentence pairs are already tokenized and manually aligned at the token level. A token is a sequence of characters delimited by spaces.

We apply the following processing procedures to the original EVWACorpus so that it fits our study.

3.1. True-Cased Corpus

First, we use true-case sentences in the dataset with Moses tool of Koehn et al. [1]. The term “true-case” means to convert a token to its most possible case. For example, the true-cased form of the token “The” is “the.” An example of a sentence in its natural form and its converted true-cased form is presented in Table 5.

True-casing procedure focuses on capitalized tokens (in Table 1, they are “The,” “Fenqing,” and “China.” Based on the frequency calculated from the corpus, these tokens will be converted to lower-cased form or stay unchanged.

3.2. Filtered Corpus

Secondly, we leave some sentence pairs out of our work. We filter out wrongly aligned sentence pairs. Sentence pairs are considered wrongly aligned if the indices of tokens are greater than the length of sentences. Due to the computational reasons, we also remove sentence pairs containing any sentence of length greater than 80 tokens. Moreover, we transform the alignment representation in EVWACorpus into Pharaoh format for later use. Finally, we get 45,035 sentence pairs with manually annotated alignments. An example of a sentence pair in the filtered corpus is presented in Table 6.

3.3. Datasets Extracted from Filtered Corpus

We divide the filtered corpus into three datasets: training, validation, and testing dataset for training and evaluating different translation models. We apply a dividing procedure similar to the one used by Nguyen et al. [30]. Specifically, we randomly take 1,527 sentence pairs from 30 news articles and use them as the testing dataset. Then, we randomly take the other 1,482 sentence pairs from the other 40 news articles and use them as the validation dataset. The remaining 42,026 sentence pairs from 930 news articles form the training dataset. Overview of the datasets is shown in Table 3.

4. Experiments and Discussion

Google Brain team releases an implementation of the transformer model in the Tensor2tensor library [7]. The library is now replaced by its successor Trax (download at https://github.com/google/trax). The transformer model is implemented in other popular NMT libraries, such as opennmt [31, 32] and Fairseq [33] of Facebook AI Research team. To carry out our experiments, we choose to use Fairseq library because it allows us to build both transformer models trained with/without prior alignments.

Following the architecture and training procedure for transformer models presented in previous sections, we apply Adam optimizer with learning rate 0.0002 to train them in 10,000 steps of 3200 tokens. After completing each epoch of the training dataset, we save the model. Among all saved models, we choose the one with the best performance in the validation dataset.

We use the testing dataset to evaluate the translation models. Each model translates all Vietnamese sentences from the testing dataset, deploying a beam search of size 5. The predicted English sentences are then compared with the corresponding reference English sentences from the testing dataset via BLEU score [34]. We apply the script multi-bleu.perl (download at https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) in Moses program [1] to calculate the score. Since BLEU score is a statistical metric designed to be applied on the dataset level, we also make complementary human judgments on the sentence level. Specifically, we randomly take 5 Vietnamese-English sentence pairs from the testing dataset, where the source sentence is composed of 10, 15, 20, 25, and 30 tokens, sequentially. We then make human judgment on the selected sentence pairs to complement the automatic machine judgment in the form of BLEU scores.

Figure 1 shows BLEU scores of the translation results of the testing dataset by the transformer models. We can find that the transformer-M model trained with manual prior alignments significantly outperforms the baseline transformer model by BLEU () on the overall dataset level. The first question of our work already has an answer. Prior alignments actually help improve the translation quality of the transformer model.

Figure 1 also reveals a surprising result. Performance of the statistical transformer-S3 model is even better than expected. It not only outperforms other statistical models but also exceeds our expectation of approaching the result of the manual transformer-M model by giving the highest BLEU score. The statistical transformer-S3 model improves the manual transformer-M model by BLEU. This can be explained by the fact that the quality of manual alignments relies on human, and human does not always provide correct alignments. It is worth to notice that it is difficult to manually align tokens between the source and target sentences. This language-related task is generally ambiguous, which is stated by Lambert et al. [35]. Moreover, the highest BLEU score of the translation result by the transformer-S3 model demonstrates the power of the statistical approach and its flexibility.

We now examine whether human judgment on translation results is correlated with automatic machine judgment on the sentence level. Here are five testcases which we randomly take and study.

Table 7 shows the translation results of a Vietnamese sentence comprising 10 tokens by transformer models. Clearly, the three presented translation models fail to translate the Vietnamese source sentence. However, from the semantic standpoint, the transformer-S3 model is better than others, successfully translating the subject “siêu nhân” of the Vietnamese source sentence into the reference target word “Superman.” Nevertheless, from the technical standpoint, the baseline transformer model performs better by providing the most number of reference target tokens “only,” “can,” “do,” while the transformer-S3 model misunderstands the source phrase “làm được” and translates them into a passive verb phrase “are done.” This incorrect translation is very interesting because Vietnamese token “được” is mostly used in passive voice. Thus, the transformer-S3 model does make the same mistake as foreign learners of Vietnamese usually do.

Table 8 presents the translation results of a Vietnamese sentence consisting of 15 tokens by transformer models. This test case actually proves the superiority of the transformer-S3 model in comparison with other models. Translation by the transformer-S3 model bears the most resemblance in meaning to the full English reference target sentence. Nevertheless, the transformer-S3 model chooses a wrong tense of the verb “stop.” Instead of the reference verb phrase of the past perfect tense “had stopped,” the transformer-S3 model uses the verb of simple present tense “stop.” It is understandable, considering the fact that Vietnamese verbs, such as “ngừng” in the source sentence, usually do not appear in tense; hence, translation models or even human translators find it difficult to translate Vietnamese verbs.

Table 9 shows the translation results of a Vietnamese sentence consisting of 20 tokens by transformer models. All three presented translation models perform pretty well in this case. Their translations generally reflect the meaning of the source sentence. Still, the translation by the transformer-S3 model is semantically closest to the reference target sentence. The transformer-S3 model translates the key phrase “nhàm chán” into the correct target word “boring.” However, it repeats the error of translating Vietnamese verbs as in testcase 2. It mistranslates the source verb phrase “không biết” into the target verb phrase of the present simple tense “don’t know,” while the reference target phrase “didn’t know” is of past simple tense. At the same time, the baseline transformer model correctly identifies the tense, producing the target phrase “didn’t know.”

Translations of a Vietnamese sentence comprising 25 tokens are presented in Table 10. This test case unveils the positive effect of the flexibility of the statistical alignment approach. We can apply a statistical aligner to different kinds of tokens without limiting ourselves to a preselected kind of tokens as in the case of manual alignments. Specifically, the transformer-S3 model successfully produces the target word “appearance,” having concatenated two neighboring syllables “ngoại” and “hình” into one word “ngoại_hình” (see Table 10). This happens due to the fact that we choose to build the transformer-S3 model as a linguistics-informed word-to-word model, while the baseline transformer model and transformer-M model are syllable-to-word models. These models require tokenization of Vietnamese sentences into syllables and English sentences into words.

Table 11 displays the translations of a Vietnamese sentence comprising 30 tokens. All three presented translation models fail to translate the key phrases of the source sentence. The subject “bang Gujarat” (meaning: the state of Gujarat) of the source sentence is mistranslated into different things: “federal federal government,” “the federal states,” and “the state of states.” Nevertheless, the translation by the baseline transformer model is smoother, consisting of many reference tokens. Unfortunately, it misses two key words “illegal” and “toxic”; therefore, its meaning is totally different from the reference. While the transformer-S3 model delivers stutters (“state of states” and “production of production”), it yields a correct key word “illegal,” making the translation result better resemble the reference in meaning.

On the whole, human judgment is in line with automatic machine judgment on the quality of the translation models. From the semantic point of view, the transformer-S3 model is the best model. Moreover, we discover that the transformer-S3 model does not succeed at handling the verb tenses.

4.1. Limitation and Future Work

Despite many advantages of training transformer-based NMT models with prior alignments, especially statistical ones, it still has a noticeable disadvantage. The models trained with them poorly handle verb tenses in translation. Translations of the best transformer-S3 model may reflect the meaning of the source sentences; however, they do not guarantee a high BLEU score since they generate verbs in an incorrect tense.

This work is the first step towards enhancing translation quality of transformer-based NMT models trained with prior alignments. Future work will address the pitfall of the word-to-word transformer-S3 model trained with statistical word-to-lemma alignments. Research into solving this problem is in progress. We will explore the selection of a head in the multihead attention mechanism, whose output is compared with prior alignments.

5. Conclusions

In this study, we have proved that prior alignments help better train the Vietnamese-English transformer-based neural machine translation model. Experiment results show the improvement of translation quality in terms of BLEU score. Moreover, to free ourselves from dependence on costly manual alignments, we have proposed a novel hybrid word-to-word transformer model trained on statistical word-to-lemma alignments. Unlike strict manual alignments, the flexible statistical aligner allows us to construct word-to-lemma alignments, representing a Vietnamese source sentence as a sequence of words and the corresponding English target sentence as a sequence of lemmas. Statistically constructed word-to-lemma alignments are then used to train a word-to-word transformer-S3 model instead of word-to-word alignment. Experiments have demonstrated that the novel word-to-word transformer-S3 model trained with statistical word-to-lemma alignments outperforms the transformer-M model trained with manual alignments in terms of BLEU score. In addition to machine judgment, we have made limited human judgments on translation results. Strong correlation between human and machine judgment has validated our findings.

Based on the experiment results, we recommend the use of statistical prior alignments in training the transformer-based neutral machine translation models at least in the context of low-resource translation tasks.

Data Availability

Readers can obtain the datasets used in this work by contacting the corresponding author Thien Nguyen via e-mail: [email protected].

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.


The authors truly appreciate Ms. Trang Nguyen, a translator and scientist. She provided the authors invaluable recommendations and encouragement when they prepared the manuscript and chose a journal to submit our work to.