Improving Transformer-Based Neural Machine Translation with Prior Alignments
Algorithm 1
Procedure to construct statistical alignments .
(1)
We tokenize both Vietnamese source sentences and English target sentences. We apply the types of tokens in the Transformer-S1 model as in the case of the Transformer-M model. A token in both source and target sentences is a sequence of characters delimited by spaces. Linguistically, Vietnamese-English Transformer-M and Transformer-S1 models are syllable-to-word models since spaces in Vietnamese delimit syllables and spaces in English delimit words.
(2)
We construct many-to-one alignments from Vietnamese to English, using the fast_align token aligner.
(3)
We repeat step 2 in the reverse direction from English to Vietnamese.
(4)
We merge the bidirectional alignments generated in steps 2 and 3, following grow-diagonal heuristics proposed by Koehn et al. [24].