Research Article
An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention
Table 3
Overall evaluation scores using four attention-based models on the IWSLT16 DE-EN evaluation set and test set.
| Model | IWSLT16 subset | Number of heads/Head size | 2/256 | 4/128 | 8/64 | 16/32 | 32/16 | 64/8 |
| (a) | Multihead attention | dev | 22.73 | 25.71 | 26.44 | 27.23 | 23.92 | 19.50 | Test | 21.15 | 24.00 | 24.38 | 25.18 | 22.05 | 17.94 | Multihead attention (head size = n, dmodel = 512) | dev | 20.93 | 23.08 | 24.79 | 26.00 | 26.94 | 26.38 | Test | 19.35 | 21.38 | 22.81 | 24.61 | 24.86 | 24.59 | Talking-head attention | dev | 20.66 | 21.39 | 22.47 | 24.02 | 24.21 | 23.65 | Test | 19.13 | 19.82 | 20.87 | 22.24 | 22.30 | 21.72 | Interacting-head attention | dev | 29.59 | 29.94 | 29.54 | 30.01(+2.78) | — | — | Test | 27.61 | 28.01 | 27.44 | 27.61(+2.43) | — | — |
| (b) | Multihead attention | dev | 5.82 | 4.85 | 3.56 | 3.10 | 3.51 | 8.73 | Test | 9.03 | 7.92 | 7.71 | 5.87 | 7.02 | 14.39 | Multihead attention (head size = n, dmodel = 512) | dev | 6.90 | 6.00 | 5.29 | 4.59 | 4.55 | 3.97 | Test | 11.47 | 10.10 | 8.98 | 8.16 | 8.11 | 7.96 | Talking-head attention | dev | 9.03 | 7.31 | 6.58 | 4.81 | 4.51 | 4.33 | Test | 14.46 | 11.90 | 9.83 | 9.10 | 8.69 | 8.90 | Interacting-head attention | dev | 2.41 | 2.44 | 2.31 | 2.25(0.85) | — | — | Test | 6.60 | 5.34 | 4.77 | 4.49(1.38) | — | — |
| (c) | Multihead attention | dev | 25.42 | 26.72 | 28.58 | 29.14 | 26.42 | 22.63 | Test | 24.77 | 25.96 | 27.67 | 28.20 | 25.57 | 22.02 | Multihead attention (head size = n, dmodel = 512) | dev | 23.47 | 25.69 | 25.76 | 26.51 | 27.05 | 27.12 | Test | 22.84 | 24.95 | 24.95 | 25.88 | 25.28 | 26.36 | Talking-head attention | dev | 22.48 | 24.28 | 23.80 | 24.46 | 25.28 | 26.76 | Test | 21.85 | 23.46 | 23.18 | 23.94 | 24.49 | 25.85 | Interacting-head attention | dev | 31.73 | 31.94 | 31.65 | 32.04(+2.90) | — | — | Test | 30.87 | 31.11 | 30.76 | 31.25(+3.05) | — | — |
| (d) | Multihead attention | dev | 53.72 | 55.68 | 57.85 | 58.76 | 55.93 | 50.52 | Test | 52.23 | 54.30 | 56.23 | 57.26 | 54.57 | 49.24 | Multihead attention (head size = n, dmodel = 512) | dev | 51.46 | 54.02 | 55.83 | 55.52 | 56.53 | 56.99 | Test | 50.21 | 52.70 | 55.44 | 54.31 | 54.18 | 55.60 | Talking-head attention | dev | 49.26 | 51.64 | 52.38 | 53.72 | 54.92 | 55.78 | Test | 48.23 | 50.19 | 51.19 | 52.30 | 53.33 | 54.45 | Interacting-head attention | dev | 61.02 | 61.29 | 61.14 | 61.41(+2.65) | — | — | Test | 59.25 | 59.80 | 59.62 | 59.96(+2.70) | — | — |
| (e) | Multihead attention | dev | 1.94 | 2.39 | 2.61 | 2.69 | 2.39 | 1.89 | Test | 1.80 | 2.20 | 2.39 | 2.49 | 2.22 | 1.75 | Multihead attention (head size = n, dmodel = 512) | dev | 1.99 | 2.20 | 2.24 | 2.27 | 2.33 | 2.38 | Test | 1.84 | 2.05 | 2.08 | 2.10 | 2.16 | 2.21 | Talking-head attention | dev | 1.70 | 1.87 | 1.84 | 2.09 | 2.20 | 2.32 | Test | 1.57 | 1.72 | 1.67 | 1.91 | 2.01 | 2.05 | Interacting-head attention | dev | 2.93 | 2.96 | 2.93 | 2.98(+0.29) | — | — | Test | 2.69 | 2.75 | 2.71 | 2.79(+0.30) | — | — |
| (f) | Multihead attention | dev | 54.33 | 55.98 | 57.91 | 58.48 | 55.25 | 49.83 | Test | 53.34 | 54.18 | 55.88 | 56.45 | 53.57 | 48.44 | Multihead attention (head size = n, dmodel = 512) | dev | 52.05 | 54.25 | 55.85 | 57.15 | 57.33 | 56.45 | Test | 50.63 | 52.57 | 54.18 | 55.51 | 56.79 | 55.94 | Talking-head attention | dev | 50.19 | 51.32 | 52.32 | 53.02 | 55.55 | 57.06 | Test | 48.97 | 49.63 | 50.63 | 51.69 | 53.72 | 55.83 | Interacting-head attention | dev | 60.92 | 61.22 | 60.91 | 61.45(+2.97) | — | — | Test | 58.68 | 59.15 | 58.67 | 60.04(+3.59) | — | — |
|
|
Note. The units of the performances are (a) BLEU, (b) WER, (c) METEOR, (d) ROUGE_L, (e) CIDEr, and (f) YiSi.
|