Research Article
An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention
Table 5
Overall evaluation scores using four attention-based models on the WMT17 EN-CS evaluation set and test set.
| Model | WMT17 subset | Number of heads/head size | 2/256 | 4/128 | 8/64 | 16/32 | 32/16 | 64/8 |
| (a) |
| Multihead attention | dev | 11.69 | 13.96 | 13.76 | 14.14 | 12.10 | 11.98 | newstest2014 | 12.90 | 15.65 | 14.82 | 15.52 | 13.26 | 12.71 | newstest2015 | 11.03 | 12.48 | 11.66 | 12.62 | 10.04 | 10.08 | newstest2016 | 11.98 | 13.85 | 13.09 | 14.32 | 11.16 | 10.85 | newstest2017 | 10.26 | 12.45 | 11.71 | 12.36 | 9.96 | 10.14 |
| Multihead attention (head size = n) | dev | 10.14 | 11.87 | 12.99 | 13.35 | 14.36 | 14.46 | newstest2014 | 11.32 | 13.34 | 14.04 | 14.49 | 15.79 | 15.28 | newstest2015 | 9.11 | 10.18 | 11.55 | 11.67 | 12.69 | 12.05 | newstest2016 | 10.23 | 11.86 | 12.51 | 12.62 | 13.73 | 13.83 | newstest2017 | 8.99 | 10.43 | 11.30 | 11.23 | 12.54 | 12.47 |
| Talking-head attention | dev | 9.80 | 10.06 | 11.77 | 12.06 | 12.75 | 12.68 | newstest2014 | 10.90 | 10.65 | 11.65 | 12.33 | 12.94 | 12.34 | newstest2015 | 8.17 | 8.54 | 10.06 | 11.70 | 11.51 | 11.57 | newstest2016 | 9.11 | 9.38 | 10.50 | 11.25 | 11.78 | 11.73 | newstest2017 | 8.50 | 8.63 | 9.92 | 11.05 | 11.52 | 11.83 |
| Interacting-head attention (our model) | dev | 17.01 | 17.76 | 17.93 | 18.01(+3.87) | — | — | newstest2014 | 18.71 | 19.48 | 20.01 | 20.14(+4.62) | — | — | newstest2015 | 15.52 | 16.49 | 16.38 | 16.40(+3.78) | — | — | newstest2016 | 17.20 | 18.37 | 18.29 | 18.74(+4.42) | — | — | newstest2017 | 14.66 | 15.66 | 15.79 | 15.78(+3.42) | — | — |
| (b) |
| Multihead attention | dev | 15.13 | 12.33 | 15.27 | 10.30 | 29.60 | 16.80 | newstest2014 | 13.95 | 11.66 | 14.15 | 8.62 | 25.57 | 13.85 | newstest2015 | 23.68 | 17.96 | 23.34 | 17.17 | 39.53 | 24.06 | newstest2016 | 18.44 | 14.90 | 19.31 | 12.70 | 35.28 | 21.31 | newstest2017 | 20.70 | 18.20 | 21.06 | 16.21 | 36.44 | 21.63 |
| Multihead attention (head size = n) | dev | 17.30 | 15.53 | 13.43 | 12.43 | 10.83 | 14.37 | newstest2014 | 15.02 | 14.35 | 12.12 | 11.12 | 9.19 | 8.56 | newstest2015 | 27.60 | 23.87 | 20.22 | 18.71 | 16.79 | 16.91 | newstest2016 | 21.34 | 20.34 | 18.24 | 16.24 | 13.80 | 14.37 | newstest2017 | 24.23 | 21.56 | 20.07 | 19.27 | 15.91 | 18.66 |
| Talking-head attention | dev | 28.37 | 43.43 | 27.00 | 34.33 | 27.57 | 27.51 | newstest2014 | 21.25 | 33.70 | 28.94 | 34.97 | 25.86 | 26.22 | newstest2015 | 38.03 | 55.35 | 37.95 | 47.21 | 42.48 | 46.87 | newstest2016 | 30.31 | 45.82 | 34.44 | 44.11 | 36.35 | 36.83 | newstest2017 | 31.01 | 47.75 | 35.77 | 41.63 | 35.86 | 35.76 |
| Interacting-head attention (our model) | dev | 7.63 | 7.27 | 8.03 | 7.25(3.05) | — | — | newstest2014 | 6.99 | 6.43 | 7.16 | 6.21(2.41) | — | — | newstest2015 | 12.99 | 12.09 | 11.94 | 12.08(5.09) | — | — | newstest2016 | 10.14 | 9.80 | 9.70 | 9.83(2.87) | — | — | newstest2017 | 12.48 | 12.95 | 11.71 | 12.24(3.97) | — | — |
| (c) |
| Multihead attention | dev | 17.37 | 19.52 | 19.16 | 19.55 | 18.85 | 17.92 | newstest2014 | 18.41 | 20.82 | 20.42 | 20.80 | 19.92 | 18.93 | newstest2015 | 16.69 | 18.90 | 18.51 | 18.77 | 18.17 | 17.32 | newstest2016 | 17.21 | 19.59 | 19.05 | 19.36 | 18.44 | 17.45 | newstest2017 | 15.65 | 17.96 | 17.51 | 17.85 | 17.13 | 16.26 |
| Multihead attention (head size = n) | dev | 16.09 | 17.62 | 18.65 | 18.91 | 19.68 | 18.10 | newstest2014 | 17.04 | 18.82 | 19.69 | 20.09 | 20.88 | 20.06 | newstest2015 | 15.30 | 16.68 | 18.08 | 18.16 | 19.05 | 18.37 | newstest2016 | 15.85 | 17.42 | 18.41 | 18.62 | 19.39 | 18.92 | newstest2017 | 14.70 | 16.00 | 16.99 | 17.10 | 19.99 | 17.42 |
| Talking-head attention | dev | 15.93 | 15.82 | 16.05 | 16.85 | 16.48 | 16.72 | newstest2014 | 16.86 | 16.58 | 16.33 | 17.04 | 16.76 | 16.76 | newstest2015 | 15.00 | 14.74 | 14.97 | 15.75 | 15.68 | 16.12 | newstest2016 | 15.59 | 15.08 | 15.07 | 15.87 | 15.50 | 15.85 | newstest2017 | 14.33 | 14.01 | 14.11 | 15.15 | 14.87 | 15.62 |
| Interacting-head attention | dev | 21.76 | 22.43 | 25.90 | 28.77(+9.22) | — | — | newstest2014 | 23.30 | 24.28 | 28.07 | 30.62(+9.82) | — | — | newstest2015 | 21.08 | 21.98 | 25.28 | 27.86(+9.09) | — | — | newstest2016 | 21.99 | 22.95 | 21.98 | 22.57(+3.21) | — | — | newstest2017 | 19.81 | 20.55 | 19.72 | 20.64(+2.79) | — | — |
| (d) |
| Multihead attention | dev | 39.80 | 43.15 | 42.51 | 43.27 | 41.93 | 40.36 | newstest2014 | 41.25 | 44.98 | 44.22 | 45.14 | 43.31 | 42.08 | newstest2015 | 39.04 | 42.13 | 41.42 | 42.20 | 40.74 | 39.46 | newstest2016 | 39.00 | 42.82 | 41.81 | 42.82 | 40.41 | 39.08 | newstest2017 | 37.29 | 40.34 | 39.80 | 40.46 | 38.83 | 37.75 |
| Multihead attention (head size = n) | dev | 37.50 | 40.03 | 41.71 | 42.17 | 43.36 | 43.13 | newstest2014 | 38.90 | 41.91 | 43.11 | 43.66 | 44.95 | 44.47 | newstest2015 | 36.23 | 38.51 | 40.99 | 40.83 | 42.40 | 41.67 | newstest2016 | 36.87 | 39.16 | 40.81 | 40.99 | 42.41 | 42.19 | newstest2017 | 35.11 | 37.52 | 38.95 | 38.94 | 40.59 | 40.38 |
| Talking-head attention | dev | 36.01 | 37.56 | 38.58 | 40.74 | 41.37 | 40.95 | newstest2014 | 37.76 | 39.21 | 38.36 | 40.42 | 41.46 | 41.27 | newstest2015 | 34.23 | 35.22 | 36.80 | 39.02 | 40.06 | 40.81 | newstest2016 | 35.10 | 36.06 | 36.42 | 38.53 | 39.47 | 39.95 | newstest2017 | 33.50 | 34.71 | 35.46 | 37.76 | 38.62 | 38.41 |
| Interacting-head attention (our model) | dev | 46.07 | 45.90 | 46.87 | 47.08(+3.81) | — | — | newstest2014 | 48.17 | 48.07 | 49.13 | 49.96(+4.82) | — | — | newstest2015 | 45.22 | 45.28 | 46.01 | 46.44(+4.24) | — | — | newstest2016 | 45.87 | 45.94 | 46.87 | 47.24(+4.42) | — | — | newstest2017 | 42.83 | 42.93 | 43.65 | 44.12(+3.66) | — | — |
| (e) |
| Multihead attention | dev | 1.42 | 1.38 | 1.37 | 1.40 | 1.32 | 1.20 | newstest2014 | 1.24 | 1.53 | 1.47 | 1.56 | 1.44 | 1.30 | newstest2015 | 1.09 | 1.30 | 1.27 | 1.32 | 1.24 | 1.11 | newstest2016 | 1.10 | 1.37 | 1.30 | 1.37 | 1.23 | 1.11 | newstest2017 | 1.02 | 1.22 | 1.20 | 1.25 | 1.44 | 1.06 |
| Multihead attention (head size = n) | dev | 1.01 | 1.18 | 1.29 | 1.32 | 1.42 | 1.44 | newstest2014 | 1.07 | 1.29 | 1.39 | 1.44 | 1.53 | 1.53 | newstest2015 | 0.91 | 1.06 | 1.22 | 1.22 | 1.32 | 1.31 | newstest2016 | 0.96 | 1.12 | 1.23 | 1.27 | 1.35 | 1.37 | newstest2017 | 0.88 | 1.03 | 1.14 | 1.35 | 1.25 | 1.28 |
| Talking-head attention | dev | 0.90 | 0.74 | 0.92 | 0.80 | 0.79 | | newstest2014 | 0.99 | 0.81 | 0.92 | 0.79 | 0.78 | 0.34 | newstest2015 | 0.79 | 0.63 | 0.90 | 0.70 | 0.70 | 0.32 | newstest2016 | 0.83 | 0.67 | 0.78 | 0.69 | 0.69 | 0.30 | newstest2017 | 0.78 | 0.61 | 0.77 | 0.68 | 0.67 | 0.33 |
| Interacting-head attention | dev | 1.63 | 1.74(+0.36) | 1.64 | | | | newstest2014 | 1.79 | 1.97(+0.44) | 1.80 | | | | newstest2015 | 1.55 | 1.65(+0.35) | 1.56 | | | | newstest2016 | 1.62 | 1.75(+0.38) | 1.64 | | | | newstest2017 | 1.43 | 1.55(+0.33) | 1.47 | | | |
| (f) |
| Multihead attention | dev | 28.64 | 32.37 | 31.74 | 32.12 | 31.41 | 29.07 | newstest2014 | 30.37 | 34.58 | 33.72 | 34.47 | 33.02 | 31.14 | newstest2015 | 27.67 | 31.15 | 30.47 | 30.93 | 30.13 | 28.24 | newstest2016 | 28.10 | 32.11 | 31.22 | 31.89 | 30.07 | 28.22 | newstest2017 | 26.20 | 29.89 | 29.24 | 29.54 | 28.53 | 26.80 |
| Multihead attention (head size = n) | dev | 26.65 | 28.94 | 30.62 | 31.06 | 32.26 | 32.57 | newstest2014 | 28.00 | 31.21 | 32.42 | 32.99 | 34.23 | 34.23 | newstest2015 | 25.02 | 27.25 | 29.69 | 29.68 | 31.07 | 31.05 | newstest2016 | 25.80 | 28.29 | 30.02 | 30.27 | 31.50 | 31.75 | newstest2017 | 24.27 | 26.59 | 28.11 | 28.11 | 29.76 | 30.05 |
| Talking-head attention | dev | 25.19 | 27.48 | 29.82 | 30.23 | 31.50 | 31.47 | newstest2014 | 27.12 | 29.28 | 29.94 | 30.20 | 31.75 | 31.82 | newstest2015 | 23.60 | 25.37 | 27.98 | 28.57 | 30.17 | 30.55 | newstest2016 | 24.51 | 26.37 | 28.09 | 28.39 | 29.68 | 29.64 | newstest2017 | 22.79 | 24.89 | 26.90 | 27.37 | 28.81 | 28.88 |
| Interacting-head attention | dev | 35.34 | 35.09 | 36.50 | 36.51(+4.14) | — | — | newstest2014 | 37.77 | 37.59 | 38.75 | 39.83(+5.25) | — | — | newstest2015 | 34.32 | 34.25 | 35.17 | 35.48(+3.97) | — | — | newstest2016 | 35.28 | 35.39 | 35.68 | 36.94(4.83) | — | — | newstest2017 | 32.40 | 32.58 | 34.81 | 33.89(+4.00) | — | — |
|
|
Note. The units of the performances are (a) BLEU, (b) WER, (c) METEOR, (d) ROUGE_L, (e) CIDEr, and (f) YiSi.
|