Research Article

An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention

Table 4

Overall evaluation scores using four attention-based models on the WMT17 EN-DE evaluation set and test set.

ModelWMT17 subsetNumber of heads/head size
2/2564/1288/6416/3232/1664/8

(a)

Multihead attentiondev8.3812.5612.4815.1512.2012.45
newstest20148.7711.3112.8515.7311.6112.28
newstest20159.3612.2713.8016.7911.9013.23
newstest201611.4014.8316.7920.9715.1716.16
newstest20178.8311.8313.1516.4111.6312.99

Multihead attention (head size = n)dev11.6612.5113.7014.5714.6115.29
newstest201411.6413.2614.9315.4214.9615.00
newstest201512.0414.0715.7816.5716.6717.57
newstest201614.7117.5719.7720.3319.8020.08
newstest201711.7213.5414.9216.0316.0716.33

Talking-head attentiondev10.9013.8915.3115.3715.2315.50
newstest201411.2112.6414.3315.2815.2915.39
newstest201511.8512.8114.6115.9515.5515.36
newstest201614.2216.3318.8419.5319.3420.43
newstest201711.1512.8114.3515.4914.4415.36

Interacting-head attentiondev13.0714.8414.5317.46(+2.31)
newstest201410.9114.7215.7517.35(+1.62)
newstest201511.9915.4316.9018.40(+1.21)
newstest201614.9519.6121.1222.36(+1.39)
newstest201711.0915.0017.3017.67(+1.26)

(b)

Multihead attentiondev39.2337.8028.2315.5732.9322.47
newstest201446.8545.5932.1019.6839.5928.80
newstest201545.2244.3132.6422.6842.7828.58
newstest201640.1439.3528.5115.9132.0423.97
newstest201743.8542.7432.1219.0738.3826.50

Multihead attention (head size = n)dev39.3736.7325.1122.9716.6713.42
newstest201446.4035.7732.5628.6518.5618.77
newstest201546.9338.4533.1628.0323.1320.07
newstest201642.9130.2427.6622.7717.1513.26
newstest201744.6534.6029.8226.7819.7516.52

Talking-head attentiondev48.1740.6339.0324.8020.1316.25
newstest201448.1948.0252.9833.9327.5820.32
newstest201561.0953.8052.3333.4331.1322.05
newstest201647.0247.2847.4231.0823.1614.82
newstest201751.1347.3450.6334.5228.0120.05

Interacting-head attentiondev58.1035.2020.839.63(5.94)
newstest201468.3947.2929.0113.64(6.04)
newstest201579.3948.5529.6416.05(6.63)
newstest201667.1244.2125.9411.27(4.64)
newstest201781.5649.0730.8915.23(3.84)

(c)

Multihead attentiondev19.3619.5420.3123.1620.5620.53
newstest201419.3919.6420.8323.7120.7220.97
newstest201519.3619.6021.1524.1120.7021.06
newstest201620.4321.2722.7326.2922.4422.63
newstest201719.5219.8521.0523.9620.7021.11

Multihead attention (head size = n)dev19.4720.9621.0423.0123.3223.79
newstest201419.6921.4522.3023.5223.9124.45
newstest201519.6421.5422.6123.7024.0224.46
newstest201620.9523.2624.4725.6525.4426.99
newstest201719.6821.5022.2523.7324.0524.49

Talking-head attentiondev17.7119.5521.3823.2223.5123.19
newstest201417.7319.5921.8723.6523.7323.75
newstest201517.3319.3922.0323.8124.1424.26
newstest201618.6020.5223.1925.4026.0225.88
newstest201717.7419.3921.9023.7123.6624.15

Interacting-head attentiondev22.4322.6123.8724.62(+1.46)
newstest201422.6722.6624.3525.10(+1.39)
newstest201522.8123.0924.5225.53(+1.42)
newstest201624.6925.0026.6827.27(+0.98)
newstest201722.1222.7624.3825.66(+1.70)

(d)

Multihead attentiondev40.0840.1941.5445.7742.4742.79
newstest201436.6537.8140.2249.0440.4041.18
newstest201537.3339.6342.5646.8441.6942.71
newstest201639.8841.9844.4044.8844.5245.13
newstest201736.9439.0141.4346.0941.0242.02

Multihead attention (head size = n)dev41.3943.6744.8945.7945.8346.01
newstest201439.7542.3543.6244.6945.0747.25
newstest201540.7643.3745.5446.1846.2248.15
newstest201643.1846.5148.2149.3448.9650.08
newstest201740.4143.0544.8745.5245.5546.08

Talking-head attentiondev38.4240.2042.0343.9944.6745.33
newstest201435.6739.2342.5142.7445.8646.09
newstest201537.9438.6141.0044.1544.9246.67
newstest201638.8341.2142.8846.2848.0546.99
newstest201737.2939.5840.9343.3045.2345.65

Interacting-head attentiondev45.4746.0346.5447.12(+1.35)
newstest201443.9543.7745.1548.20
newstest201545.3345.8346.6147.31(+0.51)
newstest201647.9748.8349.9450.47(+5.59)
newstest201743.6144.5845.9145.99

(e)

Multihead attentiondev1.261.301.371.641.371.38
newstest20141.131.191.261.521.261.26
newstest20151.241.311.421.631.361.36
newstest20161.391.501.621.931.611.60
newstest20171.171.221.351.611.291.33

Multihead attention (head size = n)dev1.231.381.451.631.631.66
newstest20141.131.301.391.541.551.58
newstest20151.191.371.451.671.681.65
newstest20161.381.651.801.941.851.99
newstest20171.161.351.431.601.711.70

Talking-head attentiondev1.061.201.381.421.491.58
newstest20140.961.111.351.381.461.51
newstest20151.041.201.371.391.361.39
newstest20161.161.401.671.451.481.62
newstest20170.981.151.431.391.481.47

Interacting-head attentiondev1.631.641.671.71(+0.07)
newstest20141.431.511.571.63(+0.11)
newstest20151.631.611.741.81(+0.18)
newstest20161.891.882.022.17(+0.24)
newstest20171.521.501.651.74(+0.13)

(f)

Multihead attentiondev43.7444.7145.3548.8246.2845.37
newstest201441.5042.5743.8047.8444.4443.77
newstest201543.1744.1045.9949.6845.8145.49
newstest201644.9045.9647.3752.2947.8647.13
newstest201742.1143.3644.9948.7144.9144.71

Multihead attention (head size = n)dev44.5846.0047.0648.5148.9949.01
newstest201443.0944.6345.6347.4048.0447.99
newstest201544.3746.0947.5448.8349.1549.56
newstest201646.2448.3049.6551.2051.3051.06
newstest201743.8245.4746.3848.3448.5548.77

Talking-head attentiondev38.1139.6340.7344.0746.1448.60
newstest201435.4638.3239.0242.7744.5247.33
newstest201537.4838.7140.3544.3546.8449.52
newstest201637.0940.2541.4345.9648.3650.70
newstest201736.2038.9339.4743.7545.8548.05

Interacting-head attentiondev48.0348.2348.7849.15(+0.33)
newstest201446.7246.6347.7848.71(+0.87)
newstest201548.1148.5249.0950.20(+0.52)
newstest201650.8850.7751.4752.71(+0.42)
newstest201747.7348.4849.8350.01(+1.30)

Note. The units of the performances are (a) BLEU, (b) WER, (c) METEOR, (d) ROUGE_L, (e) CIDEr, and (f) YiSi.