Research Article

An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention

Table 5

Overall evaluation scores using four attention-based models on the WMT17 EN-CS evaluation set and test set.

ModelWMT17 subsetNumber of heads/head size
2/2564/1288/6416/3232/1664/8

(a)

Multihead attentiondev11.6913.9613.7614.1412.1011.98
newstest201412.9015.6514.8215.5213.2612.71
newstest201511.0312.4811.6612.6210.0410.08
newstest201611.9813.8513.0914.3211.1610.85
newstest201710.2612.4511.7112.369.9610.14

Multihead attention (head size = n)dev10.1411.8712.9913.3514.3614.46
newstest201411.3213.3414.0414.4915.7915.28
newstest20159.1110.1811.5511.6712.6912.05
newstest201610.2311.8612.5112.6213.7313.83
newstest20178.9910.4311.3011.2312.5412.47

Talking-head attentiondev9.8010.0611.7712.0612.7512.68
newstest201410.9010.6511.6512.3312.9412.34
newstest20158.178.5410.0611.7011.5111.57
newstest20169.119.3810.5011.2511.7811.73
newstest20178.508.639.9211.0511.5211.83

Interacting-head attention (our model)dev17.0117.7617.9318.01(+3.87)
newstest201418.7119.4820.0120.14(+4.62)
newstest201515.5216.4916.3816.40(+3.78)
newstest201617.2018.3718.2918.74(+4.42)
newstest201714.6615.6615.7915.78(+3.42)

(b)

Multihead attentiondev15.1312.3315.2710.3029.6016.80
newstest201413.9511.6614.158.6225.5713.85
newstest201523.6817.9623.3417.1739.5324.06
newstest201618.4414.9019.3112.7035.2821.31
newstest201720.7018.2021.0616.2136.4421.63

Multihead attention (head size = n)dev17.3015.5313.4312.4310.8314.37
newstest201415.0214.3512.1211.129.198.56
newstest201527.6023.8720.2218.7116.7916.91
newstest201621.3420.3418.2416.2413.8014.37
newstest201724.2321.5620.0719.2715.9118.66

Talking-head attentiondev28.3743.4327.0034.3327.5727.51
newstest201421.2533.7028.9434.9725.8626.22
newstest201538.0355.3537.9547.2142.4846.87
newstest201630.3145.8234.4444.1136.3536.83
newstest201731.0147.7535.7741.6335.8635.76

Interacting-head attention (our model)dev7.637.278.037.25(3.05)
newstest20146.996.437.166.21(2.41)
newstest201512.9912.0911.9412.08(5.09)
newstest201610.149.809.709.83(2.87)
newstest201712.4812.9511.7112.24(3.97)

(c)

Multihead attentiondev17.3719.5219.1619.5518.8517.92
newstest201418.4120.8220.4220.8019.9218.93
newstest201516.6918.9018.5118.7718.1717.32
newstest201617.2119.5919.0519.3618.4417.45
newstest201715.6517.9617.5117.8517.1316.26

Multihead attention (head size = n)dev16.0917.6218.6518.9119.6818.10
newstest201417.0418.8219.6920.0920.8820.06
newstest201515.3016.6818.0818.1619.0518.37
newstest201615.8517.4218.4118.6219.3918.92
newstest201714.7016.0016.9917.1019.9917.42

Talking-head attentiondev15.9315.8216.0516.8516.4816.72
newstest201416.8616.5816.3317.0416.7616.76
newstest201515.0014.7414.9715.7515.6816.12
newstest201615.5915.0815.0715.8715.5015.85
newstest201714.3314.0114.1115.1514.8715.62

Interacting-head attentiondev21.7622.4325.9028.77(+9.22)
newstest201423.3024.2828.0730.62(+9.82)
newstest201521.0821.9825.2827.86(+9.09)
newstest201621.9922.9521.9822.57(+3.21)
newstest201719.8120.5519.7220.64(+2.79)

(d)

Multihead attentiondev39.8043.1542.5143.2741.9340.36
newstest201441.2544.9844.2245.1443.3142.08
newstest201539.0442.1341.4242.2040.7439.46
newstest201639.0042.8241.8142.8240.4139.08
newstest201737.2940.3439.8040.4638.8337.75

Multihead attention (head size = n)dev37.5040.0341.7142.1743.3643.13
newstest201438.9041.9143.1143.6644.9544.47
newstest201536.2338.5140.9940.8342.4041.67
newstest201636.8739.1640.8140.9942.4142.19
newstest201735.1137.5238.9538.9440.5940.38

Talking-head attentiondev36.0137.5638.5840.7441.3740.95
newstest201437.7639.2138.3640.4241.4641.27
newstest201534.2335.2236.8039.0240.0640.81
newstest201635.1036.0636.4238.5339.4739.95
newstest201733.5034.7135.4637.7638.6238.41

Interacting-head attention (our model)dev46.0745.9046.8747.08(+3.81)
newstest201448.1748.0749.1349.96(+4.82)
newstest201545.2245.2846.0146.44(+4.24)
newstest201645.8745.9446.8747.24(+4.42)
newstest201742.8342.9343.6544.12(+3.66)

(e)

Multihead attentiondev1.421.381.371.401.321.20
newstest20141.241.531.471.561.441.30
newstest20151.091.301.271.321.241.11
newstest20161.101.371.301.371.231.11
newstest20171.021.221.201.251.441.06

Multihead attention (head size = n)dev1.011.181.291.321.421.44
newstest20141.071.291.391.441.531.53
newstest20150.911.061.221.221.321.31
newstest20160.961.121.231.271.351.37
newstest20170.881.031.141.351.251.28

Talking-head attentiondev0.900.740.920.800.79
newstest20140.990.810.920.790.780.34
newstest20150.790.630.900.700.700.32
newstest20160.830.670.780.690.690.30
newstest20170.780.610.770.680.670.33

Interacting-head attentiondev1.631.74(+0.36)1.64
newstest20141.791.97(+0.44)1.80
newstest20151.551.65(+0.35)1.56
newstest20161.621.75(+0.38)1.64
newstest20171.431.55(+0.33)1.47

(f)

Multihead attentiondev28.6432.3731.7432.1231.4129.07
newstest201430.3734.5833.7234.4733.0231.14
newstest201527.6731.1530.4730.9330.1328.24
newstest201628.1032.1131.2231.8930.0728.22
newstest201726.2029.8929.2429.5428.5326.80

Multihead attention (head size = n)dev26.6528.9430.6231.0632.2632.57
newstest201428.0031.2132.4232.9934.2334.23
newstest201525.0227.2529.6929.6831.0731.05
newstest201625.8028.2930.0230.2731.5031.75
newstest201724.2726.5928.1128.1129.7630.05

Talking-head attentiondev25.1927.4829.8230.2331.5031.47
newstest201427.1229.2829.9430.2031.7531.82
newstest201523.6025.3727.9828.5730.1730.55
newstest201624.5126.3728.0928.3929.6829.64
newstest201722.7924.8926.9027.3728.8128.88

Interacting-head attentiondev35.3435.0936.5036.51(+4.14)
newstest201437.7737.5938.7539.83(+5.25)
newstest201534.3234.2535.1735.48(+3.97)
newstest201635.2835.3935.6836.94(4.83)
newstest201732.4032.5834.8133.89(+4.00)

Note. The units of the performances are (a) BLEU, (b) WER, (c) METEOR, (d) ROUGE_L, (e) CIDEr, and (f) YiSi.