Research Article

An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention

Table 3

Overall evaluation scores using four attention-based models on the IWSLT16 DE-EN evaluation set and test set.

ModelIWSLT16 subsetNumber of heads/Head size
2/2564/1288/6416/3232/1664/8

(a)
Multihead attentiondev22.7325.7126.4427.2323.9219.50
Test21.1524.0024.3825.1822.0517.94
Multihead attention (head size = n, dmodel = 512)dev20.9323.0824.7926.0026.9426.38
Test19.3521.3822.8124.6124.8624.59
Talking-head attentiondev20.6621.3922.4724.0224.2123.65
Test19.1319.8220.8722.2422.3021.72
Interacting-head attentiondev29.5929.9429.5430.01(+2.78)
Test27.6128.0127.4427.61(+2.43)

(b)
Multihead attentiondev5.824.853.563.103.518.73
Test9.037.927.715.877.0214.39
Multihead attention (head size = n, dmodel = 512)dev6.906.005.294.594.553.97
Test11.4710.108.988.168.117.96
Talking-head attentiondev9.037.316.584.814.514.33
Test14.4611.909.839.108.698.90
Interacting-head attentiondev2.412.442.312.25(0.85)
Test6.605.344.774.49(1.38)

(c)
Multihead attentiondev25.4226.7228.5829.1426.4222.63
Test24.7725.9627.6728.2025.5722.02
Multihead attention (head size = n, dmodel = 512)dev23.4725.6925.7626.5127.0527.12
Test22.8424.9524.9525.8825.2826.36
Talking-head attentiondev22.4824.2823.8024.4625.2826.76
Test21.8523.4623.1823.9424.4925.85
Interacting-head attentiondev31.7331.9431.6532.04(+2.90)
Test30.8731.1130.7631.25(+3.05)

(d)
Multihead attentiondev53.7255.6857.8558.7655.9350.52
Test52.2354.3056.2357.2654.5749.24
Multihead attention (head size = n, dmodel = 512)dev51.4654.0255.8355.5256.5356.99
Test50.2152.7055.4454.3154.1855.60
Talking-head attentiondev49.2651.6452.3853.7254.9255.78
Test48.2350.1951.1952.3053.3354.45
Interacting-head attentiondev61.0261.2961.1461.41(+2.65)
Test59.2559.8059.6259.96(+2.70)

(e)
Multihead attentiondev1.942.392.612.692.391.89
Test1.802.202.392.492.221.75
Multihead attention (head size = n, dmodel = 512)dev1.992.202.242.272.332.38
Test1.842.052.082.102.162.21
Talking-head attentiondev1.701.871.842.092.202.32
Test1.571.721.671.912.012.05
Interacting-head attentiondev2.932.962.932.98(+0.29)
Test2.692.752.712.79(+0.30)

(f)
Multihead attentiondev54.3355.9857.9158.4855.2549.83
Test53.3454.1855.8856.4553.5748.44
Multihead attention (head size = n, dmodel = 512)dev52.0554.2555.8557.1557.3356.45
Test50.6352.5754.1855.5156.7955.94
Talking-head attentiondev50.1951.3252.3253.0255.5557.06
Test48.9749.6350.6351.6953.7255.83
Interacting-head attentiondev60.9261.2260.9161.45(+2.97)
Test58.6859.1558.6760.04(+3.59)

Note. The units of the performances are (a) BLEU, (b) WER, (c) METEOR, (d) ROUGE_L, (e) CIDEr, and (f) YiSi.