Computational Intelligence and Neuroscience

Research Article

An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention

Table 3

Overall evaluation scores using four attention-based models on the IWSLT16 DE-EN evaluation set and test set.


Model	IWSLT16 subset	Number of heads/Head size
Model	IWSLT16 subset	2/256	4/128	8/64	16/32	32/16	64/8

(a)
Multihead attention	dev	22.73	25.71	26.44	27.23	23.92	19.50
Multihead attention	Test	21.15	24.00	24.38	25.18	22.05	17.94
Multihead attention (head size = n, dmodel = 512)	dev	20.93	23.08	24.79	26.00	26.94	26.38
Multihead attention (head size = n, dmodel = 512)	Test	19.35	21.38	22.81	24.61	24.86	24.59
Talking-head attention	dev	20.66	21.39	22.47	24.02	24.21	23.65
Talking-head attention	Test	19.13	19.82	20.87	22.24	22.30	21.72
Interacting-head attention	dev	29.59	29.94	29.54	30.01(+2.78)	—	—
Interacting-head attention	Test	27.61	28.01	27.44	27.61(+2.43)	—	—

(b)
Multihead attention	dev	5.82	4.85	3.56	3.10	3.51	8.73
Multihead attention	Test	9.03	7.92	7.71	5.87	7.02	14.39
Multihead attention (head size = n, dmodel = 512)	dev	6.90	6.00	5.29	4.59	4.55	3.97
Multihead attention (head size = n, dmodel = 512)	Test	11.47	10.10	8.98	8.16	8.11	7.96
Talking-head attention	dev	9.03	7.31	6.58	4.81	4.51	4.33
Talking-head attention	Test	14.46	11.90	9.83	9.10	8.69	8.90
Interacting-head attention	dev	2.41	2.44	2.31	2.25(0.85)	—	—
Interacting-head attention	Test	6.60	5.34	4.77	4.49(1.38)	—	—

(c)
Multihead attention	dev	25.42	26.72	28.58	29.14	26.42	22.63
Multihead attention	Test	24.77	25.96	27.67	28.20	25.57	22.02
Multihead attention (head size = n, dmodel = 512)	dev	23.47	25.69	25.76	26.51	27.05	27.12
Multihead attention (head size = n, dmodel = 512)	Test	22.84	24.95	24.95	25.88	25.28	26.36
Talking-head attention	dev	22.48	24.28	23.80	24.46	25.28	26.76
Talking-head attention	Test	21.85	23.46	23.18	23.94	24.49	25.85
Interacting-head attention	dev	31.73	31.94	31.65	32.04(+2.90)	—	—
Interacting-head attention	Test	30.87	31.11	30.76	31.25(+3.05)	—	—

(d)
Multihead attention	dev	53.72	55.68	57.85	58.76	55.93	50.52
Multihead attention	Test	52.23	54.30	56.23	57.26	54.57	49.24
Multihead attention (head size = n, dmodel = 512)	dev	51.46	54.02	55.83	55.52	56.53	56.99
Multihead attention (head size = n, dmodel = 512)	Test	50.21	52.70	55.44	54.31	54.18	55.60
Talking-head attention	dev	49.26	51.64	52.38	53.72	54.92	55.78
Talking-head attention	Test	48.23	50.19	51.19	52.30	53.33	54.45
Interacting-head attention	dev	61.02	61.29	61.14	61.41(+2.65)	—	—
Interacting-head attention	Test	59.25	59.80	59.62	59.96(+2.70)	—	—

(e)
Multihead attention	dev	1.94	2.39	2.61	2.69	2.39	1.89
Multihead attention	Test	1.80	2.20	2.39	2.49	2.22	1.75
Multihead attention (head size = n, dmodel = 512)	dev	1.99	2.20	2.24	2.27	2.33	2.38
Multihead attention (head size = n, dmodel = 512)	Test	1.84	2.05	2.08	2.10	2.16	2.21
Talking-head attention	dev	1.70	1.87	1.84	2.09	2.20	2.32
Talking-head attention	Test	1.57	1.72	1.67	1.91	2.01	2.05
Interacting-head attention	dev	2.93	2.96	2.93	2.98(+0.29)	—	—
Interacting-head attention	Test	2.69	2.75	2.71	2.79(+0.30)	—	—

(f)
Multihead attention	dev	54.33	55.98	57.91	58.48	55.25	49.83
Multihead attention	Test	53.34	54.18	55.88	56.45	53.57	48.44
Multihead attention (head size = n, dmodel = 512)	dev	52.05	54.25	55.85	57.15	57.33	56.45
Multihead attention (head size = n, dmodel = 512)	Test	50.63	52.57	54.18	55.51	56.79	55.94
Talking-head attention	dev	50.19	51.32	52.32	53.02	55.55	57.06
Talking-head attention	Test	48.97	49.63	50.63	51.69	53.72	55.83
Interacting-head attention	dev	60.92	61.22	60.91	61.45(+2.97)	—	—
Interacting-head attention	Test	58.68	59.15	58.67	60.04(+3.59)	—	—

Note. The units of the performances are (a) BLEU, (b) WER, (c) METEOR, (d) ROUGE_L, (e) CIDEr, and (f) YiSi.