Research Article

An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention

Table 8

Training time on the WMT17 EN-CS training dataset.

ModelNumber of heads/head size
2/2564/1288/6416/3232/1664/8

Multihead attention1 h1 h 5 m1 h 8 m1 h 10 m1 h 15 m1 h 30 m
Multihead attention (head size = n)50 m53 m57 m1 h 3 m1 h 15 m1 h 50 m
Talking-head attention58 m59 m1 h1 h 5 m1 h 10 m1 h 30 m
Interacting-head attention53 m58 m1 h 13 m1 h 50 m

Note. The units of h, m, and s stand for hour, minute, and second, respectively.