Research Article

An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention

Table 6

Training time on IWSLT16 DE-EN training dataset.

ModelNumber of heads
248163264

Multihead attention10 m 10 s10 m 30 s10 m 55 s11 m 30 s12 m 45 s15 m 30 s
Multihead attention (head size = n)9 m 20 s9 m 40 s10 m 10 s11 m 05 s12 m 55 s16 m 55 s
Talking-head attention10 m 30 s10 m 40 s11 m11 m 40 s15 m 50 s
Interacting-head attention10 m 30 s11 m 30 s14 m 30 s24 m 30 s

Note. The units of m and s stand for minute and second, respectively.