Research Article

An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention

Table 7

Training time on WMT17 EN-DE training dataset.

ModelNumber of heads/head size
2/2564/1288/6416/3232/1664/8

Multihead attention9 h 55 m11 h11 h 30 m12 h12 h 30 m14 h
Multihead attention (head size = n)11 h 30 m11 h 40 m11 h 50 m12 h 50 m13 h 10 m17 h 30 m
Talking-head attention11 h 5 m11 h 40 m11 h 50 m12 h 20 m13 h 50 m15 h
Interacting-head attention10 h 40 m11 h 30 m12 h 50 m20 h

Note. The units of h, m, and s stand for hour, minute, and second, respectively.