Research Article
An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention
Table 8
Training time on the WMT17 EN-CS training dataset.
| Model | Number of heads/head size | 2/256 | 4/128 | 8/64 | 16/32 | 32/16 | 64/8 |
| Multihead attention | 1 h | 1 h 5 m | 1 h 8 m | 1 h 10 m | 1 h 15 m | 1 h 30 m | Multihead attention (head size = n) | 50 m | 53 m | 57 m | 1 h 3 m | 1 h 15 m | 1 h 50 m | Talking-head attention | 58 m | 59 m | 1 h | 1 h 5 m | 1 h 10 m | 1 h 30 m | Interacting-head attention | 53 m | 58 m | 1 h 13 m | 1 h 50 m | — | — |
|
|
Note. The units of h, m, and s stand for hour, minute, and second, respectively.
|