Review Article
Bidirectional Language Modeling: A Systematic Literature Review
| Paper | Batch size | Max sequence | Learning rate | Step size | Parameters | Layers | Hidden | Attention head |
| [17] | 2K | 512 | 1e − 6 | 125K | 360 | 24 | 1024 | 16 | [10] | 256 | 128 | 1e − 4 | 2.4M | 340 | 24 | 1024 | 16 | [11] | 32 | 128 | 2e − 5 | 1M | 340 | 24 | 1024 | 16 | [14] | 512 | 256 | 5e − 5 | 1M | 114 | 6 | 768 | 12 | [15] | 400K | 256 | 5e − 5 | 4K | 114 | 24 | 1024 | 16 | [27] | 256 | 128 | 1e − 4 | 1M | 110 | 12 | 768 | 12 | [28] | 2048 | 512 | 1e − 5 | 500K | 340 | 24 | 1024 | 16 | [29] | 330 | 512 | 3e − 5 | 777K | 340 | 24 | 1024 | 16 | [20] | 32 | 512 | 1e − 4 | 1M | 330 | 24 | 1024 | 16 | [30] | 32 | 512 | 1e − 4 | 1M | 340 | 24 | 1024 | 16 | [31] | 256 | 128 | 1 | 1M | 14.5 | 4 | 312 | 12 | [32] | 256 | 128 | 1e − 4 | 1M | 340 | 24 | 1024 | 16 | [33] | 4096 | 512 | 0.00176 | 125K | 233 | 12 | 4096 | 128 | [34] | 1024 | 128 | 1.0e − 4 | 1M | 3.9 | 48 | 2560 | 40 | [35] | 4096 | 512 | 0.00176 | 125K | 233 | 12 | 4096 | 64 | [36] | 2048 | 128 | 0.01 | 2.1M | 11 | 12 | 768 | 12 | [37] | 2K | 512 | 10–3 | 125K | 356 | 24 | 1024 | 16 | [38] | 8K | 512 | 1e − 6 | 500K | 360 | 24 | 1024 | 16 | [39] | 32 | 128 | 2e − 5 | 1M | 340 | 12 | 768 | 12 | [40] | 2048 | 512 | 2e − 4 | 1.75M | 335 | 24 | 1024 | 16 | [41] | 1024 | 512 | 5e − 4 | 400k | 33 | 12 | 768 | 12 | [42] | 32 | 256 | 2−5 to10−5 | 1M | 66 | 6 | 768 | 12 | [43] | 256 | 128 | 1e − 4 | 1M | 108 | 12 | 768 | 12 | [44] | 128 | 128 | 1e − 4 | 1M | 340 | 24 | 1024 | 16 | [45] | 256 | 128 | 1e − 4 | 1M | 110 | 12 | 768 | 12 | [46] | 256 | 128 | 1e − 4 | 1M | 340 | 24 | 1024 | 16 | [47] | 256 | 128 | 5−5 to 10−5 | 1M | 110 | 12 | 768 | 12 | [48] | 128 | 512 | 3e − 4 | 50K | 9.5 | 24 | 1024 | 16 | [49] | 6 | 512 | 1.5e − 5 | 1M | 340 | 24 | 1024 | 16 | [50] | 8000 | 512 | 1e − 6 | 500K | 400 | 12 | 1024 | 12 | [51] | 5120 | 128 | 1.8e − 4 | 25K | 340 | 24 | 1024 | 16 | [52] | 7680 | 128 | 6e − 4 | 0.5M | 110 | 12 | 768 | 12 |
|
|