Review Article

Bidirectional Language Modeling: A Systematic Literature Review

Table 1

Overall hyperparameters.

PaperBatch sizeMax sequenceLearning rateStep sizeParametersLayersHiddenAttention head

[17]2K5121e − 6125K36024102416
[10]2561281e − 42.4M34024102416
[11]321282e − 51M34024102416
[14]5122565e − 51M114676812
[15]400K2565e − 54K11424102416
[27]2561281e − 41M1101276812
[28]20485121e − 5500K34024102416
[29]3305123e − 5777K34024102416
[20]325121e − 41M33024102416
[30]325121e − 41M34024102416
[31]25612811M14.5431212
[32]2561281e − 41M34024102416
[33]40965120.00176125K233124096128
[34]10241281.0e − 41M3.948256040
[35]40965120.00176125K23312409664
[36]20481280.012.1M111276812
[37]2K51210–3125K35624102416
[38]8K5121e − 6500K36024102416
[39]321282e − 51M3401276812
[40]20485122e − 41.75M33524102416
[41]10245125e − 4400k331276812
[42]322562−5 to10−51M66676812
[43]2561281e − 41M1081276812
[44]1281281e − 41M34024102416
[45]2561281e − 41M1101276812
[46]2561281e − 41M34024102416
[47]2561285−5 to 10−51M1101276812
[48]1285123e − 450K9.524102416
[49]65121.5e − 51M34024102416
[50]80005121e − 6500K40012102412
[51]51201281.8e − 425K34024102416
[52]76801286e − 40.5M1101276812