Research Article
Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos
Table 3
The proposed LSTM network to capture long-range temporal information from video sequences.
| Layer (type) | Output shape | No. of parameters |
| Input data | (None, 30, 1000) | 0 | LSTM | (None, 30, 128) | 578048 | LSTM | (None, 64) | 49408 | Dropout | (None, 64) | 0 | Batch normalization | (None, 64) | 256 | Activation | (None, 64) | 0 | Dense | (None, 64) | 4160 | Dense | (None, 51) | 3315 | Activation | (None, 51) | 0 |
|
|