Research Article
Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos
Table 5
Comparative analysis of the proposed method with traditional, deep LSTM, and non-LSTM-based techniques using HMDB51 and UCF50 datasets.
| Domain | Technique | Accuracy (%) | HMDB51 | UCF50 |
| Handcrafted methods | Gradient boundary histogram + motion boundary descriptor [41] | 62.2 | -- | Improved dense trajectories (IDT) hybrid approach [42] | 61.1 | 92.3 | Multiview super vector [43] | 55.9 | -- | LSTM-based methods | Adoptive recurrent convolutional hybrid (ARCH) network [44] | 58.2 | -- | Lattice-LSTM [45] | 66.2 | -- | Relational LSTM [35] | 71.4 | -- | TS-LSTM and temporal inception [46] | 69.0 | -- | Temporal optical flow with multilayer LSTM [47] | 72.2 | 94.9 | 3D-CNNs and bidirectional hierarchical LSTM [48] | 71.9 | -- | CNN and DS-GRU [21] | 72.3 | 95.2 | Non-LSTM-based methods | Improved trajectory [49] | 57.2 | 91.2 | Hierarchical clustering multitask learning [50] | 51.4 | 93.2 | The proposed method | ViT and multilayer LSTM | 73.714 | 96.144 |
|
|
The methods represented by bold text show the highest performance in their respected categories.
|