Research Article

Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos

Table 5

Comparative analysis of the proposed method with traditional, deep LSTM, and non-LSTM-based techniques using HMDB51 and UCF50 datasets.

DomainTechniqueAccuracy (%)
HMDB51UCF50

Handcrafted methodsGradient boundary histogram + motion boundary descriptor [41]62.2--
Improved dense trajectories (IDT) hybrid approach [42]61.192.3
Multiview super vector [43]55.9--
LSTM-based methodsAdoptive recurrent convolutional hybrid (ARCH) network [44]58.2--
Lattice-LSTM [45]66.2--
Relational LSTM [35]71.4--
TS-LSTM and temporal inception [46]69.0--
Temporal optical flow with multilayer LSTM [47]72.294.9
3D-CNNs and bidirectional hierarchical LSTM [48]71.9--
CNN and DS-GRU [21]72.395.2
Non-LSTM-based methodsImproved trajectory [49]57.291.2
Hierarchical clustering multitask learning [50]51.493.2
The proposed methodViT and multilayer LSTM73.71496.144

The methods represented by bold text show the highest performance in their respected categories.