Computational Intelligence and Neuroscience

Research Article

Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos

Comparative analysis of the proposed method with traditional, deep LSTM, and non-LSTM-based techniques using HMDB51 and UCF50 datasets.


Domain	Technique	Accuracy (%)
Domain	Technique	HMDB51	UCF50

Handcrafted methods	Gradient boundary histogram + motion boundary descriptor [41]	62.2	--
	Improved dense trajectories (IDT) hybrid approach [42]	61.1	92.3
	Multiview super vector [43]	55.9	--
LSTM-based methods	Adoptive recurrent convolutional hybrid (ARCH) network [44]	58.2	--
	Lattice-LSTM [45]	66.2	--
	Relational LSTM [35]	71.4	--
	TS-LSTM and temporal inception [46]	69.0	--
	Temporal optical flow with multilayer LSTM [47]	72.2	94.9
	3D-CNNs and bidirectional hierarchical LSTM [48]	71.9	--
	CNN and DS-GRU [21]	72.3	95.2
Non-LSTM-based methods	Improved trajectory [49]	57.2	91.2
Non-LSTM-based methods	Hierarchical clustering multitask learning [50]	51.4	93.2
The proposed method	ViT and multilayer LSTM	73.714	96.144

The methods represented by bold text show the highest performance in their respected categories.