Research Article

Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos

Table 3

The proposed LSTM network to capture long-range temporal information from video sequences.

Layer (type)Output shapeNo. of parameters

Input data(None, 30, 1000)0
LSTM(None, 30, 128)578048
LSTM(None, 64)49408
Dropout(None, 64)0
Batch normalization(None, 64)256
Activation(None, 64)0
Dense(None, 64)4160
Dense(None, 51)3315
Activation(None, 51)0