Research Article

Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos

Table 2

Different variants of ViT model used for image classification.

ModelLayersHidden sizeMLP sizeHeadsParams (M)

ViT-Base1276830721286
ViT-Large241024409616307
ViT-Huge321280512016632

The proposed method for features extraction is represented in bold text.