Research Article
Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos
Figure 2
The Vision Transformer architecture: (a) the main architecture of the model, (b) the transformer encoder module, (c) multiscale self-attention (MSA) head, and (d) the self-attention (SA) head.