Research Article

Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos

Figure 2

The Vision Transformer architecture: (a) the main architecture of the model, (b) the transformer encoder module, (c) multiscale self-attention (MSA) head, and (d) the self-attention (SA) head.