Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos

<div>The Vision Transformer architecture: (a) the main architecture of the model, (b) the transformer encoder module, (c) multiscale self-attention (MSA) head, and (d) the self-attention (SA) head.</div>

Computational Intelligence and Neuroscience

fig2

Figure 2

Figure 2: Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos