Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning

<div>Patch-based representation. At first, the areas related to the target verb are detected, and the patches are extracted from the input frame. Then, the features of each patch in each frame are extracted. The LSTM block represents the whole input video. Finally, the elements are locally aggregated, and the class scores for each verb class are estimated.</div>

Computational Intelligence and Neuroscience

fig4

Figure 4

Figure 4: Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning