Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning

<div>Focal representation. At first, the area related to the foreground is detected from the input frame, and the background is blurred with a lowpass Gaussian filter. Then, the features of each blurred frame are extracted. The whole input video is represented by the LSTM block. Finally, the elements are locally aggregated and the class scores for each verb classes are estimated.</div>

Computational Intelligence and Neuroscience

fig5

Figure 5

Figure 5: Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning