Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning

<div>Motion representation. The short-term temporal information in the video clip can be represented by optical flow. In first, the features of each optical flow of each frame extracted. Then the whole input video is represented by the LSTM block. Finally, the elements are locally aggregated, and the class scores for each verb classes are estimated.</div>

Computational Intelligence and Neuroscience

fig6

Figure 6

Figure 6: Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning