Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning

<div>Block diagram of the process in each processing stream in the verb recognition branch shown in Figure <a href="../fig1/">1</a>. First, the convolutional features of each frame were extracted. Then, the whole input video is represented by the LSTM block. Finally, the elements are locally aggregated.</div>

Computational Intelligence and Neuroscience

fig3

Figure 3

Figure 3: Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning