Research Article

Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning

Figure 3

Block diagram of the process in each processing stream in the verb recognition branch shown in Figure 1. First, the convolutional features of each frame were extracted. Then, the whole input video is represented by the LSTM block. Finally, the elements are locally aggregated.