Research Article

Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning

Figure 4

Patch-based representation. At first, the areas related to the target verb are detected, and the patches are extracted from the input frame. Then, the features of each patch in each frame are extracted. The LSTM block represents the whole input video. Finally, the elements are locally aggregated, and the class scores for each verb class are estimated.