Research Article

Scaling Human-Object Interaction Recognition in the Video through Zero-Shot Learning

Figure 6

Motion representation. The short-term temporal information in the video clip can be represented by optical flow. In first, the features of each optical flow of each frame extracted. Then the whole input video is represented by the LSTM block. Finally, the elements are locally aggregated, and the class scores for each verb classes are estimated.