Research Article
Context-Fused Guidance for Image Captioning Using Sequence-Level Training
Figure 1
The main idea of our proposed network. The compositional visual feature consists of the image representation at regional and global level. At each decoding step, the context gate calculates the textual context by dynamically aggregating the visual concept and word embedding. The context-fused image guidance is formulated on the compositional visual features and fused textual context.