Research Article

Context-Fused Guidance for Image Captioning Using Sequence-Level Training

Figure 1

The main idea of our proposed network. The compositional visual feature consists of the image representation at regional and global level. At each decoding step, the context gate calculates the textual context by dynamically aggregating the visual concept and word embedding. The context-fused image guidance is formulated on the compositional visual features and fused textual context.