Context-Fused Guidance for Image Captioning Using Sequence-Level Training

<div>The main idea of our proposed network. The compositional visual feature consists of the image representation at regional and global level. At each decoding step, the context gate calculates the textual context by dynamically aggregating the visual concept and word embedding. The context-fused image guidance is formulated on the compositional visual features and fused textual context.</div>

Computational Intelligence and Neuroscience

fig1

Figure 1

Figure 1: Context-Fused Guidance for Image Captioning Using Sequence-Level Training