Research Article

Context-Fused Guidance for Image Captioning Using Sequence-Level Training

Figure 2

The overview of our proposed network. For the visual concept set , a unidirectional LSTM is adopted to obtain the encoded vector . The region image feature r is extracted by a Faster R-CNN, and the image representation is obtained by the max pooling applied on r. In decoder, a two-layer LSTM architecture is adopted. indicates the fused textual context. Both and context-fused guidance are passed into the language LSTM along with the hidden state from attention LSTM. The input vector X consists of , the word embedding, and the hidden state of language LSTM.