Context-Fused Guidance for Image Captioning Using Sequence-Level Training

<div>An illustration of the context gate. <svg height="12.7178pt" id="M70" style="vertical-align:-3.42947pt" version="1.1" viewbox="-0.0498162 -9.28833 10.5135 12.7178" width="10.5135pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M619 670C619 686 593 712 555 712S459 686 410 634S335 504 320 430H250L219 400L222 388H312L258 73C223 -133 201 -166 187 -180C175 -191 158 -199 140 -199C123 -199 88 -188 74 -172C68 -166 63 -164 54 -171C38 -185 23 -201 23 -215C23 -236 60 -261 93 -261C122 -261 161 -247 207 -200C268 -138 300 -71 337 94C365 220 376 277 394 387L501 399L521 430H401C432 623 464 665 501 665C524 665 544 651 567 627C577 617 583 618 592 625C601 631 619 651 619 670Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,6.721,3.132)"><path d="M329 433H203L239 587L230 596L147 534L123 433H57L30 395L34 388H115L61 129C37 16 59 -12 85 -12C147 -12 222 58 260 98L241 125C212 95 160 62 144 62C132 62 127 71 138 126L192 386L305 394L329 433Z"></path></g></svg> is the scalar factor, s<sub>t</sub> is the fused textual context, and <i>E(y)</i> indicates the word embedding vectors.</div>

Computational Intelligence and Neuroscience

fig3

Figure 3

Figure 3: Context-Fused Guidance for Image Captioning Using Sequence-Level Training