Context-Fused Guidance for Image Captioning Using Sequence-Level Training

<div>The overview of our proposed network. For the visual concept set <span class="nowrap"><svg height="12.5794pt" id="M1" style="vertical-align:-3.29107pt" version="1.1" viewbox="-0.0498162 -9.28833 99.9678 12.5794" width="99.9678pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M686 28C612 35 607 44 591 112C563 234 541 360 519 489L489 666L457 658L147 121C100 40 89 36 24 28L17 0H240L250 28C168 34 159 41 190 101L262 237H482C495 180 503 137 510 91C517 47 514 35 441 28L433 0H677L686 28ZM475 280H285L429 541H431L475 280Z"></path></g><g transform="matrix(.013,0,0,-0.013,12.767,0)"><path d="M535 323V373H52V323H535ZM535 138V188H52V138H535Z"></path></g><g transform="matrix(.013,0,0,-0.013,24.03,0)"><path d="M293 -169V-141C218 -131 209 -90 209 -44C209 19 222 85 222 152C222 207 198 255 139 269V273C199 286 222 334 222 388C222 454 209 523 209 588C209 632 218 671 293 681V709C234 709 148 695 148 577C147 513 155 438 155 372C155 337 152 291 64 285V256C152 250 155 204 155 169C156 105 148 31 148 -41C148 -157 234 -169 293 -169Z"></path></g><g transform="matrix(.013,0,0,-0.013,28.541,0)"><path d="M483 97L471 123C436 91 401 65 392 65C388 65 384 74 390 106C414 239 444 378 457 429L455 433C444 433 429 436 416 439C392 444 368 448 344 448C281 448 204 415 152 376C71 315 23 205 23 103C23 21 57 -12 85 -12C114 -12 149 6 185 34C231 70 285 119 329 183H331L309 81C292 0 308 -12 326 -12C350 -12 421 24 483 97ZM374 387C370 363 356 291 345 261C315 193 181 50 139 50C124 50 110 71 110 118C110 224 153 331 218 379C238 394 271 402 301 402C329 402 359 394 374 387Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,34.534,3.132)"><path d="M389 0V32C297 38 291 46 291 118V635C234 613 175 595 109 583V556L161 554C203 552 207 547 207 497V118C207 46 201 38 110 32V0H389Z"></path></g><g transform="matrix(.013,0,0,-0.013,39.48,0)"><path d="M95 130C70 130 46 113 46 88C46 72 54 64 59 64C93 55 121 33 121 -3C121 -41 93 -68 44 -88L55 -117C117 -98 186 -56 186 22C186 91 131 130 95 130Z"></path></g><g transform="matrix(.013,0,0,-0.013,44.623,0)"><path d="M483 97L471 123C436 91 401 65 392 65C388 65 384 74 390 106C414 239 444 378 457 429L455 433C444 433 429 436 416 439C392 444 368 448 344 448C281 448 204 415 152 376C71 315 23 205 23 103C23 21 57 -12 85 -12C114 -12 149 6 185 34C231 70 285 119 329 183H331L309 81C292 0 308 -12 326 -12C350 -12 421 24 483 97ZM374 387C370 363 356 291 345 261C315 193 181 50 139 50C124 50 110 71 110 118C110 224 153 331 218 379C238 394 271 402 301 402C329 402 359 394 374 387Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,50.616,3.132)"><path d="M414 144C384 79 371 75 317 75H135L276 221C367 316 408 376 408 465C408 570 327 635 237 635C179 635 131 609 100 575L42 494L67 471C94 510 138 565 205 565C277 565 321 517 321 435C321 348 258 270 195 195C146 137 88 81 33 26V0H411C423 44 433 88 446 135L414 144Z"></path></g><g transform="matrix(.013,0,0,-0.013,55.563,0)"><path d="M95 130C70 130 46 113 46 88C46 72 54 64 59 64C93 55 121 33 121 -3C121 -41 93 -68 44 -88L55 -117C117 -98 186 -56 186 22C186 91 131 130 95 130Z"></path></g><g transform="matrix(.013,0,0,-0.013,60.706,0)"><path d="M113 -12C146 -12 170 11 170 46C170 78 146 103 114 103S58 78 58 46C58 11 82 -12 113 -12Z"></path></g><g transform="matrix(.013,0,0,-0.013,65.849,0)"><path d="M113 -12C146 -12 170 11 170 46C170 78 146 103 114 103S58 78 58 46C58 11 82 -12 113 -12Z"></path></g><g transform="matrix(.013,0,0,-0.013,70.992,0)"><path d="M113 -12C146 -12 170 11 170 46C170 78 146 103 114 103S58 78 58 46C58 11 82 -12 113 -12Z"></path></g><g transform="matrix(.013,0,0,-0.013,76.168,0)"><path d="M95 130C70 130 46 113 46 88C46 72 54 64 59 64C93 55 121 33 121 -3C121 -41 93 -68 44 -88L55 -117C117 -98 186 -56 186 22C186 91 131 130 95 130Z"></path></g><g transform="matrix(.013,0,0,-0.013,81.312,0)"><path d="M483 97L471 123C436 91 401 65 392 65C388 65 384 74 390 106C414 239 444 378 457 429L455 433C444 433 429 436 416 439C392 444 368 448 344 448C281 448 204 415 152 376C71 315 23 205 23 103C23 21 57 -12 85 -12C114 -12 149 6 185 34C231 70 285 119 329 183H331L309 81C292 0 308 -12 326 -12C350 -12 421 24 483 97ZM374 387C370 363 356 291 345 261C315 193 181 50 139 50C124 50 110 71 110 118C110 224 153 331 218 379C238 394 271 402 301 402C329 402 359 394 374 387Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,87.304,3.132)"><path d="M781 91L765 118C733 86 702 70 693 70C686 70 685 77 692 106L738 297C772 439 736 451 712 451S675 445 647 431C605 410 526 356 452 257H450L459 294C491 426 459 451 429 451C403 451 386 445 361 431C312 404 239 343 170 253H168L188 333C208 411 205 451 175 451C148 451 85 417 24 346L39 318C57 338 100 374 112 374C120 374 122 372 115 339C91 226 62 112 28 -6L37 -12C59 -4 87 3 114 6C125 68 142 127 154 168C184 227 316 383 372 383C397 383 396 366 381 289C362 192 338 92 312 -6L317 -12C339 -5 367 3 396 6L434 170C468 234 598 383 650 383C668 383 676 369 664 319L609 90C593 23 600 -12 629 -12C654 -12 722 23 781 91Z"></path></g><g transform="matrix(.013,0,0,-0.013,95.156,0)"><path d="M283 255V284C195 290 192 337 192 375C192 436 200 508 200 580C200 696 116 709 54 709V681C127 671 139 634 139 588C138 524 125 452 125 387C125 333 149 286 209 272V268C148 253 125 206 125 151C126 87 139 16 139 -47C139 -92 127 -131 54 -141V-169C115 -169 200 -152 200 -41C200 32 192 104 192 164C192 202 195 249 283 255Z"></path></g></svg>,</span> a unidirectional LSTM is adopted to obtain the encoded vector <span class="nowrap"><svg height="6.1673pt" id="M2" style="vertical-align:-0.2063904pt" version="1.1" viewbox="-0.0498162 -5.96091 9.39034 6.1673" width="9.39034pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M689 332C689 394 670 448 646 448C620 448 597 421 597 396C597 386 600 381 608 372C619 359 620 334 620 315C620 150 538 45 454 45C414 45 386 67 386 122C386 138 388 158 394 180L457 426L452 432L377 416L315 156C302 100 259 45 216 45C176 45 148 67 148 122C148 133 152 158 156 180C162 212 173 259 194 332C201 357 206 384 206 405C206 430 198 448 174 448C125 448 66 406 23 342L43 319C84 368 110 383 121 383C126 383 128 382 128 377C128 370 127 359 122 343C99 268 84 204 77 156C74 137 70 111 70 104C70 25 125 -12 180 -12C228 -12 276 12 319 50C338 8 378 -12 418 -12C549 -12 689 166 689 332Z"></path></g></svg>.</span> The region image feature <i>r</i> is extracted by a Faster R-CNN, and the image representation <svg height="9.42945pt" id="M3" style="vertical-align:-0.2063904pt" version="1.1" viewbox="-0.0498162 -9.22306 5.60619 9.42945" width="5.60619pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><rect height="0.65243" width="5.50656" x="0" y="-8.52081"></rect><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M393 379C402 394 400 411 393 422C384 437 365 448 348 448C301 448 237 372 186 285H182L193 335C210 408 205 448 178 448C150 448 80 402 29 344L45 321C80 355 114 373 122 373C128 373 130 365 124 330C106 228 76 98 50 -5L57 -12C82 -5 112 3 132 6L172 203C196 256 234 304 254 329C275 355 293 367 306 367C318 367 330 360 342 348C347 343 355 343 365 350S386 367 393 379Z"></path></g></svg> is obtained by the max pooling applied on <i>r</i>. In decoder, a two-layer LSTM architecture is adopted. <svg height="9.25202pt" id="M4" style="vertical-align:-3.29111pt" version="1.1" viewbox="-0.0498162 -5.96091 8.59533 9.25202" width="8.59533pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M352 391C352 416 319 448 267 448C236 448 173 423 147 400C107 364 96 332 96 304C96 248 143 210 193 181C241 153 258 124 258 100C258 72 232 38 184 38C151 38 107 66 81 108C77 114 64 116 55 111C34 99 23 84 23 65C23 29 81 -12 134 -12C220 -12 325 61 325 141C325 184 297 215 234 256C194 282 161 309 161 346C161 380 188 401 217 401C255 401 279 380 301 353C308 344 313 341 325 347C341 355 352 371 352 391Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,4.81,3.132)"><path d="M329 433H203L239 587L230 596L147 534L123 433H57L30 395L34 388H115L61 129C37 16 59 -12 85 -12C147 -12 222 58 260 98L241 125C212 95 160 62 144 62C132 62 127 71 138 126L192 386L305 394L329 433Z"></path></g></svg> indicates the fused textual context. Both <svg height="14.0004pt" id="M5" style="vertical-align:-5.3645pt" version="1.1" viewbox="-0.0498162 -8.6359 28.4633 14.0004" width="28.4633pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M697 650H468L461 623L492 619C539 613 547 605 518 546C481 471 367 264 278 116H276C239 278 197 500 186 567C180 604 185 613 226 619L252 623L260 650H24L17 623C78 617 92 613 108 533L216 -11H247C365 200 515 462 560 529C616 612 624 615 689 623L697 650Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,7.332,3.132)"><path d="M390 111C344 68 312 56 269 56C212 56 118 102 118 241C118 346 175 401 241 401C277 401 312 388 342 360C350 352 355 349 361 349C372 349 394 371 394 392C394 403 391 411 378 422C362 436 329 449 288 449H287C250 449 190 432 138 392C71 341 37 274 37 197C37 90 112 -12 238 -12C297 -12 363 32 407 90L390 111Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,11.181,3.132)"><path d="M257 449C165 449 37 374 37 209C37 98 119 -12 256 -12C355 -12 473 65 473 226C473 349 381 449 257 449ZM244 416C333 416 380 320 380 204C380 67 329 21 267 21C184 21 130 115 130 241C130 354 184 416 244 416Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,15.713,3.132)"><path d="M797 0V26C739 32 732 36 732 103V296C732 394 682 449 605 449C576 449 550 437 529 423C504 407 475 389 446 366C425 418 382 449 334 449C303 449 279 437 253 421C222 403 201 385 180 371V452C135 432 85 419 41 411V388C99 379 102 374 102 310V103C102 38 93 32 27 26V0H238V26C189 32 180 38 180 103V338C210 363 250 390 289 390C351 390 377 348 377 275V103C377 37 368 32 306 26V0H520V26C465 32 456 38 456 101V296C456 314 455 326 453 338C491 369 529 390 565 390C628 390 653 345 653 274V107C653 36 642 32 583 26V0H797Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,22.992,3.132)"><path d="M169 380V459C122 440 66 423 24 416V392C86 384 90 382 90 317V-135C90 -201 81 -207 17 -213V-240H253V-213C176 -207 169 -201 169 -125V6C182 -1 208 -11 238 -12C368 12 487 109 487 260C487 358 421 449 310 449C298 449 279 444 261 433L169 380ZM169 346C196 367 237 389 269 389C341 389 403 329 403 221C403 109 347 37 263 37C228 37 191 53 169 76V346Z"></path></g></svg> and context-fused guidance <svg height="9.39034pt" id="M6" style="vertical-align:-3.42943pt" version="1.1" viewbox="-0.0498162 -5.96091 10.8658 9.39034" width="10.8658pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M546 430L539 434C529 434 505 438 495 440C473 444 450 448 430 448C352 448 265 412 213 366C145 306 96 203 96 103C96 22 135 -12 160 -12C190 -12 238 14 262 32C310 68 368 120 411 184H413C403 117 396 75 384 21C353 -118 325 -158 291 -184C270 -200 241 -205 208 -205C133 -205 90 -164 74 -110C70 -98 58 -100 49 -107C34 -119 23 -140 23 -155C23 -190 74 -261 166 -261C219 -261 280 -233 314 -208C383 -157 446 -79 470 81C491 223 529 388 546 430ZM456 386C452 357 433 283 420 252C402 216 366 174 325 129C288 88 239 56 212 56C192 56 182 77 182 120C182 165 199 242 226 292C256 348 281 377 311 389C327 395 353 402 375 402C408 402 436 394 456 386Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,7.072,3.132)"><path d="M329 433H203L239 587L230 596L147 534L123 433H57L30 395L34 388H115L61 129C37 16 59 -12 85 -12C147 -12 222 58 260 98L241 125C212 95 160 62 144 62C132 62 127 71 138 126L192 386L305 394L329 433Z"></path></g></svg> are passed into the language LSTM along with the hidden state <svg height="13.7721pt" id="M7" style="vertical-align:-3.94357pt" version="1.1" viewbox="-0.0498162 -9.82853 11.4504 13.7721" width="11.4504pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M499 88L487 113C459 86 422 65 413 65C403 65 403 74 409 98L461 318C487 426 460 448 431 448C408 448 383 441 352 424C305 398 223 344 157 257H155L243 674C249 702 249 712 240 712C227 712 170 680 87 673L82 648H117C155 648 162 644 150 590L23 -4L30 -12C55 -4 81 3 105 8C113 53 131 150 143 187C199 281 323 387 372 387C390 387 393 369 380 311L328 86C313 19 319 -12 347 -12C379 -12 442 24 499 88Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,6.708,-5.741)"><path d="M219 86C216 168 211 250 206 337C201 410 189 448 163 448C131 448 79 396 43 344L60 322C91 359 110 375 118 375S132 358 136 298C141 238 152 81 155 -12H182C242 62 331 177 390 258C435 321 451 360 451 391C450 424 432 448 408 448C390 448 372 435 366 419C362 410 362 401 366 394C373 383 376 367 376 350C376 283 262 138 221 86H219Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,6.643,3.784)"><path d="M329 433H203L239 587L230 596L147 534L123 433H57L30 395L34 388H115L61 129C37 16 59 -12 85 -12C147 -12 222 58 260 98L241 125C212 95 160 62 144 62C132 62 127 71 138 126L192 386L305 394L329 433Z"></path></g></svg> from attention LSTM. The input vector <i>X</i> consists of <span class="nowrap"><svg height="9.42945pt" id="M8" style="vertical-align:-0.2063904pt" version="1.1" viewbox="-0.0498162 -9.22306 5.60619 9.42945" width="5.60619pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><rect height="0.65243" width="5.50656" x="0" y="-8.52081"></rect><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M393 379C402 394 400 411 393 422C384 437 365 448 348 448C301 448 237 372 186 285H182L193 335C210 408 205 448 178 448C150 448 80 402 29 344L45 321C80 355 114 373 122 373C128 373 130 365 124 330C106 228 76 98 50 -5L57 -12C82 -5 112 3 132 6L172 203C196 256 234 304 254 329C275 355 293 367 306 367C318 367 330 360 342 348C347 343 355 343 365 350S386 367 393 379Z"></path></g></svg>,</span> <svg height="6.1673pt" id="M9" style="vertical-align:-0.2063904pt" version="1.1" viewbox="-0.0498162 -5.96091 9.39034 6.1673" width="9.39034pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M689 332C689 394 670 448 646 448C620 448 597 421 597 396C597 386 600 381 608 372C619 359 620 334 620 315C620 150 538 45 454 45C414 45 386 67 386 122C386 138 388 158 394 180L457 426L452 432L377 416L315 156C302 100 259 45 216 45C176 45 148 67 148 122C148 133 152 158 156 180C162 212 173 259 194 332C201 357 206 384 206 405C206 430 198 448 174 448C125 448 66 406 23 342L43 319C84 368 110 383 121 383C126 383 128 382 128 377C128 370 127 359 122 343C99 268 84 204 77 156C74 137 70 111 70 104C70 25 125 -12 180 -12C228 -12 276 12 319 50C338 8 378 -12 418 -12C549 -12 689 166 689 332Z"></path></g></svg> the word embedding, and the hidden state of language LSTM.</div>

Computational Intelligence and Neuroscience

fig2

Figure 2

Figure 2: Context-Fused Guidance for Image Captioning Using Sequence-Level Training