Two-Level Multimodal Fusion for Sentiment Analysis in Public Security

<div>Overall architecture of TlMF. The first stage is data preparation, which turns the raw data into a unimodal sequence for text, audio, and video modalities. Once the unimodal sequence is obtained, the unimodal features can be learned by the second stage, which can extract features from each modality. Then, the tensor fusion layer is used to fuse the text-based audio feature <svg height="11.9087pt" id="M4" style="vertical-align:-3.2728pt" version="1.1" viewbox="-0.0498162 -8.6359 17.5611 11.9087" width="17.5611pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M828 650H570L564 622C646 614 650 609 634 524L604 366H253L281 524C296 608 308 615 382 622L388 650H132L126 622C211 615 214 609 198 524L121 122C106 42 102 35 23 28L17 0H276L282 28C196 35 191 40 206 122L243 324H597L559 123C544 42 537 35 452 28L446 0H710L716 28C632 35 628 43 643 122L719 524C735 610 741 615 822 622L828 650Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,10.075,3.132)"><path d="M298 36L289 62C276 55 253 45 228 45C202 45 169 60 169 141V397H276C289 405 292 426 282 437H169V574L155 576L90 509V437H45L17 408L21 397H90V107C90 28 125 -12 188 -12C198 -12 213 -8 230 1L298 36Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,12.905,3.132)"><path d="M433 39L423 65C413 59 399 54 387 54C370 54 352 69 352 114V299C352 352 342 392 307 422C285 440 255 449 225 449C168 437 102 399 75 379C56 365 44 353 44 339C44 315 69 296 87 296C101 296 111 303 116 319C124 349 133 371 145 385C156 397 171 404 190 404C241 404 275 364 275 291V274C253 256 180 229 120 209C65 190 39 159 39 110C39 47 88 -12 159 -12C189 -12 237 25 277 52C282 35 288 21 301 8C312 -3 333 -12 348 -12L433 39ZM275 84C256 65 221 48 195 48C164 48 124 73 124 124C124 161 146 180 185 198C206 208 254 229 275 240V84Z"></path></g></svg> and the text-based video feature <span class="nowrap"><svg height="11.9087pt" id="M5" style="vertical-align:-3.2728pt" version="1.1" viewbox="-0.0498162 -8.6359 17.7529 11.9087" width="17.7529pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M828 650H570L564 622C646 614 650 609 634 524L604 366H253L281 524C296 608 308 615 382 622L388 650H132L126 622C211 615 214 609 198 524L121 122C106 42 102 35 23 28L17 0H276L282 28C196 35 191 40 206 122L243 324H597L559 123C544 42 537 35 452 28L446 0H710L716 28C632 35 628 43 643 122L719 524C735 610 741 615 822 622L828 650Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,10.075,3.132)"><path d="M298 36L289 62C276 55 253 45 228 45C202 45 169 60 169 141V397H276C289 405 292 426 282 437H169V574L155 576L90 509V437H45L17 408L21 397H90V107C90 28 125 -12 188 -12C198 -12 213 -8 230 1L298 36Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,12.878,3.132)"><path d="M478 437H300V411C357 404 363 393 347 345C325 280 293 190 258 107C229 178 188 284 163 354C147 397 149 405 203 411V437H-2V411C52 403 59 396 82 339C130 222 176 109 220 -11H248C304 138 356 261 390 336C417 395 425 404 478 411V437Z"></path></g></svg>.</span> Finally, a decision fusion layer is employed to improve the accuracy of classification and prediction in the sentiment analysis task.</div>

Security and Communication Networks

fig1

Figure 1

Figure 1: Two-Level Multimodal Fusion for Sentiment Analysis in Public Security