Research Article

Two-Level Multimodal Fusion for Sentiment Analysis in Public Security

Figure 1

Overall architecture of TlMF. The first stage is data preparation, which turns the raw data into a unimodal sequence for text, audio, and video modalities. Once the unimodal sequence is obtained, the unimodal features can be learned by the second stage, which can extract features from each modality. Then, the tensor fusion layer is used to fuse the text-based audio feature and the text-based video feature . Finally, a decision fusion layer is employed to improve the accuracy of classification and prediction in the sentiment analysis task.