Abstract

With the development of social media, an increasing number of people use short videos in social media applications to express their opinions and sentiments. However, sentiment detection of short videos is a very challenging task because of the semantic gap problem and sequence based sentiment understanding problem. In this context, we propose a SentiPair Sequence based GIF video sentiment detection approach with two contributions. First, we propose a Synset Forest method to extract sentiment related semantic concepts from WordNet to build a robust SentiPair label set. This approach considers the semantic gap between label words and selects a robust label subset which is related to sentiment. Secondly, we propose a SentiPair Sequence based GIF video sentiment detection approach that learns the semantic sequence to understand the sentiment from GIF videos. Our experiment results on GSO-2016 (GIF Sentiment Ontology) data show that our approach not only outperforms four state-of-the-art classification methods but also shows better performance than the state-of-the-art middle level sentiment ontology features, Adjective Noun Pairs (ANPs).

1. Introduction

Nowadays, social applications (such as Facebook, Twitter, and Weibo) contain a huge number of texts, images, and video clips (GIF). With faster Internet connection, people are more willing to post GIF videos than static images to make a personalized and appealing post. According to a recent study [1], the total proportion of visual contents from all shared links on Twitter is 36%. Our statistical results on Sina Weibo, the largest microblog in China, show that 24% of multimedia posts contain GIF videos. However, despite the popularity of GIF videos in social networks, most sentiment detection approaches obtain users’ opinions by using only text based sentiment analysis technology. Researches for GIF sentiment analysis are still in the beginning. There are two main challenges for GIF sentiment analysis: semantic gap problem and sequence based sentiment understanding problem. Firstly, the learning process lacks middle level features and a corresponding semantic label measure. Without semantic label measure, machine cannot learn the middle level sentiment semantic elements and their relation from low level features. Secondly, semantic sequence based sentiment expression is one of the important issues in GIF sentiment analysis. Because sentiments are hidden in the sequence of images, machine cannot mine the impact of semantic sequence based sentiment expression from bag-of-words based features expression.

In particular, we make the following contributions to solve the above two problems:(1)We propose a Synset Forest method to select semantic SentiPair labels that solves the semantic gap problem in label set.(2)We propose a SentiPair Sequence based GIF video sentiment detection approach that solves the sequence based sentiment understanding problem.

The remainder of this paper is organized as follows. Section 2 briefly describes the background and related works in visual sentiment analysis. Section 3 presents the middle level feature, SentiPair Sequence. The algorithm and framework of SentiPair Sequence based approach are detailed in Section 4. Experimental results on GIF video dataset are given in Section 5. Section 6 draws conclusions and gives directions for future work.

Traditional sentiment research works focus on text based sentiment analysis because words are the most common way of expressing opinions. According to the granularity of analysis, text based sentiment research works can be divided into three levels: document level [2], sentence level [3], and entity level [4].

With the development of mobile devices and social media, an increasing number of GIF videos were used to express opinions of users in social media. Hence, visual sentiment analysis becomes a hot topic in multimedia and social media fields. According to the visual content type, recent studies can be divided into two types: image sentiment analysis and video sentiment analysis.

For image sentiment analysis, You et al. [5] used the progressive CNN and bypassed the midlevel features. Without the midlevel ontology, the number of neurons and connections is huge due to the “abstract” nature of visual sentiment. Deep networks need huge amount of less “noisy” labeled training instances to adjust the huge amount of neurons. Otherwise, it will get stuck into local optimum. To build a robust visual sentiment ontology, Borth et al. [6] and Yuan et al. [1] proposed to employ midlevel entities or attributes as features for image sentiment analysis. In [6], 1,200 Adjective Noun Pairs (ANPs), which correspond to different levels of different emotions, are extracted. These ANPs are used as queries to retrieve images from Flickr. Then, pixel-level features of images in each ANP are employed to train 1,200 ANP detectors. The responses of these 1,200 classifiers are finally used as midlevel features for visual sentiment analysis. The work in [1] employed a similar mechanism. The main difference is that 102 scene attributes were used instead. Furthermore, Jou et al. [7] proposed a large-scale Multilingual Visual Sentiment Ontology (MVSO) which is based on VSO to solve the multilingual problem in visual sentiment expression. Campos et al. [8] used a fine-tuned CNN to improve the vision based sentiment predication. By using ANP, Cao et al. [9] proposed a visual sentiment topic model for topic level sentiment detection, and Wang et al. [10] used a bag-of-words model for cross-media sentiment detection. To solve the problem of modeling object-based visual concepts, Chen et al. [11] proposed a hierarchical framework to handle the concept classification in an object specific manner. In order to process the multimodality problem in sentiment learning, Li et al. [12] proposed a multimodal correlation model to build the correlation between modalities. Furthermore, Chen et al. [13] proposed a multimodal hypergraph learning model to bridge modalities of cross-media. You et al. [14, 15] constructed a joint visual-textual sentiment framework which utilized both the state-of-the-art visual and textual sentiment analysis techniques for joint visual-textual sentiment analysis.

For video sentiment analysis, Morency et al. [16] proposed a framework which utilized video sounds and facial expressions to analyze “interview clips.” They focused on the sentiment analysis towards video with fixed contents, similar patterns, and average noises. The experiment results are promising, but, due to the fact that the subject is specified, the method cannot be used to deal with large-scale GIF videos. Jou et al. [17] proposed to use the features such as color histogram to train a framework for online GIF sentiment analysis. They also proposed a good GIF emotion dataset. However, the labels of dataset lacked temporal sequence information description which is important for understanding how an action in a GIF video yields sentiment. Cai et al. [18] proposed a spatial-temporal visual midlevel ontology and dataset. They constructed a semantic tree to label visual sentiments. However, there is no learning approach which can use those midlevel ontology labels to learn the semantic sequence for GIF sentiment analysis.

In general, the study of GIF sentiment detection is still in the beginning; semantic gap problem and sequence based sentiment understanding problem are the main challenges in this topic.

3. SentiPair Sequence

To solve the semantic gap problem, we propose a middle level sentiment representation named SentiPair Sequence. In the construction of SentiPair, we consider three important criteria: emotional correlation, universality, and detectability. Emotional correlation means that the middle level features should be related to the expressions of sentiment in videos. Universality means that the middle level features should cover most kind of visual sentiment concepts of videos. Detectability means that the middle level features should be able to be detected easily.

3.1. Emotional Correlation

For the first criterion, we introduce the SentiPair Sequence to show why it satisfies emotional correlation. Here, the SentiPair is the joint name of Adjective Noun Pair (ANP) and Verb Noun Pair (VNP). We think that there are two important sentiment expression factors in GIF videos: appearances and motions. Firstly, people often use adjective words to describe the appearances of an object which contain the subjective sentiment of users, like “lovely girl” and “cute dog.” Secondly, the motions of an object are also used to express the dynamic changes of sentiment, like “girl cry” and “girl smile.” To describe the appearances and motions, we use ANPs and VNPs, respectively.

After we obtain ANPs and VNPs, we can form a SentiPair Sequence as follows:where is the th SentiPair and is the time of the th SentiPair appeared.

The above equation shows that SentiPair Sequence denotes a sequence of appearances and motions under time series. Therefore, it effectively combines two important sentiment expression factors and enriches emotion labels for learning.

More specifically, each SentiPair refers to either a concrete concept like “smile face” or a specific motion like “falling cup.” In a SentiPair Sequence, SentiPairs are sorted by the order of their occurrence. Figure 1 shows a typical SentiPair Sequence. As we can see, the girl in the video acts differently. In the first frame, the girl was smiling and hence the first SentiPair indicates “Lovely Girl.” In the second frame, the girl looked a bit worried, and the second SentiPair is “Innocent Girl.” With the third SentiPair indicating “Girl Frown,” we can find out that the girl looks sad, which contains a negative sentiment tendency. In the last frame, the girl failed to suppress her feeling and the SentiPair indicates “Girl Shout.” As a result, we can denote the SentiPair Sequence of this GIF video as follows: “Lovely Girl,” “Innocent Girl,” “Girl Frown,” and “Girl Shout.”

SentiPair Sequence describes the concepts associated with sentiment judgment. In general, SentiPair Sequence carries two kinds of concepts. The first one is the existing objects (by ANPs), and the second one is object’s motions (by VNPs).

3.2. Universality

For the second criterion, we introduce Synset Forest to build ANPs and VNPs word set which can cover most words which are semantically related to sentiment. The Synset Forest is a forest which consists of three trees: adjective tree, verb tree, and the noun tree. An example of all three trees can be found at Figure 2. For example, word “smile” mostly denotes the positive sentiment and it belongs to the verb tree, and word “good” also denotes positive sentiment and it belongs to the adjective tree. It shows that all words are organized in a hierarchical tree structure. Furthermore, considering the semantic meaning of each word, our Synset Forest is built from WordNet, a famous lexical database of English. In the WordNet, Synsets are interlinked by means of conceptual semantic and lexical relations. By using WordNet Synsets, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms as Figure 2 shows. Therefore, the proposed Synset Forest models a unified semantic and concept architecture related to sentiment. In the construction of SentiPair, the Synset Forest acts as a collection of candidate words for ANPs and VNPs.

Beyond that, Synset Forest can be used to improve the detection performance of SentiPair by using relative ranking which calculates the semantic distance of two entities in the Synset Forest. Although the SentiPair classification result of a GIF video is wrong, we can modify the score by considering the semantic relation between each SentiPair.

Relative ranking means that we rerank the score of classification according to semantic relation in the Synset Forest. The calculation formula is shown as follows:where label is a subset of SentiPair words which are selected from Synset Forest, word is a word in the Synset Forest, is the classification score of label , is the semantic distance between label and , and are tuning parameters, and is the minimum step that costs for working from word to in the Synset Forest. For example, minstep(dog, cat) is 2 and minstep(dog, table) is 4. In our experiments, and are set to 0.5.

By using the above equations, we can determine whether a new label is related to sentiment by calculating the sentiment classification score and the semantic distance between and a positive sentiment label or negative sentiment label . Therefore, we can cover most words which are semantically related to sentiment. For example, if the ground truth label of an image is “cute animal” and the SentiPair detectors predict “cute dog,” by using relative ranking, the score of “cute animal” can be improved through (2). That is because “animal” is close to “dog” in Figure 2. Therefore, the second part of (2) () can be improved because the Semdis (“cute − dog”,“cute − animal”) is small and Cscore (“cute − dog”) is high.

3.3. Detectability

For the third criterion, we define two indicators: Sentiment Richness and Sentiment Appearance Probability, to determine which SentiPair has enough sentiment meaning and enough samples to learn.

3.3.1. Sentiment Richness

The calculation of Sentiment Richness comes from the score of SentiWordNet [19]. The score of a SentiWordNet word denotes the sentiment level of the word. The range of SentiWordNet score is , where 0 denotes neutral sentiment and the value close to 1 and −1 means that the word contains more emotional meaning. Therefore, Sentiment Richness is formulated as follows:where is the SentiWordNet score of word .

3.3.2. Sentiment Appearance Probability

We think that a middle level feature can be detected on the condition of enough samples. Hence, we choose high frequency words which are used to express sentiment. The calculation of Sentiment Appearance Probability is denoted as follows:where denotes the frequency of word in a famous GIF video website (https://giphy.com/) and denotes the maximum word frequency in https://giphy.com/. https://giphy.com/ is one of the biggest websites which collects GIF videos with annotations. We think that annotations are helpful in calculating Sentiment Appearance Probability.

The final threshold of choosing SentiPair is shown as follows:where and are tuning parameters. In our experiments, and are set to 0.5.

Through the above three criteria, we can cover most words which are semantically related to sentiment and select a suitable word subset as SentiPair labels. Therefore, by learning the middle level features, machine can perceive most kinds of sentiment expressions in GIF videos. Finally, as we can see in Table 1, we select 889 noun words, 91 verb words, and 375 adjective words from WordNet. The examples of selected and unselected words can be seen in Table 2. Those unselected words not only have low scores in SentiWordNet but also have very low frequencies in https://giphy.com/. In our examples, although “man” and “cat” have low SentiWordNet scores, they have high frequencies in https://giphy.com/.

4. SentiPair Sequence Based Sentiment Detection

To effectively learn the middle level features and understand the GIF video sequences, we propose a two-step learning framework which combines the advantage of two different deep learning neural networks, Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM). In the first step, it learns how to obtain SentiPair features from GIF videos by using the image learning ability of CNN. In the second step, it learns how to detect GIF video sentiment by using the semantic sequence learning ability of LSTM. The framework of learning is shown in Figure 3.

From bottom to top, firstly, each frame of GIF video is fed into a 7-layer CNN to learn SentiPair features; secondly, the SentiPair Sequence is used as the input for LSTM layer to learn the semantic sequence; and, finally, the output of LSTM layer is used to determine three types of sentiment (positive, negative, and neutral) through a mean pooling layer. The details of this framework are shown in the following subsection.

4.1. Middle Level Features Learning

For the first step, the SentiPair learning is a multilabel learning problem because each image may contain more than one word. To effectively capture the leaning ability of yielding multilabel, we use sigmoid cross-entropy function as loss function in our deep neural network.where is the number of samples, is the distribution probability of ground truth for a given input, and is the distribution probability of detection.

In our CNN based SentiPair learning neural network structure, we use 7-layer neural network to learn 1,274 SentiPairs (Figure 4). This framework is similar to the work in [20]. Because we do not have enough data and annotation is a heavy work, we cannot directly use the neural network to learn SentiPairs. To obtain a robust model from few data, we use a large dataset, ImageNet, as supplement. Firstly, we use ImageNet data to learn the basic image features yielded from the 6th full connection layer. Then we fix the parameters from layer 1 to layer 6 and change the output layer from object vector to SentiPair vector. Finally, after training, layer 7 learns a mapping function from image features to SentiPair features.

4.2. Sentiment Sequence Learning

For the second step, to learn sentiment from a SentiPair Sequence, we use LSTM model which is often used to model a semantic sequence in text expression. LSTM ([21]) is a classical recurrent neural network. The best advantage of LSTM is that it alleviates the problem of gradient diffusion and explosion in sequence learning. Therefore, it can learn the long dependencies in a sequence by a memory unit and three-gate mechanism. Formally, the update formulas of LSTM are shown as follows:where , , , and are input gate, forget gate, output gate, and memory cell activation vector at time-step , respectively, is the hidden vector, and , , , , , , , and are training parameters.

Because the output dimension of LSTM increases with the increase of GIF video sequences, for each hidden vector , we feed them into a mean pooling layer to reduce the dimension of LSTM output as shown in Figure 3. The mean value is calculated as follows:where is the window size of mean pooling. In our experiment, is set to be the maximum length of LSTM.

After mean pooling layer, all values are fed into softmax layer to determine the final sentiment: positive, negative, and neutral. The loss function is defined as follows:where is the number of samples, is the detection results of sample , and is the number of labels.

5. Experiment

In this section, we design several experiments to verify our framework and compare the performance with other state-of-the-art algorithms. As the main contribution of our work is a SentiPair Sequence based GIF video sentiment detection, the first experiment is conducted to show the SentiPair detection performance. Secondly, in the second experiment, four state-of-the-art bag-of-words machine learning methods are compared with our approach in both SentiPair Sequence and ANP features. Finally, some examples are compared between SentiPair Sequence and ANP features.

5.1. Experiment Setting

Since there is no suitable GIF videos datasets which are labeled with SentiPairs, we construct a new labeled dataset named GSO-2016 to train the sentiment classifiers. The GIF videos of GSO-2016 dataset came from one of the most popular microblogs. All GIF videos were posted by online users and were collected automatically. We recruited 7 workers who are undergraduate students in our university. Each worker was shown one GIF video and was expected to accomplish two tasks. Task 1 is to select suitable words from Synset Forest and form a SentiPair Sequence description for a given GIF. To be more specific, for each GIF, SentiPairs were chosen by browsing the words and tree structure of Synset Forest. Each SentiPair consists of either an adjective and a noun (ANP) or a verb and a noun (VNP). For example, in the second row of Table 3, the given GIF video was labeled with “sad man” (ANP) and “combust money” (VNP) according to the GIF video sequence. In Task 2, workers were expected to give an overall sentiment judgment (positive, negative, and neutral) for each image. For example, in the second row of Table 3, the GIF video is labeled with negative sentiment.

In GSO-2016, we provide labeled and unlabeled GIF videos for supervised learning, one-shot learning, and unsupervised learning. There are totally 36,039 GIF videos and 1,874 GIF videos were labeled with SentiPairs and three kinds of sentiment (positive, negative, and neutral). The dataset can be downloaded from our website (https://pan.baidu.com/s/1hrJBSAo). Three examples of positive, negative, and neutral sentiment GIF videos are shown in Table 3. We hope that dataset can promote the development of semantic vision understanding.

The labeled data in GSO-2016 dataset consists of 1,111 positive instances (59.2%), 164 negative instances (8.8%), and 599 neutral instances (32%). The evaluation metric in our experiment is sentiment detection accuracy. In our experiment, we use 80% and 20% labeled data as training set and test set, respectively. The distribution of experiment dataset is shown in Table 4.

5.2. Experiment Result and Analysis
5.2.1. SentiPair Experiment

In this experiment, we have tried three methods of SentiPair detection: 7-layer CNN with single label, 7-layer CNN with single label and relative ranking (see (2)), and 7-layer CNN with multilabel and relative ranking. Single label means that we just choose the first SentiPair label in training and yield multilabel in testing according to the score of labels. Multilabel means that we use multilabel in training according to the constraint of the sigmoid cross-entropy function. The experiment results are shown in Figure 5. Relative ranking are calculated according to (2).

In Figure 5, Top means the highest scores of labels according to the rank scores of classification. According to the threshold of choosing SentiPairs (0.7, 0.8, 0.9), we obtain 31463, 5111, and 1274 SentiPairs, respectively. From the results, we can obtain the following conclusions: Multilabel learning with relative ranking achieves the best detection performance in 1,274 SentiPairs. Multilabel learning with relative ranking obtains 1.6% and 2.3% improvement in 1,274 SentiPairs compared with single label and relative ranking method in Top 5 and Top 10 results, respectively. Considering that the accuracy of 7-layer CNN with single label and relative ranking is 2.1% and 4.1% in Top 5 and Top 10 results, respectively, our improvement is significant. The accuracy increases with the decrease of SentiPair number.

5.2.2. Sentiment Prediction Experiment

In this experiment, to show the performance of SentiPair Sequence based GIF sentiment detection, we compared our approach with four state-of-the-art classification methods (SMO, Naive Bayes, AdaBoost, and Logistic Regression) in the condition of two different middle level features (SentiPair and ANP). All four state-of-the-art classification methods used ANPs and SentiPairs through bag-of-words model and that means that they cannot use GIF sequence information. In this experiment, we choose the same learning structure (Figure 3) with only 1,874 labeled GIF videos as our baseline to show the effectiveness of middle level features. ANP detectors were trained by using AlexNet in more than 500,000 images from Flickr. SentiPair detectors were trained by using the 7-layer CNN neural network in 1,874 labeled GIF videos from GSO-2016 dataset and a large number of unlabeled images from ImageNet. The experiment results are shown in Figure 6.

From the results, we can obtain the following conclusions: middle level features (SentiPair and ANP) outperform low level features (raw data); it indicates that middle level features are more robust than low level features in representing the visual sentiment. LSTM outperforms the other four state-of-the-art classification methods without learning GIF sequence in both SentiPairs and ANPs; it indicates that LSTM effectively learns the impact of the time sequence information in expressing sentiment. SentiPairs outperform ANPs in four learning methods except Logistic Regression; it shows that SentiPairs are better than ANPs in sentiment learning and it also indicates that the combination of ANPs and VNPs is helpful in GIF video sentiment detection.

After we compared the experiment results in Figures 5 and 6, we can see that although the prediction performance of SentiPair is bad, it still improves the sentiment prediction results. In our SentiPair experiments, there are only 1,874 SentiPairs labeled GIF videos in GSO-2016 and more than 1,274 SentiPairs need to be learned. As a consequence, it is hard to achieve a good enough SentiPair detection performance. However, in this situation, those SentiPairs are strongly related to three kinds of sentiment. According to our results from 1,274 SentiPairs in Table 5, there are only 16.7% SentiPairs which are related to more than two kinds of sentiments, and 83.3% SentiPairs are strongly related to one kind of sentiment. Although the SentiPair prediction is wrong, the prediction results of a GIF video and the ground truth labels are related to the same sentiment with a high probability.

5.2.3. Case Study

To further compare the details of SentiPairs and ANPs in sentiment detection, we show some cases in Table 6. In this table, pictures in red circles are incorrectly classified both in ANP based and in SentiPair based approaches and pictures in green boxes demonstrate that SentiPairs outperform ANPs in GIF videos. Although the SentiPair predictions are bad, there are semantic relations between ground truth labels and our predictions and they yield a good sentiment classification result. For example, “smile boy” (prediction) is similar with “laugh boy” (ground truth) because “smile” and “laugh” have similar semantic meaning; “cry boy” (prediction) is similar to “weep boy” (ground truth) because “cry” and “weep” both describe the action of crying; “cute dog” (prediction) is similar to “cute snail” (ground truth) because “dog” and “snail” both belong to “animal.” Furthermore, by combining the advantages of ANPs and VNPs, SentiPairs can outperform the ANP based approach. For example, Table 6 shows two examples with green boxes which show the advantage of SentiPair based approach. Although the predicted ANPs, “one boy” and “tearful face,” show neutral and negative sentiment, respectively, our predicted SentiPairs obtain “smile boy” and “dance girl” to modify the wrong sentiment.

6. Conclusion

GIF video sentiment detection is a challenge. Considering the function of GIF video sequence and motion in sentiment expression, in this paper, we propose a SentiPair Sequence based approach for GIF video sentiment detection. The SentiPair Sequence not only bridges the low level image features and high level sentiment semantic spaces but also supervises the learning process to learn the sentiment expression for motions and video sequences. The experiments suggest that the prediction accuracy is 81.2% which is significant for the other four state-of-the-art classification methods and the state-of-the-art middle level features, ANP. We also released our dataset GSO-2016 to the public. GSO-2016 contains 1,874 manually labeled GIF videos selected from more than 30,000 candidates. Each video was labeled with both sentiments and SentiPair Sequences. We believe it will be helpful for further research. Furthermore, the performance of SentiPair detection is not good enough to help in sentiment detection; how to enhance the CNN is one of the important issues in our future works.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work is supported by the Nature Science Foundation of China (no. 61402386, no. 61305061, no. 61502105, no. 61572409, no. 81230087, and no. 61571188), Open Fund Project of Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (no. MJUKF201743), Education and Scientific Research Projects of Young and Middle-Aged Teachers in Fujian Province under Grant no. JA15075, and Fujian Province 2011 Collaborative Innovation Center of TCM Health Management and Collaborative Innovation Center of Chinese Oolong Tea Industry—Collaborative Innovation Center (2011) of Fujian Province.