Abstract

Visual relationship detection (VRD) aims to locate objects and recognize their pairwise relationships for parsing scene graphs. To enable a higher understanding of the visual scene, we propose a symmetric fusion learning model for visual relationship detection and scene graph parsing. We integrate objects and relationship features at visual and semantic levels for better relations feature mapping. First, we apply a feature fusion for the construction of the visual module and introduce a semantic representation learning module combined with large-scale external knowledge. We minimize the loss by matching the visual and semantic embeddings using our designed symmetric learning module. The symmetric learning module based on reverse cross-entropy can boost cross-entropy symmetrically and perform reverse supervision for inaccurate annotations. Our model is compared with other state-of-the-art methods in two public data sets. Experiments show that our proposed model achieves encouraging performance in various metrics for the two data sets investigated. The further detailed analysis demonstrates that the proposed method performs better by partially alleviating the impact of inaccurate annotations.

1. Introduction

The rapid development of the computer vision community pushes forward object detection and semantic segmentation over a short time. These advancements are driven by the deep neural network baselines, such as R-CNN and fully convolutional network (FCN) frameworks for object detection and semantic segmentation. Advanced deep convolutional neural networks (CNNs) have achieved optimal performance in the fields of visual tasks such as image classification [1], object detection [2], and visual relationship detection [3]. Nevertheless, these CNNs need to be trained in a fully supervised learning manner, requiring manually annotated data sets, such as ImageNet [4], MS-Coco [5], and Pascal VOC [6]. Most existing VRD models detect semantic relationships in the VisualGenome (VG) [7] and Visual Relationship Detection (VRD) data sets [3]. However, collecting and labeling a multimodal data set is costly and easy to make errors in actual engineering. Inaccurate and insufficient labels are common noise in manual annotations. Even high-quality data sets likely contain incorrect labels. Therefore, training accurate neural networks in the presence of manual annotations have become a task with crucial practical significance in deep learning.

Image understanding research has gradually developed from low-level feature extraction to high-level semantic learning. The next step is to start inferences on the semantic relationship between multiple objects, which could help many multimodal tasks such as visual question answering [8], image captioning [9], visual commonsense reasoning [10], human-centered activity recognition [11], and intention recognition [12]. Johnson et al. [13] proposed the scene graphs, which give a platform to infer the visual scene. Given an image, the scene graph generation (SGG) task essentially parses the fully connected graphs, and it considers pairwise interactions of nodes (objects) as edges. These interactions can be spatial, comparative, or action-based and are expressed as the subject-relationship-object (SRO) triplets such as person-ride-horse (action), plate-on-table (spatial), and person1-taller-person2 (comparative). As Figure 1(b) shows, the interaction between objects in an image could generate a scene graph to explore multiple objects’ relationships. It shows that SGG plays a vital role in the high-level understanding of images. Object entities are frequently semantically connected; Chao et al. [14] utilized improving class-representative visual features as the semantic embedding to achieve better object recognition; and several methods have been proposed to learn semantic associations between the visual and semantic modules. Most of them [3, 1517] mainly followed the pipelined method. The pipelined method detects visual relationships in two separate steps. First, the entities in a figure are detected. Next, the relations between entities are predicted by running classification. However, the accuracy of the classification will be affected by preliminary errors using these pipelines. To address this, we employ a manner that maintains visual similarity to detect the SRO triplets instead of similarity-based relations retrieval. We design a visual and a semantic module that learns the mapping from the visual feature space to the semantic embedding space.

In this work, we propose a symmetric fusion learning model for detecting visual relations. Instead of modeling objects and relations as discrete labels, we can precisely detect visual relationships by matching the visual and semantic embedding space. We utilize fusion learning to design the structure of the visual module, and we introduce a semantic representation learning module combined with large-scale external knowledge. Besides, inaccurate and insufficient labels are common noise in manual annotations. Luo et al. [18] employed an adaptive loss function to mitigate the effects of noises in their video semantic recognition task. Inspired by the symmetric cross-entropy learning loss function [19], we propose a symmetric learning module boosting cross-entropy symmetrically using reverse cross-entropy, to perform reverse supervision for inaccurate annotations and better parsing scene graphs. We demonstrate that our model is highly competitive on the VisualGenome (VG) data set, which contains 108,249 images where each with an average of 35 objects, 26 attributes, and 21 pairwise relationships. Furthermore, we also evaluate our model on the Visual Relationship Detection (VRD) data set, showing that our model can significantly improve visual relationship prediction in scene graphs.

The key contributions are summarized as follows:(i)We built a symmetric fusion learning model, which can precisely detect visual relations by matching the visual and semantic embedding space(ii)We propose a symmetric learning module boosting cross-entropy symmetrically to perform reverse supervision for inaccurate annotations and better parsing scene graphs(iii)Experiments on the two public data sets show that our model achieves encouraging performance and consistent improvements in various metrics obtained by effectively handling the visual relationships detection issue

2.1. Visual Relationship Detection

Recently, many visual tasks have focused on visual relationship detection for better parsing a scene graph. Early work mostly focused on predicting specific types of predicates, such as predicting the spatial relationship of image objects [20] and detecting the human-interaction relationships [21, 22]. As a mid-level visual task, VRD benefits many high-level visual tasks, such as visual question answering [8], image captioning [9], and visual commonsense reasoning [10].

Early VRD methods used specific phrases to detect the relationship; Lu et al. [3] first employed the “language prior” from semantic word embeddings to predict visual relationships. Zhuang et al. [15] applied feature representations to characterize the interaction pattern based on the context-aware interaction classifier. Like these methods, many other works detected objects and pairwise relationships separately [3, 1517]. Unlike these approaches, we utilize a fusion learning manner that integrates the subject and object features to design the visual module. We want to learn the mapping from the visual feature space to the semantic embedding space.

Context information learning is another attempt considered by researchers. Yu et al. [23] integrated the prior distribution obtained from external linguistic knowledge into the visual relationship prediction model. Liang et al. [24] proposed a deep neural network model with structural ranking loss to model objects and predicates separately. Subsequently, the feature interactions and message sharing were discussed by Yin et al. [25]; they formed a spatiality-aware contextual feature learning model Zoom-net to promote feature interactions.

The VRD methods mentioned above focus on detecting predicate relationships. Recently, researchers considered three components of each relationship triplet, detecting object pairs that contain specific predicates. Zhang et al. [26] embedded object pairs and predicate separately to the independent semantic spaces for object and relation. Zhan et al. [27] improved visual relationship detection by utilizing undetermined relationships. Furthermore, Zhan et al. [28] correlated object detection, significance detection, and predicate detection for better visual relationship prediction. Unlike these methods, we employ symmetric learning to adjust the representation of pairwise relationships to maintain stable scene parsing performance.

2.2. Scene Graph Parsing

SGG has attracted extensive attention during the last couple of years due to the significance of parsing scenes in various computer vision tasks. Most context-based modeling methods form the scene graph employing message passing in the local subgraph structure. Subsequently, several scene graph generation methods transfer messages between object pairs and predicates to capture contextual information. Xu et al. [29] proposed a model that passes messages containing contextual information within subgraphs. Li et al. [30] introduced a subgraph-based Factorizable Net that passes the message between object feature vectors and subgraph feature maps. Zellers et al. [31] represented the global context of objects and relationships based on recurrent sequential architecture LSTM. Chen et al. [32] introduced prior knowledge of statistical correlations represented by a knowledge graph to propagate node messages. More recently, Chen et al. [33] employed the generated missing labels to train scene graphs. Yang et al. [34] proposed probabilistic modeling to ease the semantic ambiguity of visual relationship prediction. Saha et al. [35] proposed a context-aware detection method to identify obscured regions of the scene, leading to better visual scene understanding.

Many studies have been proposed to solve various problems existing in the task. We design a semantic module to better infer the semantic relationships between entities; it can project the word vectors of the triplet into an embedding space where the words maintain higher semantic similarity to each other. Furthermore, inaccurate and insufficient labels are common noise in manual annotations. In this work, we propose a symmetric learning module that can alleviate the impact of noisy labeling by reverse supervision. It is straightforward to use and requires minor intervention for training. More importantly, it represents a vital function in tolerating label noise for the manually annotated data sets.

3. Methodology

In this section, we first describe the visual module architecture of our model. Then, based on our visual network structure, we introduce semantic representation learning combined with large-scale external knowledge. Finally, we incorporate symmetric learning against noisy labels into our model for better parsing scene graphs. The brief training process of our model is shown in Figure 2.

3.1. Visual Network Structure

Object Detection: Given an image, we employ faster R-CNN [36] object detector to get better proposals for each image as in previous works. First, we utilize the region proposal network (RPN) to generate a set of object proposals. Each pair of objects is enclosed by a bounding box and obtains the appearance feature. The appearance feature of the bounding box outlines the objects and the surrounding context, which is helpful to predict the relationships. Because the relationship between objects often arises from the visual area where the two objects interact, we extract the features from the union region of object pairs for triplet fusion learning. We utilize a similar process as Zhang et al. [37] to extract the feature for each proposal.

Fusion Learning: For each object region, the feature vector , , and of the subject, relationship, and object, respectively, are extracted by the ROI (region of interest) pooling layer. These features are sent to the fully connected layers to extract and integrate visual information through feature space transformation; then we obtain the implicit semantic embedding , , and through mapping the original features to the hidden node. To jointly identify predicates, the visual feature for the relationship is formed by fusing the hidden features , , and as shown in Figure 3. Later, the fusion learning of the triplet is carried out through the concatenation of the object feature. Each proposal will be fed into the fusion learning module. Finally, three visual embeddings , , and for a triplet are output by considering independent object features and their fusion embedding.

3.2. Semantic Modeling

On account of semantic correlated relationships to one another, we can infer the correct triplet from similar relationships that occur more frequently. Our approach presents visual relationships by grouping similar language expressions. The semantic module projects the word vectors of the triplet into an embedding space where the words maintain higher semantic similarity to each other. We first introduce the process that maps the word vectors in the embedding space; then we describe the training process that pushes the related relationship closer in the embedding space.

A suitable word vector of objects and relationship labels is essential for fine-tuning. We consider the pretrained word vectors fastText.We obtain semantic knowledge through large-scale public available text data mining. We employ a pretrained word vector fastText trained on Common Crawl [38] to implement our purpose. Unlike the word vector models that ignore the morphological features inside words, the fastText model utilizes a bag of n-grams to obtain the word-internal information. First, we employ pretrained fastText [38] to project the objects and predicates into a word embedding space.

where and are the initial standardized fastText word vector of relationship and objects, respectively; and are the number of the classes of objects and predicates, respectively; and represents the raw word vector of pretrained fastText.

Next, the word vectors , , and of the SRO labels are given into an FC layer as shown in Figure 3, which outputs the three intermediate hidden embeddings , , and . The approach aims to generate a word embedding that projects similar relationships closer to one another than the initial fastText word vector space. Finally, we get the three word embeddings , , and of SRO through the FC layer once again.

3.3. Symmetric Learning Module

We can get the output embedding and from the above two modules in the training process. Here, we minimize the loss by matching the visual and semantic embeddings using our designed symmetric learning module; is the output from matching the visual and semantic embedding as shown in Figure 4. We employ the cross-entropy (CE) for matching the embedding of triplets, while cross-entropy is the most generally utilized for training deep neural networks. Given an M-class visual relationship data set, , where represents a triplet sample in the multidimensional input space and is ground-truth from the manual annotations. The probability for each triplet learning from the ground-truth annotation is ; represents the logits. While the denotes the ground-truth distribution for the data sets, the CE for the triplet is

For the two distributions in this study, is the distribution learned from the training data, and denote the ground-truth distribution for the data sets. Kullback–Leibler divergence (denoted as ) can be used to calculate the difference between these two distributions:

where is the entropy of the ground-truth distribution for the data sets and is the cross-entropy of and . In order to make our training model closer to the real distribution, we minimize the .

Various works have confirmed the weakness of the cross-entropy used for deep neural network learning [19]. When there are noisy labels in the data set, it may cause inadequate extraction and ambiguity. A single cannot accurately represent the true distribution; instead, the predicted can denote the true distribution partly. Consequently, apart from the consideration of as the ground-truth, we need to combine the reverse to help the model fit better. Here, we consider the relative entropy of the reverse fitting to obtain the logically symmetric and extend it to the reverse cross-entropy :

We introduced the reverse cross-entropy boosting cross-entropy symmetrically into our loss function, thus performing reverse supervision for inaccurate annotations. Formally, the final loss function for the symmetric learning module iswhere and are hyperparameters, is adopted to alleviate the overfitting issue of standard cross-entropy , and mitigates label noise by robust adjustment of .

In addition, and are defined by fine-tuning different modules. In the visual modeling stage, we only use to extract the region of interest . While in the symmetric learning stage, both and are utilized for matching visual and semantic embedding.

4. Experiments and Results

In this part, the performance of relations detection and the effectiveness of noise mitigation are explored. We will first introduce experimental settings, including data sets, evaluation metrics, and implementation details. Compared results between our model and baseline methods are presented in Section 4.2.

4.1. Experimental Settings
4.1.1. Data Sets

We conduct experiments on two public data sets: VisualGenome (VG) [7] and Visual Relationship Detection (VRD) data sets [3].

VG: in our experiments, we use the pruned version of VG [29] that only contains 150 object categories and 50 predicates. We follow the same train/test splits as in Xu et al. [29].

VRD: the VRD data set we used consists of 5,000 images with 100 object categories and 70 predicate categories. We use the same train/test split as in previous work [3].

4.1.2. Evaluation Metrics

VG: following Zellers et al. [31], we conduct three metrics to evaluate the performance: scene graph generation (SGGen), scene graph classification (SGCls), and predicate classification (PredCls). SGGen is the mode that needs to predict subject/object boxes and all labels. SGCls predict that all labels are given ground-truth subject and object boxes. PredCls predict predicate labels are given ground-truth subject and object boxes and labels. We use Recall@n (R@20, R@50, and R@100) as the evaluation metrics following previous works. Recall (R@N) is defined as the ratio of the true relationship in the top-N confident relation predictions in an image.

VRD: following Zhang et al. [26], we apply the object detector pretrained on the COCO data set. We follow previous works [23] using Recall@n (R@50, R@100) as the evaluation metrics, which reports R@50 and R@100 for relationship and phrase detection at 1, 10, and 70 predicates per entity pair.

4.1.3. Implementation Details

In our experiments, to ensure compatibility with the structures of previous works, we utilize VGG-16 as the backbone of VG and VRD data sets. For our symmetric learning module, a relative is used for achieving better convergence on difficult data sets. The large tends to cause overfitting, while the small can ease the overfitting of the single CE. Nevertheless, the reverse cross-entropy term is noisy tolerant, but the convergence becomes slow when is too large. We use a relatively small to avoid overfitting and large against noisy labels. The parameters and are set to 0.1 and 1, respectively.

Our model was optimized using SGD with momentum, and the base learning rate is set to . Moreover, since the labels of subject and object play an important role in predicting visual relationships, we employ the empirical distribution over relations between object pairs to aid in generating scene graphs as in previous work.

4.2. Experimental Results
4.2.1. Compared Results

In this section, we compare our proposed method with the previous state-of-the-art models. We conducted experiments on two data sets (VG and VRD) and compared the performance with previous works. Tables 1 and 2 show the results of different baseline models, together with our framework for two data sets.

VG: We compare our method with eight state-of-the-art (SoTA) methods on the VG data set. The eight methods are IMP [29], frequency [31], frequency + overlap [31], MotifNet-LeftRight [31], graph R-CNN [39], VCTREE-SL [40], RelDN [37], and VCTREE + TranstextNet [41]. Table 1 presents the performances of ours and the other SoTA methods. As shown in Table 1, our method achieves encouraging R@n scores on various metrics in the VG data set. In SGGen, our method performs the best. Compared with the current baseline method VCTREE + TranstextNet[41], our method outperforms it by 3% at R@100. In PredCls, our method outperforms the other methods on R@20, R@50, and R@100. In SGCls, our method outperforms the best baseline by 0.5% on R@50 and is only lower than it by 0.5% on R@100. Note that our method has not made outstanding progress on the SGCls and PredCls tasks, the improvement is obvious on the SGGen task compared to the other tasks as we keep on improving relationship prediction capabilities.

VRD: Table 2 presents comparisons on VRD with eight state-of-the-art methods: VRD [3], KL distillation [23], Zoom-net [25], CAI + SCA-M [25], RelDN [37], AVR [42], GPS-Net [43], and SABRA [44]. For a fair comparison of VRD, we adopt the VGG-16 backbone pretrained on ImageNet used for these baselines to train our model. As shown in Table 2, our method consistently achieves superior performance on two metrics. The proposed method performs the encouraging R@100 (k = 70) that is 35.21% on relation detection and 43.07% on phrase detection. These improvements again verify the ability of our framework and the necessity of symmetric learning for visual relationship detection. Moreover, these performances verify that our framework can be applied to data sets of different scales, as well as to more complex situations.

4.2.2. Ablation Study

We conduct an ablation study analyzing the contributions of two key components: the structure of fusion learning in the visual module and the structure of the symmetric learning module . The baseline indicates the prediction model of Figure 3 without using symmetric fusion learning; that is, we take images and word vectors as input to the visual and semantic module without fusing the hidden features, and we minimize the loss by matching visual and semantic embedding only using cross-entropy. We validate the performance of the and components in the two harder tasks: phrase and relationship detection.

The R@n scores (R@50 and R@100) of phrase detection and relation detection on the VRD data set are chosen as the evaluation metrics. The results of the ablation experiments are summarized in Table 3; we report the phrase and relationship detection performance in R@n scores (R@50 and R@100), where baseline denotes the baseline model and baseline +  +  presents our model with all proposed components. From rows in Table 3, we can see that the performance improves consistently when all the components are utilized together.

To further evaluate the ability of our model against label noise, experimental noisy labels are generated by transforming the 20% labels of training samples to one of the other class labels randomly. In Table 4, baseline (20% noisy labels) is the baseline model trained on noisy labels; rows 2 and 4 demonstrated the effectiveness of our model against label noise. We can see that the baseline model with and can partially alleviate the impact of noisy labels.

4.2.3. Qualitative Results

Figure 4 presents the qualitative analysis of our model on VRD. We implement qualitative statistics and visualizations on the VRD data set to better show the performance improvement of our model.

To verify the effectiveness of the key components of our model, we visualize the extracting results of the two models (baseline and baseline +  + ) on test examples in Figure 4. The baseline indicates the prediction model without using symmetric fusion learning. The comparisons with the ground-truth triplets show that our proposed model can properly detect the correct triplets. It proves the effectiveness of the fusion learning structure and the symmetric learning module.

Compared with the baseline, our proposed model utilizing and component can detect more correct relationships, for example, tree-to-building , grass-beneath-hydrant , and train-under-sky . Our model can correct some mistakes of predicates. For example, the grass-on-hydrant is revised to grass-beneath-hydrant , making it more precise to parse the scene graph. It also can be observed that our model processed more complete detection of visual relationships; some triplets that are not in ground-truth are also precisely detected.

5. Conclusion

In this work, we introduce a symmetric fusion learning model for detecting visual relationships and parsing scenes. The visual module is designed by integrating the subject and object representations. We can precisely detect visual relationships by matching the visual and semantic embedding space. Moreover, the model can also alleviate the impact of noise with the symmetric learning module. Comprehensive experimental results on VRD and VG data sets show the effectiveness of our proposal.

Data Availability

All data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

X.L. and X.J. conceptualized the study. X.L. and Z.Z. contributed to data curation. X.L. and X.J. contributed to methodology. X.L., Z.Z., W.D., X.D., and Q.Z. contributed to software. X.L. contributed to formal analysis. X.L., W.D., X.D., and Q.Z. contributed to funding acquisition. X.L. wrote the original draft of the manuscript. X.L. and X.J. reviewed and edited the manuscript.