Abstract

Generation of scene graphs and natural language captions from images for deep image understanding is an ongoing research problem. Scene graphs and natural language captions have a common characteristic in that they are generated by considering the objects in the images and the relationships between the objects. This study proposes a deep neural network model named the Context-based Captioning and Scene Graph Generation Network (C2SGNet), which simultaneously generates scene graphs and natural language captions from images. The proposed model generates results through communication of context information between these two tasks. For effective communication of context information, the two tasks are structured into three layers: the object detection, relationship detection, and caption generation layers. Each layer receives related context information from the lower layer. In this study, the proposed model was experimentally assessed using the Visual Genome benchmark data set. The performance improvement effect of the context information was verified through various experiments. Further, the high performance of the proposed model was confirmed through performance comparison with existing models.

1. Introduction

Image understanding is one of the core elements of computer vision and has been extensively researched. Traditional subjects of image understanding research include image classification, object detection, and semantic segmentation [1]. Previous studies have focused on superficial information such as identification of objects included in images and their locations. However, these data are insufficient for expressing image content and have, therefore, been used as basic modules for solving complex image understanding problems such as visual question answering (VQA) [24] and referring expression comprehension [5]. Recently, there has been increased research interest on deeper understanding of images, in contrast to traditional studies on image understanding. Thus, efforts have been made to achieve more specific high-level image expressions by obtaining image captions, scene graphs, etc. [610].

Figure 1 shows examples of captions and scene graphs generated for an input image. Here, “A woman is riding a horse” is part of an image caption, and <woman riding horse> is part of a scene graph. The image captions expressed in natural language sentences and the formal knowledge expressed in scene graphs for the same image scene have complementary characteristics. Image captions expressed in natural sentences have the advantage of being in a form that can be most easily understood by humans; however, they have high complexity as a learning problem as they also include linguistic elements such as grammar in addition to the core elements of the scene. Meanwhile, simple sentence knowledge in triple form, consisting of subject, predicate, and object, as in a scene graph, requires transformation into another form for real-world application. For example, transformation to natural language sentences could be required. However, compared to natural language sentences that may be vague, scene graphs can clearly express the relationships among the objects, which are the core elements of image scenes. Further, the scene graph approach has lower difficulty as a learning problem, because there is no need to consider complex grammatical structures. Moreover, the knowledge graphs acquired from images can be easily combined with numerous existing background datasets and prior knowledge datasets and can potentially exert power in more application areas [1113]. Image captions and scene graphs have common characteristics in that they are generated with consideration of the objects in the images and the relationships between those objects. In view of this characteristic, Li et al. [14] attempted to generate image captions and scene graphs simultaneously from images. This method had the effect of solving two different problems in a complementary manner.

For high-level image understanding, the present study proposes the Context-based Captioning and Scene Graph Generation Network (C2SGNet), which is a deep neural network model that simultaneously generates natural language captions and scene graphs from input images. The proposed model consists of three layers in total: the object detection, relationship detection, and caption generation layers. Each layer attempts to predict an accurate scene graph or image caption using the context information of the lower layer. Furthermore, the dependence relations between layers are trained through delivery of features via the context information. To analyze the performance of the proposed model, various experiments are conducted using the Visual Genome benchmark dataset [15].

In general, the scene graph generation process consists of the object detection step, which detects the objects in the image, and the relationship detection step, which detects the relationships between the objects. In many studies, objects in images have been found using object detectors based on the convolutional neural network (CNN) approach, with the relationships between the objects then being predicted by extracting various features of each object pair. To generate accurate scene graphs, an object detector with high prediction accuracy is required. Among the available object detectors, the high-accuracy Faster R-CNN [16] is mainly used [1719]. In some studies [2022], attempts have been made to increase the object classification performance by adopting unconventional methods. In those works, the object regions were predicted using only the region proposal network (RPN) of Faster R-CNN, with object classification then being performed using various features. In the present study, the proposed model is based on Faster R-CNN for more accurate object detection, with the region prediction and object classification processes being separately trained.

In previous studies on scene graph generation, the effective features were obtained from detected objects and used for relationship detection. For example, Lu et al. [23] employed language features with consideration of the visual features of the objects and the semantic similarity of the object words and relation words. Further, Dai et al. [17] and Liao et al. [24] used spatial features (the location information of individual objects) to more accurately identify the location relations. Zhang et al. [25] and Newell et al. [26] converted different region features into comparable features of embedding space. Predictions were then made based on the distances between features. However, in all the above works, a sequential pipeline was essential to predict the relationships between the objects based on the detection results for individual objects. This criterion generates the limitation that the relationship detection performance is highly dependent on the individual object detection performance, i.e., the prior process.

Recently, attempts have been made to perform object detection and relationship detection in a complementary manner [14, 20, 22]. In those studies, a neural network model was designed that allows the context information acquired in both processes to be shared through a message-passing system. Simultaneously, the individual object detection process and the relationship detection process for object pairs are implemented. These complementary models have stabilized the unstable performance of individual object detection and also helped improve the relationship detection performance. Furthermore, Li et al. [14] have proposed a deep neural network model that can simultaneously generate scene graphs and image captions, by newly adding caption generation to this model and expanding the message-passing system with existing object detection and relationship detection processes. However, in the techniques presented in the above studies, subcontext features related to the object of prediction are obtained, with the simple sum or average of these features then being used as the context information. Therefore, these methods are limited because it is impossible to know the related element for which each subcontext feature provides context information.

Furthermore, the visual features of images have been extracted through CNN in studies on general image caption generation; hence, appropriate sentences for the visual features were generated using a recurrent neural network (RNN) [27]. Among the RNN methods, long short-term memory (LSTM) has mainly been used, which easily processes time-series data. Previously, Johnson et al. [8, 9, 28] investigated the problem of caption generation for various partial regions of images, in contrast to the existing image captioning problem that treats the image as a whole. This technique involves prediction of candidate regions for captioning from the image, as well as caption generation for each region. However, it is difficult to focus on object information using the technique presented in [28], because the features of the object elements included in the caption candidate regions are not used. Instead, only the total visual features are used.

The present paper proposes the C2SGNet model, which generates both natural language captions and scene graphs in a complementary manner for improved accuracy. For this purpose, the model separates the object detection, relationship detection, and caption generation processes and predicts the results using the context information corresponding to each process. For detection of interobject relationships, the context information for the corresponding two objects is used. For caption generation, the context information for objects in the caption regions and their relationships is used. Furthermore, to overcome the limitations of previous studies, an effective context information extraction method for related elements of the prediction object is proposed.

3. Image Captioning and Scene Graph Generation Model

3.1. Model Outline

This study proposes C2SGNet, a deep neural network model that simultaneously generates natural language captions and scene graphs for input images using context information.

Figure 2 shows the overall framework of the proposed model. The C2SGNet model is largely composed of three layers: the object detection, relationship detection, and caption generation layers. In addition, the process consists of four steps: candidate region proposal, region feature extraction, context feature extraction, and scene graph generation and captioning. First, in the candidate region proposal step, the candidate regions in the image that are needed for each layer are generated using a visual feature map, which is extracted from the input image by the VGG-16. In the region feature extraction step, the unique features of each layer are extracted from the candidate regions. In the context feature extraction step, the context information to be used in the upper layer is extracted from the lower layer through the relationship context network (RCN) and the caption context network (CCN). Finally, in the scene graph generation and captioning step, captions are generated that consist of scene graphs composed of triples in <subject predicate object> form as well as natural language sentences; this is achieved by combining the region features of each layer and the context features obtained from the lower layer. Whereas the natural language captions of the input image are directly obtained as a result of the top caption generation layer, the scene graphs are obtained by combining the results of the two lower layers, i.e., the object detection and relationship detection layers.

3.2. Candidate Region Proposal

As shown in Figure 2(a), C2SGNet generates candidate regions for object detection, relationship detection, and caption generation, respectively, for scene graph generation and captioning. Hereafter, the object region indicates the candidate region of an individual object. The relationship region indicates a candidate region that includes two object regions that have a relationship. Finally, the caption region indicates the candidate region for caption generation. In this paper, we define the form of region R as shown in (1). A region R consists of four values which indicate center coordinates, width, and height of region R.

In C2SGNet, the individual object region and caption region are generated through and , respectively. These two RPNs have the same network structure, but are trained differently in accordance with their role. The proposal process of these two regions is illustrated in Figure 3. To generate the regions, the visual feature of the input image I is extracted through VGG-16, which is a CNN, as shown in (2). The extracted is used as the input feature of the two RPNs, as shown in (3) and (4). Sharing the image features allows reduction of the model size and increased model prediction speed.

The RPN predicts the bounding box value of the region and the region score through the convolution layer. The region score consists of two probabilities. A first probability indicates a correctness of region, whereas a second probability indicates an incorrectness of region. The sum of two probabilities is always 1. Therefore, only a first probability is used to judge the bounding box includes a correct target. In , the target will be an object. If a bounding box is close to a perfect object region, the correctness score will be high. Similarly, in , the target will be a caption region that could be used to generate a proper caption. The region scores of and are illustrated as object score and caption score in Figure 3. If the correctness score of a region is greater than a predefined threshold, the region is selected as the candidate region.

The relationship region consists of subjects and objects. Therefore, as shown in Figure 4, the relationship region is generated as a combination of predicted object regions. From the object regions predicted through the RPN, two object pairs are created. Then, the minimum-sized rectangular region wrapping around each object region is defined as the for the corresponding object pair. This is expressed as follows:

3.3. Region Feature Extraction

For the generated candidate regions, C2SGNet extracts the unique region features for object detection, relationship detection, and caption generation, respectively, in each layer, as shown in Figure 2(b). This process is detailed in Figure 5.

Each layer performs region of interest (RoI) pooling for each candidate region (of various sizes) and determines the visual features of the same size. Here, all three layers use the preextracted visual features of the image as the features for pooling. Then, the region features appropriate for each layer are determined through two fully connected layers. and region features are extracted as follows:

In the case of the relationship detection layer, the region features are determined by inputting the two bounding boxes ( of the two objects comprising the relationship region, as well as the visual features, to the fully connected layer, as shown in (11). This aids identification of the spatial relationship (e.g., a relationship indicated by an “in” or “on” preposition) through the locational relationships between the two objects. The region features of each layer determined through this process are defined as object features, relationship features, and caption features.

3.4. Context Feature Extraction

The region features of each layer include the visual features of each candidate region only and do not include structural information on the scene of the corresponding image. Therefore, the relationship detection and caption generation layers additionally require the core elements comprising the relationships or scenes, as well as the context information indicating their combination structures besides the visual features of the candidate region. For example, the relationship detection layer requires additional context information on the two objects comprising the relationships in order to effectively determine the relationships between object pairs. On the other hand, the caption generation layer requires additional context information on the relationships between object pairs to be included in the natural language captions. To overcome this problem, in this study, the context features required in the relationship detection and caption generation layers are extracted through the RCN and CCN, respectively, as shown in Figure 2(c).

Figure 6 shows the RCN and CCN, which are the networks for extracting the relationship contexts and caption contexts, respectively. To overcome the above-mentioned limitations of the existing techniques, context features are generated by combining the region features of the two core components before and after the region feature of the candidate region. To extract the relationship context feature , the RCN uses the region features of the three components of <subject, relationship, object > form. Here, is expressed as

To extract the caption context features , the CCN uses the regions and context features of the three components of <relationship, caption, relationship_context> form. Meanwhile, as one caption region can include multiple relationship regions, the proposed model selects the relationship region having the highest intersection over union (IoU) with the caption region and uses it for extraction of caption context features. Here, is expressed as

To effectively combine the three input features, the RCN and CNN are composed of two bidirectional LSTM (Bi-LSTM) layers. Bi-LSTM is a bidirectional RNN, as apparent from the following relations:

The aim here is to extract context features by sufficiently considering the combination sequence of the relationship and caption components. In this study, for effective expression of context information, is selected as the context feature, being dependent on the input features on both sides (see (15)).

3.5. Scene Graph Generation

The scene graph for an image is expressed as a triple set of <subject predicate object> form, which consists of subject, object and the relationships between them. To generate this scene graph, the object detection layer detects the objects and the relationship detection layer predicts the relationships between object pairs.

Figure 7 shows the scene graph generation network and process, which corresponds to Figure 2(d). The two layers input the features extracted in the previous process into the fully connected layer and predict the probability distribution of the objects or relationship classes. The object detection layer uses only the object feature as the input feature, whereas the relationship detection layer uses both the relationship feature and the relationship context feature . The predicted probability distribution represents the individual probability for predefined types. Therefore, the type having the highest probability is selected as the predicted result of the corresponding region. This process is expressed in the following equations, where and indicate the predicted object type and relationship type, respectively: In the case of the object region, not only the probability distribution, but also the delta value of the bounding box is predicted, which is used to tune the predicted object region more accurately.

To generate the scene graph at a later stage, the detected objects are expressed as nodes of the scene graph and the relationship between the two detected objects is indicated by the edges of the scene graph. Here, each edge is connected from the subject node to the object node.

3.6. Region Captioning

To create captions for partial regions of images, C2SGNet uses LSTM, which is an RNN method. The specific caption generation process is illustrated in Figure 8.

First, to generate captions from the LSTM, the hidden state of the LSTM is initialized using and , which are extracted from the previous stage. Then, the word feature is extracted by inputting the <start> word token to the LSTM. Here, is input to the fully connected layer and used to predict the word probability distribution. The word having the highest value in the probability distribution is selected as the newly created word and is input to the LSTM through the embedding layer, as shown in (18) and (19). This process is repeated until the LSTM generates the <end> word token, and region captions are generated until the repetition finishes.

4. Performance Evaluation

4.1. Dataset

In this study, the Visual Genome benchmark dataset was used for the experiment to evaluate the performance of the proposed C2SGNet. The Visual Genome dataset has definitions of various objects and relationships for each image it contains, and includes natural language captions for partial image regions. For appropriate model training, a partial set of the Visual Genome dataset proposed by Li et al. [14] was used in this experiment. For this dataset, 150 object types and 50 relationship types with high frequencies in the Visual Genome dataset were selected and very small object regions were removed. Among the images in the acquired dataset, 70,998 were used as training data and 25,000 were taken as test data.

4.2. Model Training

Before the experiment, CS2GNet was implemented using PyTorch, which is the Python deep learning library in the Ubuntu 16.04 LTS environment. The model training and experiment were performed in a hardware environment with an installed GeForce GTX1080 Ti GPU card.

In the approach presented in this study, the model is trained in two steps for greater efficacy. In the first step, only the two RPNs used for proposal of the object region and caption region are trained in advance. In the second step, the total network including the two pretrained RPNs is trained. First, to train the two RPNs, the smooth L1 loss function and the cross entropy loss function of Faster R-CNN [16] are implemented as follows: Consequently, the loss functions of the two RPNs, and , are independently trained and both have the following structure:

Note that and in (24) indicate the region score and bounding box value predicted for the object, respectively, while and indicate the sign of the region in the actual dataset and the bonding box value, respectively. A region is regarded as positive if the IoU value between the predicted and real regions is higher than 0.7, and as negative if this value is lower than 0.3. The other regions are excluded from training.

In the second step, each layer of C2SGNet has different loss functions. In the case of the object detection layer, the loss function consists of the smooth L1 loss function for the delta value of the bounding box and the cross entropy loss function for the object classification result, as shown in (25). The relationship detection layer has a cross entropy loss function for the classification result, as shown in (26). The loss function of the caption generation layer is the sum of the cross entropy loss values for the generated words, as shown in (27).

The total loss function of C2SGNet, which is defined as the sum of the loss values of the two RPNs and the loss value of each layer, is expressed as follows:

In the experiment conducted in this study, the Adam optimization algorithm was used to minimize the above loss function. The initial learning rate was set to 0.01 and the learning rate decay method was used, which multiplies the existing learning rate by 0.1 whenever one epoch finishes.

4.3. Metric

To evaluate the scene graph generation performance and caption generation performance of the proposed model, the SGGen and Meteor mAP rating scales were used [14, 28]. SGGen measures the recall of triples comprising the scene graph; that is, it measures the number of positive triples that can be found in a given image. In SGGen, a triple is determined as positive if the object pair and relationship comprising the triple match the positive values and the two detected object regions have an IoU of 0.5 or higher with the positive object region. Furthermore, Meteor mAP, which is an extension of the single-caption Meteor rating scale [29], is a scale for evaluating multiple captions generated from one image. To calculate Meteor mAP, only the captions having a Meteor rating above a certain value are determined as positive. Meteor mAP then represents the mean ratio of the captions determined as positive among the generated captions.

4.4. Experiments

The first experiment performed in this study analyzed the effects of the RCN and CCN, i.e., the proposed context information extraction networks, on the scene graph generation and captioning performance using the SGGen and Meteor mAP scales. For scene graph generation in particular, SGGen values for the top 50 (R@50) and 100 (R@100) results were measured. Table 1 lists the evaluation results depending on usage of the RCN and/or CCN. The baseline was the C2SGNet model structure excluding both the RCN and CCN. As apparent from the experiment results listed in Table 1, the model employing both the RCN and CCN exhibited the highest performance for both scene graph generation and captioning. Furthermore, higher performance was achieved for the cases using context information through application of the RCN or CCN compared to the baseline. As regards comparison of the RCN and CCN, the RCN yielded better performance improvement than the CCN for scene graph generation, but the opposite was observed for caption generation. This seems to be because the RCN delivers the context information to the relationship detection process, which is a core element of the scene graph procedures, whereas the CCN delivers it to the caption generation process.

The second experiment compared the performances of the proposed C2SGNet model and the existing state-of-the-art models. As apparent from the experiment results listed in Table 2, the proposed C2SGNet model exhibited better performance compared to the existing models for both scene graph generation and captioning. This experiment result confirms the excellence of the proposed C2SGNet model, which can effectively employ context information.

Figure 9 presents the qualitative evaluation results of the C2SGNet model. The left column shows the input images and the objects detected by the model. The results in the right column consist of the scene graphs and image captions generated by the model. The results for Figures 9(a) and 9(b) are examples of appropriate scene graph and image caption generation for the given image, whereas the result for Figure 9(c) is an example of an inappropriate outcome. Examination of the results for Figures 9(a) and 9(b) confirms that the proposed model correctly detected even detailed object regions. Furthermore, the generated scene graphs show that the relationships between objects were also predicted properly. The region captions show that various expressions were generated depending on the image complexity. In the case of Figure 9(c), although various partial objects in the image were found, the person, who was the key element of the image, was not detected. As a result, the main triple representing the image could not be generated. This reveals the limitation that the scene graph generation is greatly dependent on the object detection performance. Furthermore, examination of the scene graph shows that inappropriate triples such as <pant wearing hat> were created. This suggests that the model was trained to predict relationships with large emphasis on the spatial relationships between object pairs.

5. Conclusion

In this paper, we propose a method to solve high-level image understanding problem using existing low-level image understanding model. This method can be used to solve problems that demand high-level image understanding such as referring expression comprehension, image retrieval, and visual question answering. This paper suggested the C2SGNET deep neural network model, which can simultaneously generate scene graphs and natural language captions from input images for high-level image understanding. This model uses features related to each task as context information, based on the characteristic that scene graphs and natural language captions can be generated from objects and the relationships between objects. For an effective prediction result and model training, these two tasks are structured into three layers: the object detection, relationship detection, and caption generation layers. The results are predicted through four steps: candidate region proposal, region feature extraction, context feature extraction, and scene graph generation and captioning. In particular, Bi-LSTM, which is a bidirectional RNN, is used to effectively extract context features.

In this study, to evaluate the performance of the proposed model, experiments were conducted using the Visual Genome dataset. The experiment results confirmed that context information is helpful for performance improvement. Furthermore, a performance comparison with existing models confirmed the high performance of C2SGNet. However, the proposed model has a limitation that the context information of each layer is obtained only from the lower layer, but not from the higher layer. Future research will be proceeded to overcome the limitation and make full use of the entire context information.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Institute for Information & Communications Technology Promotion (IITP), grant funded by the Korean Government (MSIT) [Grant no. 2018-0-00677], “Development of Robot Hand Manipulation Intelligence to Learn Methods and Procedures for Handling Various Objects with Tactile Robot Hands.” Also, this work was supported by Kyonggi University’s Graduate Research Assistantship 2018.