Abstract

Most cross-modal retrieval methods based on subspace learning just focus on learning the projection matrices that map different modalities to a common subspace and pay less attention to the retrieval task specificity and class information. To address the two limitations and make full use of unlabelled data, we propose a novel semi-supervised method for cross-modal retrieval named modal-related retrieval based on discriminative comapping (MRRDC). The projection matrices are obtained to map multimodal data into a common subspace for different tasks. In the process of projection matrix learning, a linear discriminant constraint is introduced to preserve the original class information in different modal spaces. An iterative optimization algorithm based on label propagation is presented to solve the proposed joint learning formulations. The experimental results on several datasets demonstrate the superiority of our method compared with state-of-the-art subspace methods.

1. Introduction

In real applications, data are often represented in different ways or obtained from various domains. As a consequence, the data with the same semantic may exist in different modalities or exhibit heterogeneous properties. With the rapid growth of multimodal data, there is an urgent need for effectively analyzing the data obtained from different modalities [15]. Although there is much attention to the multimodal analysis, the most common method is to ensemble the multimodal data to improve the performance [69]. Cross-modal retrieval is an efficient way to achieve data from different modal data. The typical example is to take the image as a query to retrieve related texts (I2T) or to search images by utilizing the textual description (T2I). Figure 1 shows the detailed process for I2T and T2I tasks. The results obtained by cross-modal retrieval are more comprehensive compared with the results of traditional single-modality.

Generally, semantic gap and relevant measure impede the development of cross-modal retrieval. Although there are many approaches to solve this problem, the performance of these approaches still cannot achieve a satisfactory level. Therefore, the methods [1016] are proposed to learn a common subspace by minimizing the pairwise differences to make different modalities comparable. However, task specificity and class information are often ignored, which leads to low-level retrieval performance.

To solve these problems mentioned above, this paper proposes a novel semi-supervised joint learning framework for cross-modal retrieval by integrating the common subspace learning, task-related learning, and class discriminative learning. Firstly, inspired by canonical correlation analysis (CCA) [7] and linear least squares, a couple of projection matrices are learnt by coupled linear regression to map original multimodal data to the common subspace. At the same time, linear discriminant analysis (LDA) and task-related learning (TRL) are used to keep the data structure in different modalities and the semantic relationship in the projection space. Furthermore, to mine the category information of unlabelled data, a semi-supervised strategy is utilized to propagate the semantic information from labelled data to unlabelled data. Experimental results on three public datasets show that the proposed method outperforms the previous state-of-the-art subspace approaches.

The main contributions of this paper can be summarized as follows:(1)The proposed joint formulation seamlessly combines semi-supervised learning, task-related learning, and linear discriminative analysis into a unified framework for cross-modal retrieval(2)The class information of labelled data is propagated to unlabelled data, and the linear discriminative constraint is introduced to preserve the interclass and intraclass similarity among different modalities

The remainder of the paper is organized as follows. In Section 2, we briefly overview the related work on the cross-modal retrieval problem. The details of the proposed methodology and the iterative optimization method are introduced in Section 3. Section 4 reports the experimental results and analysis. Conclusions are finally given in Section 5.

Because cross-modal retrieval plays an important role in various applications, many subspace-based methods have been proposed by establishing the intermodal and intramodal correlation. Rasiwasia et al. [7] investigated the retrieval performance of various combinations of image features and textual representations, which cover all possibilities in terms of the two guiding hypotheses. Later, partial least squares (PLS) [17] has also been used for the cross-modal matching problem. Sharma and Jacobs [18] used PLS to linearly map images from different views into a common linear subspace, where the images have a high correlation. Chen et al. [19] solved the problem of cross-modal document retrieval by using PLS to transform image features into the text space, and the method easily achieved the similarity measure between two modalities. In [20, 21], the bilinear model and generalized multiview analysis (GMA) have been proposed and performed well in the field of cross-modal retrieval.

In addition to CCA, PLS, and GMA, Mahadevan et al. [22] proposed a manifold learning algorithm that can simultaneously reduce the dimension of data from different modalities. Mao et al. [23] introduced a cross-media retrieval method named parallel field alignment retrieval, which integrates a manifold alignment framework from the perspective of vector fields. Lin and Tang [24] proposed a common discriminant feature extraction (CDFE) method to learn the difference within each scattering matrix and between scattering matrices. Sharma et al. [21] improved LDA and marginal Fisher analysis (MFA) to generalized multiview LDA (GMLDA) and generalized multiview MFA (GMMFA) by extending from single-modality to multimodalities. Inspired by the semantic information, Gong et al. [25] proposed a three-view CCA to deeply explore the correlation between features and their corresponding semantics in different modalities.

Furthermore, other methods, such as dictionary learning, graph-based learning, and multiview embedding, are proposed for the cross-modal problem [2629]. Zhuang et al. [30] proposed SliM2 by adding a group sparse representation to the pairwise relation learning to project different modalities into a common space. Xu et al. [31] proposed that dictionary learning and feature learning should be combined to learn the projection matrix adaptively. Deng et al. [32] proposed a discriminative dictionary learning method with the common label alignment by learning the coefficients of different modalities. Wei et al. [33] proposed a modal-related method named MDCR to solve the modal semantic problem. Wu et al. [34] utilized spectral regression and a graph model to jointly learn the minimum error regression and latent space. Wang et al. [35] proposed an adversarial learning framework, which can learn modality-invariant and discriminative representations of different modalities. And in this framework, the modality classifier and the feature projector compete with each other to obtain a better pair of feature representations. Cao et al. [36] used multiview embedding to obtain latent representations for visual object recognition and cross-modal retrieval. Zhang et al. [37] utilized a graph model to learn a common space for cross-modal by adding the relationship of intraclass and interclass in the projection process.

The main purpose of these methods is to solve the correlation of distance measure, but the class information and task specificity are not well solved. Therefore, how to solve the two problems at the same time for different tasks is particularly important. Based on the idea, we learn two couples of projections for different retrieval tasks and apply a linear discriminative constraint to the projection matrices. To achieve this goal, we combine task-related learning with linear discriminative analysis through semi-supervised label propagation. Figure 2 shows the flowchart of our method. Experimental results on three open cross-modal datasets demonstrate that our cross-modal retrieval method outperforms the latest methods.

3. Methodology

To improve the retrieval performance, we introduce the discriminative comapping and pay more attention to different retrieval tasks and class information preservation. Here, we focus on the retrieval of I2T and I2T, and it is easy to expand our method to the retrieval of other modalities.

3.1. The Objective Function

Define image data as and text data as separately, where and denote the labelled image and its text with dimensions, and and represent the unlabelled image and its text with dimensions. Let be pairs of image and text documents, where and denote the labelled and unlabelled documents, respectively. is the semantic matrix, where is the category number, is the label of labelled data with one-hot coding, and is the pseudo-label of unlabelled data. The goal of our method is to learn two couples of projection matrices that project data from different modalities into a common space for different tasks. Then, the cross-modal retrieval can be performed in the common space.

We propose a novel modal-related projection strategy based on semi-supervised learning for task specificity. Here, the pairwise closeness of multimodal data and the semantic projection are combined into a unified formulation. For I2T and T2I, the minimization forms are obtained as follows:where and stand for the projection matrices for modalities and separately.

The linear discriminant constraint to equations (1) and (2) is introduced to preserve the class information in the latent projection subspace. We denote as the mean of the labelled samples in the th class and as the mean of all labelled samples. The intraclass scatter matrix can be defined as , and the total scatter matrix can be represented as . The objective function is represented as follows:where is the projection matrix and is the dimension of the basic vector.

According to equation (3), the linear discriminant constraint can be transformed into , where is . The intraclass scatter of is represented as , and the interclass scatter of is . Under the multimodal condition, our method utilizes LDA projections to preserve class information of each modal. The corresponding formula is as follows:where and denote and separately.

We add equation (4) to equations (1) and (2), respectively, and then get the objective functions of I2T and T2I in the following:where is a tradeoff coefficient to balance pairwise information and semantic information and and are regularization parameters to balance the structure information of the image and text. According to equations (1) and (2), the structure projection of and is the same as the semantic projection. Consequently, our method can bridge the feature and semantic spaces. This can decrease the loss of projection and improve the performance of cross-modal retrieval.

We introduce the semi-supervised learning strategy. To propagate the label information from the labelled data, we utilize the radial basis function (RBF) kernel to evaluate the pairwise similarities between the unlabelled data after projection, and then the similarities are regarded as the label information to be updated in the optimization process until the results converge. For any data and , the kernel function is defined as follows:where is the kernel parameter.

3.2. Algorithm Optimization

The objective functions of equations (5) and (6) are nonconvex, so the iteration method is used to update each variant when other variants are fixed alternatively.

For any matrix , the partial derivative of equation (5) is represented as follows:

Similarly, the partial derivative of equation (6) is given as follows:

According to equations (8)–(11), our method can be solved by gradient descent. Algorithm 1 describes the optimization of cross-modal learning. After the projection matrices for the I2T and T2I tasks are obtained, and can be mapped to the common space where cross-modal retrieval is achieved.

Input: all image feature matrices , all text feature matrices , and the corresponding semantic matrix .
Initial: , and set the parameters and maximum iteration time. is the step size in the alternating updating process, and is the convergence condition.
Repeat:
Until
Repeat:
Until
Until maximum iteration number
Output:

4. Experiments

To evaluate the performance of the proposed method (MRRDC), we do comparison experiments with several other methods on three public datasets.

4.1. Datasets
4.1.1. Wikipedia Dataset

This dataset consists of 2,866 image-text pairs labelled with one of 10 semantic classes. In this dataset, 2,173 pairs of data are selected as the training set, and the rest are the testing set. In our experiments, we use the public dataset [7] provided by Rasiwasia et al. (wiki-R), where images are represented by 128-dimensional SIFT description histograms [38], and the representation of the texts with 10 dimensions is derived from an LDA model [39]. At the same time, we also use the dataset provided by Wei et al. (wiki-W) [40], where 4,096-dimensional CNN features [41] are used to present images and 100-dimensional LDA features are utilized to denote the texts.

4.1.2. Pascal Sentence Dataset [40]

This dataset consists of 1,000 image-text pairs with 20 categories. We randomly choose 30 pairs from each category as training samples and the rest as test samples. The image features are 4,096-dimensional CNN features, and the text features are 100-dimensional LDA features.

4.1.3. INRIA-Websearch [42]

This dataset contains 71,478 pairs of image and text annotations from 353 classes. We remove some pairs which are marked as irrelevant and select the pairs that belong to any one of the 100 largest categories. Then, we get a subset of 14,698 pairs for evaluation. We randomly select 70% of pairs from each category as the training set (10,332 pairs), and the rest are treated as the testing set (4,366 pairs). Similarly, images are represented with 4,096-dimensional CNN features, and the textual tags are represented with 100-dimensional LDA features.

4.2. Evaluation Metrics

To evaluate the performance of the proposed method, two typical cross-modal retrieval tasks are conducted: I2T and T2I. In the test phase, the projection matrices are used to map the multimodal data into the common subspace. Then, the data of different modalities can be retrieved. In all experiments, the cosine distance is adopted to measure the feature similarities. Given a query, the aim of each cross-modal task is to find the top-k nearest neighbors from the retrieval results.

The performance of the algorithms is evaluated by mean average precision (mAP), which is one of the standard information retrieval metrics. To obtain mAP, average precision (AP) is calculated bywhere is the number of correlation data in the test dataset, is the precision of top retrieval data, and if , the top retrieval data are relevant; otherwise, . Then, the value of mAP can be obtained by averaging AP for all queries. The larger the mAP, the better the retrieval performance. Besides the mAP, the precision-recall curves and mAP performance for each class are used to evaluate the effectiveness of different methods.

4.3. Comparison Methods

To verify that our method has good performance, we compare our method with seven state-of-the-art methods, such as PLS [18], CCA [7], SM [7], SCM [7], GMLDA [21], GMMFA [21], MDCR [33], JLSLR [34], ACMR [35], and SGRCR [37].

PLS, CCA, SM, and SCM are typical methods that utilize pairwise information to learn a common latent subspace, where the similarity between different multimodals can be measured by metric methods directly. These kinds of approaches make the pairwise data in the multimodal dataset closer in the learned common subspace. GMLDA, GMMFA, and MDCR are based on the semantic category information via supervised learning. Due to the use of label information, these methods can easily learn a more discriminative subspace.

4.4. Experimental Setup

The parameters of the proposed MRRDC in Algorithm 1 for the retrieval tasks of I2T  and T2I are set as follows: , , , , , , , , and on Wikipedia provided by Rasiwasia and INRIA-Websearch. On Wikipedia provided by Wei and Pascal, , and the rest are the same with the above. In our experiment, learning rate is set .

4.5. Results and Analysis

Table 1 shows all the mAP scores achieved by PLS, CCA, SM, SCM, GMMFA, GMLDA, MDCR, and our method on wiki-R, wiki-W, Pascal Sentence, and INRIA-Websearch. We observe that our method outperforms its counterparts. This may be because the projection matrices preserve more discriminative class information via semi-supervised learning. The common subspace of our method is more discriminative and effective by further exploiting the class semantic of intramodality and intermodality similarity simultaneously. From Table 1, we also find that, in most cases, GMMFA, GMLDA, MDCR, and MRRDC always perform better than PLS, CCA, SM, and SCM, and images with CNN features have superiority compared with the shallow features. For the first result, this is because PLS, CCA, SM, and SCM only use pairwise information, but the other approaches add class information to their objective functions, which provides better separation between different categories in the latent common subspace. For the second result, this is due to the powerful semantic representation of CNN.

The precision-recall curves on wiki-R, wiki-W, Pascal Sentence, and INRIA-Websearch are plotted in Figure 3. Figure 4 shows the mAP scores of comparison approaches and our method, and the rightmost bar of each figure shows the average mAP scores. For most categories, the mAP of our method outperforms that of comparison methods. From these experimental results, we can draw the following conclusions:(1)Compared with the current state-of-the-art methods, our method improves the average mAP greatly. Our method consistently outperforms compared methods, which is due to the factor that MRRDC learns projection matrices in task-related and linear discrimination ways for different modalities, where different modalities can preserve semantic and original class information. Besides, both labelled data and unlabelled data of all the different modalities are explored. The labelled information can be propagated to the unlabelled data during the training process.(2)In most cases, GMLDA and GMMFA outperform CCA since GMLDA and GMMFA add category information to their formulation, which makes the common projection subspace more suitable for cross-modal retrieval.(3)Compared with the shallow features, CNN features have great advantages for the I2T task, which is because CNN features can easily obtain the semantic information from original images directly.

To further verify the effectiveness of our proposed MRRDC, we also provide the confusion matrices on single-modal retrieval and the query examples for I2T and T2I in Figures 5 and 6 separately. Intuitively, from Figure 5, our method can achieve high precision in each category, which proves that the projection space is discriminative. We also observe from Figure 6 that, in many categories, our proposed method always successfully obtains the best retrieval results from query samples.

4.6. Convergence

Our objective formulation is solved by an iterative optimization algorithm. In a practical application, a fast retrieval speed is necessary. In Figure 7, we plot the convergence curves of our optimization algorithm as to the objective function value of equations (5) and (6) at each iteration on wiki-W and Pascal Sentence datasets separately. In this figure, the curve is monotonic at each iteration, and the algorithm generally converges within about 20 iterations for these datasets. The fast speed can ensure the high efficiency of our method.

5. Conclusion

In this paper, we propose an effective semi-supervised cross-modal retrieval approach based on discriminative comapping. Our approach uses different couples of discriminative projection matrices to map different modalities to the common space where the correlation between different modalities can be maximum for different retrieval tasks. In particular, we use labelled samples to propagate the category information to unlabelled samples, and the original class information is preserved by using linear discriminant analysis. Therefore, the proposed method not only uses the relationship of different retrieval tasks but also keeps the structure information for different modalities. In the future, we will mine the correlation between different modalities and focus on the unsupervised cross-modal retrieval method for unlabelled data.

Data Availability

The data supporting this paper are from the reported studies and datasets in the cited references.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (no. 61702310), the Major Fundamental Research Project of Shandong, China (no. ZR2019ZD03), and the Taishan Scholar Project of Shandong, China (no. ts20190924).