Abstract
Most crossmodal retrieval methods based on subspace learning just focus on learning the projection matrices that map different modalities to a common subspace and pay less attention to the retrieval task specificity and class information. To address the two limitations and make full use of unlabelled data, we propose a novel semisupervised method for crossmodal retrieval named modalrelated retrieval based on discriminative comapping (MRRDC). The projection matrices are obtained to map multimodal data into a common subspace for different tasks. In the process of projection matrix learning, a linear discriminant constraint is introduced to preserve the original class information in different modal spaces. An iterative optimization algorithm based on label propagation is presented to solve the proposed joint learning formulations. The experimental results on several datasets demonstrate the superiority of our method compared with stateoftheart subspace methods.
1. Introduction
In real applications, data are often represented in different ways or obtained from various domains. As a consequence, the data with the same semantic may exist in different modalities or exhibit heterogeneous properties. With the rapid growth of multimodal data, there is an urgent need for effectively analyzing the data obtained from different modalities [1–5]. Although there is much attention to the multimodal analysis, the most common method is to ensemble the multimodal data to improve the performance [6–9]. Crossmodal retrieval is an efficient way to achieve data from different modal data. The typical example is to take the image as a query to retrieve related texts (I2T) or to search images by utilizing the textual description (T2I). Figure 1 shows the detailed process for I2T and T2I tasks. The results obtained by crossmodal retrieval are more comprehensive compared with the results of traditional singlemodality.
Generally, semantic gap and relevant measure impede the development of crossmodal retrieval. Although there are many approaches to solve this problem, the performance of these approaches still cannot achieve a satisfactory level. Therefore, the methods [10–16] are proposed to learn a common subspace by minimizing the pairwise differences to make different modalities comparable. However, task specificity and class information are often ignored, which leads to lowlevel retrieval performance.
To solve these problems mentioned above, this paper proposes a novel semisupervised joint learning framework for crossmodal retrieval by integrating the common subspace learning, taskrelated learning, and class discriminative learning. Firstly, inspired by canonical correlation analysis (CCA) [7] and linear least squares, a couple of projection matrices are learnt by coupled linear regression to map original multimodal data to the common subspace. At the same time, linear discriminant analysis (LDA) and taskrelated learning (TRL) are used to keep the data structure in different modalities and the semantic relationship in the projection space. Furthermore, to mine the category information of unlabelled data, a semisupervised strategy is utilized to propagate the semantic information from labelled data to unlabelled data. Experimental results on three public datasets show that the proposed method outperforms the previous stateoftheart subspace approaches.
The main contributions of this paper can be summarized as follows:(1)The proposed joint formulation seamlessly combines semisupervised learning, taskrelated learning, and linear discriminative analysis into a unified framework for crossmodal retrieval(2)The class information of labelled data is propagated to unlabelled data, and the linear discriminative constraint is introduced to preserve the interclass and intraclass similarity among different modalities
The remainder of the paper is organized as follows. In Section 2, we briefly overview the related work on the crossmodal retrieval problem. The details of the proposed methodology and the iterative optimization method are introduced in Section 3. Section 4 reports the experimental results and analysis. Conclusions are finally given in Section 5.
2. Related Work
Because crossmodal retrieval plays an important role in various applications, many subspacebased methods have been proposed by establishing the intermodal and intramodal correlation. Rasiwasia et al. [7] investigated the retrieval performance of various combinations of image features and textual representations, which cover all possibilities in terms of the two guiding hypotheses. Later, partial least squares (PLS) [17] has also been used for the crossmodal matching problem. Sharma and Jacobs [18] used PLS to linearly map images from different views into a common linear subspace, where the images have a high correlation. Chen et al. [19] solved the problem of crossmodal document retrieval by using PLS to transform image features into the text space, and the method easily achieved the similarity measure between two modalities. In [20, 21], the bilinear model and generalized multiview analysis (GMA) have been proposed and performed well in the field of crossmodal retrieval.
In addition to CCA, PLS, and GMA, Mahadevan et al. [22] proposed a manifold learning algorithm that can simultaneously reduce the dimension of data from different modalities. Mao et al. [23] introduced a crossmedia retrieval method named parallel field alignment retrieval, which integrates a manifold alignment framework from the perspective of vector fields. Lin and Tang [24] proposed a common discriminant feature extraction (CDFE) method to learn the difference within each scattering matrix and between scattering matrices. Sharma et al. [21] improved LDA and marginal Fisher analysis (MFA) to generalized multiview LDA (GMLDA) and generalized multiview MFA (GMMFA) by extending from singlemodality to multimodalities. Inspired by the semantic information, Gong et al. [25] proposed a threeview CCA to deeply explore the correlation between features and their corresponding semantics in different modalities.
Furthermore, other methods, such as dictionary learning, graphbased learning, and multiview embedding, are proposed for the crossmodal problem [26–29]. Zhuang et al. [30] proposed SliM2 by adding a group sparse representation to the pairwise relation learning to project different modalities into a common space. Xu et al. [31] proposed that dictionary learning and feature learning should be combined to learn the projection matrix adaptively. Deng et al. [32] proposed a discriminative dictionary learning method with the common label alignment by learning the coefficients of different modalities. Wei et al. [33] proposed a modalrelated method named MDCR to solve the modal semantic problem. Wu et al. [34] utilized spectral regression and a graph model to jointly learn the minimum error regression and latent space. Wang et al. [35] proposed an adversarial learning framework, which can learn modalityinvariant and discriminative representations of different modalities. And in this framework, the modality classifier and the feature projector compete with each other to obtain a better pair of feature representations. Cao et al. [36] used multiview embedding to obtain latent representations for visual object recognition and crossmodal retrieval. Zhang et al. [37] utilized a graph model to learn a common space for crossmodal by adding the relationship of intraclass and interclass in the projection process.
The main purpose of these methods is to solve the correlation of distance measure, but the class information and task specificity are not well solved. Therefore, how to solve the two problems at the same time for different tasks is particularly important. Based on the idea, we learn two couples of projections for different retrieval tasks and apply a linear discriminative constraint to the projection matrices. To achieve this goal, we combine taskrelated learning with linear discriminative analysis through semisupervised label propagation. Figure 2 shows the flowchart of our method. Experimental results on three open crossmodal datasets demonstrate that our crossmodal retrieval method outperforms the latest methods.
3. Methodology
To improve the retrieval performance, we introduce the discriminative comapping and pay more attention to different retrieval tasks and class information preservation. Here, we focus on the retrieval of I2T and I2T, and it is easy to expand our method to the retrieval of other modalities.
3.1. The Objective Function
Define image data as and text data as separately, where and denote the labelled image and its text with dimensions, and and represent the unlabelled image and its text with dimensions. Let be pairs of image and text documents, where and denote the labelled and unlabelled documents, respectively. is the semantic matrix, where is the category number, is the label of labelled data with onehot coding, and is the pseudolabel of unlabelled data. The goal of our method is to learn two couples of projection matrices that project data from different modalities into a common space for different tasks. Then, the crossmodal retrieval can be performed in the common space.
We propose a novel modalrelated projection strategy based on semisupervised learning for task specificity. Here, the pairwise closeness of multimodal data and the semantic projection are combined into a unified formulation. For I2T and T2I, the minimization forms are obtained as follows:where and stand for the projection matrices for modalities and separately.
The linear discriminant constraint to equations (1) and (2) is introduced to preserve the class information in the latent projection subspace. We denote as the mean of the labelled samples in the th class and as the mean of all labelled samples. The intraclass scatter matrix can be defined as , and the total scatter matrix can be represented as . The objective function is represented as follows:where is the projection matrix and is the dimension of the basic vector.
According to equation (3), the linear discriminant constraint can be transformed into , where is . The intraclass scatter of is represented as , and the interclass scatter of is . Under the multimodal condition, our method utilizes LDA projections to preserve class information of each modal. The corresponding formula is as follows:where and denote and separately.
We add equation (4) to equations (1) and (2), respectively, and then get the objective functions of I2T and T2I in the following:where is a tradeoff coefficient to balance pairwise information and semantic information and and are regularization parameters to balance the structure information of the image and text. According to equations (1) and (2), the structure projection of and is the same as the semantic projection. Consequently, our method can bridge the feature and semantic spaces. This can decrease the loss of projection and improve the performance of crossmodal retrieval.
We introduce the semisupervised learning strategy. To propagate the label information from the labelled data, we utilize the radial basis function (RBF) kernel to evaluate the pairwise similarities between the unlabelled data after projection, and then the similarities are regarded as the label information to be updated in the optimization process until the results converge. For any data and , the kernel function is defined as follows:where is the kernel parameter.
3.2. Algorithm Optimization
The objective functions of equations (5) and (6) are nonconvex, so the iteration method is used to update each variant when other variants are fixed alternatively.
For any matrix , the partial derivative of equation (5) is represented as follows:
Similarly, the partial derivative of equation (6) is given as follows:
According to equations (8)–(11), our method can be solved by gradient descent. Algorithm 1 describes the optimization of crossmodal learning. After the projection matrices for the I2T and T2I tasks are obtained, and can be mapped to the common space where crossmodal retrieval is achieved.

4. Experiments
To evaluate the performance of the proposed method (MRRDC), we do comparison experiments with several other methods on three public datasets.
4.1. Datasets
4.1.1. Wikipedia Dataset
This dataset consists of 2,866 imagetext pairs labelled with one of 10 semantic classes. In this dataset, 2,173 pairs of data are selected as the training set, and the rest are the testing set. In our experiments, we use the public dataset [7] provided by Rasiwasia et al. (wikiR), where images are represented by 128dimensional SIFT description histograms [38], and the representation of the texts with 10 dimensions is derived from an LDA model [39]. At the same time, we also use the dataset provided by Wei et al. (wikiW) [40], where 4,096dimensional CNN features [41] are used to present images and 100dimensional LDA features are utilized to denote the texts.
4.1.2. Pascal Sentence Dataset [40]
This dataset consists of 1,000 imagetext pairs with 20 categories. We randomly choose 30 pairs from each category as training samples and the rest as test samples. The image features are 4,096dimensional CNN features, and the text features are 100dimensional LDA features.
4.1.3. INRIAWebsearch [42]
This dataset contains 71,478 pairs of image and text annotations from 353 classes. We remove some pairs which are marked as irrelevant and select the pairs that belong to any one of the 100 largest categories. Then, we get a subset of 14,698 pairs for evaluation. We randomly select 70% of pairs from each category as the training set (10,332 pairs), and the rest are treated as the testing set (4,366 pairs). Similarly, images are represented with 4,096dimensional CNN features, and the textual tags are represented with 100dimensional LDA features.
4.2. Evaluation Metrics
To evaluate the performance of the proposed method, two typical crossmodal retrieval tasks are conducted: I2T and T2I. In the test phase, the projection matrices are used to map the multimodal data into the common subspace. Then, the data of different modalities can be retrieved. In all experiments, the cosine distance is adopted to measure the feature similarities. Given a query, the aim of each crossmodal task is to find the topk nearest neighbors from the retrieval results.
The performance of the algorithms is evaluated by mean average precision (mAP), which is one of the standard information retrieval metrics. To obtain mAP, average precision (AP) is calculated bywhere is the number of correlation data in the test dataset, is the precision of top retrieval data, and if , the top retrieval data are relevant; otherwise, . Then, the value of mAP can be obtained by averaging AP for all queries. The larger the mAP, the better the retrieval performance. Besides the mAP, the precisionrecall curves and mAP performance for each class are used to evaluate the effectiveness of different methods.
4.3. Comparison Methods
To verify that our method has good performance, we compare our method with seven stateoftheart methods, such as PLS [18], CCA [7], SM [7], SCM [7], GMLDA [21], GMMFA [21], MDCR [33], JLSLR [34], ACMR [35], and SGRCR [37].
PLS, CCA, SM, and SCM are typical methods that utilize pairwise information to learn a common latent subspace, where the similarity between different multimodals can be measured by metric methods directly. These kinds of approaches make the pairwise data in the multimodal dataset closer in the learned common subspace. GMLDA, GMMFA, and MDCR are based on the semantic category information via supervised learning. Due to the use of label information, these methods can easily learn a more discriminative subspace.
4.4. Experimental Setup
The parameters of the proposed MRRDC in Algorithm 1 for the retrieval tasks of I2T and T2I are set as follows: , , , , , , , , and on Wikipedia provided by Rasiwasia and INRIAWebsearch. On Wikipedia provided by Wei and Pascal, , and the rest are the same with the above. In our experiment, learning rate is set .
4.5. Results and Analysis
Table 1 shows all the mAP scores achieved by PLS, CCA, SM, SCM, GMMFA, GMLDA, MDCR, and our method on wikiR, wikiW, Pascal Sentence, and INRIAWebsearch. We observe that our method outperforms its counterparts. This may be because the projection matrices preserve more discriminative class information via semisupervised learning. The common subspace of our method is more discriminative and effective by further exploiting the class semantic of intramodality and intermodality similarity simultaneously. From Table 1, we also find that, in most cases, GMMFA, GMLDA, MDCR, and MRRDC always perform better than PLS, CCA, SM, and SCM, and images with CNN features have superiority compared with the shallow features. For the first result, this is because PLS, CCA, SM, and SCM only use pairwise information, but the other approaches add class information to their objective functions, which provides better separation between different categories in the latent common subspace. For the second result, this is due to the powerful semantic representation of CNN.
The precisionrecall curves on wikiR, wikiW, Pascal Sentence, and INRIAWebsearch are plotted in Figure 3. Figure 4 shows the mAP scores of comparison approaches and our method, and the rightmost bar of each figure shows the average mAP scores. For most categories, the mAP of our method outperforms that of comparison methods. From these experimental results, we can draw the following conclusions:(1)Compared with the current stateoftheart methods, our method improves the average mAP greatly. Our method consistently outperforms compared methods, which is due to the factor that MRRDC learns projection matrices in taskrelated and linear discrimination ways for different modalities, where different modalities can preserve semantic and original class information. Besides, both labelled data and unlabelled data of all the different modalities are explored. The labelled information can be propagated to the unlabelled data during the training process.(2)In most cases, GMLDA and GMMFA outperform CCA since GMLDA and GMMFA add category information to their formulation, which makes the common projection subspace more suitable for crossmodal retrieval.(3)Compared with the shallow features, CNN features have great advantages for the I2T task, which is because CNN features can easily obtain the semantic information from original images directly.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
To further verify the effectiveness of our proposed MRRDC, we also provide the confusion matrices on singlemodal retrieval and the query examples for I2T and T2I in Figures 5 and 6 separately. Intuitively, from Figure 5, our method can achieve high precision in each category, which proves that the projection space is discriminative. We also observe from Figure 6 that, in many categories, our proposed method always successfully obtains the best retrieval results from query samples.
(a)
(b)
4.6. Convergence
Our objective formulation is solved by an iterative optimization algorithm. In a practical application, a fast retrieval speed is necessary. In Figure 7, we plot the convergence curves of our optimization algorithm as to the objective function value of equations (5) and (6) at each iteration on wikiW and Pascal Sentence datasets separately. In this figure, the curve is monotonic at each iteration, and the algorithm generally converges within about 20 iterations for these datasets. The fast speed can ensure the high efficiency of our method.
(a)
(b)
(c)
(d)
5. Conclusion
In this paper, we propose an effective semisupervised crossmodal retrieval approach based on discriminative comapping. Our approach uses different couples of discriminative projection matrices to map different modalities to the common space where the correlation between different modalities can be maximum for different retrieval tasks. In particular, we use labelled samples to propagate the category information to unlabelled samples, and the original class information is preserved by using linear discriminant analysis. Therefore, the proposed method not only uses the relationship of different retrieval tasks but also keeps the structure information for different modalities. In the future, we will mine the correlation between different modalities and focus on the unsupervised crossmodal retrieval method for unlabelled data.
Data Availability
The data supporting this paper are from the reported studies and datasets in the cited references.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was partially supported by the National Natural Science Foundation of China (no. 61702310), the Major Fundamental Research Project of Shandong, China (no. ZR2019ZD03), and the Taishan Scholar Project of Shandong, China (no. ts20190924).