Abstract
Fast crossmodal retrieval technology based on hash coding has become a hot topic for the rich multimodal data (text, image, audio, etc.), especially security and privacy challenges in the Internet of Things and mobile edge computing. However, most methods based on hash coding are only mapped to the common hash coding space, and it relaxes the two value constraints of hash coding. Therefore, the learning of the multimodal hash coding may not be sufficient and effective to express the original multimodal data and cause the hash encoding category to be less discriminatory. For the sake of solving these problems, this paper proposes a method of mapping each modal data to the optimal length of hash coding space, respectively, and then the hash encoding of each modal data is solved by the discrete crossmodal hash algorithm of two value constraints. Finally, the similarity of multimodal data is compared in the potential space. The experimental results of the crossmodel retrieval based on variable hash coding are better than that of the relative comparison methods in the WIKI data set, NUSWIDE data set, as well as MIRFlickr data set, and the method we proposed is proved to be feasible and effective.
1. Introduction
With the advent of the big data era, the different types of modal data, e.g., text, image, and audio for the Internet of Things and Mobile Edge Computing, are dramatically increasing [1]. The traditional singlemode data retrieval methods, e.g., text retrieval text, image retrieval image, and audio retrieval audio, are gradual shift to crossmodal retrieval, e.g., text retrieval image, text retrieval audio, image retrieval text, which makes the retrieval return with the characteristics of diverse information and rich content [2]. Over the last few years, the crossmodal retrieval algorithms have been recently receiving significant attention and progress due to the application research of guaranteed data privacy and privacypreserving cooperative object classification [3, 4].
There are two main categories in these research methods. One is the potential subspace learningbased method [5–8], among which the canonical correlation analysis (CCA) is the most commonly used model [5]. The CCA mapped the twomodal data into a potential subspace to achieve the correlation maximization of the associated data pairs, and then directly retrieves the similarity query in the subspace. Given the paramount idea of the correlation maximization of relevant data in subspace, some experts have proposed other deformation model algorithms similar to the CCA model. Fu et al. proposed the generalized Multiview analysis (GMA) to maximize the subspace correlation of multimodal data and achieve the classdiscriminant via adding label information, which is conducive to further boosting the accuracy of the crossmodal retrieval [6]. Costa Pereira et al. first projected the original feature data of each mode into their respective semantic feature space, and then mapped the semantic features of multimodes into a unified subspace via applying CCA or kernel CCA. The proposed model utilized the label information of the data to improve the classification area analysis, meanwhile avoiding the direct mapping of the original multimodal features into the unified subspace so that the crossmodal retrieval performance is notably improved [7]. Mandal and Biswas proposed the generalized dictionary pair algorithm and achieved good results via learning unified sparse coding subspace [8]. Although some progress has been made in unified subspace learningbased crossmodal retrieval algorithms, there are still some problems in crossmodal retrieval of largescale multimodal data scenarios, e.g., high computational cost, high data storage resource consumption, and weak stationarity. Therefore, another kind of crossmodal retrieval algorithm based on hashing coding has stimulated a lot of interest in the research community.
With the characteristics of storage consumption and efficient retrieval speed, the Hash coding technology is very suitable for largescale data transmodal and transmedia tasks, e.g., realtime multimodal data personalized recommendation [9], hot topic detection, and transmedia retrieval. In the Hash codingbased crossmodal retrieval method [10–13], for maintaining the connection between multimodal data, the multimodal data was projected into lowdimensional Hamming space through linear mapping, and then an XOR operation was performed to measure the similarity distance. Thus, the speed problem of largescale data retrieval was solved effectively. However, most of the prior arts are only suitable for scenarios of the single label and paired training data. Therefore, Mandal et al. first proposed a hashing crossmodal retrieval model for multiple training scenarios [14]. However, this model is similar to the method presented in Refs. [15, 16] that maps multimode data into equallength hash coding, so that the data of various modes may not be well represented. In addition, the solution of binary hash coding is an NPhard problem, which relaxes the binary constraint of hash coding, so that the learned hash coding is not accurate enough. For analytical simplicity, this paper first proposed a crossmodal retrieval model based on variablelength hash coding and added binary constraints in the process of solving hash coding. Therefore, the learned variablelength hash coding can better represent the original multimodal data and achieve higher accuracy. The main highlights of this paper are organized as follows.(1)To combat the issue caused by the same length, we propose a variablelength hash codingbased crossmodal retrieval model in this paper, i.e., all modal data are projected into the hash coding space of the optimal lengths. Therefore, compared with the hash coding space of the fixed length, the original multimodal data can be represented more easily, and the model in this paper is more flexible in debugging experiments.(2)We propose a more generalized multiscene crossmodal retrieval. The great majority of the existing crossmodal retrieval models, based on single label and pairwise multimodal dataset scenarios, cannot be applied to multilabel and unpaired multimodal dataset scenarios. In addition, the crossmodel retrieval in this paper has good adaptability to single label or multilabel, paired, or unpaired multimodal dataset scenarios.(3)Based on the singlemodal data hash method, we propose a variablelength discrete hash codingbased crossmodal retrieval algorithm, and the validity of the algorithm is verified on several public data sets.
2. Related Works
This section mainly introduces several related hash coding crossmodal retrieval algorithms, which are also served as benchmark algorithms in the experimental process. Any reader who has a great interest in other crossmodal retrieval models, such as incorporating feedback technology and deep learning, can refer to Ref. [17].
2.1. Hashing CrossModal Retrieval Based on Semantic Correlation Maximization
Taherkhani et al. proposed a Semantic Correlation Maximization (SCM)based crossmodal hash retrieval model. Meanwhile, compared with other supervised hash crossmodal retrieval models, this model has the advantages of lower training time complexity, better adaptability, and more stability for largescale data sets [10]. The main highlights are as follows. (1) The calculation of the complex pintopair similarity matrix can be avoided directly via applying label information of the training data set to calculate the similarity matrix, thus only small linear time complexity can be achieved, which also makes the model more stable. (2) The serialization solution method of hash coding is proposed via the computation code of bit by bit on the closed interval. Therefore, there is no need to set hyperparameters and stop conditions. To use label semantic information, cosine similarity between label vectors is used to construct the similarity matrix, and the similarity between the data object and the data object is defined as follows.where represents the inner product of the corresponding label vector and describes the binary norm of the label vector. To achieve a crossmodal similarity query, the hash function should maintain the semantic similarity of multimodal data. More specifically, the hash coding of each modal data can reconstruct the semantic similarity matrix. The specific objective function of the SCM model is defined as follows:where and represent the data of the two modes, defines the linear transformation matrix, describes the equilibrium parameter, and defines the similarity measurement between two data among different modalities. There is a symbolic function in (2), so it is obvious that the optimization solution is an NPhard problem, which relaxes the constraints of the symbolic function and adds the constraints between the bits of the hash coding. Finally, the transformation matrixes of each modal data can be calculated, so that the hash coding of new data can be resolved.
2.2. Hashing CrossModal Retrieval Based on Semantic Preserving
Chen et al. proposed a Semantic Preserving Hash crossmodal retrieval (SEPH) model, which converts the similar association information of data into the form of the probability distribution and then approximates hash coding via minimizing the Kullback–Leibler (KL) divergence distance [11]. The whole objective function model is effectively guaranteed in mathematical theory. As with the SCM model, the similarity matrix is first constructed to provide supervisory information for the learned hash coding. This model mainly includes two steps, i.e., hash coding solution and learning of kernel logic Sti regression function. When it comes to the process of solving the hash coding, the similarity matrix is first transformed into the form of probability , and the semantic probability distribution on the unified hash coding is calculated, then the KL distance between the two distributions is minimized, and the semantic preserving hash coding is resolved.where represents the Hamming distance function of hash coding; learning the best hash coding aims to make the distribution between and as similar as possible. The KL distance between the distributions is measured as follows:
In all, a better unified semantically preserving hash coding can be calculated according to the solution steps, and then the logistic regression mapping function of each modal data mapped to the unified hash coding is learned. The representation of learning the th Logistic regression function for mode data is defined as follows:where defines the column vector on the th bit attribute of the common binary code, and the transformation matrix can be solved. Then, the probability that the value belongs to −1 and +1 at the th bit of the binary code of the new sample data in mode can be calculated as follows:
Therefore, the value at the th bit of data binary coding is selected as the value corresponding to the high probability, which is defined as follows:
Finally, the th logistic regression function on the mode data can be learned, and then the new sample is mapped into the binary coding with the growing degree of . The final hash coding can be achieved by changing the element with the value of −1 into 0.
2.3. Hashing CrossModal Retrieval Based on Generalized Semantic Preserving
Because most of the existing crossmodal retrieval methods require multimodal data to appear in pairs, i.e., another modal data corresponding to text or image exists in training set data, Mandal et al. proposed a Generalized Semantic Preserving Hashing model (GSPH) for Nlabel crossmodal retrieval, which is suitable for a single label or multilabel, paired or unpaired multimodal data application scenarios [14]. The GSPH model first learns the optimal hash coding of each modal data, meanwhile the hash coding preserves the semantic similarity between the multimodal data and then learns the hash function of multimodal data mapped to the hash coding space. The main highlights are as follows. (1) A hash model that can deal with singlelabel paired data and singlelabel unpaired data is proposed for the first time. (2) The generalized hash crossmodal retrieval model is proposed, which can be applied to the scenarios of singlelabel paired data, singlelabel unpaired data, multilabel paired data, as well as singlelabel unpaired data. Meanwhile, the semantic similarity of data is maintained by the common hash coding. As with SCM and SEPH methods, the GSPH algorithm also needs to define the similarity matrix between multimodal data, where and are the sample numbers of and modal data, respectively, so the objective function of the GSPH model is defined as follows:
The binary coding and of the and modal data can be calculated by the GSPH method, and then the mapping function of the original data for each modal into hash coding needs to be learned. Just like the SEPH method, the logistic regression function is selected as the mapping function. Therefore, readers can refer to Section 2.2 for learning the mapping hash function and generating the hash coding of new samples.
3. CrossModal Retrieval Based on VariableLength Hash Coding
In this section, the crossmodal retrieval algorithm of variablelength hash coding is presented, and the optimization process of the objective function and time complexity of the algorithm is analyzed. To facilitate the analytical simplicity and reduce the experimental operation, this paper mainly studies the case of twomodal data and gives the algorithm model extended to three or more modal data in Section 3.5.
3.1. Algorithmic Model
The variables presented in this paper are defined as follows. and represent the original feature data sets of the two modes, respectively, and are the corresponding variablelength hash coding, where each column represents a sample and each row represents attribute features. In addition, and are the projection matrixes, and is the association matrix of two modes. The similarity matrix between multimode data is constructed as follows:where defines the label vector of the sample, and each element of the similarity matrix represents the similarity between modal data and modal data . The next goal of this paper is to learn the compact hash coding of the optimal length for each model, so that these hash coding can perfectly represent the original multimode data and maintain the semantic similarity of multimode data sets. This paper calculates the similarity of different modal data in potential space by referring to Ref. [7] and assumes that there is a common potential abstract semantic space between multimodal data, in which multimodal data can be queried and retrieved directly. And, each modal hash coding is projected into the potential abstract semantic space in the following form:
In the space , the similarity between data can be calculated according to the relation of the inner product, which is defined as follows:
Remembering , we do not need to explicitly solve the existing form of each mode data in the potential abstract semantic space , but only calculate the similarity between the variedlength hash coding of each mode. The crossmodal retrieval objective function of the specific variablelength hash coding is defined as follows:
The first two terms of (12) are applied to, respectively, project the twomodal data into the hash coding space of the optimal lengths, and the last term indicates that the variablelength hash coding in the potential space still maintains the semantic similarity relation of the original multimodal data. The corresponding projection matrixes , hash coding , and correlation matrix can be solved simultaneously through optimization.
3.2. Model Solution Procedure
To simplify the difficulty of solving hash coding, the prior art converts binary constraint conditions of hash coding into solving continuous realvalued problems and then obtains approximate hash coding through symbolic functions [10–12]. However, the solved hash coding has essential defects and cannot represent the original multimodal data effectively. The binary constraint condition of hash coding is always maintained in the solving process of this subsection. When the objective function is solved, the variables of simultaneous solution are nonconvex and difficult to solve. Therefore, this paper first solves one of the variables and fixed the remaining variables, and then solves the other variables in this way. All variables are solved by iteration until the objective function tends to converge.(a)Fix other variables and resolve . Therefore, the objective function can be simplified in the following form: Therefore, the analytical formulae can be calculated by regression formula, respectively,(b)Fix other variables and resolve . The objective function can be simplified in the following form: It is obvious that (15) is a bilinear regression model, and the analytical formula is as follows:(c)Fix other variables and resolve . The objective function can be simplified in the following form: Because of the twovalue constraint, it is complicated to resolve directly. Therefore, in this paper, the variable is solved successively, i.e., when solving a row vector of , the remaining row vectors are fixed first, and then the other row vectors are solved iteratively. (17) can be further transformed into (18). Because of the binary constraint, it is obvious that the first term is a constant, i.e., . If constant terms and irrelevant variables are removed, (18) can be rewritten into a more concise form. where , and are the trace of the solution matrix. After deformation, the solution of (19) has a relationship with the solution of the objective function in Ref. [16], so this paper refers to its solution process. When solving the th row vector of , let be the matrix after row vector deletion , defines the th row vector of , represents the matrix after row vector deletion , defines the th column vector of , and represents the matrix after column vector deletion , and then refer to the solution results in Ref. [16]. The th row vector of can be resolved, and then the remaining row vectors can be solved via a similar procedure.(d)Fix other variables and resolve .
In the process of solving , it is similar to solving , so readers can refer to the solution method of for a detailed solution of .
3.3. Algorithm Description
To project hash coding into the optimal space for comparison, measurement, and retrieval, the associated transformation matrix is introduced into the crossmodal retrieval model of variablelength hash coding on the base of the GSPH model, and then the similarity between data can be compared in the potential space through . Subsection 2.2 provides the solution process of each variable in the model, and the overall training steps for the model are shown in Algorithm 1.

According to the proposed training process, the projection matrix of each mode can be calculated separately, and then the corresponding hash coding can be solved by a symbolic function. For query sample or , the corresponding hash coding generation method is or . To improve the accuracy of generating corresponding hash coding, the query sample pair information of these two modes can be used to generate hash coding simultaneously. If the final hash coding is expected to exist in the hash coding space of the mode, then . If the final hash code is desired to exist in the hash coding space of the mode, then , where is a nonnegative equilibrium parameter. The overall testing steps for the model are summarized in Algorithm 2.
3.4. Time Complexity
The time complexity of the crossmodal retrieval algorithm in this section is mainly composed of computationrelated variables. In the training phase, the time of each iteration is consumed in updating the projection matrixes , transformation matrix , and corresponding hash coding matrixes , in which these variables are calculated by (14) and (16), and (17), respectively, and the corresponding calculation time complexity is . Therefore, the total time complexity of the proposed model is , where represents the total number of iterations, where . More specially, , , and are the original dimension, hash length, and the total number of samples of mode data, respectively, and , , and are the original dimension, hash length, and the total number of samples of mode data, respectively. Once the training process is end, the time and space complexity for generating a new sample is .

3.5. Application Scenario
The crossmodal retrieval model can be easily extended to the scenarios of three or more modal data, assuming that modal data, then the crossmodal retrieval model of variablelength hash coding for modal data is defined as follows:
The first item in (21) represents the hash code mapping of all modal data into the optimal length, and the second item represents the semantic relationship preservation between the hash coding of each mode and another modal hash coding. The process of model optimization and query sample hash coding generation can follow the way of twomodal data scenarios.
4. Results and Discussion
4.1. Data Sets and Performance Metrics
To verify the validity of the model, the commonly used WIKI data set, NUSWIDE data set, and MIRFlickr data set are selected for the crossmodal retrieval. In addition, the precisionrecall and Mean Average Precision (MAP) index are used to measure model performance as shown in Refs. [11–13].
WIKI data set is collated from Wikipedia page [7], and each image has the corresponding description text, in which each text contains no less than 70 words. The data sets belong to a singlelabel data, and there are 10 categories, each image or text belongs to one of these categories, and images or texts belonging to the same category are considered to have similar semantic information. There exist 2866 samples (2173 training sets and 693 test sets), in which image data is represented by 128dimensional Scale Invariant Feature Transform (SIFT) features and text data by 10dimensional Latent Dirichlet Allocation (LDA) features.
NUSWIDE data set is collected and sorted from the Internet by the National University of Singapore [18], which regulates 269,648 images and explanatory annotations accomplished by about 5,000 people. Each sample belongs to multilabel data, which is eventually divided into 81 categories. Due to the sample numbers of some categories differ greatly in this paper, just as Refs. [10, 11], the top 10 categories with many samples are firstly selected, and finally 186,577 textimage pairs have been achieved. Text and image are considered similar, if there is at least one of the same category attributes. Subsequently, 1% of the data (about 1866) are randomly selected as the test set and 5000 samples as the training set. The images of the NUSWIDE data set are represented by 500dimensional SIFT features and the text data by the word frequency of 1000 dimensions.
MIRFlickr data set originated from the Flickr website, which contains 25000 images and corresponding manually annotated text information [19]. Just as Ref. [11], we have deleted some data without labels or with less than 20 times of labeled words, and finally 16,738 samples are divided into 24 categories. Each image text pair belongs to multicategory data, which contains at least one category label. This paper selects 5% data as a test set and 5000 samples as the training set. Images in the data set are represented by 150dimensional edge histograms and text by 500dimensional vectors. The evaluation criteria are defined as follows:where represents the number of relevant samples among results stemming from the retrieval and defines the number of samples related to query samples in the whole database.
Average Precision (AP) indicator calculation: Given a query sample and the first returned results, the AP calculation equation of this sample is defined as follows:where represents the number of retrievably returned results related to query samples, and defines the accuracy of the returned first retrieval results. If the th retrieval result is related to the query sample, is 1; otherwise, is 0. Finally, the AP average value of all query samples is solved, which is the MAP index to evaluate the overall search performance.
4.2. Benchmark Algorithm
In this subsection, the various multimodal data are preprocessed according to the method represented in Ref. [16], i.e., the distance between sample points and randomly selected reference points is calculated. Then the discrete supervised hash model is used to initialize the hashing coding of each mode. To highlight the importance of the label matrix in the process of optimization, the label matrix of all data is enlarged by 10 times. In addition, CCA, a typical correlation analysis method commonly used in the field of crossmodal retrieval, and the crossmodal retrieval algorithm based on semantic correlation hash coding in recent years are selected as a comparative experiment. These hashing crossmodal retrieval models are SCM, SEPH, and GSPH, respectively, and the comparison experiments proposed in this paper are implemented in MATLAB with the help of the parameters set in the original text. Both SEPH and GSPH models include two methods to learn hash functions: (1) training hash functions SEPH_rnd and GSPH_rnd based on randomly selected samples; (2) training hash functions SEPH_knn and GSPH_knn based on selecting samples through clustering. The experiment shows that the performance of the hash function obtained by these two training methods is the same. Therefore, the first method, randomly selected samples, is selected to train the hash functions of both SEPH and GSPH models in the comparative experiment. Moreover, the two different methods in the SCM model are SCM_seq and SCM_orth, and the experiment results show that the former is generally superior to the latter; therefore the former is used as a comparative experiment [10].
4.3. Experimental Results
This subsection presents the experimental results of crossmodal retrieval on the WIKI dataset, NUSWIDE dataset, and MIRFlickr dataset. The following crossmodal retrieval tasks include image retrieval text and text retrieval image, and these two retrieval tasks are analyzed in detail. Figure 1 shows the curves of retrieval accuracy rate and recall rate on three kinds of data sets. To facilitate the comparison with the benchmark algorithm, both image and text are projected into equallength hash coding space (64 bits). It can be seen from Figure 1 that the performance of the method proposed in this paper is generally superior to that of the comparison method, although the front part of the curve (subgraph (a) of Figure 1) in the image retrieval text task on the WIKI dataset is slightly lower than that of SEPH and GSPH methods. However, it can be seen from the subgraph (a) of Figure 2 that the effect of the optimal hash coding combination length in this paper is slightly higher than that of SEPH and GSPH methods. It can also be seen from Figure 1 that for the other two groups of multilabel data, the effect of this paper has been improved more than that of the comparison methods, due to the model in this paper being more suitable for multilabel data sets than the CCA, SCM, SEPH, and GSPH models.
(a)
(b)
(c)
(d)
(e)
(f)
(a)
(b)
(c)
(d)
(e)
(f)
The MAP index of image retrieval text and text retrieval image of each method is presented in detail in Tables 1 and 2, respectively, and the highest MAP value of each column is marked black. To compare the effects of CCA and other methods, this paper projected data into subspaces of different dimensions to observe the influence of CCA methods. Tables 1 and 2 show that the MAP value of the proposed method and other hash coding methods increases slightly as the length of hash coding increases. As can be seen from the numerical part marked black in the table, the MAP value of the proposed method is superior to that of the comparison method, no matter in the image retrieval text task or the text retrieval image task. Given that the hash coding length is 64 bits, this paper improves about 15%, 10%, and 13% in the image retrieval text task on WIKI, NUSWIDE, and MIRFlickr data sets, and about 12%, 11%, and 5% in the text retrieval image task compared with the GSPH method.
Figure 2 shows the experimental results of different length combinations for the hash coding proposed in this paper (image hash coding length text hash coding length). To show the variation tendency of different hash length combinations, the curve colors of hash coding length combinations from 16 16 to 128 128 gradually change from dark blue, light blue, light red, and then dark red as shown in Figure 2. Generally speaking, with the growth of image hash coding, the crossmodal retrieval effect also becomes better, especially for the subgraphs (d) and (f) of Figure 2. In addition, Figure 2 also shows that the crossmodal retrieval model of variablelength hash coding in this paper has a more significant impact on WIKI data sets.
From the MAP threedimensional histogram in Figure 3, it can be seen that the same and fixed hash code length cannot be set for all datasets. To be special, the optimal hash code combination is 48 64(text image) for the img2txt task on the NUSWIDE dataset. But the optimal hash code length combination is 32 64 (text image) for the img2txt task on the MIRFlickr dataset to implement the img2txt task. The reason is that the text information of NUSWIDE is richer and more hash codes are needed to represent text features. From another point of view, for some retrieval tasks, using a shorter hash code length can also achieve a comparable retrieval effect. Thus, we can conclude that using a variablelength hash code can balance the data redundancy and retrieval accuracy.
(a)
(b)
(c)
(d)
(e)
(f)
5. Conclusion
In this paper, a variablelength hash codingbased crossmodal retrieval algorithm is first proposed, which projects multimodal data into the optimal hash length space of each modal data. The similarity matrix of multimodal data is constructed according to the label matrix of each mode, and the semantic similarity relationship of the original data is still guaranteed after the multimodal hash coding is projected into the potential abstract semantic space. Then the binary constraint condition of the hash coding is always maintained in the process of optimizing the model, so that the learned multimode hash coding can better represent the original multimode data. A wide variety of experiments on WIKI datasets, NUSWIDE datasets, and MIRFlickr datasets show that the performance of the proposed method is generally superior to that of the correlation benchmark algorithms. Therefore, the method in this paper is feasible and effective. Compared with the deep learningbased hashing methods, the retrieval performance is relatively low. Thus, in our future work, we will embed the proposed similarity matrix into the deep learningbased method to further improve the retrieved accuracy and effectively measure the relationship among multiple source data.
Data Availability
The datasets used and/or analyzed during the current study are available from the author on reasonable request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant 61871062), Science and Technology Research Program of Chongqing Municipal Education Commission (Grants KJQN201800615 and KJQN201900609), Natural Science Foundation of Chongqing, China (Grants cstc2020jcyjzdxmX0024 and cstc2021jcyjmsxm2025), Key Project of Science and Technology of Chongqing Municipal Education Commission (Grant KJZDK201901203), and University Innovation Research Group of Chongqing (Grants CXQT20017 and CXQT20024).