Cross-Modal Discrimination Hashing Retrieval Using Variable Length

He, Chao; Wang, Dalin; Tan, Zefu; Xu, Liming; Dai, Nina

doi:https://doi.org/10.1155/2022/9638683

Security and Communication Networks

On this page

Abstract Introduction Related Works Results and Discussion Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Security and Privacy Challenges in Internet of Things and Mobile Edge Computing 2021

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 9638683 | https://doi.org/10.1155/2022/9638683

Cross-Modal Discrimination Hashing Retrieval Using Variable Length

Chao He,¹Dalin Wang,²Zefu Tan,¹Liming Xu,³and Nina Dai¹

Academic Editor: Qing Yang

Received20 Apr 2022

Accepted10 Aug 2022

Published09 Sept 2022

Abstract

Fast cross-modal retrieval technology based on hash coding has become a hot topic for the rich multimodal data (text, image, audio, etc.), especially security and privacy challenges in the Internet of Things and mobile edge computing. However, most methods based on hash coding are only mapped to the common hash coding space, and it relaxes the two value constraints of hash coding. Therefore, the learning of the multimodal hash coding may not be sufficient and effective to express the original multimodal data and cause the hash encoding category to be less discriminatory. For the sake of solving these problems, this paper proposes a method of mapping each modal data to the optimal length of hash coding space, respectively, and then the hash encoding of each modal data is solved by the discrete cross-modal hash algorithm of two value constraints. Finally, the similarity of multimodal data is compared in the potential space. The experimental results of the cross-model retrieval based on variable hash coding are better than that of the relative comparison methods in the WIKI data set, NUS-WIDE data set, as well as MIRFlickr data set, and the method we proposed is proved to be feasible and effective.

1. Introduction

With the advent of the big data era, the different types of modal data, e.g., text, image, and audio for the Internet of Things and Mobile Edge Computing, are dramatically increasing [1]. The traditional single-mode data retrieval methods, e.g., text retrieval text, image retrieval image, and audio retrieval audio, are gradual shift to cross-modal retrieval, e.g., text retrieval image, text retrieval audio, image retrieval text, which makes the retrieval return with the characteristics of diverse information and rich content [2]. Over the last few years, the cross-modal retrieval algorithms have been recently receiving significant attention and progress due to the application research of guaranteed data privacy and privacy-preserving cooperative object classification [3, 4].

There are two main categories in these research methods. One is the potential subspace learning-based method [5–8], among which the canonical correlation analysis (CCA) is the most commonly used model [5]. The CCA mapped the two-modal data into a potential subspace to achieve the correlation maximization of the associated data pairs, and then directly retrieves the similarity query in the subspace. Given the paramount idea of the correlation maximization of relevant data in subspace, some experts have proposed other deformation model algorithms similar to the CCA model. Fu et al. proposed the generalized Multiview analysis (GMA) to maximize the subspace correlation of multimodal data and achieve the class-discriminant via adding label information, which is conducive to further boosting the accuracy of the cross-modal retrieval [6]. Costa Pereira et al. first projected the original feature data of each mode into their respective semantic feature space, and then mapped the semantic features of multimodes into a unified subspace via applying CCA or kernel CCA. The proposed model utilized the label information of the data to improve the classification area analysis, meanwhile avoiding the direct mapping of the original multimodal features into the unified subspace so that the cross-modal retrieval performance is notably improved [7]. Mandal and Biswas proposed the generalized dictionary pair algorithm and achieved good results via learning unified sparse coding subspace [8]. Although some progress has been made in unified subspace learning-based cross-modal retrieval algorithms, there are still some problems in cross-modal retrieval of large-scale multimodal data scenarios, e.g., high computational cost, high data storage resource consumption, and weak stationarity. Therefore, another kind of cross-modal retrieval algorithm based on hashing coding has stimulated a lot of interest in the research community.

With the characteristics of storage consumption and efficient retrieval speed, the Hash coding technology is very suitable for large-scale data trans-modal and trans-media tasks, e.g., real-time multimodal data personalized recommendation [9], hot topic detection, and trans-media retrieval. In the Hash coding-based cross-modal retrieval method [10–13], for maintaining the connection between multimodal data, the multimodal data was projected into low-dimensional Hamming space through linear mapping, and then an XOR operation was performed to measure the similarity distance. Thus, the speed problem of large-scale data retrieval was solved effectively. However, most of the prior arts are only suitable for scenarios of the single label and paired training data. Therefore, Mandal et al. first proposed a hashing cross-modal retrieval model for multiple training scenarios [14]. However, this model is similar to the method presented in Refs. [15, 16] that maps multimode data into equal-length hash coding, so that the data of various modes may not be well represented. In addition, the solution of binary hash coding is an NP-hard problem, which relaxes the binary constraint of hash coding, so that the learned hash coding is not accurate enough. For analytical simplicity, this paper first proposed a cross-modal retrieval model based on variable-length hash coding and added binary constraints in the process of solving hash coding. Therefore, the learned variable-length hash coding can better represent the original multimodal data and achieve higher accuracy. The main highlights of this paper are organized as follows.(1)To combat the issue caused by the same length, we propose a variable-length hash coding-based cross-modal retrieval model in this paper, i.e., all modal data are projected into the hash coding space of the optimal lengths. Therefore, compared with the hash coding space of the fixed length, the original multimodal data can be represented more easily, and the model in this paper is more flexible in debugging experiments.(2)We propose a more generalized multiscene cross-modal retrieval. The great majority of the existing cross-modal retrieval models, based on single label and pairwise multimodal dataset scenarios, cannot be applied to multilabel and unpaired multimodal dataset scenarios. In addition, the cross-model retrieval in this paper has good adaptability to single label or multilabel, paired, or unpaired multimodal dataset scenarios.(3)Based on the single-modal data hash method, we propose a variable-length discrete hash coding-based cross-modal retrieval algorithm, and the validity of the algorithm is verified on several public data sets.

This section mainly introduces several related hash coding cross-modal retrieval algorithms, which are also served as benchmark algorithms in the experimental process. Any reader who has a great interest in other cross-modal retrieval models, such as incorporating feedback technology and deep learning, can refer to Ref. [17].

2.1. Hashing Cross-Modal Retrieval Based on Semantic Correlation Maximization

Taherkhani et al. proposed a Semantic Correlation Maximization (SCM)-based cross-modal hash retrieval model. Meanwhile, compared with other supervised hash cross-modal retrieval models, this model has the advantages of lower training time complexity, better adaptability, and more stability for large-scale data sets [10]. The main highlights are as follows. (1) The calculation of the complex pin-to-pair similarity matrix can be avoided directly via applying label information of the training data set to calculate the similarity matrix, thus only small linear time complexity can be achieved, which also makes the model more stable. (2) The serialization solution method of hash coding is proposed via the computation code of bit by bit on the closed interval. Therefore, there is no need to set hyperparameters and stop conditions. To use label semantic information, cosine similarity between label vectors is used to construct the similarity matrix, and the similarity between the data object and the data object is defined as follows.where represents the inner product of the corresponding label vector and describes the binary norm of the label vector. To achieve a cross-modal similarity query, the hash function should maintain the semantic similarity of multimodal data. More specifically, the hash coding of each modal data can reconstruct the semantic similarity matrix. The specific objective function of the SCM model is defined as follows:where and represent the data of the two modes, defines the linear transformation matrix, describes the equilibrium parameter, and defines the similarity measurement between two data among different modalities. There is a symbolic function in (2), so it is obvious that the optimization solution is an NP-hard problem, which relaxes the constraints of the symbolic function and adds the constraints between the bits of the hash coding. Finally, the transformation matrixes of each modal data can be calculated, so that the hash coding of new data can be resolved.

2.2. Hashing Cross-Modal Retrieval Based on Semantic Preserving

Chen et al. proposed a Semantic Preserving Hash cross-modal retrieval (SEPH) model, which converts the similar association information of data into the form of the probability distribution and then approximates hash coding via minimizing the Kullback–Leibler (KL) divergence distance [11]. The whole objective function model is effectively guaranteed in mathematical theory. As with the SCM model, the similarity matrix is first constructed to provide supervisory information for the learned hash coding. This model mainly includes two steps, i.e., hash coding solution and learning of kernel logic Sti regression function. When it comes to the process of solving the hash coding, the similarity matrix is first transformed into the form of probability , and the semantic probability distribution on the unified hash coding is calculated, then the KL distance between the two distributions is minimized, and the semantic preserving hash coding is resolved.where represents the Hamming distance function of hash coding; learning the best hash coding aims to make the distribution between and as similar as possible. The KL distance between the distributions is measured as follows:

In all, a better unified semantically preserving hash coding can be calculated according to the solution steps, and then the logistic regression mapping function of each modal data mapped to the unified hash coding is learned. The representation of learning the -th Logistic regression function for mode data is defined as follows:where defines the column vector on the -th bit attribute of the common binary code, and the transformation matrix can be solved. Then, the probability that the value belongs to −1 and +1 at the -th bit of the binary code of the new sample data in mode can be calculated as follows:

Therefore, the value at the -th bit of data binary coding is selected as the value corresponding to the high probability, which is defined as follows:

Finally, the -th logistic regression function on the mode data can be learned, and then the new sample is mapped into the binary coding with the growing degree of . The final hash coding can be achieved by changing the element with the value of −1 into 0.

2.3. Hashing Cross-Modal Retrieval Based on Generalized Semantic Preserving

Because most of the existing cross-modal retrieval methods require multimodal data to appear in pairs, i.e., another modal data corresponding to text or image exists in training set data, Mandal et al. proposed a Generalized Semantic Preserving Hashing model (GSPH) for N-label cross-modal retrieval, which is suitable for a single label or multilabel, paired or unpaired multimodal data application scenarios [14]. The GSPH model first learns the optimal hash coding of each modal data, meanwhile the hash coding preserves the semantic similarity between the multimodal data and then learns the hash function of multimodal data mapped to the hash coding space. The main highlights are as follows. (1) A hash model that can deal with single-label paired data and single-label unpaired data is proposed for the first time. (2) The generalized hash cross-modal retrieval model is proposed, which can be applied to the scenarios of single-label paired data, single-label unpaired data, multilabel paired data, as well as single-label unpaired data. Meanwhile, the semantic similarity of data is maintained by the common hash coding. As with SCM and SEPH methods, the GSPH algorithm also needs to define the similarity matrix between multimodal data, where and are the sample numbers of and modal data, respectively, so the objective function of the GSPH model is defined as follows:

The binary coding and of the and modal data can be calculated by the GSPH method, and then the mapping function of the original data for each modal into hash coding needs to be learned. Just like the SEPH method, the logistic regression function is selected as the mapping function. Therefore, readers can refer to Section 2.2 for learning the mapping hash function and generating the hash coding of new samples.

In this section, the cross-modal retrieval algorithm of variable-length hash coding is presented, and the optimization process of the objective function and time complexity of the algorithm is analyzed. To facilitate the analytical simplicity and reduce the experimental operation, this paper mainly studies the case of two-modal data and gives the algorithm model extended to three or more modal data in Section 3.5.

3.1. Algorithmic Model

The variables presented in this paper are defined as follows. and represent the original feature data sets of the two modes, respectively, and are the corresponding variable-length hash coding, where each column represents a sample and each row represents attribute features. In addition, and are the projection matrixes, and is the association matrix of two modes. The similarity matrix between multimode data is constructed as follows:where defines the label vector of the sample, and each element of the similarity matrix represents the similarity between modal data and modal data . The next goal of this paper is to learn the compact hash coding of the optimal length for each model, so that these hash coding can perfectly represent the original multimode data and maintain the semantic similarity of multimode data sets. This paper calculates the similarity of different modal data in potential space by referring to Ref. [7] and assumes that there is a common potential abstract semantic space between multimodal data, in which multimodal data can be queried and retrieved directly. And, each modal hash coding is projected into the potential abstract semantic space in the following form:

In the space , the similarity between data can be calculated according to the relation of the inner product, which is defined as follows:

Remembering , we do not need to explicitly solve the existing form of each mode data in the potential abstract semantic space , but only calculate the similarity between the varied-length hash coding of each mode. The cross-modal retrieval objective function of the specific variable-length hash coding is defined as follows:

The first two terms of (12) are applied to, respectively, project the two-modal data into the hash coding space of the optimal lengths, and the last term indicates that the variable-length hash coding in the potential space still maintains the semantic similarity relation of the original multimodal data. The corresponding projection matrixes , hash coding , and correlation matrix can be solved simultaneously through optimization.

3.2. Model Solution Procedure

To simplify the difficulty of solving hash coding, the prior art converts binary constraint conditions of hash coding into solving continuous real-valued problems and then obtains approximate hash coding through symbolic functions [10–12]. However, the solved hash coding has essential defects and cannot represent the original multimodal data effectively. The binary constraint condition of hash coding is always maintained in the solving process of this subsection. When the objective function is solved, the variables of simultaneous solution are nonconvex and difficult to solve. Therefore, this paper first solves one of the variables and fixed the remaining variables, and then solves the other variables in this way. All variables are solved by iteration until the objective function tends to converge.(a)Fix other variables and resolve . Therefore, the objective function can be simplified in the following form: Therefore, the analytical formulae can be calculated by regression formula, respectively,(b)Fix other variables and resolve . The objective function can be simplified in the following form: It is obvious that (15) is a bilinear regression model, and the analytical formula is as follows:(c)Fix other variables and resolve . The objective function can be simplified in the following form: Because of the two-value constraint, it is complicated to resolve directly. Therefore, in this paper, the variable is solved successively, i.e., when solving a row vector of , the remaining row vectors are fixed first, and then the other row vectors are solved iteratively. (17) can be further transformed into (18). Because of the binary constraint, it is obvious that the first term is a constant, i.e., . If constant terms and irrelevant variables are removed, (18) can be rewritten into a more concise form. where , and are the trace of the solution matrix. After deformation, the solution of (19) has a relationship with the solution of the objective function in Ref. [16], so this paper refers to its solution process. When solving the -th row vector of , let be the matrix after row vector deletion , defines the -th row vector of , represents the matrix after row vector deletion , defines the -th column vector of , and represents the matrix after column vector deletion , and then refer to the solution results in Ref. [16]. The -th row vector of can be resolved, and then the remaining row vectors can be solved via a similar procedure.(d)Fix other variables and resolve .

In the process of solving , it is similar to solving , so readers can refer to the solution method of for a detailed solution of .

3.3. Algorithm Description

To project hash coding into the optimal space for comparison, measurement, and retrieval, the associated transformation matrix is introduced into the cross-modal retrieval model of variable-length hash coding on the base of the GSPH model, and then the similarity between data can be compared in the potential space through . Subsection 2.2 provides the solution process of each variable in the model, and the overall training steps for the model are shown in Algorithm 1.

	Input: Training datasets and label matrix ; Initialized association matrix ; Initialized variable-length hash ; Initialized iteration control parameter
	Output: Variables
	Procedure:
(0)	Applying label matrix and (9) to construct a semantic similarity matrix
(1)	;
(2)	while do
(3)	According to (14), update the dictionary projection matrix ;
(4)	According to (16), update the association matrix ;
(5)	According to equation (18) and the detailed solving process in Ref. [14], the hash code of variable length is updated one line at a time and finally updated as a whole ;
(6)	If the objective function (12) tends to converge, and stop the iteration; otherwise, skip to step (2);
(7)	End while

According to the proposed training process, the projection matrix of each mode can be calculated separately, and then the corresponding hash coding can be solved by a symbolic function. For query sample or , the corresponding hash coding generation method is or . To improve the accuracy of generating corresponding hash coding, the query sample pair information of these two modes can be used to generate hash coding simultaneously. If the final hash coding is expected to exist in the hash coding space of the mode, then . If the final hash code is desired to exist in the hash coding space of the mode, then , where is a non-negative equilibrium parameter. The overall testing steps for the model are summarized in Algorithm 2.

3.4. Time Complexity

The time complexity of the cross-modal retrieval algorithm in this section is mainly composed of computation-related variables. In the training phase, the time of each iteration is consumed in updating the projection matrixes , transformation matrix , and corresponding hash coding matrixes , in which these variables are calculated by (14) and (16), and (17), respectively, and the corresponding calculation time complexity is . Therefore, the total time complexity of the proposed model is , where represents the total number of iterations, where . More specially, , , and are the original dimension, hash length, and the total number of samples of mode data, respectively, and , , and are the original dimension, hash length, and the total number of samples of mode data, respectively. Once the training process is end, the time and space complexity for generating a new sample is .

	Input: Testing datasets ; trained , and .
	Output: The top n cross-modal data matching the samples to be retrieved.
	Procedure:
(1)	if input independent or then
(2)	compute the corresponding hash code by or ;
(3)	end if
(4)	if input paired then:
(5)	if hash code exists in space of Y data:
	;
(6)	else:
	;
(7)	end if
(8)	end if
(9)	Calculate the Hamming distance between the hash code b' and the hash codes of all samples in the retrieval database
(10)	Sort the distances calculated in ascending order, and return the first n samples.

3.5. Application Scenario

The cross-modal retrieval model can be easily extended to the scenarios of three or more modal data, assuming that modal data, then the cross-modal retrieval model of variable-length hash coding for modal data is defined as follows:

The first item in (21) represents the hash code mapping of all modal data into the optimal length, and the second item represents the semantic relationship preservation between the hash coding of each mode and another modal hash coding. The process of model optimization and query sample hash coding generation can follow the way of two-modal data scenarios.

4. Results and Discussion

4.1. Data Sets and Performance Metrics

To verify the validity of the model, the commonly used WIKI data set, NUS-WIDE data set, and MIRFlickr data set are selected for the cross-modal retrieval. In addition, the precision-recall and Mean Average Precision (MAP) index are used to measure model performance as shown in Refs. [11–13].

WIKI data set is collated from Wikipedia page [7], and each image has the corresponding description text, in which each text contains no less than 70 words. The data sets belong to a single-label data, and there are 10 categories, each image or text belongs to one of these categories, and images or texts belonging to the same category are considered to have similar semantic information. There exist 2866 samples (2173 training sets and 693 test sets), in which image data is represented by 128-dimensional Scale Invariant Feature Transform (SIFT) features and text data by 10-dimensional Latent Dirichlet Allocation (LDA) features.

NUS-WIDE data set is collected and sorted from the Internet by the National University of Singapore [18], which regulates 269,648 images and explanatory annotations accomplished by about 5,000 people. Each sample belongs to multilabel data, which is eventually divided into 81 categories. Due to the sample numbers of some categories differ greatly in this paper, just as Refs. [10, 11], the top 10 categories with many samples are firstly selected, and finally 186,577 text-image pairs have been achieved. Text and image are considered similar, if there is at least one of the same category attributes. Subsequently, 1% of the data (about 1866) are randomly selected as the test set and 5000 samples as the training set. The images of the NUS-WIDE data set are represented by 500-dimensional SIFT features and the text data by the word frequency of 1000 dimensions.

MIRFlickr data set originated from the Flickr website, which contains 25000 images and corresponding manually annotated text information [19]. Just as Ref. [11], we have deleted some data without labels or with less than 20 times of labeled words, and finally 16,738 samples are divided into 24 categories. Each image text pair belongs to multicategory data, which contains at least one category label. This paper selects 5% data as a test set and 5000 samples as the training set. Images in the data set are represented by 150-dimensional edge histograms and text by 500-dimensional vectors. The evaluation criteria are defined as follows:where represents the number of relevant samples among results stemming from the retrieval and defines the number of samples related to query samples in the whole database.

Average Precision (AP) indicator calculation: Given a query sample and the first returned results, the AP calculation equation of this sample is defined as follows:where represents the number of retrievably returned results related to query samples, and defines the accuracy of the returned first retrieval results. If the -th retrieval result is related to the query sample, is 1; otherwise, is 0. Finally, the AP average value of all query samples is solved, which is the MAP index to evaluate the overall search performance.

4.2. Benchmark Algorithm

In this subsection, the various multimodal data are preprocessed according to the method represented in Ref. [16], i.e., the distance between sample points and randomly selected reference points is calculated. Then the discrete supervised hash model is used to initialize the hashing coding of each mode. To highlight the importance of the label matrix in the process of optimization, the label matrix of all data is enlarged by 10 times. In addition, CCA, a typical correlation analysis method commonly used in the field of cross-modal retrieval, and the cross-modal retrieval algorithm based on semantic correlation hash coding in recent years are selected as a comparative experiment. These hashing cross-modal retrieval models are SCM, SEPH, and GSPH, respectively, and the comparison experiments proposed in this paper are implemented in MATLAB with the help of the parameters set in the original text. Both SEPH and GSPH models include two methods to learn hash functions: (1) training hash functions SEPH_rnd and GSPH_rnd based on randomly selected samples; (2) training hash functions SEPH_knn and GSPH_knn based on selecting samples through clustering. The experiment shows that the performance of the hash function obtained by these two training methods is the same. Therefore, the first method, randomly selected samples, is selected to train the hash functions of both SEPH and GSPH models in the comparative experiment. Moreover, the two different methods in the SCM model are SCM_seq and SCM_orth, and the experiment results show that the former is generally superior to the latter; therefore the former is used as a comparative experiment [10].

4.3. Experimental Results

This subsection presents the experimental results of cross-modal retrieval on the WIKI dataset, NUS-WIDE dataset, and MIRFlickr dataset. The following cross-modal retrieval tasks include image retrieval text and text retrieval image, and these two retrieval tasks are analyzed in detail. Figure 1 shows the curves of retrieval accuracy rate and recall rate on three kinds of data sets. To facilitate the comparison with the benchmark algorithm, both image and text are projected into equal-length hash coding space (64 bits). It can be seen from Figure 1 that the performance of the method proposed in this paper is generally superior to that of the comparison method, although the front part of the curve (subgraph (a) of Figure 1) in the image retrieval text task on the WIKI dataset is slightly lower than that of SEPH and GSPH methods. However, it can be seen from the subgraph (a) of Figure 2 that the effect of the optimal hash coding combination length in this paper is slightly higher than that of SEPH and GSPH methods. It can also be seen from Figure 1 that for the other two groups of multilabel data, the effect of this paper has been improved more than that of the comparison methods, due to the model in this paper being more suitable for multilabel data sets than the CCA, SCM, SEPH, and GSPH models.

(a)

(b)

(c)

(d)

(e)

(f)

(a)

(b)

(c)

(d)

(e)

(f)

The MAP index of image retrieval text and text retrieval image of each method is presented in detail in Tables 1 and 2, respectively, and the highest MAP value of each column is marked black. To compare the effects of CCA and other methods, this paper projected data into subspaces of different dimensions to observe the influence of CCA methods. Tables 1 and 2 show that the MAP value of the proposed method and other hash coding methods increases slightly as the length of hash coding increases. As can be seen from the numerical part marked black in the table, the MAP value of the proposed method is superior to that of the comparison method, no matter in the image retrieval text task or the text retrieval image task. Given that the hash coding length is 64 bits, this paper improves about 15%, 10%, and 13% in the image retrieval text task on WIKI, NUS-WIDE, and MIRFlickr data sets, and about 12%, 11%, and 5% in the text retrieval image task compared with the GSPH method.

Figure 2 shows the experimental results of different length combinations for the hash coding proposed in this paper (image hash coding length text hash coding length). To show the variation tendency of different hash length combinations, the curve colors of hash coding length combinations from 16 16 to 128 128 gradually change from dark blue, light blue, light red, and then dark red as shown in Figure 2. Generally speaking, with the growth of image hash coding, the cross-modal retrieval effect also becomes better, especially for the subgraphs (d) and (f) of Figure 2. In addition, Figure 2 also shows that the cross-modal retrieval model of variable-length hash coding in this paper has a more significant impact on WIKI data sets.

From the MAP three-dimensional histogram in Figure 3, it can be seen that the same and fixed hash code length cannot be set for all datasets. To be special, the optimal hash code combination is 48 64(text image) for the img2txt task on the NUS-WIDE dataset. But the optimal hash code length combination is 32 64 (text image) for the img2txt task on the MIRFlickr dataset to implement the img2txt task. The reason is that the text information of NUS-WIDE is richer and more hash codes are needed to represent text features. From another point of view, for some retrieval tasks, using a shorter hash code length can also achieve a comparable retrieval effect. Thus, we can conclude that using a variable-length hash code can balance the data redundancy and retrieval accuracy.

(a)

(b)

(c)

(d)

(e)

(f)

5. Conclusion

In this paper, a variable-length hash coding-based cross-modal retrieval algorithm is first proposed, which projects multimodal data into the optimal hash length space of each modal data. The similarity matrix of multimodal data is constructed according to the label matrix of each mode, and the semantic similarity relationship of the original data is still guaranteed after the multimodal hash coding is projected into the potential abstract semantic space. Then the binary constraint condition of the hash coding is always maintained in the process of optimizing the model, so that the learned multimode hash coding can better represent the original multimode data. A wide variety of experiments on WIKI datasets, NUS-WIDE datasets, and MIRFlickr datasets show that the performance of the proposed method is generally superior to that of the correlation benchmark algorithms. Therefore, the method in this paper is feasible and effective. Compared with the deep learning-based hashing methods, the retrieval performance is relatively low. Thus, in our future work, we will embed the proposed similarity matrix into the deep learning-based method to further improve the retrieved accuracy and effectively measure the relationship among multiple source data.

Data Availability

The datasets used and/or analyzed during the current study are available from the author on reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant 61871062), Science and Technology Research Program of Chongqing Municipal Education Commission (Grants KJQN201800615 and KJQN201900609), Natural Science Foundation of Chongqing, China (Grants cstc2020jcyj-zdxmX0024 and cstc2021jcyj-msxm2025), Key Project of Science and Technology of Chongqing Municipal Education Commission (Grant KJZDK201901203), and University Innovation Research Group of Chongqing (Grants CXQT20017 and CXQT20024).

References

J. Xiong, J. Ren, L. Chen et al., “Enhancing privacy and availability for data clustering in intelligent electrical service of IoT,” IEEE Internet of Things Journal, vol. 6, no. 2, pp. 1530–1540, 2019.
View at: Publisher Site | Google Scholar
Y. Peng, X. Huang, and Y. Zhao, “An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 9, pp. 2372–2385, 2018.
View at: Publisher Site | Google Scholar
J. Xiong, R. Bi, Y. Tian, X. Liu, and D. Wu, “Toward lightweight, privacy-preserving cooperative object classification for connected autonomous vehicles,” IEEE Internet of Things Journal, vol. 9, no. 4, pp. 2787–2801, 2022.
View at: Publisher Site | Google Scholar
J. Xiong, M. Zhao, M. Z. A. Bhuiyan, L. Chen, and Y. Tian, “An AI-enabled three-party game framework for guaranteed data privacy in mobile edge crowdsensing of IoT,” IEEE Transactions on Industrial Informatics, vol. 17, no. 2, pp. 922–933, 2021.
View at: Publisher Site | Google Scholar
D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical correlation analysis: an overview with application to learning methods,” Neural Computation, vol. 16, no. 12, pp. 2639–2664, Dec, 2004.
View at: Publisher Site | Google Scholar
X. Fu, K. Huang, E. E. Papalexakis et al., “Efficient and distributed generalized canonical correlation analysis for big Multiview data,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 12, pp. 2304–2318, 2019.
View at: Publisher Site | Google Scholar
J. Costa Pereira, E. Coviello, G. Doyle et al., “On the role of correlation and abstraction in cross-modal multimedia retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 521–535, Mar. 2014.
View at: Publisher Site | Google Scholar
D. Mandal and S. Biswas, “Generalized coupled dictionary learning approach with applications to cross-modal matching,” IEEE Transactions on Image Processing, vol. 25, no. 8, pp. 3826–3837, 2016.
View at: Publisher Site | Google Scholar
J. Xiong, R. Ma, L. Chen et al., “A personalized privacy protection framework for mobile crowdsensing in IIoT,” IEEE Transactions on Industrial Informatics, vol. 16, no. 6, pp. 4231–4241, 2020.
View at: Publisher Site | Google Scholar
F. Taherkhani, V. Talreja, M. C. Valenti, and N. M. Nasrabadi, “Error-corrected margin-based deep cross-modal hashing for facial image retrieval,” IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 2, no. 3, pp. 279–293, Jul. 2020.
View at: Publisher Site | Google Scholar
Z. D. Chen, C. X. Li, X. Luo, L. Nie, W. Zhang, and X. S. Xu, “SCRATCH: a scalable discrete matrix factorization hashing framework for cross-modal retrieval,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 7, pp. 2262–2275, Jul, 2020.
View at: Publisher Site | Google Scholar
D. Wang, X. Gao, X. Wang, L. He, and B. Yuan, “Multimodal discriminative binary embedding for large-scale cross-modal retrieval,” IEEE Transactions on Image Processing, vol. 25, no. 10, pp. 4540–4554, 2016.
View at: Publisher Site | Google Scholar
X. Xu, F. Shen, Y. Yang, H. T. Shen, and X. Li, “Learning discriminative binary codes for large-scale cross-modal retrieval,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2494–2507, May 2017.
View at: Publisher Site | Google Scholar
D. Mandal, K. N. Chaudhury, and S. Biswas, “Generalized semantic preserving hashing for N-label cross-modal retrieval,” in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2633–2641, Honolulu, Hi, USA, July 2017.
View at: Google Scholar
F. Shen, X. Zhou, Y. Yang, J. Song, H. T. Shen, and D. Tao, “A fast optimization method for general binary code learning,” IEEE Transactions on Image Processing, vol. 25, no. 12, pp. 5610–5621, Dec, 2016.
View at: Publisher Site | Google Scholar
J. Gui, T. Liu, Z. Sun, D. Tao, and T. Tan, “Fast supervised discrete hashing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 2, pp. 490–496, 2018.
View at: Publisher Site | Google Scholar
J. Xiong, X. Chen, Q. Yang, L. Chen, and Z. Yao, “A task-oriented user selection incentive mechanism in edge-aided mobile crowdsensing,” IEEE Transactions on Network Science and Engineering, vol. 7, no. 4, pp. 2347–2360, 2020.
View at: Publisher Site | Google Scholar
T. S. Chua, J. Tang, and R. Hong, “NUS-WIDE: a real-world web image database from National University of Singapore,” in Proceedings of the 2009 Proceedings of the ACM International Conference on Image and Video Retrieval, pp. 1–9, ACM, 2009.
View at: Google Scholar
M. J. Huiskes and M. S. Lew, “The MIR Flickr retrieval evaluation,” in Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval, pp. 39–43, ACM, Santorini Island, Greece, July 2008.
View at: Google Scholar

Copyright

Copyright © 2022 Chao He et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

160

Downloads

299

Citations

Security and Communication Networks

Security and Privacy Challenges in Internet of Things and Mobile Edge Computing 2021

Cross-Modal Discrimination Hashing Retrieval Using Variable Length

Abstract

1. Introduction

2. Related Works

2.1. Hashing Cross-Modal Retrieval Based on Semantic Correlation Maximization

2.2. Hashing Cross-Modal Retrieval Based on Semantic Preserving

2.3. Hashing Cross-Modal Retrieval Based on Generalized Semantic Preserving

3. Cross-Modal Retrieval Based on Variable-Length Hash Coding

3.1. Algorithmic Model

3.2. Model Solution Procedure

3.3. Algorithm Description

3.4. Time Complexity

3.5. Application Scenario

4. Results and Discussion

4.1. Data Sets and Performance Metrics

4.2. Benchmark Algorithm

4.3. Experimental Results

5. Conclusion

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright