Abstract

As more and more image data are stored in the encrypted form in the cloud computing environment, it has become an urgent problem that how to efficiently retrieve images on the encryption domain. Recently, Convolutional Neural Network (CNN) features have achieved promising performance in the field of image retrieval, but the high dimension of CNN features will cause low retrieval efficiency. Also, it is not suitable to directly apply them for image retrieval on the encryption domain. To solve the above issues, this paper proposes an improved CNN-based hashing method for encrypted image retrieval. First, the image size is increased and inputted into the CNN to improve the representation ability. Then, a lightweight module is introduced to replace a part of modules in the CNN to reduce the parameters and computational cost. Finally, a hash layer is added to generate a compact binary hash code. In the retrieval process, the hash code is used for encrypted image retrieval, which greatly improves the retrieval efficiency. The experimental results show that the scheme allows an effective and efficient retrieval of encrypted images.

1. Introduction

With the development of cloud computing, more and more companies and individuals store image data on the cloud server. Therefore, how to efficiently retrieve images in the cloud becomes an urgent problem. Cloud computing [1] is an emerging new computing paradigm with efficient image storage, which makes it an attractive choice for image retrieval. Despite the benefits, image information privacy becomes the main concern with image retrieval in cloud computing.

In order to protect the image information, it is necessary to encrypt the image before it is submitted to the cloud. The widely used encryption methods include chaotic image encryption [2] and Arnold transform [3]. However, it is not suitable to directly apply image retrieval technology in the plaintext domain for image retrieval on the encryption domain. Therefore, how to protect image information in the cloud computing while quickly retrieving the images that users need is an urgent problem that needs to be solved in the field of encrypted image retrieval.

In the field of image retrieval, most previous approaches exploit the frequency domain feature [4, 5], SIFT [6]. However, these approaches are based on hand-crafted features which cannot represent the image content comprehensively because of the low retrieval accuracy.

With the development of deep learning, the CNNs [711] have shown significant improvements in the performance on various tasks. However, the most CNNs usually have hundreds of layers, thus making networks more inefficient. Most state-of-the-art lightweight architectures, such as MobileNet [12] and ShuffleNet [13], become more efficient because of their network architectures. These networks can be carried out in a timely fashion on a computationally limited platform.

Even though the CNN-based representation is an appealing solution for image retrieval in the plaintext domain, it is inefficient to directly compute the similarity between two CNN features, such as 4096-dimensional vectors of the full connection layer in AlexNet. Recently, some approaches have been using deep architectures for hash learning for image retrieval [14, 15]. However, most of them are used for the plaintext domain, but lacks research on the encryption domain.

In order to address the above issues, this paper proposes an improved CNN-based hashing method for encrypted image retrieval (DLHEIR). In our method, we increase the size of the input image of the CNN to obtain better features and replace a part of the structure of the DenseNet network with inverted residual block to reduce the computational cost and parameters. The improved CNNs are used to generate hash codes for encrypted image retrieval.

Our main contributions are as follows:(1)This paper proposes an improved CNN-based hashing method for encrypted image retrieval (DLHEIR). This network can learn image representations to generate the binary hash code for rapid image retrieval.(2)We used images with larger sizes as input to the CNN to obtain better features. Moreover, the inverted residual block is introduced into our method, which can reduce the computational cost and parameters.

The organization of the remaining part is given as follows. Section 2 discusses the related works. Section 3 introduces the proposed method. Section 4 shows our experimental results, and we conclude this paper in Section 5.

Content-based image retrieval (CBIR) refers to the retrieving of the needed information in large-scale multimedia data according to the content of the image. Recently, image retrieval has been applied in many fields, such as image search [16, 17] and image steganography [18]. However, it cannot be applied in cloud computing due to the privacy of images.

The searchable encryption (SE) method enables the users to store encrypted data in the cloud computing and supports data search in the encrypted domain. Xia et al. [19] proposed an encrypted image retrieval scheme (PSSE) in the cloud environment, which uses MPEG-7 visual descriptors as image features. The KNN is used to protect features, and the local sensitive hashing is used to improve retrieval efficiency. Qin et al. [20] proposed an encrypted image retrieval approach in the cloud computing environment, which employs the improved Harris algorithm and Local Sensitive Hash (LSH) to retrieve encrypted images. Shen et al. [21] proposed a secure content-based image retrieval method, which uses a secure multiparty computation technique to encrypt image features. Cheng et al. [4] proposed an encrypted JPEG image retrieval scheme based on the Markov process, which uses encryption to encrypt DCT coefficients to protect the confidentiality of the JPEG image content. Xia et al. [22] proposed an outsourcing CBIR scheme based on the BOEW model. Ferreira et al. [23] proposed a secure framework for outsourcing privacy-protected storage and retrieving in a large shared image repository. Lu et al. [24] proposed a privacy protection image retrieval method based on an encrypted image collection which uses a set of visual words to represent images, and the Jaccard distance is used to measure the similarity between images. Xia et al. [25] proposed a privacy-preserving image retrieval method based on Scale Invariant Feature Transform (SIFT) features and Earth Mover’s Distance (EMD). Weng et al. [26] proposed a privacy preserving framework for an application called outsourcing media search. The framework relies on multimedia hashing and symmetric encryption to protect image information. However, these approaches are based on hand-crafted features, which do not consider the global information of the image, resulting in low accuracy for encrypted image retrieval.

CNNs have recently provided an attractive solution for many version tasks. The previous approaches are attributed to the ability of CNN to learn the rich image representations, which can be applied to the field of image retrieval [27, 28]. However, due to the high-computational cost of computing the similarity between two CNN features, some approaches use CNNs to automatically learn binary hashing codes [2931]. However, these approaches are applicable only in the plaintext domain, and there are few approaches that focus on CNN-based encryption image retrieval.

In this paper, CNNs are applied to the field of encrypted image retrieval. With the powerful representation ability of CNNs’ features, the accuracy of encrypted image retrieval is improved. At the same time, the retrieval efficiency is greatly improved by using the hash code.

3. Proposed Method

3.1. System Model

The system model is shown in Figure 1, and the system model mainly consists of three parts: data owner, cloud server, and query user.

Data owner has the image dataset . To preserve the image content, the dataset needs to be encrypted, generating the encrypted dataset . where is the number of images in the dataset. To achieve rapid image retrieval, the data owner needs to generate the hash code corresponding to the image dataset. Both the encrypted image and hash code are outsourced to the cloud server. The data owner also needs to send the key to the query user when receiving the retrieval request.

Cloud server stores the encrypted dataset and hash code from the data owner. When receiving the retrieval request from the query user, the cloud server needs to calculate the similarity between the hash code from the data owner and the trapdoor of the query image and returns the top retrieval results to the query user.

Query user generates the trapdoors for the query images and uploads it to the cloud server. We define the trapdoor as the hash code for query images, which utilize the same method as the data owner does. After receiving the resulting images, the query user sends a request to the data owner and obtains the key, and the user can decrypt the encrypted image with the key.

3.2. Overview of the Proposed Method

The proposed method mainly includes six functions, which are executed by the data owner, cloud server, and query user.

The following functions are executed in the data owner:(1)Key Generation. . The input of the function is parameter , and it returns the key . After the user authorization, the data owner sends the key to the user for decrypting the encrypted image.(2)Image Encryption. . The inputs of the function are the key and the image dataset , and it returns the encrypted image dataset .(3)Hash Code Generation. . By adopting our method, the input of the function is the image dataset , and this function returns the hash code .

The following functions are executed in the query user:(1)Trapdoor Generation. . The input of this function is the query image . Construct trapdoor and generate hash code of query image.(2)Image Decryption. . The inputs of this function are the key and the similar encrypted image returned by the cloud server, and it decrypts the similar encrypted image to return a similar image .

The following function is executed in the cloud server:(1)Search. . The function calculates the similarity between corresponding to the query image and the corresponding to the encrypted image dataset and it returns similar encrypted image set .

3.3. Improved Convolutional Neural Network Hashing

In this section, we will introduce our method, which consists of two main components, image preprocessing and network architecture.

3.3.1. Image Preprocessing

Before training or testing the network, the input images should be resized to the same size. For example, when training and testing DenseNet, all images should resize to 224 × 224 before feeding into the network.

The large image is resized to 224 × 224 or 299 × 299 by cropping or warping. The cropping may lose important information of the image, while the warping may change the aspect ratio of the image, and this will affect the features extracted by the CNN.

Consequently, in this paper, we increase the input image size of CNNs. Specifically, for the Corel10K dataset, we calculate the maximum image height and width, and then, the largest value height and width are taken as the image size. For example, for the Corel10K dataset, the maximum image height and width in the Corel10K dataset is 384 and 256, so the size of the input image is resized to 384 × 384.

3.3.2. Network Architecture

Inverted Residual Block. The network architecture of our method is shown in Figure 2. Specifically, the image is resized to 384 × 384 as the input of the DenseNet201. Then, the inverted residual block is introduced to replace a part of the architecture in the DenseNet, which can greatly reduce computational cost and parameters.

The inverted residual block consists of depthwise separable convolution. The computational cost of depthwise separable convolution is shown in the following equation:

The parameter of depthwise separable convolution is computed in the following equation:

For standard convolutions, the computational cost and parameter are computed by the following equation:

Suppose the input feature map of depthwise separable convolution has the size and the output feature map has the size , where and are the channel of the feature map, and are the width of the feature map, and are the height of the feature map, respectively, and denotes the kernel size. The computational cost ratio of the depthwise separable convolution to standard convolution is shown in the following equation:

The parameters’ cost ratio of the depthwise separable convolution to standard convolution is shown in the following equation:

Equations (4) and (5) show that the depth separable convolution uses less computational cost and parameters than standard convolution.

Densenet201 consists of four dense blocks, which consists of 6, 12, 48, and 32 BN-ReLU-Conv (1 × 1)-BN-ReLU-Conv (3 × 3) structures, respectively, where BN indicates batch normalization, ReLU indicates linear rectifier function, and Conv (1 × 1) indicates a Conv2D layer with filters of kernel size 1-by-1. In order to reduce computational cost and parameters of the network, the last 14 BN-ReLU-Conv (1 × 1)-BN-ReLU-Conv (3 × 3) were replaced by an inverted residual block. Then, a hash layer is added, which consists of a convolution layer, batch normalization, sigmoid activation, and pooling layer. Finally, SoftMax is added to form our network.

Hash Layer. In this section, we will systematically describe the hash layer. It consists of three main layers, which are a convolutional layer, a batch normalization layers, activation function, and a global average pooling layer. The convolutional layer is a Conv2D layer with filters of kernel size 1-by-1. For the activation function, we choose sigmoid so that parameters are approximated to (0, 1).

Suppose the input feature map has size of , where , , and are height, width, and channel of the feature map, respectively. The output of the feature map hash layer has size of , and the feature is obtained.

In feature extraction, firstly, all images are resized to 384 × 384 before being fed into the network, and the feature of the global average pooling layer is extracted, and the binary codes are obtained by using the hash function to binarize by a threshold. The hash function is shown in the following equation:where is the parameter in and is the threshold of the hash function.

4. Experimental Results and Analysis

The experiments were performed on the Corel10K dataset [32]. Corel10K is a benchmark dataset for image retrieval. It includes 100 categories, and each category contains 100 similar images.

The experiment code was written in Python and Matlab R2016a under the Windows 10 system, using Intel(R) Core (TM) i7-9700KF CPU @ 3.60GHz, 16.00 GB RAM, and a Nvidia GeForce GTX 2080Ti GPU.

In the experiment, 80 images were randomly selected from each category of the Corel10K dataset as the training set, and the remaining images were used as the test set. DenseNet201 was selected as the backbone network. In the fine-tuning, we use the pretrain model which is trained on the ImageNet dataset. The stochastic gradient descent (SGD) is used as the network optimizer, the learning rate is set to 0.01, the momentum is set to 0.9, the batch size is set to 64, and epochs are set to 200.

4.1. Retrieval Precision

In our experiments, “precision” was used as the evaluation metric, which is defined as , where is the number of real similar images in the retrieved images. In the experiment, we use the test set as the query image and the training set as the query image collection to test the retrieval precision. We compare our method with the other methods [6, 17]. The experimental results are reported in Figure 3.

As shown in Figure 3, it is clear that our method outperforms conventional methods [6, 17]. This is because these methods all utilize the hand-craft feature, which limits their performance. In particular, the performance gap is not obvious as increases, except . Also, note that our method with 48 bits has better performance than the model with others.

We also evaluate the role of image size for retrieval precision. The experimental results are shown in Table 1.

It is clear from Table 1 that the increase in the image scale consistently improves retrieval precision in different hash bits. This is because using larger images is beneficial for performance improvement. A scale larger than our method would instead increase the memory consumption of GPU and computational cost and parameters.

4.2. Comparison of Model Parameters and MFLOPs

In this section, we compare parameters and MFLOPS of our method with the original CNN combined with the hash layer. The experimental results are reported in Table 2.

Floating point operations per second (FLOPs) is a measure of computer performance, which is widely used to measure the computation cost in CNN models, such as ShuffleNet [13]. As can be seen from Table 2, we can find that our method has less parameters and MFLOPs.

4.3. Efficiency

The time consumptions of the retrieval, feature extraction, index construction, and trapdoor generation are compared in this section.

Time Consumption of Retrieval. In order to utilize the powerful computing power of the cloud server, the retrieval is applied in the cloud server, and the most similar images are returned by calculating the Euclidean distance between two hash codes. Table 3 presents the time consumption of retrieval when retrieving images .

It can be seen from Table 3, the retrieval time consumption increases as the retrieval collection increases. It is clear that our method achieves better efficiency [6, 17]. This is because our method utilized the low-dimension binary hash code, which achieved efficiency in image retrieval.

Time Consumption of Feature Extraction. We also compared the time consumption of feature extraction with the CSD and SCD descriptors in the MPEG-7 feature extraction method of [17], and the time consumption of SIFT feature extraction in [6]. The experimental results are shown in Figure 4.

Figure 4 shows the feature extraction times for different numbers of image collection. Compared with [6, 17], the time consumption of our method is shorter on different numbers of image collections in most cases. This is because the time consumption of feature extraction in our method mainly consists of two parts: time consumption of the load model and hashing. Compared with complex conventional methods, our method is more efficient.

Time Consumption of Index Construction. In our method, the similarity is directly computed by two hash codes without index construction, so there is no time consumption of index construction in our method. The time consumption of index construction comparison with Xia and Qin is shown in Figure 5.

Time Consumption of Trapdoor Generation Time. Similar to the feature extraction, the trapdoor generation incurs the hash code generated by the data owner, so the time consumption of the trapdoor construction is the hash code generation time of the query image. The experimental results are shown in Figure 6.

We test the time consumption of trapdoor generation compared with the [6, 17] in Figure 6. Our method has more time consumption to these methods. This is because, in feature extraction, we need to extract features from the deep layer of DenseNet, so the time consumption is more than [6, 17].

4.4. Security Analysis

(i)The Privacy of Image Content. In our method, the images stored on the cloud server are encrypted with an encryption method. The key is generated by the data owner. Thus, the privacy of the image content in our scheme is well protected.(ii)The Privacy of Hash Code . The hash code may reveal the information about the image content. In our method, the hash code mapped from the feature vectors are protected by a one-way hash function. Thus, the hash code is well protected.

5. Conclusion

This paper proposes an improved CNN-based hashing method for encrypted image retrieval. In our method, we increase the size of the input image of the CNN to obtain better features and replace part of the structure of the DenseNet network with inverted residual block to reduce the computational cost and parameters, and a hash layer is added for hash code generation. These hash codes are used for encrypted image retrieval. The experimental results show that the method achieves better performance and greatly improves the retrieval efficiency. In the future, we plan to design more efficient methods to reduce the burden on users.

Data Availability

The Corel10K data used to support the findings of this study have been deposited in the “http://www-db.stanford.edu/∼wangz/image.vary.jpg.tar.”

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant nos. 61972205, U1836208, and U1836110, National Key R&D Program of China under Grant 2018YFB1003205, Priority Academic Program Development (PAPD) of Jiangsu Higher Education Institutions Fund, Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET) Fund, China, and Ministry of Science and Technology (MOST), Taiwan, under Grant nos. 108-2221-E-259-009-MY2 and 109-2221-E-259-010.