Abstract

The rapid advancements in the Internet of Things (IoT) and cloud computing technologies have significantly promoted the collection and sharing of various data. In order to reduce the communication cost and the storage overhead, it is necessary to exploit data deduplication mechanisms. However, existing data deduplication technologies still suffer security and efficiency drawbacks. In this paper, we propose two secure data deduplication schemes based on Rabin fingerprinting over wireless sensing data in cloud computing. The first scheme is based on deterministic tags and the other one adopts random tags. The proposed schemes realize data deduplication before the data is outsourced to the cloud storage server, and hence both the communication cost and the computation cost are reduced. In particular, variable-size block-level deduplication is enabled based on the technique of Rabin fingerprinting which generates data blocks based on the content of the data. Before outsourcing data to the cloud, users encrypt the data based on convergent encryption technologies, which protects the data from being accessed by unauthorized users. Our security analysis shows that the proposed schemes are secure against offline brute-force dictionary attacks. In addition, the random tag makes the second scheme more reliable. Extensive experimental results indicate that the proposed data deduplication schemes are efficient in terms of the deduplication rate, the system operation time, and the tag generation time.

1. Introduction

The wireless sensor network (WSN) is an ad hoc network composed of a large number of sensors, and the sensors communicate with each other over a wireless channel in a multihop manner [15]. Sensors are usually a low-cost, simple device with limited computing power and working batteries, which have the ability to collect, process, and transfer data. With the rapid development of Internet of Things (IoT) and cloud computing technologies, WSN has found many promising applications. As an extension to the cloud computing paradigm, fog computing makes it possible to execute the IoT applications in the network of edge. Xu et al. [6] proposed a dynamic resource allocation method for load balancing in fog environment. Cloud computing [7, 8] supports distributed data storage and parallel processing and its data processing framework handles huge amounts of data in a local computer rather than requiring to transmit these data remotely [911]. We know that cloud storage technology is the most common and most popular cloud computing service today. The extensive application of cloud storage motivates enterprises and organizations to outsource data storage to third-party cloud providers [1216]. Zhang et al. [17] proposed a fine-grained access control system suitable for resource-constrained users in cloud computing. It is reported that the average size of backup data for a medium size enterprise is 285 TB and faces an annual growth rate of about 24-27%. According to the analysis report of IDC, personal user data has reached terabytes in 2006. From 2006 to 2010, global data volume continues to grow at a rate of 57% annually. In 2011, the global data volume has entered the era of ZB, and the total amount of data used globally exceeds 1.8 ZB. It is expected that the global data volume will reach 40 ZB by 2020 [18].

Data deduplication has been widely accepted as an effective technique to reduce workload and overhead of the cloud storage system [1923]. Today’s commercial cloud storage services, such as Dropbox, Google Drive, Bitcasa, Mozy, and Memopal, have been applied deduplication to save maintenance cost. However, the extensive application of data deduplication makes its security problems increasingly prominent [24, 25]. Compared with traditional information security, cloud storage security [2628] mainly has two characteristics: users do not enjoy physical control over the data they upload to the cloud storage system and the same kind of physical resources is shared by multiple users. The confidentiality and integrity of data will be threatened. It is noted that cloud storage security has drawn many attentions [29, 30]. Xu et al. [31] proposed a cost and energy aware data placement method for privacy-aware applications over big data in hybrid cloud. Harnik et al. [32] pointed out that there were security vulnerabilities in the deduplication technology used by the provider. Douceur et al. [33] introduced convergent encryption (CE) that uses the hash value of the data itself as a secret key to solve the problem of contradiction between deduplication and confidentiality. Bellare et al. [34] defined a cryptographic primitive called message-locked encryption. Li et al. [35] implemented Dekey using the Ramp secret sharing scheme to manage the CE keys. Literature [21] pointed out that, in the data deduplication, simply using the hash value of the file represents the entire file, making the data deduplication process vulnerable to hacking, and the hash value is not confidential, and the attacker can obtain the entire file content by obtaining the hash value. Abadi et al. [36] proposed two schemes, including a completely random scheme and a deterministic scheme, which support the randomization of tags to ensure the security of the data deduplication system. In the schemes, CE directly uses the data fingerprint as the key derivation function and hence only achieves security for unpredictable data. In fact, offline brute-force dictionary attacks can be easily launched because of the determination of CE keys [37]. Moreover, current deduplication schemes [35, 37] directly deduplicate the encrypted data, which increases the computational overhead. In the future, it is possible to realize decentralized data deduplication schemes via blockchain technologies, which have been used to realize decentralized outsourcing computation [38, 39] and searchable encryption with two-side verifiability [40] in cloud computing.

Deduplication can be defined based on different granularities [41]: file-level deduplication and block-level deduplication (fine-grained fixed-size or variable-size data block). File-level deduplication is the easiest but inefficient method. Fixed-size block-level deduplication refers to blocking the file into fine-grained fixed-size (such as 4MB, 512KB) data blocks and then deleting the duplicate blocks [42]. However, it is difficult for fixed-size block-level deduplication to deal with the situation of insertion of data in the file. Abadi et al. [36] propose a completely random scheme that avoids deterministic messages to generate tags directly and better guarantees the security of the data deduplication process. On the basis of [36], Jiang et al. [43] added static data deduplication decision trees and dynamic data deduplication decision trees and optimized duplicate detection operations. However, most of previous schemes realize data deduplication after the data is encrypted by users, and hence the computation and communication efficiencies remain to be improved. In [44], the authors proposed a data deduplication scheme based on Rabin fingerprinting, which is a preliminary version of the work given in Section 4.2 of this paper. In this paper, we significantly revise the preliminary scheme [44] and add more technical details as compared to the preliminary abstract [44]. First, we add Section 3 to describe a system architecture of secure deduplication based on Rabin fingerprinting over wireless sensing data in cloud computing. Second, we improve the basic construction to support randomized tags and provide detailed procedures of data deduplication using randomized tags in Section 4.3. Third, we present security analysis of both schemes in Section 5 and do extensive experiments to evaluate the proposed deduplication schemes in Section 6.

Our Contribution. The contributions of this paper can be summarized as follows. In order to tackle the security and efficiency drawbacks in the existing data deduplication technologies, we propose two secure data deduplication schemes based on Rabin fingerprinting over wireless sensing data in cloud computing. The first scheme is based on deterministic tags and the other one adopts random tags. Note that the randomized tag achieves more reliable security guarantees than the deterministic tag. In order to reduce the communication cost and the computation cost, data deduplication in the proposed schemes is realized before the data is outsourced to the cloud storage server. For the sake of practicability, we realize variable-size block-level deduplication of the data, which is enabled based on the technique of Rabin fingerprinting. In order to protect the outsourcing data from being accessed by unauthorized users, the data is encrypted by users based on convergent encryption technologies before outsourcing data to the cloud. Our security analysis shows that the proposed schemes are secure against both external attacks and internal attacks. Extensive experimental results indicate that the proposed data deduplication schemes are efficient in terms of the deduplication rate, the system operation time, and the tag generation time.

Organization. The rest of this paper is organized as follows. Notations and cryptographic backgrounds are reviewed in Section 2. The system model, the threat model and security requirements of a secure deduplication scheme are described in Section 3. We present the proposed two data deduplication schemes in Section 4. Section 5 gives the security analysis of the proposed schemes and Section 6 shows the performance evaluation. Finally, our concluding remarks are made in Section 7.

2. Preliminaries

In this section, we first explain notations used throughout this paper and then simply review some cryptographic backgrounds involved in the proposed data deduplication schemes.

2.1. Notations

In Table 1, we list notations mainly used in the description of the proposed data deduplication schemes.

2.2. Rabin Fingerprinting

The technique of Rabin fingerprinting is widely used for quick comparison and recognition of duplicate data. It is based on arithmetic modulo an irreducible polynomial over [45]. Let be a bit string. We define a polynomial of degree over asLet be an irreducible polynomial of degree over . Given a fixed , the Rabin fingerprinting of is defined as the polynomial . The computation of Rabin fingerprinting is illustrated in Figure 1, where is a continuous string and each character is a tuple of 8 bits.

Note that a sliding window of width is used. Assume the starting point is which is represented by a polynomial ; thus the Rabin fingerprinting value of the string in the window isWhen the window slides forward 8 bits, becomes the starting point and then the Rabin fingerprinting value of the string in the window isIn fact, the Rabin fingerprinting algorithm computes a rolling checksum of the data [46]. The window of the data is configurable, but it is typically a few dozen bytes long. The Rabin module will read through a file and let the window slide over the data. When a byte is read, the fingerprint is recalculated. If the fingerprint is a special value, the Rabin module considers the corresponding window position to be a boundary. The data preceding this window position is taken to be a “block” of the file. For , let be a “block,” and the fingerprint of the data block is defined as.

2.3. Proof of Ownership

A proof of ownership (PoW) protocol [47] enables a client to prove to the server that they own a given file. The server can derive a small metadata from the data . To prove the ownership of the data , the user needs to send and run a proof algorithm with the sever. Its ownership is accepted if and only if and the proof is correct [48].

2.4. Convergent Encryption

The notion of convergent encryption was proposed by Douceur et al. [33]. In order to ensure the confidentiality of outsourcing data in the data deduplication process, users first encrypt data and then upload ciphertexts. In practice, if traditional encryption mechanisms are adopted, different users have diverse encryption keys, which leads to that the same file will be encrypted to different ciphertexts by diverse users. This property poses a serious challenge to data deduplication form the point of efficiency. In convergent encryption, the key is derived from the outsourcing data, and hence the same data corresponds to the same ciphertext even if users are different. Therefore, CE makes it possible to realize secure data deduplication in ciphertexts. Figure 2 illustrates the process of a convergent encryption. A convergent encryption scheme consists of the following algorithms:(i). The key generation algorithm generates a convergent key based on data . For a secure use of convergent encryption, the convergent key should be unpredictable, which can be realized by introducing randomness based on the message authentication code (MAC). MAC is also known as the keyed hash function. It is a value obtained based on a secret key and a message digest, which is usually used to data source authentication and integrity checking. A MAC is defined as below.(a) is a hash algorithm, such as SHA-1 and SHA-256, which takes as input the message and outputs the hash value.(b) is a message authentication code that takes as inputs the hash value and a random parameter and outputs a randomized convergent key .(ii). It is a symmetric encryption algorithm that takes the convergent key and the data as inputs and outputs a ciphertext .(iii). It is the corresponding decryption algorithm that takes the convergent key and the ciphertext as inputs and outputs the original data .(iv). The tag generation algorithm maps the original data to a tag . Essentially, the Rabin fingerprinting of data is used as the tag in the deterministic tag based scheme and is used to generate tags for the random tag based scheme.

2.5. Decision Trees

As a predictive model, a decision tree is a tree-like structure, in which each internal node denotes a test on an attribute, each branch represents the test output, and each leaf node means a category. For example, as shown in Figure 3, a decision tree consists of nodes and branches. Typically, a decision tree begins with the root node and branches connect the nodes. A branch that originates from a decision node is called a decision branch. Note that different conditions are associated with different branches. A leaf node acts as a termination node, which indicates the final outcome of the branch.

3. Models and Security Goals

In this section, we first introduce the system model and then describe the threat model and security goals.

3.1. System Model

The system model is illustrated in Figure 4, in which three entities are involved, including a management server (MS), users, and a cloud storage server (CSS). In the model, users outsource their data to CSS and access the data later with the help of MS, while keeping the ability of data deduplication. The details are described as follows.

(i) MS. It is trusted by users and manages secret keys and users’ information. MS introduces a random secret parameter to generate randomized convergent keys for users.

(ii) Users. Users can compute the block fingerprints before data deduplication. They encrypt data and then upload ciphertexts to CSS. For recovering the data, they decrypt the corresponding ciphertext from CSS.

(iii) CSS. It is honest but curious and provides data storage service to users. It stores and manages user’s unique data copies in the form of ciphertexts. In the subsequent random tag based deduplication scheme, CSS checks duplicate data based on a decision tree.

3.2. Threat Model and Security Goals

We consider both external attackers and internal attackers for the security of outsourcing data storage with data deduplication. For one thing, in the public channels, the external attackers are able to achieve partial of information on the data. An external attacker can access CSS by disguising as a legitimate user. For another, the internal attackers are honest but curious. They will follow the procedures of the proposed scheme and try to get confidential information as much as possible. The goal of the internal attackers is to obtain the contents of the data from CSS and obtain the randomized convergent keys from MS.

Considering the above threat model, we specify the following security goals. First, we need to ensure that the semantic security of encrypted data blocks. This requirement has been formalized in [49]. Therefore, the adversary does not have the ownership of the data because there is no convergent key to encrypt. Second, the convergent keys should be kept secure. The goal of the attackers is to get the other users’ keys and the data block ciphertexts. We aim to guarantee the security of the keys’ transmission and storage. Neither external attackers nor internal attackers can obtain other convergent keys.

4. Data Deduplication Schemes Based on Rabin Fingerprinting

In this section, we propose two data deduplication schemes based on Rabin fingerprinting, including a deterministic tag based scheme and a random tag based scheme. In each scheme, three phases, system setup, file uploading, and file downloading, are performed for data outsourcing storage with deduplication. The proposed deduplication schemes perform block-level data deduplication before users’ data encryption, in which the file blocks are generated based on Rabin fingerprinting.

4.1. Overview of Our Schemes

In the first scheme, the outsourcing data is first divided into many data blocks based on the Rabin fingerprinting technique. For each data block, a deterministic tag is generated based a hash function. With the tag, the cloud storage server can check whether the corresponding data block has already existed. If it exists, the user proves to the cloud server that it indeed has the ownership of the data block. Otherwise, the user encrypts the data block and uploads the generated ciphertext to the cloud server, in which the ciphertext is based on a convergent encryption and the convergent key is generated by the management server. The security of data in the deduplication process is ensured based on encryption techniques, and the convergent keys are also effectively managed. However, deterministic tags fail to meet the standard confidentiality requirement, such as semantic security. To be specific, if the plaintext can be listed, the attacker can learn the content of the plaintext by computing the tags and comparing the ciphertexts. If the tag is unpredictable, the above security drawback can be avoided. In the second scheme, the tag is randomly generated by the management server. The new scheme can support randomized tags and also allows decision tree based data duplicate detection. The decision tree supports the deletion and updating without needing expensive bilinear pairing operations. The randomized tags sacrifice efficiency to some extent but provide more reliable protection for data confidentiality in data deduplication systems.

4.2. Data Deduplication with Deterministic Tags
4.2.1. System Setup

In the system setup phase, necessary parameters are generated based on the following procedures:(S1)Given a security parameter , MS specifies a convergent encryption scheme (, , , and ), an asymmetric encryption scheme (, , and ), and a PoW algorithm. MS runs to generate an asymmetric public and secret key pair (, ) for each user. Note that is realized based on and is computed based on the Rabin fingerprinting.(S2)The CSS initializes two types of storage systems: a fast storage system for efficient detection of duplicate data tags and a file storage system for storing encrypted outsourcing data.(S3)MS initializes its local storage system for storing users’ metadata and randomized convergent keys.

4.2.2. File Uploading

The uploading phase is shown in Figure 5. Suppose that a user uploads a file and then performs the block-level deduplication below:(S1)The user sends a file-backup request to MS, including its authentication information. Then, MS performs an identity authentication. If passed, the following steps are performed.(S2)Based on the Rabin fingerprinting technique, the user divides into a set of blocks denoted by . The user computes each block fingerprint and sends as tags to CSS for duplicate checking.(S3)In addition, the fingerprints are sent to MS for generating convergent keys later.(S4)Once the data block fingerprints are received, CSS computes the data block signal vector as follows:(i)For each , if an existing block fingerprint matches , CSS sets to indicate “block duplicate.”(ii)Otherwise, CSS sets to indicate “no block duplicate.” CSS stores into the fast storage system.After the data deduplication is fulfilled, CSS returns the signal vector to the user.(S5)After receiving , the user checks if . If it is, the user runs a PoW algorithm to prove to CSS that it owns the data block . If CSS accepts the proof, it directly returns the corresponding pointer of to the user. At the same time, the user stores the block pointer of which is not needed to upload. In the other cases, the protocol is terminated and the involved entities quit the protocol.(S6)Otherwise, the user sends to MS. Upon receiving , MS checks if . If it is, MS generates the convergent key = , where is a randomly chosen parameter. MS sends the randomized convergent key corresponding to the nonduplicate block to the user. The user computes a ciphertext and uploads to CSS.

4.2.3. File Downloading

Suppose that a user intends to download a file . The user first sends a downloading request to MS, including its authentication information. If the authentication is successfully verified, the following procedures are performed:(S1)MS encrypts the randomized convergent key by computing , which is then sent to the user.(S2)Upon receiving the ciphertext , the user decrypts it based on its secret key to get the randomized convergent key , that is, . Subsequently, the user obtains the encrypted data block from CSS.(S3)The user decrypts the corresponding ciphertext by computing , based on , and then restores the file .

4.3. Data Deduplication with Randomized Tags
4.3.1. System Setup

The details are the same to those in the deterministic tag based scheme. Besides, MS specifies a cyclic group of prime order with generator .

4.3.2. File Uploading

Suppose that a user intends to outsource the file . The tag corresponding to the data block is , where is randomly chosen from and is the data fingerprint. The value of can be 0 or 1. If , it means the corresponding data block of the tag in the decision tree has not been deleted. When the data block is deleted, is set to be 0, which means there is no corresponding data block in CSS. The data uploading process is illustrated in Figure 6.(S1) The user sends a file-backup request to MS, including its authentication information. Then, MS performs an identity authentication. If passed, the following steps are performed:(S2)Based on the Rabin fingerprinting technique, the user divides into a set of blocks denoted by . The user computes each block fingerprint and sends to MS.(S3)Upon receiving the data backup request, CSS first iterates through the tag nodes in the order of the decision tree. If , it traverses the next node’s tag until or a leaf node. If , CSS returns the tag to the user. Note that the root node tag of the decision tree is and has a default value 1.(S4)Once the user receives the tag sent by CSS, the user calculates and verifies that is equal to .(i)If , the user sends “data duplication” to CSS and skips to the step (S6).(ii)Otherwise, the user calculates and sends it to CSS.(S5)The server moves the pointer to the next node in the decision tree based on the result of .(i)If , CSS will move the pointer to the left node.(ii)If , the pointer will be moved to the right node and the above step (S3) is performed again.If the decision tree pointer has not found a duplicate node after moving to the leaf node, CSS will send “data non-duplication” instruction to the user and skips to the following step (S7).(S6)Once the user receives a “data duplication” instruction for a block , it runs a PoW protocol with CSS to prove its ownership of the block. If passed, then CSS will return to the user a pointer to the duplicate data block . The user then stores the pointer and the data block does not need to be uploaded.(S7)Once the user receives the “data non-duplication” associated with the data block , MS generates the convergent key , where is a randomly chosen parameter, and sends the randomized convergent key to the user. The user will run the encryption algorithm to compute the ciphertext and upload it to CSS. At the same time, the user chooses , generates a corresponding tag , and sends the tag to CSS.(S8)Upon receiving the tag of the block from the user, CSS computes . If , the tag will cover the left node with , or be placed on the left leaf node. If , the tag will cover the right node with or be placed on the right leaf node.

4.3.3. File Downloading

Suppose that a user intends to download a file . The user first sends a downloading request to MS, including its authentication information. If the authentication is successfully verified, the following procedures are performed:(S1) MS encrypts the randomized convergent key by computing , which is then sent to the user.(S2)Upon receiving the ciphertext , the user decrypts it based on its secret key to get the randomized convergent key , that is, . Subsequently, the user obtains the encrypted data block from CSS.(S3)The user decrypts the corresponding ciphertext by computing , based on , and then restores the file .

5. Security Analysis of the Proposed Schemes

The proposed two data deduplication schemes differ in the tags. The first scheme adopts deterministic tags and the second scheme uses random tags. The involvement of random parameters in the tag generation makes the second scheme more secure. In the following, we only show that the deterministic tag based scheme is secure against both external attacks and internal attacks.

5.1. Security against External Attacks

In data deduplication systems, external attackers must be prevented from accessing data. For instance, the transmitted data between the user and CSS may be obtained by an external attacker. After selecting the range of a dictionary, the attacker can obtain data corresponding to metadata by the way of brute-force dictionary attack. In particular, an external attacker may maliciously modify and destroy users’ transmitted data in order to compromise both the integrity and the availability of the data. In the proposed deduplication scheme, random information is added to the convergent key by MS, which randomizes the convergent key and alleviate the key compromise risk. The randomization of the convergent key makes offline brute-force attack very difficult. Because each user first encrypts outsourcing data and then transmits data ciphertexts in the system, it is impossible for external attackers to get the original data without needing the relevant key.

5.2. Security against Internal Attacks

In order to issue cross-border operations, attackers often try to hide their own identities. For example, an attacker may disguise as other legitimate users to violate the privacy of other users. To prevent the internal attackers, the secure deduplication system realizes the identity authentication when a user initially communicate with MS which stores and manages the convergent keys to prevent unauthorized information read. At the side of CSS, if a user aims to access a file, a PoW protocol is required to be performed between the user and CSS. The user can prove to CSS its ownership of the file. The proposed deduplication scheme can effectively prevent attackers from accessing any files and keys beyond their ownership. In the random tag based scheme, besides the security of the first scheme, it also avoids the use of deterministic tags during duplicate data detection. Accordingly, even if an attacker obtains a tag, the randomness of the tag makes it possible to obtain the corresponding convergent key, which further improves the system security.

6. Performance Evaluation

In this section, we evaluate the performance of the proposed Rabin fingerprinting based data deduplication systems. We also compare the trivial deduplication rate of the fixed-size block scheme and our Rabin fingerprinting based schemes.

6.1. Simulation Environment

The hardware used in the simulation is a 64-bit Lenovo 80ER laptop with Windows 7 Home Basic operating system, and its CPU is Intel(R) Core(TM) i5-5300U CPU @2.30GHz. The simulation code is written with Java language by using the MyEclipse development platform. In our experiments, the data samples are 5400 different journal articles from the China national knowledge infrastructure, and they are about 15.9 GB in size.

6.2. Experimental Results and Analysis
6.2.1. The Optimized Sliding Step Size

In this section, we aim to find the optimized sliding step size of the data deduplication scheme based on the Rabin fingerprinting. The optimized sliding step size enables a better performance of the data deduplication system. Specifically, given a fixed-size data set, we set the upper bound of the data block size as 8 KB and the sliding window size of the Rabin fingerprinting as 64 KB. Then, when the window sliding step size varies from 1 B to 20 B, we test the running time and the deduplication rate of the Rabin fingerprinting, respectively. Note that the data deduplication rate is defined as the ratio of the remaining nonduplicate data size after data deduplication to the total data size. The smaller the ratio, the better the data deduplication effect.

Figure 7 illustrates the variation of the data deduplication rate with the sliding step size of the window. Figure 8 shows the variation of the data deduplication time with the sliding step size of the window. We can see from Figure 7 that the deduplication rate has the optimal value when the Rabin fingerprinting has a window sliding step of 1 B and more than 50% of the duplicate data is removed. In this case, however, the deduplication time is the longest as shown in Figure 8. As the sliding step size increases, the data deduplication rate fluctuates between fixed values and it tends to be steady between several given sliding step size. When the window sliding step size is 1 B, the time of the deduplication based on Rabin fingerprinting is the longest. The longer the sliding window moves, the less time it takes for data to be deduplicated. Generally, in order to ensure the effect of data deduplication based on Rabin fingerprinting and reduce the system operation time, we exploit a sliding window of 64 KB and a sliding step size of 18 B in the following experiments.

6.2.2. The Performance Comparison of Rabin Fingerprinting Based Scheme and Fixed-Size Block Scheme

Figure 9 shows that the time for the data block generation varies with the file size. At the same time, we compare the block generation time of the fixed-size block deduplication scheme and the Rabin fingerprinting based scheme. Given the test file of the same size, the time required for the fixed-size block scheme is smaller than that based on the Rabin fingerprinting. Nevertheless, we will show that the total system operation time of the Rabin fingerprinting based scheme is optimal, later. Figure 10 compares the time of data deduplication based on the Rabin fingerprinting and data deduplication based on fixed-size blocks. The deduplication time does not include the block generation time, and the performance of these two schemes is compared from the perspective of data deduplication. It can be seen from Figure 10 that the data deduplication time of both schemes increases with the file size. If the test file is given, the deduplication efficiency of the Rabin fingerprinting based scheme is obviously better than that of the fixed-size block scheme. The comparison reflects the advantage of the Rabin fingerprinting in data deduplication.

In the subsequent simulation, the fixed-size block algorithm is first used to divide the test files into fixed-size data blocks of sizes 4, 8, 16, 32, 64, 128, 256, 512, 1024, and 2048 KB, respectively. Then, the Rabin fingerprint algorithm is used to divide the same files into variable-size data blocks with the upper bound limit of sizes 4, 8, 16, 32, 64, 128, 256, 512, 1024, and 2048KB, respectively. Finally, the data deduplication time and the deduplication rate are tested and compared.

Figure 11 shows the comparison of the total running time of the data deduplication system based on the Rabin fingerprinting and the fixed-size block-level data deduplication system. For the sake of clarity, the horizontal axis adopts a logarithmic scale. We can see that the performance of the data deduplication system based on the Rabin fingerprinting is better. Therefore, the use of the characteristics of duplicate data can be quickly found by using the Rabin fingerprinting, which makes the data deduplication more efficient. Figure 12 shows the data deduplication rate of the Rabin fingerprinting based scheme and the fixed-size block scheme. It can be seen that the duplicate data detection rate of the former is better than the latter. With the increase of the data block size, the deduplication rate becomes inferior. However, the data deduplication rate of the system based on the Rabin fingerprinting is always better than the fixed-size block scheme. Figure 13 is a comparison of the overall system data storage performance based on the Rabin fingerprinting and fixed-size block level, respectively. From this figure, we can see that the overall storage performance of the data deduplication scheme based on Rabin fingerprinting is better than that of the fixed-size block data deduplication system. And with the increase of the data volume of the file, the increase trend of the storage time of the fixed-size block-level data deduplication system is faster than that of the Rabin fingerprinting deduplication system.

6.2.3. The Encryption Time of Nonduplicated Data Blocks

Besides the deduplication performance, we also consider the cost of encryption. As shown in Figure 14, the encryption time of the nonduplicated data after deduplication operation increases with the file size. Based on Figure 10, we know that the fixed-size block-level data deduplication scheme will generate more data blocks, and hence the data block encryption time will be longer than that of the Rabin fingerprinting based scheme. In particular, the fixed-size block-level data deduplication scheme needs to encrypt data before performing data deduplication. Put another way, duplicate data is also encrypted, which further increases the encryption overhead of the system.

6.2.4. The Performance Comparison of Deterministic Tags and Random Tags

In the above analysis, the schemes are of deterministic tags. In the following, we test and analyze the performance of the data deduplication systems based on the Rabin fingerprint algorithm with deterministic tags and random tags. In Figure 15, we show the performance comparison of tag generation in the data deduplication scheme based on deterministic tags and random tags.

It can be seen from Figure 16 that the time for generating random tags is much longer than that of the deterministic tags. With the increase of the number of uploaded files, the time cost of generating random tags will also increase and its rising trend is obvious. In Figure 16, we compare the storage performance of the two types of schemes. As can be seen from Figure 16, the larger the number of uploaded files, the greater the total data deduplication time of the two deduplication schemes. Generally, the random tag based deduplication system is more secure and the deterministic tag based scheme is more efficient.

7. Conclusions and Future Work

In this paper, we proposed two secure data deduplication schemes based on Rabin fingerprinting. The schemes are realized, respectively, based on deterministic tags and random tags. In our schemes, data deduplication is enabled before the data is outsourced to the cloud storage server, and hence both the communication cost and the computation cost are reduced. In particular, we realized variable-size block-level deduplication by using Rabin fingerprinting. The data confidentiality is kept based on convergent encryption technologies. Our security analysis showed that the proposed schemes can resist offline brute-force dictionary attacks. Our simulation results indicated that the proposed schemes are practical in terms of the efficiency.

In the future research, it would be interesting to design decentralized block-level data deduplication schemes with fine-grained access control.

Data Availability

The data used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by National Key R&D Program of China (no. 2017YFB0802000), the National Natural Science Foundation of China (nos. 61772418, 61472472, and 61402366), and the Natural Science Basic Research Plan in Shaanxi Province of China (nos. 2018JZ6001 and 2015JQ6236). Yinghui Zhang is supported by New Star Team of Xi’an University of Posts and Telecommunications (2016-02).