Abstract

With the rapid development of cloud computing, cloud storage is widely used. In the cloud environment, users’ virtual machine system mirrors and data are stored in the cloud server. The escape of virtual machines and Trojan virus attacks make it challenging to ensure the integrity of virtual machine systems. Trusted computing is expensive to randomly verify data integrity and does not adapt to dynamic data changes. Provable data integrity is a potential solution to this problem. Merkle Hash Tree (MHT) model is widely adopted in provable data integrity. Although MHT requires only a small amount of evidence for verification, the verifier’s number of hash calculations and the server’s efficiency of evidence query are not optimal. Moreover, the verification frequency of each piece of data is not considered by MHT. Properly handling these factors can improve the actual verification performance. In this paper, a lightweight and efficient data integrity verification approach called HB+-MHT is proposed for the tenant virtual machine (TVM) in cloud computing. In HB+-MHT, the Huffman hash tree scheme is used for small file verification to ensure that the hot file has a shorter path, which reduces the required amount of evidence for verification. Meanwhile, the B+ hash tree scheme is used for big files verification, which can effectively reduce evidence query time and hash calculation times. The experimental results show that the scheme proposed in this paper can perform data integrity verification well, with reduced computing and storage overhead.

1. Introduction

The wide application of cloud computing provides users with convenient and cheap services, which greatly reduces the storage overhead and management burden of users. However, due to the huge scale of cloud computing systems and unprecedented openness and complexity, these systems are faced with security issues more severe than before. The centralized management of cloud computing centers has become the target of attacks, which endangers the data confidentiality, integrity, and availability of cloud platforms [13].

When users or enterprises outsource their data to cloud computing service providers or entrust cloud computing service providers to run their applications, cloud computing service providers have priority access to the data or applications. Cloud system tenants usually use encryption to ensure the security of the private data stored on the cloud server [4]. Meanwhile, the development of encryption technology and access control technology provides reliable confidentiality for the tenant’s private data [5]. However, due to various risks such as internal personnel management negligence, hacker attack, or server system failure, cloud service providers cannot make users believe that their data has not maliciously tampered. Since all data of the tenant virtual machine are stored in the cloud, security issues often occur, such as the virtual machine escaping and the attack of virtual machines by illegal users outside the system. These issues bring great challenges to the integrity protection of tenant data in the system. In the cloud environment, it is usually assumed that the cloud server is not trusted, and the tenant virtual machine can only establish trust through online interaction with the cloud server. In addition, there are many dynamic services in the cloud environment, thus effectively ensuring that the data integrity in the dynamic environment is crucial to users’ experience of cloud services. Due to the openness and dynamic characteristics of cloud computing, it is difficult to establish trust, which makes the security of dynamic data integrity of tenants a problem.

Therefore, it is urgent to design an efficient integrity verification scheme to ensure the integrity of tenant virtual machine systems and data in the cloud. Building a secure and reliable integrity verification method has become a research hotspot in the field of information security.

The traditional trusted computing method is based on the trusted computing technology proposed by the Trusted Computing Group (TCG) [6], which tries to provide credibility for the endpoints in the distributed computing environment. The Trusted Platform Module (TPM) is introduced into the hardware layer of the computing platform to provide a hardware-based trusted root for the computing platform. Starting from the trusted root, the integrity of the hardware and software of a local platform is measured layer by layer by the trust chain transmission mechanism. The traditional trusted computing methods have a low verification level and coarse granularity, and they are faced with the problem of high computational cost in system dynamic measurement. When verifying the integrity of a file, these methods need to calculate the integrity of all previous files in turn, which requires a lot of computing power to randomly verify data integrity and does not adapt to dynamic data changes.

Provable data integrity (PDI) is considered a potential solution to this problem. According to whether fault-tolerant processing is adopted, PDI can be divided into two types: POR model of data recovery proof and PDP model of data holding proof [7, 8]. PDP judges whether the cloud data is damaged, while POR attempts to recover the data after identifying the damaged data. PDP pays more attention to detection efficiency, while POR pays more attention to data recovery. From the perspective of monitoring Trojans or Viruses destroying cloud data, this paper mainly discusses the PDP mechanism.

Merkle Hash Tree (MHT) is commonly used in PDP for the cloud environment. MHT is a well-studied authentication structure [9] that intends to efficiently prove that a data set is undamaged and unaltered. It is constructed as a binary tree where the leaves are the hashes of authentic data values. Each middle node is the hash of the concatenation of its left and right child nodes, and the root node is signed by the integrity management authority. The research on MHT has greatly improved the efficiency of PDP. However, there are two issues that MHT does not consider. One issue is that the verification frequency of specific data is not considered, and each data has an equal authentication path. It is reasonable that the frequently verified data should have a shorter path. The other issue is that MHT designs the authentication structure for reducing the number of evidence but ignores the impact of the number of hash calculations of the verifiers and the evidence query time of the servers, which may be important performance constraints in the actual environment.

To address the aforementioned challenges, this paper proposes a data integrity verification approach called HB+-MHT for TVM. HB+-MHT adopts different verification schemes according to the file size. Small files are loaded into the TVM, and the optimized hash tree is adopted for verification. In this case, compared with the communication overhead of transmitting small files to TVM, the traditional PDP methods with multiple signature operations might take more verification time. For large files, an improved PDP method is adopted because the communication overhead of transmitting large files to TVM may be too high. Our contributions are summarized as follows:(i)The Huffman Merkel scheme is designed to verify the integrity of small files. According to different integrity verification frequencies or weights, the authentication path of hot files is optimized and shortened effectively.(ii)The B+ hash tree scheme is designed to verify the integrity of large files, which optimizes the server evidence query and reduces the number of client hash verifications.(iii)Experiments show that our scheme can realize data integrity verification well, which reduces the computational overhead effectively and achieves high verification efficiency.

The rest of the paper is arranged as follows. Section 2 discusses the related work, Section 3 describes the detailed design of the scheme, Section 4 evaluates and analyzes the experimental results, and finally, Section 5 concludes the paper.

TCG introduces the TPM in the hardware layer. Based on the trusted root and the trusted chain transfer mechanism, the integrity measurement can be implemented for the hardware and software layers of the local platform. However, the integrity measurement mechanism proposed by TCG has low verification levels. It can only verify the credibility of the operating system of the computing platform and does not specify the verification method for application layer credibility. In the process of verification, the whole measurement list should be sent to the verifier, which will easily cause measurement leakage. Reiner Sailer et al. [10] proposed IMA, a more secure system integrity measurement architecture. The system kernel maintains a list of metrics and hashes the measured values to obtain the aggregation values as a reliable basis for the verification of the metrics. Meanwhile, the system extends the concept of TCG trust metrics to the application layer dynamic system. However, every verification needs to hash all the measurements again to obtain the aggregate value, which greatly reduces the efficiency of the system operation. Also, the privacy protection of the TCG remote authentication mechanism is still unsolved. To solve the shortcomings of IMA, Xu et al. [11] store the measurement with the Merkle tree structure so that only the authentication path related to the measurement needs to be sent. However, since the Merkle tree is a balanced binary tree, too many middle nodes need to be queried. So, the average search complexity is high, and this method needs to be further optimized.

The current PDP mechanisms mainly rely on MAC authentication code, RSA signature or BLS signature, and they can support dynamic operation, multiple copies or privacy protection, etc. Based on the data holding proof of MAC authentication code [12], the MAC value of the message authentication code is used as the verification metadata to verify the integrity of the data stored on the remote server. However, users need to download data for integrity verification, which results in a lot of communication overhead and privacy data leakage. Also, this method does not support dynamic data integrity verification. To solve the problem that dynamic data operation is not supported, Erway et al. [13] introduced the jump table dynamic data structure to support dynamic data operation. This scheme has some problems, such as a long authentication path, requiring a lot of auxiliary information, and high computational and communication overhead. Wang et al. [14] proposed a PDP mechanism based on the Merkle tree, which uses the MHT structure to ensure the correctness of data block location. Meanwhile, they used the BLS signature mechanism to ensure the integrity of data block content. Although the model supports dynamic data operations, the insertion of data is easy to increase the scale of MHT and make it out of balance. Shen et al. [15] proposed a fully dynamic structure combining bidirectional linked list and position array to support data dynamic update more effectively. Tan et al. [16] proposed a lightweight integrity verification scheme based on the jump table and BLS signature mechanism to reduce the overhead of generating verification metadata for verifiers. Sun et al. [17] proposed an adaptive authenticated data structure with privacy-preserving for big data stream in cloud, which can provide real-time authentication of outsourced big data. Jin et al. [18] verified the data that is accessed frequently in the cloud by signature verification. However, this method requires verifiers to download the data locally and verify the data that has not been queried for a long time through PDP.

In general, many methods reduce the amount of verification evidence through MHT and extend the structure of MHT for further performance optimization or security enhancement. However, MHT may not be optimal in some cases. For example, a frequently verified file has the same amount of evidence as an infrequently verified file in the traditional MHT-based scheme. It is more reasonable for hotspot files to have shorter authentication paths. For the verification of large files, because the amount of evidence without transferring the file is significantly reduced in the PDP scheme, how the cloud storage server can quickly find the evidence and how the client can quickly calculate the hash values may be the main bottleneck. To this end, this paper proposes the Huffman hash tree and B+ hash tree to overcome the bottleneck.

3. Detailed Design

3.1. Overall Framework

As shown in Figure 1, HB+-MHT consists of TVM, Cloud Storage Server (CSS), and Key Management Center (KMC). KMC securely distributes public and private keys to tenants and transmits the tenants’ public keys to CSS. TVM runs the query agent to initiate a data integrity verification challenge to CSS. CSS queries the evidence and generates proof as a response. TVM verifies the data integrity by analyzing the correctness of the proof.

The HB+-MHT cloud data integrity verification approach consists of two verification schemes. One is the Huffman hash tree (HHT) scheme, which is generally used to verify the integrity of the small file. Compared with the communication overhead of transmitting small files, the traditional PDP methods with more signature operations might take more computing time. Therefore, for small files, this scheme adopts a method that loads these files into TVM and further combines hash tree and Huffman tree to realize HHT, thereby shortening the authentication path and improving the verification efficiency of the files with high verification weight. The other is based on the combination of the B+ hash tree and BLS aggregate signature. It reduces communication overhead by not loading files into TVM and achieves no-copy verification of large files. Because the amount of evidence is significantly reduced when the PDP scheme is adopted, how the cloud storage server can quickly find the evidence and how the client can quickly calculate the hash values has become the main bottleneck. B+ hash tree optimizes hash computation and evidence organization, which can effectively reduce the processing time of CSS and TVM.

3.2. Hash Tree Construction
3.2.1. B+HT: B+ Hash Tree

B+HT is an extension of the combination of B+ tree and MHT, and it is also a balanced multichannel evidence lookup tree. For the B+HT with an order of m, each node in the tree has at most m subtrees, and all nonleaf nodes except the root have at least subtrees. A middle node consists of a tuple (C1, H1, P1, C2, H2, P2, …, Cm, Hm, Pm), where Ci is the number of leaf nodes of its i-th subtree; Hi is the hash value of the concatenation of all data items in the root node of its i-th subtree; Pi points to the root node of its i-th subtree. The leaf nodes represent all data blocks of the file. The root node has one more item than the middle node, which is the signature of the root node data.

The data of B+HT should be usually stored on a disk due to limited memory. Most operating systems read data in blocks and put data written at one time into one disk block or multiple consecutive disk blocks. Therefore, if the size of the B+HT node is slightly smaller than the disk block size , the quick query of evidence in nodes can be achieved. Let be the size of subtree information in a node; then, m should be approximately equal to . To ensure performance and tree balance, the number of subtrees of each middle node cannot be less than ⌈M/2⌉. Each hash calculation in MHT only takes the data of two subtrees as the input, while each hash calculation in B+HT takes the data of multiple subtrees as the input. Therefore, B+HT can obtain the total hash value of all blocks through much fewer hash operations than MHT, which means that the number of hash calculations for evidence verification is greatly reduced. Each middle node stores the number of leaves of each subtree, which can realize a fast data block search. As an example, Figure 2 shows looking up the data block d11. According to the root node, it is deduced that d11 is the seventh leaf node of V3’s subtree. Similarly, according to the V3 node, it is known that d11 is the fourth leaf node of V7’s subtree.

We define authentication path about data di as the set of passing nodes on the path from node to root node. The set of node hash values of the sibling nodes of all nodes on the authentication path is called auxiliary authentication information, which is denoted as . As shown in Figure 2, the authentication path about d11 is and its auxiliary authentication information . Obviously, the authentication path is much shorter than MHT.

3.2.2. HHT: Huffman Hash Tree

Huffman hash tree (HHT) is an extension of the combination of the Huffman tree and MHT, in which each leaf node is assigned a weight to reflect the frequency of data verification. A middle node is represented as a tuple (LC, LP, H, RC, RP), where H is the hash of the data concatenation of its left child and right child; LP and RP point to the left and right child nodes, respectively; LC and RC are the numbers of leaf nodes contained by the left and right children, respectively. In HHT, the leaf nodes are numbered sequentially from left to right. The path of a leaf can be quickly located according to LC and RC stored in the middle nodes. The root node also adds a signature to the hash value H. For a leaf l, this paper defines l’s weighted path length as the product of l’s path length and l’s. Let AWPL be the average weighted path length of all leaf nodes in the tree. Suppose the weights of n leaf nodes are ; then, the binary hash tree with the smallest AWPL is called the Huffman hash tree.

For example, in Figure 3, there are four files a, b, c, and d with verification probabilities of (0.6, 0.2, 0.1, 0.1), respectively. Three binary hash trees are constructed with (I) AWPL = 2, (II) AWPL = 2.8, and (III) AWPL = 1.6. It can be seen that AWPL of the tree in (III) is the smallest. It can be analyzed and verified that the tree in (III) is exactly a Huffman hash tree.

The general process of generating HHT is as follows. (1) According to the given weights , , the set of n binary trees are formed. Each Ti has only one node, and its left and right subtrees are empty. (2) Select two trees Ts1 and Ts2 with the smallest root node weight from F and build a new binary tree Tnew with Ts1 and Ts2 as its left and right subtrees. The weight of Tnew’s root node is the sum of the weights of Ts1 and Ts2, and LC, LP, RC, and RP of Tnew’s root node are assigned corresponding values according to Ts1 and Ts2. (3) Remove Ts1 and Ts2 and add Tnew to the set. (4) Repeat Steps 2 and 3 until there is only one tree in F. This tree is a Huffman hash tree.

3.3. Data Integrity Verification Process
3.3.1. HHT Scheme for Small Files

(1) Initialization Phase

Step 1. The KMC uses the key generation algorithm. Let be the multiplicative cyclic group of a large prime. and are the generators of . Randomly select a , and calculate . Then, the public parameter is , and the system master private key is . KMC generates the tenant's private key and the tenant's public key according to the tenant’s user . The generated keys are sent to TVM.

Step 2. TVM establishes a hash tree HHT of integrity measurements. The generation of HHT is shown in Algorithm 1. The data stored in the leaf node of HHT is the hash value of the initial small files of the system. The internal nodes are constructed according to the properties of the hash tree and generated by hash operations. TVM sings the HHT root node with the tenant private key, and a timestamp [19] is added to the signature. .

Step 3. TVM sends the file set that needs to be verified, its HHT and , to CSS. The data are saved and maintained by CSS.
(2) Challenge Verification Phase. Challenge verification is initiated by TVM or a third-party auditor (TPA) on behalf of the tenant. TVM generates a query challenge about the file fi and sets it to CSS. Then, CSS generates the authentication path according to HHT and sends a proof to TVM. After receiving the proof, TVM calculates the hash value of the file and uses to calculate . The verification process can be expressed as .
If the verification is successful, the integrity of the file is not damaged.
(3) Dynamic Data Update Phase. The HHT integrity verification scheme supports file addition, deletion, and modification. The operation flow of file-level data modification is shown in Figure 4. First, TVM sends a request of modifying the file fi to CSS and then uses the hash value of the original file fi to verify the correctness of received . If it passes the verification, TVM uses the updated file and to calculate the new root node of HHT and then uploads the new signature with a timestamp to CSS.
For the file insertion operation, since a newly added file fnew has not been verified and has the lowest weight, a leaf node flow with the lowest weight is selected, and flow is expanded into a subtree fsubtree with only three nodes. Fsubtree’s left child is fnew, and the right child is flow. Accordingly, all nodes on the path from fsubtree’s root node to the root node of the tree need to be updated, so the complexity of the insertion update operation is the height of the tree.
For the file deletion operation, if the nodes with high weight are deleted, the nodes with low weight need to move up as a whole, which requires a lot of calculation. Meanwhile, CSS needs to interact with TVM for a lot of information. Considering the overhead, this paper keeps the leaf nodes corresponding to the deleted file and does not move them, except that the hash values stored in them are cleared. Accordingly, all nodes on the path need to be updated, so the complexity of the deleting update is the height of the tree. In the subsequent addition of files, these dummy node locations are preferred.
(4) Verification Frequency Change. Tenants’ personalized verification requirements may change dynamically and often show local characteristics; i.e., some files are frequently verified in a certain period. The static HHT generated at one time may not provide good verification services under dynamic changes. Also, regenerating the whole HHT might bring inconvenience and cost to tenants. Thus, this paper proposes the VFC mechanism that combines local updates and global updates to deal with the change of verification frequency. VFC records the verification history and counts the verification frequency of each file in the latest period into a vector Wcur. Meanwhile, it maintains a vector Whht that denotes the verification probability of each file used in the current HHT. The elements in Wcur and Whht are arranged in descending order. In the local update mode, when a file f under verification is ranked in the top k of Wcur, that is, it is a hot file, CSS will check whether f is ranked in the top k of Whht. If not, CSS will seek a file f′ and exchange the positions of f and f′ in HHT. f′ satisfies the following conditions: (1) in the top k of Whht, (2) not in the top k of Wcur, and (3) f′ has the lowest weight among the files satisfying conditions 1 and 2. After CSS exchanges f and f′, it updates their values in Whht and requests the tenant to resign the relevant path. This operation occurs when TVM verifies f. Therefore, the evidence information related to f can be reused and verified at one time. In the global update mode, the Euclidean distance between the vectors Wcur and Whht is calculated regularly. When the distance is greater than the threshold, HHT will be regenerated and resigned. Considering the high overhead of this method, the update interval is long, and the threshold is high. Local update follows the locality principle so that the hot files can always be verified quickly with a shorter authentication path. Moreover, since the update occurs when the tenant verifies the hot file, the reuse of evidence saves communication bandwidth and computational overhead. The global update ensures that HHT will not fall into local optimization and can maintain the adaptation of updated HHT.

Input: Fileinfoarray satisfying  = 1
Output: Nodelist
Begin
  For each item f in Fileinfoarray {
  //Initialization, each tree has only one root node
  HHTNode n = new HHTnode ();
  n.h = hash (f.data);
  n.weight = f.weight;
  n.lc = 0; n.rc = 0;
  n.lp = null; n.rp = null;
  Nodelist.equeue (n)
 }
While (NodeList.size! = 1)
   //Select two trees with the smallest AWPL from the queue
   nl, nr ← min (NodeList);
   //Merge the two trees
   HHTNode n = new HHTnode();
   n.h = hash (nl.h || nr.h);
   n.weight = nl.weight + nr.weight;
   n.lc = nl.lc + nl.rc; n.rc = nr.lc + nr.rc;
   n.lp = nl; n.rp = nr;
   //Add a new binary tree to the queue
   Nodelist.equeue (n);
   //Delete the analyzed two subtrees
   Nodelist.deequeue (nl);
   Nodelist.deequeue (n2);
  end while
  return Nodelist
end
3.3.2. B+HT Scheme for Big Files

(1) Initialization Phase. Firstly, TVM divides the file into fixed-size blocks to obtain the data block set and hashes all blocks to obtain the hash set . Then, TVM randomly generates and uses the tenant private key to sign each data block , which obtains the signature set through . Then, TVM uses the elements in as leaf node to construct the B+HT structure and signs C with , i.e., , where a timestamp is added to the signature. Finally, TVM sends to CSS for saving.

(2) Challenge Verification Phase. Challenge verification is initiated by TVM. TVM randomly selects the location set I of data blocks to be verified. Then, it randomly generates and sends a challenge message to CSS. CSS calculates and . Next, it sends , , and of the target data block and auxiliary authentication information of the data block as the proof to TVM.

After receiving the verification, TVM uses to calculate and performs the verification:

If the verification is correct, continue to judge:

If the verification is successful, the integrity of the file is not damaged. The whole process of challenge verification is shown in Figure 5.

Correctness:

Through formula (1), TVM can verify the integrity of the data block di of a file F stored in CSS without downloading di. However, verifying the integrity of multiple data blocks in F does not guarantee that F is not destroyed, even if TVM has verified all data blocks. This is because formula (1) does not bind the location of the data block di in F and untrusted CSS can use another data block dj in other locations. If the data of dj is correct, this fake can pass the verification of (1). Formula (2) limits the position of each data block with the hash tree. Although the hash tree can verify the content, the verification can only be conducted when the verified data block di is loaded into TVM. Therefore, only the combination of formulas (1) and (2) can realize the PDP verification of the content and location.

(3) Dynamic Data Update Phase. To make the B+HT scheme more suitable for the cloud environment, the integrity verification scheme must support the dynamic update of data. The block update operations include modification, insertion, and deletion.If the verification is correct, auxiliary information and are used to calculate , and CSS updates .(b)Data insertionData insertion operation inserts a new data block. Besides changing the physical structure of the file, this operation may cause node splitting that does not meet the structural properties of the B+ hash tree and change the authentication structure . We use the B+HT tree in Figure 2 to illustrate the insertion operation.Suppose TVM wants to insert a block after a block . First, TVM generates signature corresponding to and then sends the insertion request message to CSS. After CSS receives the message, it first finds the node that involves . Assume that the order of the B+HT tree is k. There are two situations in the current node, as shown in Figure 7.(1)The number of subtrees of is less than k. In this case, is added to , and the involved nodes on the authentication path are updated. Figure 7(a) shows the case of inserting a new node m after the block d1 in Figure 2. Then, CSS will directly generate a proof .(2)The number of subtrees of is equal to k. In this case, needs to be split, and its parent node whose subtrees number is equal to k also needs to be split, etc. Specifically, CSS splits into and . involves the information of all subnodes before (including) in ,and involves the information of all subnodes after in . The information of is appended to the end of . Then, CSS updates the information of . The item about the subnode in is divided into two items. The hash values in the two items are, respectively, hash and hash ,and their number of leaf nodes and child pointers are assigned according to and , respectively. Figure 7(b) shows the case of inserting a new node n after block d9 in Figure 2. CSS will generate proof according to the new structure of B+HT.After receiving , TVM continues to calculate the value of the nodes in the authentication path until is obtained. If the verification passes, TVM updates the root signature and sends it to CSS.(c)Data deletionData deletion removes a leaf node, and it is the opposite of data insertion. Assume the order of the B+HT tree is k. If a node ddel is deleted, a simple case is that the number of subtrees of the node dhash-del involving hash (ddel) is greater than . In this case, the item of ddel in dhash-del is deleted simply. Taking the B+HT tree in Figure 2 as an example, Figure 8(a) shows the processing of deleting node d5. If the number of subtrees of dhash-del is less than . The rest of dhash-del must regenerate new nodes with its brother nodes to satisfy the properties of the B+HT tree. Similarly, the related parent node changes are processed. For example, Figure 8(b) shows the process of deleting node d1 and merging its parent node with its peer neighbor node.

4. Analysis and Evaluation

4.1. Security Analysis

The security of HB+ MHT is analyzed from the aspects of integrity verification and data update, including whether it can prevent untrusted CSS from cheating tenants with incorrect data, and whether it can prevent untrusted third parties from maliciously updating authentication data in CSS.

Theorem 1. If the hash algorithm is collision-resistant and the signature algorithm is unforgeable, no adversary against the B+HT scheme could make the verifier accept in a data verification protocol instance with a nonnegligible probability, unless it responds by correctly calculating the value.

Proof. The BLS signature scheme is adopted, whose unforgeability is proved in [14]. The security of HHT and B+HT trees is mainly analyzed. It will be proven that if there is an adversary who can break the scheme with nonnegligible probability advantage, then has an algorithm B to find a pair of hash collisions with a nonnegligible probability advantage.
Given a block di to be verified with its correct authentication path pathi, and correct auxiliary authentication information , TVM’s verification process can be denoted as a Boolean function verify with parameters (i, H (di), ), and its calculation of path nodes starts from the leaf node. The analysis of the jth node on the path yieldswhere the j-1th node on pathi is the rth child of the j node, K is the order of B+HT, and is the hash value of the sth child of the jth node on the path. Finally, the correctness of is verified and the verification result is obtained.
If can construct a tuple with as a proof to the TVM, and the function returns true, wins the security game.
In this case, and have the same root node, while . The authentication path of is recorded as ; then, there must be and satisfyingHowever, since  = , it is contradictory to the collision-resistant characteristics of the hash function. It can be concluded from the above discussion that the probability the adversary can destroy the data verification protocol of the B+HT is negligible. Therefore, the data verification protocol of the B+HT is secure.

Theorem 2. If the hash algorithm is collision-resistant and the signature algorithm is unforgeable, no adversary against the B+HT scheme could make CSS accept in a data update protocol instance with a nonnegligible probability, unless it responds by correctly calculating the value.

Proof. When an adversary submits an update request of inserting a block after the ith block to CSS, will receive a message from CSS. According to , and the current time timestampnow, needs to recalculate the root value hashroot of the tree and sign the hash value of the concatenation of hashroot and timestampnow. Because the signature cannot be forged and has no private key, can only reuse a previously signed message signatureold. Assume signatureold = hash(hashroot-old||timestampold). The current timestamp timestampnew is greater than timestampold and cannot be changed. If can construct hashroot-new by adjusting that satisfies hash (hashroot-new||timestampnew) = hash (hashroot-old||timestampold), then A wins the security game. However, due to hashroot-newhashroot-old and timestampnewtimestampold, it is contradictory to the collision-resistant characteristics of the hash function. It can be concluded from the above discussion that the probability the adversary can destroy the data insertion protocol of the B+HT is negligible. The analysis of the deletion and modification operations is similar. Therefore, the data update protocol of the B+HT is secure.
From the perspective of security, HHT is a deformation of MHT, and the security of the authentication structure of the binary hash tree has been analyzed by [14]. Therefore, the analysis of HHT is omitted here.

4.2. Performance Evaluation

Here, representative schemes are compared to better evaluate the performance of HB+ MHT. More specifically, HHT is compared with MHT, and B+HT is compared with Wang’s scheme [14] in terms of space and time complexity. The experiment is performed in a simulation environment. TVM and CSS are, respectively, run on two hosts, which are equipped with an Intel(R) Core (TM) i7-8700k@ 3.70 GHz CPU and 16.0 GB RAM and run Windows 10 operating system. The two hosts are directly connected through a gigabit network cable. B+HT is implemented in C++ language, and the development tool is Visual Studio 2019. The hash operation uses SHA1. The signature operation uses BLS. Besides, a pairing-based cryptography library [20] is adopted to implement these cryptographic algorithms.

4.2.1. Space Complexity

In PDP integrity verification schemes, TPA or TVM only stores a small amount of information such as signature and timestamp, which takes up a small storage space. The storage overhead of CSS is analyzed in Table 1, where is the signature length, lH is the length of hash, M is the number of files, itemhht is the size of the node in HHT, itemmht is the size of the node in MHT, N is the number of file blocks, and K is the order of B+HT tree.

As shown in Table 1, the storage overhead of CSS is mainly related to the storage of tree nodes. For small file verification, HHT occupies slightly more storage space than MHT because the data of HHT nodes includes the number of leaf nodes in its subtree with a size of 32 bits. However, compared with the 160-bit size of the SHA’s hash value stored in a tree node, the increment of storage space is small. For large file verification, the storage overhead of B+HT and Wang’s scheme [14] for storing block signature is the same. However, B+HT occupies significantly less storage space than Wang’s scheme for storing tree nodes. With the increase of K, B+HT can save storage space logarithmically than Wang’s scheme [14]. Of course, K cannot increase infinitely, and it should not exceed the size of the disk block.

4.2.2. Time Complexity

(1) HHT Scheme. Because HHT is obtained by changing MHT, their time complexity is evaluated and compared. Integrity verification services are used more frequently than data update services. Here, the time from the initiation of a user request to the end of verification is analyzed. The main overhead includes CSS’s query evidence time Tquery, the communication time Tcom, and TVM’s verification time Tver. The time cost of each phase of the two schemes is shown in Table 2. It can be seen that whether HHT is superior to MHT mainly depends on whether HHT can provide an AWPL lower than , where M is the number of the files.

Therefore, the value of AWPL is tested. During the initialization stage, given M files with randomly allocated weights, B+HT runs Algorithm 1 to establish an HHT tree. In the verification stage, in order to simulate local selection, we set up a set whose size is equal to 50. Only the files in the set can be selected. After selecting a file in the set to verify at a time, we randomly update a small number of files in the set. According to their verification weights, the files to be verified are selected by roulette-wheel. Meanwhile, a variation factor is established to slightly adjust the weights during verification. According to whether HHT enables the VFC mechanism, HHT is divided into HHT-1 (disabled) and HHT-2 (enabled). Let C be the number of tests and let L be the average authentication path. Then, the value of is recorded in Table 3 to intuitively reflect the performance of HHT. The test results show that the HHT scheme is generally better than the MHT scheme. When M is small or C is large, the HHT scheme has a better effect, which is mainly attributed to the sampling variance. HHT optimizes the authentication path of the tree according to the weight. When the number of samples increases, sample indicators can better reflect the real indicators of HHT. In addition, the result that the HHT-2 scheme is better than the HHT-1 scheme shows that the VFC mechanism can better adapt to the change of verification probability.

(2) B+ HT Scheme. B+HT and Wang’s scheme [14] adopt the same signature scheme for the content verification of data blocks. The difference between them is the hash tree structure for the location authentication of data blocks. Thus, this paper mainly compares the performance of their hash trees. TVM’s verification time Tver is mainly related to the number of hash calculations, which depends on the height of the tree. The communication time Tcom is mainly related to the product of the node size and the height of the tree. CSS’s evidence query time Tquery mainly depends on the node query times, i.e., the height of the tree. The complexity of location verification for a single data block is shown in Table 4. It can be seen that the verification performance of TVM and CSS in our scheme is times higher than that of Wang’s scheme [14]. When the order K of B+HT equals 4, Tver and Tquery are reduced by two times. Of course, this is accompanied by the increase of Tcom. Assuming that the number of blocks M = 1204, K = 4, and the hash length is 20 bytes, B+HT only increases the traffic by 0.6 k more than Wang’s scheme [14] for each verified data block. In fact, with the development of communication technology, the increased traffic can be negligible. This scheme is suitable for the case where the computing power of both TVM and CSS is weak or the burden is heavy, and the communication bandwidth is good. This case is common in the cloud environment.

From the perspective of tenant experience, the total verification response time is further evaluated because the tenant may be concerned about how long it can complete verification after it sends a query request. The total time Tsum is equal to the sum of Tquery, Tcom, and Tver. The experimental parameters are as follows: K = 4, BLS signature, a block size of 20 bytes, and 1G bandwidth. Figure 9 shows the time overhead of the two schemes with the same detection probability of 99% under different file sizes. It can be seen that Tsum of our scheme is lower than that of Wang’s scheme. When K = 4, the overhead of Tcom increases by 1.5x, while the overhead of Tquery and Tver decreases by 2x. The test results show that the reduced time of Tquery and Tver exceeds the increased time of Tcom. At this time, Tquery and Tver become the performance bottleneck. Moreover, the larger the file, the higher the height of the tree, and the more complex the system management, resulting in a slower search of disk I/O and a larger Tquery.

5. Conclusions

This paper proposes a lightweight and efficient integrity verification method for cloud data file systems, which uses different integrity verification schemes for different file types to improve the verification efficiency. The Huffman Merkel tree verification scheme is used for small files, which could shorten the authentication path of small files according to the file verification frequency or user-defined file verification weight. Meanwhile, the B+ hash tree integrity verification scheme is used to verify large files, which can effectively reduce the query response time of nodes. Besides, Bilinear aggregate signature is used to verify the existence of large files, which can effectively reduce the requirements of user computing power and communication bandwidth. The experimental results indicate that our scheme can perform data integrity verification well, with less computation and communication overhead and higher verification efficiency than the existing methods.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research work was supported by the National Natural Science Foundation of China under Grant nos. 62176265and 61972040.