Abstract

With the wide application of Internet of Things (IoT), a huge number of data are collected from IoT networks and are required to be processed, such as data mining. Although it is popular to outsource storage and computation to cloud, it may invade privacy of participants’ information. Cryptography-based privacy-preserving data mining has been proposed to protect the privacy of participating parties’ data for this process. However, it is still an open problem to handle with multiparticipant’s ciphertext computation and analysis. And these algorithms rely on the semihonest security model which requires all parties to follow the protocol rules. In this paper, we address the challenge of outsourcing ID3 decision tree algorithm in the malicious model. Particularly, to securely store and compute private data, the two-participant symmetric homomorphic encryption supporting addition and multiplication is proposed. To keep from malicious behaviors of cloud computing server, the secure garbled circuits are adopted to propose the privacy-preserving weight average protocol. Security and performance are analyzed.

1. Introduction

In the modern Internet of Things (IoT), huge data are collected from sensor-networks and need to be provided for analysis by high-effective techniques, such as data mining. This process requires enormous computation and storage to support; cloud computing technology can provide the corresponding support. However, this process may leak the privacy of participants’ information. The privacy-preserving data mining (PPDM) based on encryption method has emerged as a solution to this problem.

Privacy-Preserving Data Mining Framework. Considering different frameworks and theories, PPDM was originated by Lindell et al. [1] and Agrawal et al. [2] in 2002, respectively. Lindell’s framework is essentially a secure cryptography-based two-participant computation protocol without outsourcing. In other words, two parties can interactively compute on their private input and . Agrawal’s framework is essentially a single-participant disturbance-based data storage and computation outsourcing algorithm. In particular, one party can upload disturbed data to server for private computation. With the development of cloud computation and IoT, a multiparty storage and computation outsourcing framework is preferred.

Cryptography-based privacy-preserving data mining supporting one-party outsourcing has been studied [3, 4], with homomorphic encryption. However, multiple-key homomorphic encryption is an open problem when multiple parties are involved in the outsourcing framework. For example, how to execute ciphertext addition and multiplication on ciphertexts encrypted by different public keys?

Security Models. We usually consider two different security models, including the semihonest and malicious security model. The definition in the semihonest model requires that all the users need to follow the rules of protocol. But we allow the dishonest users to obtain internal states of the other users. In the malicious model, different from the first security model, the corrupted users are allowed to deviate from the specified protocol. The success of the adversary means that the adversary can get the results of these protocols.

Data Distribution. Three types of distributed datasets are defined in related works, including the horizontally distributed datasets, vertically distributed datasets, and arbitrarily distributed datasets. The users in the horizontally distributed dataset can keep divided parts for the same attributes. However, in the vertical datasets, users are allowed to keep different attributes. In the last one, the datasets can be arbitrarily divided and stored by the users.

Due to the existence of malicious participants in the real environment, malicious participants may not follow the protocol. For example, they can intentionally tamper with the data, suspend the protocol anytime during the execution of the protocol, and so on. To solve this problem, this paper combines the noncontact commitment and confusion circuit mechanism, studies the average computing protocol based on confusion circuit, and then proposes the framework of a secure cryptography-based two-participant protocol with data storage and computation outsourcing. The framework consists of two data owners and two cloud servers (cloud storage server (CSS) and cloud computing server (CCS)). Each data owner has a horizontally distributed private database that is encrypted before being outsourced to the cloud for storage and computation.

1.1. Our Contribution

We decompose the key function of distributed ID3 decision tree, , into counting, , sum, multiplication, and comparison.

In counting, we propose the Secure Equivalent Testing (SET) protocol to calculate the number of items for each attribute value based on the encrypted data.

To calculate the value of in malicious model, we implement the Outsourcing Secure Circuit (OSC) protocol.

To perform the sum and multiplication operations over ciphertext, we adopt the Paillier encryption system and implement the Secure Multiplication (SM) protocol.

To execute comparison over ciphertext, we adopt the Secure Minimum out of 2 Numbers (SMIN2) protocol.

1.2. Related Work

Distributed PPDM without Outsourcing. Distributed PPDM without outsourcing is mainly for data stored and calculated locally by the participant, based on distributed data based on various data mining methods, which can be decomposed to different operations, such as average calculation, calculation, and calculation of logarithmic vector inner product. Then the cryptography-based technology is used to design various privacy-preserving computing protocols. In 2002, Lindell and Pinkas [1] proposed a secure ID3 decision tree algorithm over horizontally partitioned data. They decompose the distributed ID3 algorithm to multilogarithmic calculation, polynomial evaluation calculation, and data comparison, and then designed the security log protocol, polynomial evaluation protocol, and secure comparison protocol, so as to achieve privacy-preserving in distributed ID3 algorithm. In 2007, Emekci et al. [5] implemented a secure addition computational protocol based on the secret sharing algorithm and extended the secure logarithmic computing protocol from two parties to multiple parties; thus realizing the multiparty participation of the privacy protection ID3 method. However, the complexity of the algorithm increases exponentially when the participant data are more numerous. In 2012, Lory et al. [6] used Chebyshev polynomial expansion to replace Taylor expansion in [1], thus further improving the computational efficiency of secure logarithmic computing protocols. However, their agreements still have limited efficiency in the implementation of privacy protection protocols.

Different from above, in 2003, Vaidya et al. [7, 8] designed a multiparty privacy-preserving ID3 algorithm of vertically distributed data sets. They vectorized all attribute value information by constructing constraint sets and then computed it by using the method of secure intersection protocol, thus designing privacy-preserving ID3 for vertically distributed data sets.

In 2007, Han and Ng [9] proposed a multiparty distributed privacy-preserving ID3 method based on arbitrary distributed data sets. Firstly, each participant’s data set is vectorized, and then the attribute value information is computed by using security intersection protocol and so on. Then, the entropy value of each attribute is computed by using security logarithm computation protocol and so on. Thus, the ID3 decision tree classification method of privacy protection based on arbitrary distributed data set is obtained. However, with the increase of the number of participants, the computing volume of the client increases exponentially.

Li et al. [10] and Gao et al. [11] addressed the Naive Bayes Learning for aggregated arbitrary distributed databases.

PPDM with Computation Outsourcing. Cryptography-based privacy-preserving data mining has a lot of encryption and decryption operations in the computation process. Therefore, it is difficult for large-scale data processing. As a measure for solving resource-restricted problems, the outsourcing technique has been widely used in cloud computing applications, such as data sharing [12, 13], data storing [14, 15], data updating [16, 17], and social network analysis [18, 19]. In this context, we need to rely on security outsourcing technology to outsource computing or storage tasks of all participants to the cloud to process, thus greatly reducing the computing/storage load of each participant. In 2014, Liu et al. [3] adopted a new encryption scheme that supports both addition and multiplication over cipher texts. In this scheme, most of the computations are performed on the cloud, which reduces the computation workload of the data owner. However, the scheme is limited to a single party’s data mining operation. Chen et al. [20] designed new algorithms for secure outsourcing of modular exponentiations. In 2015, Bost et al. [21] proposed the privacy-preserving hyperplane decision, Naive Bayesian, and decision tree classification algorithms, and through the semihonest model, secure two-party computation model to prove that the above scheme can satisfy the semantic security (Semantic Security); and the related protocol makes it possible to design an adaptive enhancement algorithm (Adaptive Boosting) combine to further enhance the accuracy of the algorithm; building a classifier can be used to construct the privacy protection of the library, the further development of the classification algorithms for privacy-preserving technology in the future lays a solid foundation.

PPDM with Multiparticipant Data Storage and Computation Outsourcing. In 2013, Peter et al. [22] proposed a new solution for the outsourcing of multiparty computation. Such a technique can be used in our setting. But as the security analysis in the previous works, they can only achieve security in the semihonest model. In [23], a new protocol was proposed to achieve data mining for two parties. In [24], association rule mining was addressed in the malicious model. In [25], the privacy-preserving KNN classification was addressed. In [26], the deep learning task was addressed. Besides the above related work, several fundamental secure algorithms, such as dynamic homomorphic encryption [27, 28], authentication [29, 30], and light-weight multiparty computation [31], which have also been considered in the malicious model, have been proposed. However, to the best of our knowledge, no existing study has considered a method for outsourcing computation in the malicious model.

In this study, the secure outsourcing of ID3 data mining is considered in the malicious model for the cloud environment. We show how to solve the outsourcing problem for ID3 protocol over horizontally partitioned data.

2. Preliminaries

In this section, we present a brief overview of the preliminaries used in this paper, including the ID3 decision tree algorithm, Paillier’s homomorphic encryption scheme, and the other related protocols.

2.1. Distribute ID3 Decision Tree Algorithm

The ID3 algorithm description is given as follows. It builds a decision tree in a top-down manner with the information of samples. Starting at the root, the best object classification will be obtained. The best prediction is computed with the information gain. The information gain of an attribute is defined as

Assume that there are 2 parties, , with 2 databases, . Each party has one database . All databases share the same general attribute (column) set and each attribute has several general discrete attribute values, denoted by , and one class attribute .

Without considering privacy, each party shares his own and to all other parties. As a result, any party can calculate and .

where is the subset of with tuples that have value for class attribute . equals the set of transactions with class attribute set to in database .

Then the value of can be calculated as

where is the subset of with tuples that have values for attribute and for class attribute .

Therefore, (3) can be easily computed by party and parties all of the values and from its database. Each party then sums these together with the values and from its database and completes the computation.

Then each party can calculate value at its own side.

2.2. Paillier’s Homomorphic Encryption Scheme

Homomorphic encryption is a special type of encryption in which the result of applying a special algebraic operation to plain texts can be obtained by applying another algebraic operation (which may be different or the same) to the corresponding ciphertexts. Thus, even when the user does not know the plain texts, he/she can still obtain the results of applying that algebraic operation to the plain texts.

Let and be two plain texts with encryptions and , respectively.

The Paillier encryption scheme [32] is described as follows:

2.3. Li’s Symmetric Homomorphic Encryption Scheme

The description of symmetric homomorphic encryption scheme proposed by Li et al. [33] is as follows.(i): is used to generate key for users as . and are primes with the condition that . is chosen from .(ii): is a small positive integer, which is denoted as ciphertext degree in this paper.(iii):

2.3.1. Properties of the Proposed Homomorphic Encryption

Homomorphic Multiplication. Let , be the ciphertexts of two plaintexts , . Then we have and for some random ingredients and . And we can obtain that

Homomorphic Addition

Readers may refer to [18] for details on the scheme.

2.4. Garbling Scheme

A garbling scheme [34] consists of four algorithms, which is denoted by . can be transformed by Gb into (). Note that is the garbled circuit. The encoding and decoding information algorithms are denoted by . The output of garbled can be encrypted and get the result .

2.5. Noninteractive Commitment

A noninteractive commitment scheme [35] is also required in our paper, denoted by (). The distribution of is determined by the value of as .

2.6. Basic Cryptographic Subprotocols

In this section, we present a set of cryptographic subprotocols that will be used as subroutines when constructing the proposed protocol.

2.6.1. Outsourcing Secure Comparison Protocol (OSCP)

The value of is kept secure from the cloud and users. The value of is computed. is kept by the data owner (Algorithm 1).

1. The computation at Data owner:
Compute for the cloud.
2. The computation at cloud:
are chosen such that
3. The cloud compute the following value for the data owner:
, and sends .
4. Data Owner computes the following values:
,
and compares with .
If . Otherwise, .
2.6.2. Secure Equivalent Testing Protocol (SET)

With two ciphertexts and , SET is to compute and decide if the plaintexts are identical () (Algorithm 2).

Two ciphertext are computed by the cloud as and .
1. The cloud computes and .
2. Check if or and computes
if , . else if , .
3. The value of is computed as follows:
.
If , set .
2.6.3. Secure Multiplication Protocol (SMP)

The algorithm is described as in Algorithm 3.

The values are computed and ; keeps ().
1. :
Choose
  ,
computes and sends to
2. computes:
  , , mod ,
The value of is sent to C
3. computes:
  ,
  
2.6.4. Secure Minimum out of 2 Numbers Protocol (SMIN2)

The algorithm is described as in Algorithm 4.

1. :
The function is chosen
for    to    do
.
if  :   then
,
end
else
end
,
and
,
end
Sends to
2. :
  , for
if     such that    then
(which means )
end
else
(which means )
end
2.7. Secure Circuit Protocol (SCP)

We denote the three parties of the protocol by , , and CCS and their respective inputs by , , or Their goal is to securely compute the function [34] (Algorithm 5).

Input:   has , has
Output:  
1. CCS:
Randomly selects for commitment, and sends to and .
Generates the random number and share it secretly as , sends to , and sends to .
2. : Select seed for pseudo random function and send to .
3. and : Generate corresponding circuit based on function . Random
selection of and generate the following commitments for all and :
.
.
   and send the following information to :
.
.
4. CCS: Abort if and report different values for these items.
5. and :
   sends decommitment
, , and to CCS
   sends decommitment , , and to CCS.
6. CCS: For , compute , , for the appropriate . If any call to
returns , then abort. Similarly, CCS knows the values and , and aborts if or
can not open the corresponding commitments of and : , , and .
Run and , then broadcasts and to and .
7. and : Compute .

For simplicity, we assume that . All communication between parties is via private point-to-point channels. Next, we assume that and can learn the same output , while CCS can get the garbled values for the portion of the output wires corresponding to its own outputs only. CCS cannot get the output with these garbled values. This protocol uses a garbling scheme, a four-tuple algorithm , as the underlying algorithm. is a randomized garbling algorithm that transforms a function of a triple. and are encoding and decoding algorithms, respectively. is an algorithm that produces a garbled output for a garbled input and garbled circuit. Further, is an algorithm that can verify commitments.

3. Outsourcing Privacy-Preserving ID3 Decision Tree Algorithm in Malicious Model

In this section, we present our secure outsourcing ID3 decision tree in cloud computing using the homomorphic encryption scheme and subprotocols proposed in Section 2 as building blocks.

3.1. Main Concept

The aim is to privately compute ID3 over encrypted databases, and the key is to find privately the attribute for which is maximum. From the above description, the key value which needs to be calculated with other parties is .

Since all the data was encrypted and sent to the cloud, the cloud server can count the number of , using the SET protocol described in Section 2. Now, (3) can be executed as , and the calculation of the logarithmic operation can be performed in CSS. The value to be calculated is the value of , which can be easily determined using our SCP protocol as explained in Section 2. Then, all the parties can calculate the value of independently.

3.2. System Model

The system model is shown in Figure 1, which includes two data owners and cloud servers (cloud storage server , , and cloud computing server CCS). Each data owner owns a private data set that is encrypted and outsourced to cloud server storage. Data owners can request cloud server to process ID3 data on encrypted data. At the same time, CSS and CCS servers participate in supporting the outsourcing privacy protection ID3 data mining algorithm steps; after the implementation of the algorithm, the final results are sent to the data owner. Assuming that the data owner and the CSS server are semihonest participants, CCS is a malicious participant.

3.3. Details of the Proposed Algorithm

Our securely outsourcing ID3 decision tree (SOID3) algorithm is detailed as follows:

   and run to generate the secret key and a public parameter of Li’s homomorphic encryption scheme. Further, each party shares with the other party and the cloud but shares only with itself.

Each party uses its key to encrypt every attribute value of its database, and then outsources the encrypted database to the CSS ( and ).

The and use the SET protocol to calculate the value of and for each attribute with each party .

Each party generates its Paillier public and private keys (), , and sends the public keys to the and .

CCS, , and jointly use the SCP protocol to compute . Here, has (), and has ().

Each party decrypts the received information, calculates it with the logarithmic operation of , and then encrypts it with its public key. Then, it sends it back to the cloud.

After getting the result, and use the SMIN2 protocol to select the ciphertext data with the minimum value and then further select the attribute label with the maximum information gain and return it to each participant.

The participants divide the data sets and build tree nodes. Then, go to Step until termination.

4. Security Analysis

In this section, we prove that the secure outsourcing ID3 decision tree (SOID3) algorithm can offer protection against the malicious cloud server.

Theorem 1. The SOID3 algorithm can achieve privacy for each party and the semihonest cloud storage server.

Proof. We mainly consider the security model under the noncollusive semihonest model and the semihonest cloud server. Suppose there are two parties, and , and cloud storage server .
Let be the participants of all protocols. Consider three types of attackers ( , , and ) that can invade , , and . In the real model, and have data sets and , respectively, and has encrypted data sets and . Make a collection of honest participants. For all , indicates the output of . If is invaded, represents all views of participant in running protocol .
For each , the attacker , view in the runtime protocol can be defined asIn the ideal model, there exists an ideal model for function , and all participants can interact with the model . That is, Challenger and participant can send data and to . If or is , returns . Finally, can return to challenger . As mentioned earlier, is a collection of honest participants. For each participant in the collection, return the as output to . If is intruded on by a semihonest attacker, is still consistent with the output of in previous realistic models.
For all , in the ideal model, in the presence of independent simulators , , the view is Therefore, it is considered that the protocol is secure in the presence of noncolluded semitruthful attackers.
Definition 2. Let be a deterministic functionality among parties in . Let be the subset of honest parties in . We say that securely realizes if there exists a set of transformations (where and so on) such that for all semihonest adversaries , for all inputs and auxiliary inputs , and for all parties the following holds:where denotes computational indistinguishability.

Theorem 2. The SOID3 algorithm is secure with the semihonest cloud storage server and the malicious cloud computing server.

Proof. First consider the case where or is corrupted. It is necessary to prove that, in the SCP protocol, the ideal model and the realistic model are not distinguishable. That is, in the following interactions, it is impossible to distinguish between the various types of interaction information and outputs of the participants in the ideal model and the real model.
In the real model, assume that there is an emulator that can simulate various behaviors of a semihonest participant (or ), and receive inputs and from the execution environment of the protocol. At the same time, the simulator can simulate the function , which sends all inputs and to the simulated . Since the simulator does not do anything computed by , there is no difference between the real and the simulated from the execution environment point of view.
Because in Step 2, and uniformly select the seed of Pseudo-Random Function (), the security shows that the real model in Step 2 is indistinguishable from the ideal model.
In Step 3, we modify the simulator, which knows in advance what promises will be opened when the simulator generates commitment . First, the simulator selects the random numbers , that can be marked which promises to be opened and calculates the values of , . At this point, the simulator has obtained the values of , , and , . Then, the simulator can submit the markings that promise not to be opened. In this process, due to the concealment of commitment, the realistic model and the ideal model are equally indistinguishable.
In Step 6a, the simulated and stop executing when   =  . Change the emulator to make . By obfuscating the authenticity of the circuit, has only negligible probability to obtain in   =  . Therefore, in this step, the realistic model and the ideal model are equally indistinguishable.
In Step 6b, the correctness of the obfuscation circuit guarantees that both and of the analog can be output. Therefore, if there is no pause in 6a, we can modify the simulator to an analog obfuscation circuit that generates . We can simulate the output of and by simulating the instructions of . According to the security of the confusing circuit, the real model is also indistinguishable from the ideal model in this step.
Therefore, in this protocol, the execution environment can not distinguish between the realistic model and the ideal model. And the protocol is secure when is a malicious participant.

5. Performance Analysis

In this paper, we consider that CSS has a strong calculation ability and we ignore its computation time. Each data owner does not need to store the ciphertext but can just use the public key to encrypt the message and the private key to decrypt the ciphertext.

In each iteration, first, each data owner will execute the SBD protocol and SMIN2 protocol with the cloud. There are two interactions in the SBD protocol and interactions in the SMIN2 protocol. Then, , , and CCS will execute 6 interactions in the SCP protocol. Finally, each data owner will execute 1 interaction when it goes to the new iteration. We assume that is the iteration time, so the communication traffic is at most .

In this paper, a secure average computing protocol based on SCP is implemented. The server selected in the experimental environment is CPU: Intel (R) Xeon (R) CPU E5-2620 [email protected]2, memory: 32G, operating system Ubuntu 16.04.4 LTS version. In the experiment, AES-128 is chosen as the basic encryption method of the confusion circuit, and the open source code of JustGarble is changed, and the commitment protocol is implemented based on SHA-256. Finally, the average values obtained from experiments are as follows.

In our secure outsourcing ID3 decision tree (SOID3) algorithm experiment, two participants were tested with different numbers of records. The experimental results are shown in Figure 2.

From Figure 2, since the client is only responsible for encrypting uploaded data, the time consumption is very low. In the cloud, and servers need to run SCP protocol, resulting in a lot of time consumption (Table 1). The main reason is that a large number of bit commitment processing is needed in the obfuscation circuit, and the performance improvement will be focused on this issue in the follow-up work.

6. Conclusion

In this paper, we proposed a secure outsourcing ID3 decision tree algorithm for two parties of the malicious model. Our algorithm can preserve the privacy of the users’ data as well as that of the data mining scheme for the cloud servers. The parties can get only the result trees and have no knowledge about the data mining scheme. Moreover, the cloud servers cannot get any private information about the parties. In summary, our protocol offers protection against malicious cloud servers.

In the future, we intend to extend our algorithm to vertical and arbitrary partitioning in the malicious model. In addition, we plan to extend our algorithm to a general multiparty privacy-preserving framework suitable for other useful schemes, such as random decision tree, Bayes, SVM, and other data mining methods, and can be extended for use in the wireless sensor-networks [36, 37].

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by National Key Research and Development Program of China (no. 2017YFB0803002), Basic Research Project of Shenzhen, China (no. JCYJ20160318094015947), National Natural Science Foundation of China (nos. 61872109 and 61771222), and Key Technology Program of Shenzhen, China (no. JSGG20160427185010977).