Wireless Communications and Mobile Computing

Volume 2018, Article ID 2385150, 10 pages

https://doi.org/10.1155/2018/2385150

## Securely Outsourcing ID3 Decision Tree in Cloud Computing

^{1}Harbin Institute of Technology, Shenzhen, Shenzhen 518055, China^{2}Jinan University, Guangzhou, China^{3}Henan Normal University, Henan, China^{4}School of Computer Science, Guangzhou University, China

Correspondence should be addressed to Zoe L. Jiang; nc.ude.tih@gnaijleoz

Received 2 May 2018; Accepted 2 September 2018; Published 4 October 2018

Academic Editor: Jaime Lloret

Copyright © 2018 Ye Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

With the wide application of Internet of Things (IoT), a huge number of data are collected from IoT networks and are required to be processed, such as data mining. Although it is popular to outsource storage and computation to cloud, it may invade privacy of participants’ information. Cryptography-based privacy-preserving data mining has been proposed to protect the privacy of participating parties’ data for this process. However, it is still an open problem to handle with multiparticipant’s ciphertext computation and analysis. And these algorithms rely on the semihonest security model which requires all parties to follow the protocol rules. In this paper, we address the challenge of outsourcing ID3 decision tree algorithm in the malicious model. Particularly, to securely store and compute private data, the two-participant symmetric homomorphic encryption supporting addition and multiplication is proposed. To keep from malicious behaviors of cloud computing server, the secure garbled circuits are adopted to propose the privacy-preserving weight average protocol. Security and performance are analyzed.

#### 1. Introduction

In the modern Internet of Things (IoT), huge data are collected from sensor-networks and need to be provided for analysis by high-effective techniques, such as data mining. This process requires enormous computation and storage to support; cloud computing technology can provide the corresponding support. However, this process may leak the privacy of participants’ information. The privacy-preserving data mining (PPDM) based on encryption method has emerged as a solution to this problem.

*Privacy-Preserving Data Mining Framework. *Considering different frameworks and theories, PPDM was originated by Lindell et al. [1] and Agrawal et al. [2] in 2002, respectively. Lindell’s framework is essentially a secure cryptography-based two-participant computation protocol without outsourcing. In other words, two parties can interactively compute on their private input and . Agrawal’s framework is essentially a single-participant disturbance-based data storage and computation outsourcing algorithm. In particular, one party can upload disturbed data to server for private computation. With the development of cloud computation and IoT, a multiparty storage and computation outsourcing framework is preferred.

Cryptography-based privacy-preserving data mining supporting one-party outsourcing has been studied [3, 4], with homomorphic encryption. However, multiple-key homomorphic encryption is an open problem when multiple parties are involved in the outsourcing framework. For example, how to execute ciphertext addition and multiplication on ciphertexts encrypted by different public keys?

*Security Models. *We usually consider two different security models, including the semihonest and malicious security model. The definition in the semihonest model requires that all the users need to follow the rules of protocol. But we allow the dishonest users to obtain internal states of the other users. In the malicious model, different from the first security model, the corrupted users are allowed to deviate from the specified protocol. The success of the adversary means that the adversary can get the results of these protocols.

*Data Distribution. *Three types of distributed datasets are defined in related works, including the horizontally distributed datasets, vertically distributed datasets, and arbitrarily distributed datasets. The users in the horizontally distributed dataset can keep divided parts for the same attributes. However, in the vertical datasets, users are allowed to keep different attributes. In the last one, the datasets can be arbitrarily divided and stored by the users.

Due to the existence of malicious participants in the real environment, malicious participants may not follow the protocol. For example, they can intentionally tamper with the data, suspend the protocol anytime during the execution of the protocol, and so on. To solve this problem, this paper combines the noncontact commitment and confusion circuit mechanism, studies the average computing protocol based on confusion circuit, and then proposes the framework of a secure cryptography-based two-participant protocol with data storage and computation outsourcing. The framework consists of two data owners and two cloud servers (cloud storage server (CSS) and cloud computing server (CCS)). Each data owner has a horizontally distributed private database that is encrypted before being outsourced to the cloud for storage and computation.

##### 1.1. Our Contribution

We decompose the key function of distributed ID3 decision tree, , into counting, , sum, multiplication, and comparison.

In counting, we propose the Secure Equivalent Testing (SET) protocol to calculate the number of items for each attribute value based on the encrypted data.

To calculate the value of in malicious model, we implement the Outsourcing Secure Circuit (OSC) protocol.

To perform the sum and multiplication operations over ciphertext, we adopt the Paillier encryption system and implement the Secure Multiplication (SM) protocol.

To execute comparison over ciphertext, we adopt the Secure Minimum out of 2 Numbers (SMIN2) protocol.

##### 1.2. Related Work

*Distributed PPDM without Outsourcing. *Distributed PPDM without outsourcing is mainly for data stored and calculated locally by the participant, based on distributed data based on various data mining methods, which can be decomposed to different operations, such as average calculation, calculation, and calculation of logarithmic vector inner product. Then the cryptography-based technology is used to design various privacy-preserving computing protocols. In 2002, Lindell and Pinkas [1] proposed a secure ID3 decision tree algorithm over horizontally partitioned data. They decompose the distributed ID3 algorithm to multilogarithmic calculation, polynomial evaluation calculation, and data comparison, and then designed the security log protocol, polynomial evaluation protocol, and secure comparison protocol, so as to achieve privacy-preserving in distributed ID3 algorithm. In 2007, Emekci et al. [5] implemented a secure addition computational protocol based on the secret sharing algorithm and extended the secure logarithmic computing protocol from two parties to multiple parties; thus realizing the multiparty participation of the privacy protection ID3 method. However, the complexity of the algorithm increases exponentially when the participant data are more numerous. In 2012, Lory et al. [6] used Chebyshev polynomial expansion to replace Taylor expansion in [1], thus further improving the computational efficiency of secure logarithmic computing protocols. However, their agreements still have limited efficiency in the implementation of privacy protection protocols.

Different from above, in 2003, Vaidya et al. [7, 8] designed a multiparty privacy-preserving ID3 algorithm of vertically distributed data sets. They vectorized all attribute value information by constructing constraint sets and then computed it by using the method of secure intersection protocol, thus designing privacy-preserving ID3 for vertically distributed data sets.

In 2007, Han and Ng [9] proposed a multiparty distributed privacy-preserving ID3 method based on arbitrary distributed data sets. Firstly, each participant’s data set is vectorized, and then the attribute value information is computed by using security intersection protocol and so on. Then, the entropy value of each attribute is computed by using security logarithm computation protocol and so on. Thus, the ID3 decision tree classification method of privacy protection based on arbitrary distributed data set is obtained. However, with the increase of the number of participants, the computing volume of the client increases exponentially.

Li et al. [10] and Gao et al. [11] addressed the Naive Bayes Learning for aggregated arbitrary distributed databases.

*PPDM with Computation Outsourcing. *Cryptography-based privacy-preserving data mining has a lot of encryption and decryption operations in the computation process. Therefore, it is difficult for large-scale data processing. As a measure for solving resource-restricted problems, the outsourcing technique has been widely used in cloud computing applications, such as data sharing [12, 13], data storing [14, 15], data updating [16, 17], and social network analysis [18, 19]. In this context, we need to rely on security outsourcing technology to outsource computing or storage tasks of all participants to the cloud to process, thus greatly reducing the computing/storage load of each participant. In 2014, Liu et al. [3] adopted a new encryption scheme that supports both addition and multiplication over cipher texts. In this scheme, most of the computations are performed on the cloud, which reduces the computation workload of the data owner. However, the scheme is limited to a single party’s data mining operation. Chen et al. [20] designed new algorithms for secure outsourcing of modular exponentiations. In 2015, Bost et al. [21] proposed the privacy-preserving hyperplane decision, Naive Bayesian, and decision tree classification algorithms, and through the semihonest model, secure two-party computation model to prove that the above scheme can satisfy the semantic security (Semantic Security); and the related protocol makes it possible to design an adaptive enhancement algorithm (Adaptive Boosting) combine to further enhance the accuracy of the algorithm; building a classifier can be used to construct the privacy protection of the library, the further development of the classification algorithms for privacy-preserving technology in the future lays a solid foundation.

*PPDM with Multiparticipant Data Storage and Computation Outsourcing. *In 2013, Peter et al. [22] proposed a new solution for the outsourcing of multiparty computation. Such a technique can be used in our setting. But as the security analysis in the previous works, they can only achieve security in the semihonest model. In [23], a new protocol was proposed to achieve data mining for two parties. In [24], association rule mining was addressed in the malicious model. In [25], the privacy-preserving* KNN* classification was addressed. In [26], the deep learning task was addressed. Besides the above related work, several fundamental secure algorithms, such as dynamic homomorphic encryption [27, 28], authentication [29, 30], and light-weight multiparty computation [31], which have also been considered in the malicious model, have been proposed. However, to the best of our knowledge, no existing study has considered a method for outsourcing computation in the malicious model.

In this study, the secure outsourcing of ID3 data mining is considered in the malicious model for the cloud environment. We show how to solve the outsourcing problem for ID3 protocol over horizontally partitioned data.

#### 2. Preliminaries

In this section, we present a brief overview of the preliminaries used in this paper, including the ID3 decision tree algorithm, Paillier’s homomorphic encryption scheme, and the other related protocols.

##### 2.1. Distribute ID3 Decision Tree Algorithm

The ID3 algorithm description is given as follows. It builds a decision tree in a top-down manner with the information of samples. Starting at the root, the best object classification will be obtained. The best prediction is computed with the* information gain*. The* information gain* of an attribute is defined as

Assume that there are 2 parties, , with 2 databases, . Each party has one database . All databases share the same general attribute (column) set and each attribute has several general discrete attribute values, denoted by , and one class attribute .

Without considering privacy, each party shares his own and to all other parties. As a result, any party can calculate and .

where is the subset of with tuples that have value for class attribute . equals the set of transactions with class attribute set to in database .

Then the value of can be calculated as

where is the subset of with tuples that have values for attribute and for class attribute .

Therefore, (3) can be easily computed by party and parties all of the values and from its database. Each party then sums these together with the values and from its database and completes the computation.

Then each party can calculate value at its own side.

##### 2.2. Paillier’s Homomorphic Encryption Scheme

Homomorphic encryption is a special type of encryption in which the result of applying a special algebraic operation to plain texts can be obtained by applying another algebraic operation (which may be different or the same) to the corresponding ciphertexts. Thus, even when the user does not know the plain texts, he/she can still obtain the results of applying that algebraic operation to the plain texts.

Let and be two plain texts with encryptions and , respectively.

The Paillier encryption scheme [32] is described as follows:

##### 2.3. Li’s Symmetric Homomorphic Encryption Scheme

The description of symmetric homomorphic encryption scheme proposed by Li et al. [33] is as follows.(i): is used to generate key for users as . and are primes with the condition that . is chosen from .(ii): is a small positive integer, which is denoted as** ciphertext degree** in this paper.(iii):

###### 2.3.1. Properties of the Proposed Homomorphic Encryption

*Homomorphic Multiplication. *Let , be the ciphertexts of two plaintexts , . Then we have and for some random ingredients and . And we can obtain that

*Homomorphic Addition*

Readers may refer to [18] for details on the scheme.

##### 2.4. Garbling Scheme

A garbling scheme [34] consists of four algorithms, which is denoted by . can be transformed by Gb into (). Note that is the garbled circuit. The encoding and decoding information algorithms are denoted by . The output of garbled can be encrypted and get the result .

##### 2.5. Noninteractive Commitment

A noninteractive commitment scheme [35] is also required in our paper, denoted by (). The distribution of is determined by the value of as .

##### 2.6. Basic Cryptographic Subprotocols

In this section, we present a set of cryptographic subprotocols that will be used as subroutines when constructing the proposed protocol.

###### 2.6.1. Outsourcing Secure Comparison Protocol (OSCP)

The value of is kept secure from the cloud and users. The value of is computed. is kept by the data owner (Algorithm 1).