Abstract

It is a challenging concern in data collecting, publishing, and mining when personal information is controlled by untrustworthy cloud services with unpredictable risks for privacy leakages. In this paper, we formulate an information-theoretic model for privacy protection and present a concrete solution to theoretical architecture in privacy computing from the perspectives of quantification and optimization. Thereinto, metrics of privacy and utility for randomized response (RR) which satisfy differential privacy are derived as average mutual information and average distortion rate under the information-theoretic model. Finally, a discrete multiobjective particle swarm optimization (MOPSO) is proposed to search optimal RR distorted matrices. To the best of our knowledge, our proposed approach is the first solution to optimize RR distorted matrices using discrete MOPSO. In detail, particles’ position and velocity are redefined in the problem-guided initialization and velocity updating mechanism. Two mutation strategies are introduced to escape from local optimum. The experimental results illustrate that our approach outperforms existing state-of-the-art works and can contribute optimal Pareto solutions of extensive RR schemes to future study.

1. Introduction

With the rapid development of artificial intelligence [1, 2] and mobile computing, big data has conceived booming series of services and has been regarded as a ubiquitously fundamental resource. New paradigms such as smart city [3], Takeout [4], and mobile pay [5] are growing at an amazing speed. Undoubtedly, these applications can provide users with accurate and personalized services with great convenience. However, large amounts of users’ personal information, including consuming habits, income status, and location, are collected beyond the control of their real owners. Especially, massive privacy concerns, including AOL search logs scandal (2006) [6, 7], de-anonymity and cancellation of Netflix prize (2009) [8, 9], complaint of privacy settings by Facebook (2009) [10], privacy breach of New York taxi trips (2016) [11], and massive breach of Equifax data (2017) [12] have tightened the whole society’s nerve over sensitive data collecting, releasing, and mining. It can be regarded as a profound privacy challenge and a motivating research field in privacy computing. On the whole, categories of privacy computing can be divided into three imperative aspects: privacy preserving data collecting (PPDC), privacy preserving data mining (PPDM), and privacy preserving data publishing (PPDP).

Fruitful works of anonymity-based traditional privacy preserving methods [1317], as well as randomization-based approaches [1820], have been prominently studied and extensively applied in so many fields ranging from social networks [21], location-based services [22], and smart healthcare [2325] to intelligent transportation [26]. However, most of traditional approaches are designed on specific application scenarios. Notably, newly arising challenges in privacy computing can be summarized into the following four urgent issues: there still lack rigorous theoretical architecture and fundamental principles of privacy computing, which can systematically quantify privacy and describe relationship between protection level and utility loss; large amounts of personal data are collected by untrustworthy collector, which enlarges the gap between owners and collectors in risk of privacy leakages; new computing paradigms, such as MapReduce [27], Storm [28], and Spark [29], and deep learning, are in need of further security enhancement from the perspective of privacy computing; optimization of the trade-off between individual privacy and data utility in privacy computing schemes should be further studied under the premise of satisfying users’ diversified requirements.

In this paper, our work is concentrated on the most challenging issue of quantifying and optimizing the trade-off between data privacy and utility in privacy computing. Derived from the concept of privacy computing, randomized response [30, 31] is highly efficient in PPDC and capable of masking private data, while maintaining the reconstruction ability of aggregate information with tolerable errors. In RR schemes, individuals’ private data is transformed by RR distorted matrices (for ease of illustration, we use RR matrices for short.) to nonsensitive disguised data. Although RR schemes have been widely studied in sensitive survey, the primary problem of searching for optimal RR matrices is rarely explored. Additionally, the search space of RR matrix is astronomical and infeasible to exploit with brute-force method. We transform the comparison of RR schemes into the model of multiobjective optimization problem, for which multiobjective particle swarm optimization (MOPSO) [32] is extremely fit to provide a much more diversified set of optimal Pareto fronts than other solutions. Thus, different RR schemes can be compared in a quantified manner by two conflicting metrics of privacy and utility under the information-theoretic model. Moreover, differential privacy can provide a rigorous and fundamental guarantee for the optimization of RR schemes in privacy computing. Overall, MOPSO is utilized to search optimal RR matrices by the metrics of privacy and utility derived from the information-theoretic model. Our experimental results show a satisfying improvement over existing works and a wide range of optimal Pareto fronts are obtained for extensive schemes. The contributions of our work can be summarized as follows:(i)We formulate the information-theoretic model for privacy quantification and present a solution to theoretical architecture in privacy computing.(ii)We derive metrics of privacy and utility under the information-theoretic model for RR matrices which satisfy differential privacy.(iii)We proposed a specified discrete MOPSO to find the optimal set of RR matrices under two conflicting goals: privacy and utility.

The remainder of our paper is structured as follows. Section 2 makes a systematic review on related works in privacy computing from the perspectives of quantification and optimization in PPDC, PPDP, and PPDM. In Section 3, the information entropy model in privacy computing is illustrated and the multiobjective optimization problem for quantifying privacy and data utility of RR schemes under ε-differential privacy is modeled. Section 4 presents a solution to searching optimal set of RR matrices by discrete MOPSO and Section 5 discusses the experimental results. Conclusion and future work are provided in Section 6.

Roughly speaking, there can be three typical scenarios in privacy computing, namely, data collecting, data publishing, and data mining according to the convoluted life-cycle of private data. In fact, privacy computing (will be detailed in Section 3) emphasizes the quantification and optimization.

Privacy preserving data collection (PPDC), regarded as local privacy, is a strategy that perturbs the data locally before they reach the untrustworthy data-collector. The goal of PPDC is to satisfy accurate estimation of population statistics as well as guarantee the privacy of individual simultaneously. Wang et al. [33] compared randomized response with Laplace mechanism using differential privacy in PPDC and recommended a RR-based scheme with less utility loss. RAPPOR in [31] is also a RR-based application of differential privacy in Chrome, which allows crowd-sourcing statistics from client-end browsers with rigorous ε-differential privacy guarantee. PPDC can achieve meaningful aggregate information inferences while preserving the privacy of client-side users. Works in [30, 3437] can be categorized into the scenario where data-collector wishes to grasp the distribution of original data while each client is just required to submit a perturbed version. Intuitively, these methods are in similarity with reconstruction on noised data. Notably, randomized response and local differential privacy [38] can be two powerful tools in PPDC. Coincidentally, randomized response can satisfy ε-differential privacy by specified parameter settings. However, FRAPP in [39] just generalized each element in distorted matrix by a single metric of accuracy. While, Agrawal and Srikant [36] just measured data reconstruction under differential privacy in terms of data utility by mean square error in the process of data mining. Huang and Du proposed a scheme of OptRR [34] which combines SPEA2 to find optimal distorted matrices in a heuristic way. Unfortunately, they mistake the dominance relationship of Pareto solutions. Additionally, the upper bound of privacy that derived from maximum a posteriori (MAP) estimate is not rigorous compared with differential privacy budget. In our work, two metrics of individuals’ privacy and data utility are optimized simultaneously under a unified model of multiobjective optimization problem, which can present a much more thorough illustration of RR matrices than FRAPP. Additionally, discrete MOPSO is proposed to provide more diversified optimal solutions than SPEA2. Particles’ position and velocity are both redefined and a novel velocity updating mechanism is adopted with two mutation strategies. The experiments show that our approach outperforms SPEA2 and FRAPP.

Intuitively, there is no prominent discrimination between privacy preserving data publishing (PPDP) and privacy preserving data mining (PPDM) for the reason that the two models are built on the same assumption of trusted data-curator. Also, privacy preserving data collection can be covered by PPDM when collected data is directly served for data mining. According to the life-cycle of data and typical appliances, PPDC can be studied independently. On the one hand, there are two settings, namely, interactive and noninteractive in PPDP. Thereinto, -anonymity [13] and its variants [1417] are extensively studied on the assumption that data can be divided into sensitive attributes and quasi-identifiers in tabular forms. However, linkage attack, differential attack, and background knowledge attack are proposed in sequence to destroy the above privacy preserving models. To this end, anonymity-based approaches fail to provide strong privacy guarantees compared with the rigorous definition of differential privacy. Li et al. [40] preceded with randomized sampling before -anonymity to achieve (ε, δ)-differential privacy, which shed lights on the broken gap between semantic security and syntactic security in PPDP. On the other hand, privacy preserving data mining was first proposed by Agrawal and Srikant in 2000 [36]. Later, Agrawal and Aggarwal [35] utilized EM algorithm to preserve individual privacy in the reconstruction of data distribution. In [30], Du and Zhan proposed a RR-based scheme to build decision trees on perturbed data and Agrawal and Haritsa presented a solution of privacy preserving computation for perturbed data with multiple attributes from multiple clients in [39].

Although several lines of privacy computing fall into the scope of PPDC, PPDP, and PPDM, there still lacks a systematical architecture of privacy computing with fundamental supports in rigorous and provable theory. Our work systematically illustrates the domain of privacy computing and builds a new paradigm of multiobjective optimization problem, which is aimed at quantifying individuals’ privacy and data utility in PPDC. Guaranteed by differential privacy in randomized response, discrete MOPSO presents an excellent solution to the trade-off between privacy and data utility in optimal RR matrices and performs better than the state-of-the-art schemes.

3. Privacy Computing

3.1. Information-Theoretic Model

Privacy computing [41] can be defined as an open theoretic system, which is comprised of computing models, application scenarios, quantitative definitions, and powerful privacy preserving techniques. We quantify the concept of privacy under the general definition of information theory rather than terms that are restricted to sensitive or personal data from laws and regulations.

Inspired by the process of information transferring in communication system, intrinsic similarity can be applied in privacy computing. Information will be affected by a noisy channel which can be regarded as a beneficial protection when they are transferred to the receiver-side. Here, we review information theory basics before demonstrating how to quantify privacy and utility in privacy computing.

Information theory [42] is a mathematical framework for quantifying information transmission in communication systems. Figure 1 illustrates the scenario of a general channel model between source and receiver, which is illustrated in Scenario 1.

Scenario 1. The probability space is comprised of discrete memoryless information source , receiver , and channel as follows:Source and receiver (sometimes ) are discretized into mutually exclusive and exhaustive categories with corresponding probability as shown in Abbreviations. From the perspective of receiver’s observations, information source is an abstract mathematical representation for a physical entity that produces a succession of discrete symbols in a randomized manner. is the feature of channel with distortion noise. Overall, the transmission flow can be formulated asInformation theory quantifies how much information a receiver carries about the source , which is applicable in the scenario of measuring how much original information can be estimated from distorted data. Applying Shannon’s mathematical theory in information theory to privacy computing, we can derive several quantitative representations of information.

Pertinent to information source in probability space, self-information denotes the uncertainty of event before the following occurs:Prior entropy of denotes the average uncertainty about before received and the probability of secret leakage. The Shannon entropy of is defined asPosterior entropy denotes the average uncertainty about after being received and the chance of attackers detecting secrets over the outputMutual information denotes how much can be preserved from and gains by observing information leakageAverage mutual information can be calculated as

On the other hand, the decoding function can be used as the inverse process of channel when received from source

When signal is received, conditional error rate can be measured as follows:

It is worth noting that privacy has many similarities with information theory and anonymity from a mathematical point of view. Thus, privacy can be generalized and formulated as follows.

Definition 1 (privacy). Given i.i.d vector and the probability , privacy of can be defined by information entropy in probability spaceWhen variables of and are not independent, mutual information can be adopted to measure the information contained in one process about another through a noisy channel environmentIn practice, privacy can be estimated from distorted data by certain priori knowledge. Data utility can be measured by mean square error (MSE) of estimator and [34]

Equipped with the information-theoretic model of privacy above, we establish an open architecture of privacy computing with 7 key components, named tuple (, , , , , , ) as shown in Abbreviations.

In detail, the distribution of plays a significant role in reconstruction and estimate of original attributes for private data. denotes the set of participants, including owners of private information, analysts, attackers with arbitrary background knowledge, and curators and collectors of private information. represents the metrics of quantifying privacy, for example, privacy and utility. involves constraints on time and scenarios, such as data collection, data releasing, and data mining. can be extended to the operations covering the whole life-cycle of privacy, including PPDC, PPDP, and PPDM. represents mature privacy computing paradigms, such as randomized response, differential privacy, and -anonymity. Additionally, is the supplementary set of privacy computing. Theoretically, we can formulate the relationship of privacy and utility in the system of privacy computing

In summary, privacy and utility are quantified by metric functions of and with tuple of input parameter , , , , and (the detailed description of and will be presented in Section 3.3). Specifically, in the concrete setting of our work, denotes the collected categorical data from distributed clients. represents the client-side user and the server-side untrustworthy collector. is the evaluation metric of privacy and utility. depicts the scenario setting where data collection occurs. is the operation of randomized response in PPDC. Notably, denotes the paradigm of randomized response guaranteed by differential privacy (the concrete usage of and will be described in Section 4).

Intuitively, the scope of privacy computing involved in the life-cycle of privacy can be illustrated in Figure 2. When privacy is collected, published, mined, or stored without protection, it is no better than destroying them directly.

3.2. Randomized Response under Differential Privacy

Randomized response aims at eliciting information on sensitive survey and can be used to derive unbiased estimate from untruthful responses. Typically, aggregate information and relatively accurate reconstruction from distorted responses can be estimated and built while protecting local individual privacy by RR under differential privacy.

Treated as the most rigorous privacy preserving tool, differential privacy (DP) [43, 44] is widely studied in privacy computing. DP can provide effective protection over the outcome no matter how one opts in or opts out of the database.

Definition 2 (ε-differential privacy [18]). Given two statistical databases and which satisfy (Hamming distance), randomized function achieves ε-differential privacy on condition thatwhere denotes the output within the domain of . is privacy budget which can balance privacy preserving level and utility of outputs.

It is well accepted that Laplace mechanism can calibrate i.i.d noise from Laplace distribution which satisfies differential privacy. However, compared with randomized response, MSE of Laplace mechanism is ill-performed under the single metric of data utility [33]. Thus, our work emphasizes randomized response which conducts sensitive categorical data collection in privacy computing as shown in Scenario 2 and Figure 3.

Scenario 2. Given that individual , , with categorical private data , , to protect data privacy, each individual submits another distorted version within the same category domain to the untrustworthy data-collector by designed distorted matrices in randomized response.
Given that denotes the proportion of category in the original sensitive data and denotes the proportion of in the distorted data and given that and , our designed distorted matrix is as follows:where denotes the probability that when original data input is and disguised output is , we can get the following equation:

Compared with Scenario 1, we can bridge the gap between information theory and RR scheme in Scenario 2 of privacy computing. From the perspective of data-collector which is equal to the role of receiver in information-theoretic model, the number of categories in distorted data () and the total number of records can be used to derive the maximum likelihood estimation (MLE) of

In [37], Agrawal et al. proposed an iterative approach to estimate , resulting however in high running time consumption. Thus, we derive by matrix transformation based on (16). In other words, MLE of can be obtained when is invertible. We regard it as a constraint of in the case that determinant of is nonzero. It can be concluded that is the key factor that determines the accuracy of estimating . Undoubtedly, as the amount of data grows, the distribution reconstruction of will be relatively close to the true distribution of original dataHowever, how to design distorted matrices and compare which matrix is optimal remain to be fundamentally unsolved. In [33], designed distorted matrix with larger diagonal elements is deemed to achieve better data utility while thoroughly ignoring the metric of privacy and failing to provide a diversified set of optimal distorted matrices. Then we take a concrete example of binary data collection for better understanding [45].

Example 1. Given and 2 × 2 distorted matrix , the unbiased estimator and variance of the binary attribute are as follows:

Derived from differential privacy in Definition 2, we can reach the conclusion that while is satisfied, ε-differential privacy can be achieved. When and , the designed distorted matrix can achieve the optimal utility [33]. Additionally, when a participant responses with a deniable probability of 1/4 in RR survey, we can get . That is, -differential privacy can guarantee the process of the sensitive information survey [31].

However, while obtaining the highest data utility, privacy is not perfectly protected. Meanwhile, although guaranteed by differential privacy, different designed distorted matrices, such as Warner’s scheme in [45], Uniform Perturbation in [37], and FRAPP in [39], are difficult to determine which is optimal. Notably, the relationship of the two conflicting objectives can be defined as follows. Let and denote privacy and utility of RR matrices, respectively.

Definition 3 (Pareto dominance). is in Pareto dominance i.f.f

Definition 4 (optimal RR matrix set). The set of all in Pareto dominance from feasible domain are

Definition 5 (optimal RR matrix front). The set of objective functions obtained by Pareto optimal set areObviously, optimal RR matrices are not unique and, actually, we can find a set of optimal RR matrices for further selection in search space. In the next subsection, we are focused on the quantification scheme for metrics of privacy and utility pertinent to and .

3.3. Scheme for Quantification

Measuring privacy and utility in RR schemes is crucial for many reasons [46, 47]. It is difficult to justify RR matrices just by investigator’s intuition and experience. The two criteria are conflicting in nature: when data utility increases, privacy will undoubtedly decrease to a certain degree and vice versa. However, combination of the two metrics can give thoroughly reasonable quantification schemes than a single one.

Meanwhile, information theory can quantify how much information a receiver carries about the source . Noisy channel can link source coding and receiver decoding by quantitative notions of average mutual information and distortion rate. In OptRR [34], privacy is measured by the amount of individual privacy estimated from the distorted data, and utility is measured by MSE of original data and the estimator. Differently and more convincingly, we derive metrics of privacy and utility for distorted matrices by average mutual information and average distortion rate in information-theoretic model with much more rigorous mathematical foundation.

3.3.1. Quantification for Privacy

Mutual information can be interpreted as the information preserved in the process of receiver from information source. Similarly, privacy can be defined as average mutual information in terms of discrete random probability space as in Scenario 2. Similar to noisy channel between symmetric discrete source and receiver, client-side and server-side can communicate by distorted matrices as shown in Table 1.

Our goal is to optimize distorted matrices by designed metrics inspired from noisy channel. In detail, privacy can be defined as average mutual information that the server-side obtained from client-side, namely, priori entropy minus posterior entropy . The value of privacy is equal to average uncertainty about that remained in

Undoubtedly, the upper bound of can be restricted with obvious proof by

From the perspective of adversary, the smaller the value of is, the better the privacy is protected. According to Jensen’s inequality and (23), average mutual information is bounded by which is suitable for the bound of privacy. Metric functions of in privacy computing can be represented by

3.3.2. Quantification for Utility

In information-theoretic model of privacy computing, data utility can be quantified by average distortion rate which is the measure of how much and are distorted. Distortion function and distortion matrix are represented by

Distortion function defined in (26) and (27) is nonnegative. The distortion matrix can be computed in (28). Finally, average distortion rate can be computed as

Thus, utility is defined by average distortion rate; the smaller its value, the higher its utility. Metric functions of in privacy computing can be represented by

Finally, our work is focused on minimizing the two objectives of privacy and utility simultaneously under specified constraints. In next section, we will illustrate our solution to the problem illustrated in (31) by our proposed discrete MOPSO

Notably, constraint that is determinant of , , is nonzero which guarantees that is invertible.

4. MOPSO in Privacy Computing

In a simplified condition that RR matrices are generated from the range of ( is integer), the -hard problem of searching optimal RR matrices reaches an astronomical number of , while the real search boundary of our scheme is the whole probability space which is much more complex than the condition above. Therefore, we provide discrete MOPSO to optimize conflicting goals of privacy and utility simultaneously and evaluate different RR schemes with fewer tunable variables as well as fast convergent rate.

4.1. Outline of Algorithm

The two objectives of privacy and utility are derived from RR matrices which can reflect the nature of privacy information in RR schemes. In detail, average mutual information and average distortion rate are constrained by ε-differential privacy and probability boundary. Our proposed discrete MOPSO is comprised of the following stages as shown in Algorithm 1.   Particles are initialized in a structure of discrete position, velocity, personal-best, global-best, repository, and mutation archive with dynamic hypervolume; velocity updating is redefined by activation function of ReLU which can retain best genes from global-best for new discrete position; repository updating is guaranteed by hypercube quality and diversity by hypervolume in combination with roulette wheel selection and crowding distances dynamically and adaptively; performance of mutation archive is determined by two strategies of mutation percentage and guided-random mechanism to maintain diversity of computation; constraints judgment: ε-differential privacy guarantees the rigorous protection bounds against arbitrary background knowledge. In each column of RR matrix, the ratio of maximum and minimum is bounded by privacy budget. The sum of each column is 1 and determinant of each matrix is nonzero; termination judgment. Notably, the combination of repository and mutation archive are the final output of optimal RR matrices after Pareto dominance check. The process of our algorithm is illustrated in the following and parameters are listed in Abbreviations.

Input: Parameters of , , , ,
Output: Final optimal set repository.
Step  1. Initialization:
Do
Step  1.1. Initialization in particles of position POS, Velocity VEL, PBEST,
GBEST;
Step  1.2. Initialization in repository REP;
Step  1.3. Initialization in mutation archive ARC;
Step  1.4. Check constraints:
(1) Each column in POS is bounded by ;
(2) Sum of each column in POS is 1;
(3) Determinant of POS cannot be 0.
While constraints cannot be satisfied simultaneously, return to Step  1.1.
Step  2. Repeat:
Step  2.1. Updating VEL and POS for each particle according to velocity
updating mechanism and check constraints for new POS as Step  1.1;
Step  2.2. Calculate fitness under evaluation function for each particle and update
PBEST and GBEST;
Step  2.3. Update REP by Pareto dominance and hyper-volume respectively;
Step  2.4. Mutation is operated on ARC while . Two strategies of mutation
percentage is set to 1/3 and guided-random mechanism is a partial imitation of
GBEST in REP under boundary constraints;
Step  3. Termination:
If stopping criterion is achieved, halt and check the combination of REP and
ARC by Pareto dominance. Output the final optimal RR matricesfrom REP and ARC;
Otherwise, loop to Step  2.
4.2. Initialization

RR matrix is coded as position of particle, POS, in which sum of each column is normalized to 1 and determinant of which is nonzero under constraint of ε-differential privacy. POS_fit is a vector of privacy and utility for each particle evaluated by two objective functions from (25) and (30). Initialization of velocity VEL is a randomized matrix with the same size as POS. Personal-best PBEST is initialized by POS and POS_fit. GBEST is selected from PBEST of each particle at random. Repository REP and mutation archive ARC are initialized as empty sets at first.

4.3. Velocity Updating

According to continual velocity updating rule in (32), we redefine the operation of velocity in a discrete form. Inspired by activation function of rectified linear unit (ReLU), velocity is discretized into a copy of certain features from GBEST by our scheme of ReLU in (32) and (33) as shown in Figure 4. Each element in VEL is a representation of certain excellent episodes from GBEST. In this paper, cognitive and social components of , are fixed to value 2. Inertia weight of ω is set to 0.4. and are random factors generated from the interval of 0 and 1

In detail, VEL is transformed by ReLU by the threshold of 0. Utilizing the operator of AND with GBEST in (34), outstanding episode truncated from GBEST is preserved directly. Then corresponding position in POS is substituted by the outstanding episode maintained in VEL with replacing operator of . After normalization of POS, a new position is updated in (35)

Overall, an illustrative example of velocity updating for each particle can be depicted in Figure 5.

4.4. Repository Updating

Repository REP is the archive of good RR matrices in the representative form of POS. In order to maintain diversity and optimize distribution, our updating mechanism is based on hypervolume and crowd distance. Hypervolume is divided by hyperlimits and hypercube dynamically and adaptively according to the boundaries of each Pareto solution in REP. Each hypercube bounded by hyperlimits is assigned with hyperquality based on the number of particles inside. GBEST is selected by roulette wheel in the hypercube with higher hyperquality. When particles exceed the capacity of repository, extra particles with the smallest crowding distances, which are measured by (36), are removed and transferred to the mutation archive ARC for the use of mutationwhere is the dimension of and denotes the number of particles in REP. In other words, crowding distance is a measure of average distance with all the other particles in REP.

4.5. Strategies for Mutation

To maintain diversity and enhance ability to escape from local optimum, two strategies for mutation are designed: fixed mutation percentage and guided-randomization as shown in Algorithm 2.

Input: ARC and mutation percentage P
Output: New particles.
Step  1. Randomized mutation
Select particles from ARC according to percentage P;
Each selected particle is re-randomized;
columns in each particle are replaced by newly generated columns which still
satisfy constraints as in Algorithm 1;
Calculate fitness for each newly generated particle.
Step  2. Guided mutation
Select the remaining particles from ARC;
Each particle is mutated by the guidance of GBEST;
columns in each particle are replaced by corresponding columns in
GBEST which still satisfy constraints as in Algorithm 1;
Calculate fitness for each newly generated particle.
4.6. Boundaries Checking

Position of all particles are bounded by ε-differential privacy when , which means that the maximum in each column is no more than times of the minimum. The tunable parameter of privacy budget ε is regulated to satisfy different users’ privacy-preserved requirements. In addition, mutation is performed in the phase of particle updating when certain position is out of the bound.

4.7. Termination Checking

Termination criterion for MOPSO can be set as follows: limiting the maximum iteration number; among certain number of consecutive generations, quality of repository REP does not improve. When termination criterion is met, the combination set of optimal RR matrices in Pareto dominance from REP and ARC is output as the final solution. In our work for termination, we combine both methods for the reason that either criterion can guarantee a diversified optimal solution set for different requirements and can benefit from the fast convergent feature of MOPSO.

4.8. Complexity Analysis
4.8.1. Space Complexity

In our scheme of discrete MOPSO, two RR matrices set, namely, REP and ARC, are needed, both of which occupy complexity of . and are the size of REP and ARC with particles’ dimension of . All particles need a complexity of , where is the size of population. Thus, the total space complexity of discrete MOPSO for RR matrices is .

4.8.2. Time Complexity

The main time complexity is occupied by Pareto dominance check and repository updating in Algorithm 1. For ease of representation, let be the number of solution. The evaluation number is , namely, . Initialization in Step  1 needs operations of Pareto dominance check. Repository updating in Step  2 occupies complexity of . Thus, the worst running time complexity of our scheme is , where is the iteration number.

5. Experiments and Discussion

We conduct our experiments on two kinds of datasets: synthesis datasets and real datasets from UCI. Performance of our proposed discrete MOPSO is compared with Wanner, FRAPP, UP, and SPEA2 in [34]. Parameters of discrete MOPSO are shown in Table 2.

5.1. Pareto Front

RR matrices that are generated from Warner, UP, and FRAPP have been proved to obtain the same Pareto front distribution [34]. Thus, for ease of illustration, we take FRAPP as a comparison. Synthesis datasets are generated in the following way: diagonal element of in RR matrices is generated from with the stride of 0.001. 999 RR matrices can be obtained. Random numbers are chosen from different distribution in a single-dimension set of 10000 records with 8 different categories. Nondominated solutions that are beyond differential privacy are deleted in advance. For different distributions, we set privacy budget . In addition, when privacy budget ε varies, the results are featured with similar trends. Six kinds of Pareto fronts of different RR schemes are presented in Figure 6. Corresponding statistics from boxplot are shown in Figure 7.

In Figure 6, we can conclude that metric of utility obtained by our approach can reach a wider range than FRAPP. Our solutions are almost in dominance of all those in FRAPP, which shows the better global optimization capability by discrete MOPSO. In terms of privacy, we can find more Pareto solutions for RR matrices than FRAPP. FRAPP just consider RR matrices by a single metric of accuracy, while our method can optimize two metrics of privacy and utility simultaneously.

We take Gamma distribution in Figure 6(f) as an example. Only 389 out of 999 RR matrices in FRAPP satisfy differential privacy when , while our approach has the ability to maintain a wider range of privacy and utility by adopting REP and ARC. Mutation and velocity updating operation enhance the searching ability especially in a wider range of privacy. On the other hand, it can be calculated that is 2.6741 and . Maximum of privacy falls within the scope of minimum of and which can prove the conclusion in Section 3.3.1.

In Figure 7, it obviously shows that RR matrices optimized by our method have a much wider range of privacy and utility than FRAPP, especially in the metric of privacy by average mutual information. It can be summarized that lower distortion rate responds to better utility and better performance of privacy is in accordance with less average mutual information.

5.2. Privacy Budget

Privacy budget is the most important index of balancing the trade-off between privacy and utility. Under the bound of differential privacy that , effect of different privacy budgets () on the number of RR matrices is shown in Figure 8.

In Figure 8, the case of means that each element in RR matrix is equal, which is the special condition of maximum privacy with minimum utility according to information-theoretic model. Notably, when , the size of REP and ARC is set to 200 and when , 2, the size is set to 400 for evaluating the further exploitative ability of our method in Figure 8(a). In Figure 8(b), where , size of REP and ARC is set to 200. It can be concluded that our method has a better performance on finding more RR matrices. As ε grows larger, FRAPP can find more RR matrices. However, our method is focused on a more diversified distribution of Pareto solution with much higher quality on the trade-off between privacy and utility. As with decrease in ε, privacy is strongly preserved. There is a sharp drop in the number of RR matrices obtained by FRAPP, while our method keeps almost stable due to the adoption of REP and ARC.

5.3. Case Study

We adopt two datasets from UCI: car evaluation data set, including 1728 instances of 4 classes of uncc. (70.023%), acc. (22.222%), good (3.993%), and v. good (3.762%) with 6 attributes; the game of Connect-4 opening database containing all legal 8-ply positions from 67557 instances and 42 attributes in 3 classes of win (65.83%), loss (24.62%), and draw (9.55%).

In Figure 9, under a strong bound of privacy budget , we can conclude that our method can obtain a wider range of Pareto front in terms of privacy and utility than FRAPP for car evaluation, especially in the metric of privacy. For Connect-4, both methods have nearly the same range and statistics in boxplot figure. Meanwhile, our approach can still gain better statistics in Pareto dominance of FRAPP as shown in Figures 9(a) and 9(c). In addition, when , we can estimate the best utility of original data distribution by the two methods. Our method has a better performance than FRAPP as to the metric of utility as illustrated in Figure 10.

In addition, to fully test our method, we compared it with OptRR [34] in the aspect of running time for obtaining 200 RR matrices in REP. Dimension of RR matrices, , is set from 5 to 150. OptRR is developed from evolutionary multiobjective optimization method, namely, SPEA2. The experiment is conducted by Intel core i5-4590 [email protected] GHz, 4 GB RAM, and coded in matlab on Windows 7-64 bit. Results of running time are shown in Table 3.

The running time results are obtained from the mean value of 20 independent experiments in Table 3. Our approach has a faster convergence rate than OptRR in finding optimal RR matrices, especially when dimension of RR matrices grows. Larger dimensions of RR matrices challenge computing capability as well as the optimization strategies in MOPSO and OptRR. The simple operators in our methods like AND and Replacement are beneficial in reducing complexity and,also, the advantage of MOPSO in fast convergence plays a key role in speeding up the searching process.

6. Conclusion

In this paper, we formulate the information-theoretic model as to quantification for privacy and establish an open framework of privacy computing. In terms of optimizing RR matrices in information collection by randomized response, we derived metrics for privacy and utility, namely, average mutual information and average distortion rate. Furthermore, our proposed approach is the first solution to optimized RR matrices using discrete MOPSO under information-theoretic model. Our discrete MOPSO is proposed to solve the double-objective optimization problem under differential privacy. Two mutation strategies in extra mutation archive and velocity updating mechanism are successful in helping MOPSO escape from local optimum. The experimental results on synthesis datasets and real datasets from UCI show that our approach outperforms existing state-of-the-art works in the aspects of Pareto front, privacy budget, and data distribution reconstruction, especially excellent in privacy protection. Moreover, we hope to contribute our optimal Pareto solutions of extensive RR schemes to the public for the future study. Some extensive work can be applied to other domains of privacy computing.

Abbreviations

Illustration of General Channel Model
:Probability that source takes value
:Probability that receiver takes value
:Probability that receiver takes value when source value is presented.
Tuples of Privacy Computing
:Private information
:Participants involved in privacy computing
:Quantifying metrics on private attributes
:Constraints on privacy disclosure
:Operation on private information
:Effective privacy computing tools and paradigms
:Supplementary set of privacy computing system.
Description of Parameters in Our Approach
:Population size
:Dimension
:Repository size
:Mutation archive size
:Maximum number of iterations.

Conflicts of Interest

The authors declare that there will not be any conflicts of interest regarding the publication of this manuscript.

Acknowledgments

This work was supported by the Chinese National Natural Science Foundation (Grant no. U1603261) and the Natural Science Foundation of Xinjiang Province, China (Grant no. 2016D01A080). Also, the authors would like to thank their partners at their research lab for their generous gifts in supporting this research. Genuine thanks are, especially, due to Professors Yi Qu, Yiliang Han, Anming Gong, and Zhen Liu for their beneficial advice.