Abstract
In recent times, the utility and privacy are tradeoff factors with the performance of one factor tends to sacrifice the other. Therefore, the dataset cannot be published without privacy. It is henceforth crucial to maintain an equilibrium between the utility and privacy of data. In this paper, a novel technique on tradeoff between the utility and privacy is developed, where the former is developed with a metaheuristic algorithm and the latter is developed using a cryptographic model. The utility is carried out with the process of clustering, and the privacy model encrypts and decrypts the model. At first, the input datasets are clustered, and after clustering, the privacy of data is maintained. The simulation is conducted on the manufacturing datasets over various existing models. The results show that the proposed model shows improved clustering accuracy and data privacy than the existing models. The evaluation with the proposed model shows a tradeoff privacy preservation and utility clustering in smart manufacturing datasets.
1. Introduction
Data anonymization [1] is considered as a major factor for data preservation in data mining. The result maintains the data privacy and sensitivity of data for data publishing. Since a large portion of the information distributing or information mining schemes go through linkage assaults [2] or probabilistic inference assault [3], the anonymization process can either be achieved through attributes suppression or generalization. This technique either causes data loss or can lead to NPhard problem. Therefore, it is essential to use privacy preservation models with lightweight mechanism to avoid various constraints like NPhard and data loss problems.
While maintaining the privacy, it is essential to address the data utility as it maintains the accuracy of the context and can provide data usefulness while displaying the dataset to the users. Elimination of appropriate data leads to diminishing the sensitive information [4]. Therefore, it is essential to maintain the tradeoff between the utility and privacy metrics in data pulsing schemes [5].
Issues associated with privacy in statistical database have been investigated thoroughly. The privacypreserving techniques in data mining have been the active research study. There are various privacypreserving clustering methods [4, 6–22], which are used as privacy preserved data mining models to secure the data and to balance it with data utility for providing equilibrium. In all these techniques, the frequency of sensitive information formed from the cluster is not uniform. It is found that certain information unlike original datasets exists more rapidly in each cluster. This leads to probabilistic inference attack. Hence, it is very necessary to use uniformly distributed clustered sensitive information. Therefore, the distribution of data records using a specific value of sensitive attribute among all classes is needed based on a neighborhood method. This could lead to the attainment of anonymized clusters with sensitive information that are distributed uniformly. In this regard, we develop a model that maintains the sensitivity of data while considering privacy and utility as its atmost concerns. The main contributions of the paper are given as follows:(1)The authors develop a novel tradeoff scheme that maintains the utilityprivacy of smart manufacturing data(2)The utility is handled by a metaheuristic algorithm, and the privacy is maintained using a lightweight cryptographic algorithm(3)Finally, the model is tested over various smart manufacturing datasets in order of validating the model over privacyutility metrics.
The outline of the paper is given as follows. Section 2 provides the related works. Section 3 provides the data anonymization and data clustering technique that include preliminaries and the algorithm required to resolve the problem. Section 4 provides the evaluation of the proposed work with existing methods. Section 5 concludes the work with possible directions of the future.
2. Related Works
Liu et al. [23] proposed the minimum monotonicity of privacyutility which constitutes differential privacy limits. The study first defined the restricted concession of varying privacy and suggested limited data protection monotony based on technological irrelevance. The study also has shown in theory various distinct data protection structures.
Ruddell et al. [24] offered a statistical overview of the effectiveness and privacy relationship in Los Angeles. The trade depends on the economic field and type of services. This study provides guidelines for the better and more ethically balanced collection and distribution of customer knowledge.
Cho et al. [25] offered a guideline for automated biomedical databases, optimising the concept of differentiating privacy from meaning.
Asikis and Pournaras [26] studied the requirements for secrecy which balance maximum utility, minimal privacy and low consumption, and maximal privacy where the utility of aggregation calculations relates to accuracy. Data security settings should be universally applied as systemwide rules and checks. Each customer may, however, often be individually enforced or decided by offers.
Valdez, and Ziefle [10] explored how personal health information sharing is required and beneficial to protect security and confidentiality. The data exchange value, mode of data, and privacy are maintained by the data recipient, and users were asked to select the ideal communication scenario. The use of data for clinical studies and physical disorders was objecting to the sharing of mental disease and heavy deanonymisation data for commercial purposes but was not interested.
3. ECCACO Algorithm
The ECCACO algorithm is illustrated in Figure 1. The study applies the privacy of datasets using the lightweight ECC algorithm prior the formation of cluster and its relevant classes. The clusters are formed in such a manner that the formed cluster has reduced clustering error. To validate the clustered samples, we use classification to classify the relevant class samples. The data are transmitted or published if there exist a reduced classification error than the threshold level. The privacy of data is not compromised or adjusted based on the clustered instances for maintaining the utility.
3.1. Preliminaries
The datasets published by the owners usually have private and public attributes. The public attributes are known among all users, and it is publically available.
Formally, let I represent M items with proper subsets . Consider an itemset X, which is regarded as the subset of the itemset I, where . The transaction T is represented in the form of tuple (t_{id}, X), where X is the itemset and t_{id} differentiates the transactions with same items.
Item set X is supported for the purpose of transaction T if . The transactional database D over the itemset I with N transactions and the T_{nm} value is assigned for 1 of n^{th} transaction and has an m item or else T_{nm} is 0.
The probability of intruders to reveal the user identity through the publically available attributes is called as QID attributes. At the time of data publishing, certain attributes are kept confidential, and these data are called as sensitive attributes (SA). The datasets usually possess the information like gender, age, and zipcode, and they are called as QID attributes. Further, the disease of an individual is called as sensitive attribute, and the identity of a person is a public attribute. It is easier for an attacker to reveal the private or sensitive user information without their knowledge. Knowing the QID information of a neighbor, the attack can link with their disease, and such attack is called as linking attack or identity disclosure attack. However, the conclusion of disease may change, and this is linked with another attack named as probabilistic inference attack. This attack is common in place of a more frequent occurrence of one sensitive attribute information than others. This can lead to an assumption of predicting the most common disease based on QID values. The proposed clustering method developed in this paper provides better resilience against the above two attacks.
3.2. ACO Clustering Process
The data clustering process or the partitioning process is given in [20] in the form of suitable steps, which is given in Algorithm 1. After the partitioning of data into clusters, the size of each cluster may vary. Hence, it is very necessary to adjust the size of clusters to obtain kanonymity. The cluster value lesser than k value is merged with other clusters. This reduces the deterministic penalty since the combinations of all the clusters are same, which is given in Algorithm 2.


To improve the distributed clusters uniformity, the study uses a Knearest neighbor classifier after clustering the datasets using ACO. The classifier generates the equivalent classes of clustered samples with uniformity, and hence the information provided for data privacy is considered sensitive. In each cluster, KNN finds the equivalent class of best neighbor by maintaining the sensitive information of each data sample. Algorithm 3 provides the pseudocode of ECCACO with classifier.

4. Results and Discussion
The implementation of ECCACO is carried out in Java on Eclipse IDE. The study considers two different experiments over smart manufacturing datasets. The ECCACO is tested with Smart Energy Data: Aachen/Cologne Smart Factory [21] and Power System Datasets [22]. It can effectively be applied on power system analysis or any application that generates rapid data. The model is tested using three different metrics that include average equivalence class size after classification, discernibility cost, and KullbackLeibler (KL) divergence. Further, the performance of utility is estimated in terms of Fmeasure, accuracy, and execution time.
4.1. Cluster Output
The conventional methods work on anonymization, which forms equivalence class or groups; however, it fails to provide data related to cluster size. The model uses both these parameters as the cluster significantly provides the difference between ACO and ECCACO clustering algorithm based on cluster size. The experiments are conducted on synthetic dataset and adult dataset case one. Further, the results provide the cluster size formed by ACO.
The study considers six different parameters that include the total number of ants (m), a parameter to control the pheromone τ for the sample selection choice probability α, a parameter to control the local heuristic factor η for the sample selection choice probability β, the pheromone persistence coefficient ρ, the pheromone reward factor R, and the pheromone penalty factor P_{p}.
Figure 2 provides the total number of clusters and cluster frequency of case one. Efficient clusters are created by using the proposed ant colony optimization (ACO) clustering scheme over the given datasets with different sensitive values say 5, 10, and 15 shown in (Figures 2(a)–2(c)) with a total of 10000 records and fixed k value. In this experiment, the value of S is varied in order to provide variations between various clusters as per different frequencies. Similarly, the results are obtained by varying the value of k, i.e., 50 (Figure 2(d)) and 100 (Figure 2(e)) with fixed S value. Finally, Figure 2(f) shows the clusters formed by elliptic curve cryptographyant colony optimization (ECCACO) algorithm with fixed k (=300) value and fixed S (=2) value, i.e., case one. The results show that ECCACO produces equivalence clusters of differing sizes, whereas the ECCACO algorithm forms uniform size.
(a)
(b)
(c)
(d)
(e)
(f)
Finally, the ECCACO is compared with models in terms of detection accuracy and Fmeasure. Here, the study considers the classification of clustered samples in order to find the accurate classes. The above two measures define the total number of correctly classified instances classes with classifiers like Naïve Bayes and J48 and experimented over smart manufacturing dataset. The experiments on case one and case two are carried out on HDFS.
4.2. Comparison with Existing Methods
Tables 1 and 2 show the results of percentage of optimal clustering of smart manufacturing data instances with varying values of k, say 5, 10, 25, and 30. The results of optimal clustering exhibit optimal results with smart manufacturing data while the instances are generated using the two different classifiers after the process of clustering that includes Naive Bayes and J48 classifier.
Tables 3 and 4 show the results of classified instances using the classifiers with varying values of k. The Hadoop system is improved by varying the parameters in order to increase the classification accuracy. The result shows that significant improvement is noted with increased random shuffling prior data chunk creation. It is seen that the performance of the ECCACO algorithm exhibits closer and optimal performance with J48 classifier. In case of Naive Bayes, the ACO clustering algorithm obtains improved performance than other algorithms. The overall performance shows that ECCACO exhibits better classification accuracy and Fmeasure than existing algorithms.
4.3. Data Utility and Privacy Metrics
The model is tested with three different metrics that include equivalence class size, discernibility cost, and KullbackLeibler divergence to find the utility losses associated with the smart manufacturing datasets on the privacy preserved dataset. It is seen that the proposed model acquires reduced loss in data than other methods. Tables 5 and 6 show the details of discernibility cost with various experiments conducted on different datasets. The results on the model show optimal discernibility cost than other models. Also, it shows a reduced cost than other methods, which shows that the model is better than other with optimal performance on data utility.
Table 7 shows average values of class size on dataset 1, and Table 8 shows average values of class size on dataset 1.
Table 9 shows the results of degree of privacy for the manufacturing datasets with increasing clustering values. The results show that the ECCACO achieves higher degree of privacy with increasing k samples than the ACO model.
4.4. Analysis
The result of evaluation shows the level of confidence of P^{−1} that helps to maintain the privacy of data even where there exists an attack on the data. The study considers an individual QID attribute that offers a sensitive attribute to the given unique clustered sample. The results also show that the probability of finding the clustered data samples provides improved data utility using ACO clustering, and it proves an improved clustering result than other models.
The study also observes that the tradeoff often provides the smart manufacturing datasets with improved data utility and privacy. With increasing cluster size, the tradeoff between the utility and privacy is hence formed without comprising the tradeoff between them. However, with increased dataset size, the data privacy tends to get reduced that tends to get compromised while the data utility is maintained. The reduction of data privacy also occurs with increased number of data samples in the smart manufacturing datasets.
5. Conclusion
In this paper, the tradeoff between the utility and privacy of smart manufacturing datasets is maintained, and it is provided in an optimal manner. The utility through ACO clustering and privacy through ECC clustering offers an optimal tradeoff between them such that the model does not sacrifice one factor for the other. The extensive evaluation conducted on the smart manufacturing datasets shows reduced cost, optimal divergence, and accurate class size. It further opts for an improved privacy of data while considering the utility metrics like Fmeasure and clustering accuracy. The privacy result proves that the data are distributed in a secured way, and the sensitive information is protected against various attacks. The ECCACO is not algorithm specific; however, on any data type with the proposed model, it can be applied. It is highly scalable to larger datasets. Also, the major limitation of the study is that the method can be applied on a domainspecific and not on multiple applications. In future, the ECCACO can be used to tune the running time as it is directly affected by the formation of increased number of clusters. The privacy of datasets can be improved for other attacks.
Data Availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Conflicts of Interest
The authors declare that there are no conflicts of interest.