Abstract

In recent times, the utility and privacy are trade-off factors with the performance of one factor tends to sacrifice the other. Therefore, the dataset cannot be published without privacy. It is henceforth crucial to maintain an equilibrium between the utility and privacy of data. In this paper, a novel technique on trade-off between the utility and privacy is developed, where the former is developed with a metaheuristic algorithm and the latter is developed using a cryptographic model. The utility is carried out with the process of clustering, and the privacy model encrypts and decrypts the model. At first, the input datasets are clustered, and after clustering, the privacy of data is maintained. The simulation is conducted on the manufacturing datasets over various existing models. The results show that the proposed model shows improved clustering accuracy and data privacy than the existing models. The evaluation with the proposed model shows a trade-off privacy preservation and utility clustering in smart manufacturing datasets.

1. Introduction

Data anonymization [1] is considered as a major factor for data preservation in data mining. The result maintains the data privacy and sensitivity of data for data publishing. Since a large portion of the information distributing or information mining schemes go through linkage assaults [2] or probabilistic inference assault [3], the anonymization process can either be achieved through attributes suppression or generalization. This technique either causes data loss or can lead to NP-hard problem. Therefore, it is essential to use privacy preservation models with lightweight mechanism to avoid various constraints like NP-hard and data loss problems.

While maintaining the privacy, it is essential to address the data utility as it maintains the accuracy of the context and can provide data usefulness while displaying the dataset to the users. Elimination of appropriate data leads to diminishing the sensitive information [4]. Therefore, it is essential to maintain the trade-off between the utility and privacy metrics in data pulsing schemes [5].

Issues associated with privacy in statistical database have been investigated thoroughly. The privacy-preserving techniques in data mining have been the active research study. There are various privacy-preserving clustering methods [4, 622], which are used as privacy preserved data mining models to secure the data and to balance it with data utility for providing equilibrium. In all these techniques, the frequency of sensitive information formed from the cluster is not uniform. It is found that certain information unlike original datasets exists more rapidly in each cluster. This leads to probabilistic inference attack. Hence, it is very necessary to use uniformly distributed clustered sensitive information. Therefore, the distribution of data records using a specific value of sensitive attribute among all classes is needed based on a neighborhood method. This could lead to the attainment of anonymized clusters with sensitive information that are distributed uniformly. In this regard, we develop a model that maintains the sensitivity of data while considering privacy and utility as its atmost concerns. The main contributions of the paper are given as follows:(1)The authors develop a novel trade-off scheme that maintains the utility-privacy of smart manufacturing data(2)The utility is handled by a metaheuristic algorithm, and the privacy is maintained using a lightweight cryptographic algorithm(3)Finally, the model is tested over various smart manufacturing datasets in order of validating the model over privacy-utility metrics.

The outline of the paper is given as follows. Section 2 provides the related works. Section 3 provides the data anonymization and data clustering technique that include preliminaries and the algorithm required to resolve the problem. Section 4 provides the evaluation of the proposed work with existing methods. Section 5 concludes the work with possible directions of the future.

Liu et al. [23] proposed the minimum monotonicity of privacy-utility which constitutes differential privacy limits. The study first defined the restricted concession of varying privacy and suggested limited data protection monotony based on technological irrelevance. The study also has shown in theory various distinct data protection structures.

Ruddell et al. [24] offered a statistical overview of the effectiveness and privacy relationship in Los Angeles. The trade depends on the economic field and type of services. This study provides guidelines for the better and more ethically balanced collection and distribution of customer knowledge.

Cho et al. [25] offered a guideline for automated biomedical databases, optimising the concept of differentiating privacy from meaning.

Asikis and Pournaras [26] studied the requirements for secrecy which balance maximum utility, minimal privacy and low consumption, and maximal privacy where the utility of aggregation calculations relates to accuracy. Data security settings should be universally applied as system-wide rules and checks. Each customer may, however, often be individually enforced or decided by offers.

Valdez, and Ziefle [10] explored how personal health information sharing is required and beneficial to protect security and confidentiality. The data exchange value, mode of data, and privacy are maintained by the data recipient, and users were asked to select the ideal communication scenario. The use of data for clinical studies and physical disorders was objecting to the sharing of mental disease and heavy de-anonymisation data for commercial purposes but was not interested.

3. ECC-ACO Algorithm

The ECC-ACO algorithm is illustrated in Figure 1. The study applies the privacy of datasets using the lightweight ECC algorithm prior the formation of cluster and its relevant classes. The clusters are formed in such a manner that the formed cluster has reduced clustering error. To validate the clustered samples, we use classification to classify the relevant class samples. The data are transmitted or published if there exist a reduced classification error than the threshold level. The privacy of data is not compromised or adjusted based on the clustered instances for maintaining the utility.

3.1. Preliminaries

The datasets published by the owners usually have private and public attributes. The public attributes are known among all users, and it is publically available.

Formally, let I represent M items with proper subsets . Consider an itemset X, which is regarded as the subset of the itemset I, where . The transaction T is represented in the form of tuple (tid, X), where X is the itemset and tid differentiates the transactions with same items.

Item set X is supported for the purpose of transaction T if . The transactional database D over the itemset I with N transactions and the Tnm value is assigned for 1 of nth transaction and has an m item or else Tnm is 0.

The probability of intruders to reveal the user identity through the publically available attributes is called as QID attributes. At the time of data publishing, certain attributes are kept confidential, and these data are called as sensitive attributes (SA). The datasets usually possess the information like gender, age, and zipcode, and they are called as QID attributes. Further, the disease of an individual is called as sensitive attribute, and the identity of a person is a public attribute. It is easier for an attacker to reveal the private or sensitive user information without their knowledge. Knowing the QID information of a neighbor, the attack can link with their disease, and such attack is called as linking attack or identity disclosure attack. However, the conclusion of disease may change, and this is linked with another attack named as probabilistic inference attack. This attack is common in place of a more frequent occurrence of one sensitive attribute information than others. This can lead to an assumption of predicting the most common disease based on QID values. The proposed clustering method developed in this paper provides better resilience against the above two attacks.

3.2. ACO Clustering Process

The data clustering process or the partitioning process is given in [20] in the form of suitable steps, which is given in Algorithm 1. After the partitioning of data into clusters, the size of each cluster may vary. Hence, it is very necessary to adjust the size of clusters to obtain k-anonymity. The cluster value lesser than k value is merged with other clusters. This reduces the deterministic penalty since the combinations of all the clusters are same, which is given in Algorithm 2.

Input: privacy datasets
Output: unequally clustered datasets
Step 1. Pheromone matrix initialization, where the matrix elements are set with small values.
Step 2. Ants initialization(a)  Start initial process of iteration(b)  Reset the memory, cluster center matrix, and weight matrix
Step 3. Random selection of data object (other than memory list)
Step 4. Cluster selection using the process of exploitation and exploration(a)  Exploitation using the greedy algorithm(b)  Exploration allots probabilities of nodes and chooses the higher probability ant in stochastic manner(c)  This is regarded as the cluster center, which is used for forming the cluster
Step 5. Update the ants(a)  Update the memory, cluster center matrix, and weight matrix(b)  If the memory list is not full, go back to Step 3(c)  Else continue with the process(d)  End
Step 6. Find the fitness
Step 7. Update the value of pheromone
Step 8. Check end condition(a)  If total iterations > maximum iterations(i)   End the process(b)  Else go to Step 6(c)  End
Input: unequal clusters from the ACO algorithm
Output: k-anonymity table
Step 1. For each cluster(a)  Find the size of cluster
Step 2. If the size of cluster is lesser than maximum value of size of a cluster(a)  Merge the clusters
Step 3. Else
Step 4. Divide the groups into subgroups, s.t. each groups will have at least k-tuples
Step 5. End

To improve the distributed clusters uniformity, the study uses a K-nearest neighbor classifier after clustering the datasets using ACO. The classifier generates the equivalent classes of clustered samples with uniformity, and hence the information provided for data privacy is considered sensitive. In each cluster, KNN finds the equivalent class of best neighbor by maintaining the sensitive information of each data sample. Algorithm 3 provides the pseudocode of ECC-ACO with classifier.

Input: privacy datasets
Output: classified sensitive samples
Step 1. Sort private dataset using sensitive attributes
Step 2. Split the sorted datasets into subgroups
Step 3. If there exist identical attributes(a)  Cluster using ACO
Step 4. Else(a)  Go to Step 1
Step 5. End
Step 6. Repeat over other clusters(a)  If the size of cluster size = Dmin(b)  Distribute the data sample among the clusters(c)  Create single cluster element(d)  Add nearest neighbor in the cluster(e)  Eliminate unwanted instances(f)  End
Step 7. In case two, the value of S is greater than K(a)  Distribute the instances to all clusters, equally(b)  Add the record from Dmin to single element cluster(c)  Follow Step 6(c)(d)  Split Dmin values from each cluster(e)  Follow Step 8(f)  Compete the KN values and add the neighboring values to each cluster similar to case one(g)  Replace QID values with center values in each cluster to anonymize the data clusters.(h)  Repeat the process
Step 8. Calculate the centroid
Step 9. Use Euclidean distance to calculate nearest instances along the centroid in clusters
Step 10. Classify the similar instances using classifiers
Step 11. Merge the equivalent class samples

4. Results and Discussion

The implementation of ECC-ACO is carried out in Java on Eclipse IDE. The study considers two different experiments over smart manufacturing datasets. The ECC-ACO is tested with Smart Energy Data: Aachen/Cologne Smart Factory [21] and Power System Datasets [22]. It can effectively be applied on power system analysis or any application that generates rapid data. The model is tested using three different metrics that include average equivalence class size after classification, discernibility cost, and KullbackLeibler (KL) divergence. Further, the performance of utility is estimated in terms of F-measure, accuracy, and execution time.

4.1. Cluster Output

The conventional methods work on anonymization, which forms equivalence class or groups; however, it fails to provide data related to cluster size. The model uses both these parameters as the cluster significantly provides the difference between ACO and ECC-ACO clustering algorithm based on cluster size. The experiments are conducted on synthetic dataset and adult dataset case one. Further, the results provide the cluster size formed by ACO.

The study considers six different parameters that include the total number of ants (m), a parameter to control the pheromone τ for the sample selection choice probability α, a parameter to control the local heuristic factor η for the sample selection choice probability β, the pheromone persistence coefficient ρ, the pheromone reward factor R, and the pheromone penalty factor Pp.

Figure 2 provides the total number of clusters and cluster frequency of case one. Efficient clusters are created by using the proposed ant colony optimization (ACO) clustering scheme over the given datasets with different sensitive values say 5, 10, and 15 shown in (Figures 2(a)2(c)) with a total of 10000 records and fixed k value. In this experiment, the value of S is varied in order to provide variations between various clusters as per different frequencies. Similarly, the results are obtained by varying the value of k, i.e., 50 (Figure 2(d)) and 100 (Figure 2(e)) with fixed S value. Finally, Figure 2(f) shows the clusters formed by elliptic curve cryptography-ant colony optimization (ECC-ACO) algorithm with fixed k (=300) value and fixed S (=2) value, i.e., case one. The results show that ECC-ACO produces equivalence clusters of differing sizes, whereas the ECC-ACO algorithm forms uniform size.

Finally, the ECC-ACO is compared with models in terms of detection accuracy and F-measure. Here, the study considers the classification of clustered samples in order to find the accurate classes. The above two measures define the total number of correctly classified instances classes with classifiers like Naïve Bayes and J48 and experimented over smart manufacturing dataset. The experiments on case one and case two are carried out on HDFS.

4.2. Comparison with Existing Methods

Tables 1 and 2 show the results of percentage of optimal clustering of smart manufacturing data instances with varying values of k, say 5, 10, 25, and 30. The results of optimal clustering exhibit optimal results with smart manufacturing data while the instances are generated using the two different classifiers after the process of clustering that includes Naive Bayes and J48 classifier.

Tables 3 and 4 show the results of classified instances using the classifiers with varying values of k. The Hadoop system is improved by varying the parameters in order to increase the classification accuracy. The result shows that significant improvement is noted with increased random shuffling prior data chunk creation. It is seen that the performance of the ECC-ACO algorithm exhibits closer and optimal performance with J48 classifier. In case of Naive Bayes, the ACO clustering algorithm obtains improved performance than other algorithms. The overall performance shows that ECC-ACO exhibits better classification accuracy and F-measure than existing algorithms.

4.3. Data Utility and Privacy Metrics

The model is tested with three different metrics that include equivalence class size, discernibility cost, and KullbackLeibler divergence to find the utility losses associated with the smart manufacturing datasets on the privacy preserved dataset. It is seen that the proposed model acquires reduced loss in data than other methods. Tables 5 and 6 show the details of discernibility cost with various experiments conducted on different datasets. The results on the model show optimal discernibility cost than other models. Also, it shows a reduced cost than other methods, which shows that the model is better than other with optimal performance on data utility.

Table 7 shows average values of class size on dataset 1, and Table 8 shows average values of class size on dataset 1.

Table 9 shows the results of degree of privacy for the manufacturing datasets with increasing clustering values. The results show that the ECC-ACO achieves higher degree of privacy with increasing k samples than the ACO model.

4.4. Analysis

The result of evaluation shows the level of confidence of P−1 that helps to maintain the privacy of data even where there exists an attack on the data. The study considers an individual QID attribute that offers a sensitive attribute to the given unique clustered sample. The results also show that the probability of finding the clustered data samples provides improved data utility using ACO clustering, and it proves an improved clustering result than other models.

The study also observes that the trade-off often provides the smart manufacturing datasets with improved data utility and privacy. With increasing cluster size, the trade-off between the utility and privacy is hence formed without comprising the trade-off between them. However, with increased dataset size, the data privacy tends to get reduced that tends to get compromised while the data utility is maintained. The reduction of data privacy also occurs with increased number of data samples in the smart manufacturing datasets.

5. Conclusion

In this paper, the trade-off between the utility and privacy of smart manufacturing datasets is maintained, and it is provided in an optimal manner. The utility through ACO clustering and privacy through ECC clustering offers an optimal trade-off between them such that the model does not sacrifice one factor for the other. The extensive evaluation conducted on the smart manufacturing datasets shows reduced cost, optimal divergence, and accurate class size. It further opts for an improved privacy of data while considering the utility metrics like F-measure and clustering accuracy. The privacy result proves that the data are distributed in a secured way, and the sensitive information is protected against various attacks. The ECC-ACO is not algorithm specific; however, on any data type with the proposed model, it can be applied. It is highly scalable to larger datasets. Also, the major limitation of the study is that the method can be applied on a domain-specific and not on multiple applications. In future, the ECC-ACO can be used to tune the running time as it is directly affected by the formation of increased number of clusters. The privacy of datasets can be improved for other attacks.

Data Availability

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.