Abstract

In recent years, blockchain and machine-learning techniques have received increasing attention both in theoretical and practical aspects. However, the applications of these techniques have many challenges, one of which is the privacy-preserving issue. In this paper, we focus on, specifically, the privacy-preserving issue of imbalanced datasets, a commonly found problem in real-world applications. Built based on the fully homomorphic encryption technique, this paper presents two new secure protocols, Privacy-Preserving Synthetic Minority Oversampling Protocol (PPSMOS) and Borderline Privacy-Preserving Synthetic Minority Oversampling Protocol (Borderline-PPSMOS). Our analysis reveals that PPSMOS is generally more efficient in performance than Borderline-PPSMOS. However, Borderline-PPSMOS achieves a better TP rate and F-Value than PPSMOS.

1. Introduction

In the past few years, new information technology techniques, such as blockchain [14] and machine-learning [515], have been developing rapidly and used successfully in various real-life applications. However, they still face a critical challenge in the privacy-preserving issue. For example, the openness of a blockchain system poses a serious threat to the privacy and security of any user transactions. Thus, research for privacy-preserving techniques is becoming even more crucial.

Datasets in the wild come with a variety of problems. One of the most common problems is the imbalanced issue of the datasets. Imbalanced datasets issue arises in many real-world sectors, such as disease detection [16], bankruptcy prediction [17], fraud detection [18], etc. As the distribution of samples is incorrect in imbalanced datasets, it may cause the classification algorithms to produce inaccurate results and further issues. An imbalanced dataset usually consists of a number of classes, which falls into one of these two types: majority classes, which has a bigger number of examples, and minority classes, in which there are fewer examples. In this paper, we consider the situation where there are only two classes in a dataset, i.e., one majority class and one minority class.

The existing solutions proposed to solve the imbalanced dataset problem are categorized according to which level the technique is solving the problem from, e.g., data level, feature level, and machine-learning algorithm level. In this paper, we focus on fixing the problems at the data level. There are two known data level techniques, namely undersampling and oversampling methods. The undersampling method works by removing parts of the samples from the majority class to balance the ratio of majority and minority samples, whereas oversampling method balances the majority and minority samples by generating new minority samples. In 1972, Wilson [19] proposed an undersampling method, in which a majority sample should be deleted if all of its neighbors are minority samples. In 2020, Wang et al. [20] proposed a novel entropy and confidence-based undersampling boosting framework to solve imbalanced dataset issues, which could be applied to noniterating algorithms such as decision trees.

Random oversampling of minority classes is the simplest oversampling method. Through sampling with replacement, samples are continuously drawn from the minority class. This method, however, can easily lead to data overfitting. In 2002, Chawla et al. [21] proposed the Synthetic Minority Oversampling Technique (SMOTE) algorithm, which is one of the best-known oversampling methods to date. The algorithm works by generating artificial data using bootstrapping and the K-nearest neighbor algorithm. Further improvising SMOTE algorithm, Han et al. [22] proposed Borderline-SMOTE in 2005. The algorithm focuses on working on samples that are on the boundary of both majority and minority classes. The demonstration showed that Borderline-SMOTE achieved a better TP rate and F-Value than its predecessor. In 2008, Douzas et al. [23] presented a simple and effective oversampling method based on K-means clustering and SMOTE, which is able to eliminate noise generation and effectively overcome imbalances between and within classes. Furthermore, Li et al. [24] presented three sampling approaches for imbalanced learning in 2020. Unlike the previous solutions, their approaches considered a new class-imbalance metric, which contains the differences of information contents between classes, instead of the traditional imbalance ratio.

Although so many solutions have been proposed to solve the imbalanced data sets problem, the privacy-preserving issue has not been well resolved. To the best of our knowledge, Hong et al. [25] proposed a secure collaborative machine-learning solution in which they used secure multiparty computation to adjust the class weight for the imbalanced dataset. That is, the privacy-preserving issue of the imbalanced dataset was tackled at the machine-learning algorithm level. The privacy-preserving solution in the machine-learning level is specific. That is, when we change the machine-learning algorithm, a new privacy-preserving solution to the imbalanced data set problem should be proposed. By contrast, as the privacy-preserving solutions in the data level solve the problem in the preprocessing stage, the output of these solutions can be widely used as they are independent of the machine-learning algorithms. So, in this paper, we focus on the privacy-preserving issue of imbalanced data sets at the data level.

Despite the numerous solutions proposed to solve the imbalanced data sets problems, there is almost none of them attempted to resolve the privacy-preserving issue. To the best of our knowledge, Hong et al. invented a secure collaborative machine-learning solution, in which they used a secure multiparty computation to adjust the class weight of the imbalanced dataset. They tackled the privacy-preserving issue of the imbalanced dataset on the machine-learning algorithm level. This solution, however, is sensitive to the algorithm used for machine-learning, i.e., when the machine-learning algorithm is changed, a new privacy-preserving solution must be proposed for the imbalanced dataset problem. In contrast, as the privacy-preserving solutions at the data level work by solving the problem in the preprocessing stage, their output can be used widely, regardless of the machine-learning algorithm adopted in the system. Hence, in this paper, we focus on tackling the privacy-preserving issue of imbalanced datasets on the data level.

Currently, Secure Multiparty Computation (SMC) is one of the most widely used techniques to tackle the privacy-preserving issue. In SMC, multiple parties participate in the game with their individual secure inputs and nobody knows anything of each other’s inputs. When the game ends, according to the game rules, some of the parties will obtain the output. The first SMC solution [26] to the millionaire problem was first presented by Yao. Since then, SMC has been developing rapidly. In 2017, Makri et al. [27] proposed SPDZ, a private image classification with SVM using the SMC framework. Mohassel et al. [28] presented a privacy-preserving machine-learning framework, SecureML, in which the privacy-preserving issue of the linear regression, logistic regression, and neural network training using the stochastic gradient descent method was considered.

In SMC, there are various underlying cryptographic tools, such as garbled circuit, homomorphic encryption scheme, oblivious transfer, and secret sharing scheme. In this paper, we focus on handling the imbalanced dataset problem with the privacy-preserving two-party computation using the homomorphic encryption scheme. Homomorphic encryption is one of the most active research areas in the field of cryptography. Homomorphic encryption was initially proposed by Rivest et al. [29] in 1978. In 1985, ElGamal et al. [30] proposed a widely used multiplicatively homomorphic encryption scheme, known as ElGamal scheme. In 2001, Damgard et al. [31] promoted an additively homomorphic encryption scheme, named Paillier scheme. In 2009, Gentry [32] proposed a fully homomorphic encryption scheme, a ground-breaking development to homomorphic encryption study. Currently, the two most widely used fully homomorphic encryption schemes are the BGV scheme [33] by Brakerski et al. and BFV scheme [34] by Fan et al. In 2021, Chen et al. [35] presented a dynamic multikey fully homomorphic encryption scheme based on LWE assumption in the public key setting.

1.1. Contributions

In this paper, we propose two novel privacy-preserving oversampling protocols, namely PPSMOS and Borderline-PPSMOS. Both PPSMOS and Borderline-PPSMOS are aimed to solve the problem of the imbalanced dataset while preserving the participants’ input and output privacies. With the client and the service denoted as Bob and Alice, respectively, the work in this paper can be generally viewed as follows.(1)PPSMOS: This algorithm works in a distributed architecture, where Bob inputs no examples at the beginning of the protocol. All the examples, both majority and minority, are provided by Alice. After the protocol, Bob gets the synthetic minority example while he learns nothing of Alice’s examples. At the same time, Alice learns nothing of the output Bob receives. PPSMOS shows to be a good solution with a privacy-preserving manner for data balance problems encountered in the cold start phase of many real-life applications.(2)Borderline-PPSMOS: In this algorithm, at the start of the protocol, Bob has some majority examples as his input. Meanwhile, Alice has a number of minority examples. After the protocol, Bob receives synthetic minority examples, while he learns nothing of Alice’s minority examples, and Alice learns nothing of Bob’s input and output.(3)PPSMOS and Borderline-PPSMOS performance analysis: Our analysis shows that PPSMOS generally works more efficiently than Borderline-PPSMOS, while Borderline-PPSMOS achieves a better TP rate and F-Value than PPSMOS. We also found that PPSMOS and Borderline-PPSMOS are both secure in the semihonest model.

1.2. Roadmap of This Paper

The rest of this paper is organized as follows. In Section 2, we introduce the preliminaries. We present the Privacy-Preserving Synthetic Minority Oversampling (PPSMOS) protocol in Section 3 and Borderline-PPSMOS in Section 4. We compare and analyse our protocols in Section 5. We, then, give our concluding remarks in Section 6.

2. Preliminaries

2.1. Homomorphic Encryption

The homomorphic encryption scheme allows us to operate the ciphertext directly. The result obtained after the application of this scheme is equivalent to the ciphertext obtained after performing an operation on a plaintext. Homomorphic encryption algorithms are divided into three categories: additive homomorphism, multiplicative homomorphism, and full homomorphism. For our protocols, we adopt the fully homomorphic encryption scheme. We describe the fully homomorphic encryption algorithm as follows.

We denote as the system keys, where is the public key and is the secret key. Furthermore, is the encryption operation on the plaintext and is the decryption operation on the ciphertext . The fully homomorphic encryption scheme follows the properties below.

2.2. Semihonest Model

There are two widely used adversarial models in SMC, the semihonest model, and the malicious model. In this work, we design our protocols in the semihonest model.

In the semihonest model, there are two kinds of participants, the honest participants and the semihonest participants. The honest participants follow the protocol without doing any other activities. At the same time, the semihonest participants followed the protocol and collected the data they obtained during the process of the protocol. After the protocol, they may want to infer information from the data they collected. A protocol is secure in the semihonest model if the semihonest participants get no valuable information from the data they collected.

3. Privacy-Preserving Synthetic Minority Oversampling Protocol

In this section, we present our Privacy-Preserving Synthetic Minority Oversampling Protocol (PPSMOS) and analyze its security aspect.

Suppose that Alice has the total dataset with with . To simplify, Alice puts all the minority samples in front of . In other words, we denote the minority subclass by where is the number of the minority samples. Both Alice and Bob wish to generate a minority sample based on . After the protocol, Bob gets the output under the condition that Alice and Bob cannot know any information about and , respectively.

3.1. PPSMOS
3.1.1. Input

Alice inputs , where and , with the first elements belonging to the minority class. Bob inputs nothing.

3.1.2. Output

Bob obtains a newly synthesized minority sample while Alice gets nothing.

3.1.3. Preprocessing Stage

(1)Alice calls the key generation algorithm of the fully homomorphic encryption system to generate the system key .(2)Alice computes the ciphertext as follows.(i) where (3)Alice constructs a matrix that contains the indices of the k-nearest neighbors of every element in , i.e., every in presents the index of the nearest neighbor of in the minority class .(4)Alice discloses and on the network.(5)Bob gets and published by Alice.

3.1.4. Processing Stage

(1)Bob generates two random integers, and , where and .(2)Bob generates two random numbers and where . Then, using the public key and the encryption algorithm , he computes the ciphertext and .(3)Using both ciphertexts obtained in (2), Bob does the following operation to produce . Then he sends to Alice.(4)Alice decrypts using the secret key and obtains , before sending it to Bob.(5)Bob gets the final result as follows.

3.2. Security Analysis

Theorem 1. Under the assumption that the underlying fully homomorphic encryption scheme is secure, PPSMOS securely generates the minority samples in the semihonest model.

Proof. First, we analyse the situation where Alice is corrupted. In PPSMOS, Alice receives from Bob. Using the secret key, Alice is able to recover the plaintext:As , , and are random numbers, Alice does not have the ability to infer the matchup between and its samples. Furthermore, since are confused by the random number , Alice has no way of knowing Bob's newly generated point . Hence, even if Alice is corrupted, Bob’s output is isolated from Alice and, thus, secure.
Next, we analyze the case that Bob is corrupted. In the preprocessing stage, Bob gets the ciphertext and a matrix , which are both disclosed by Alice. As the underlying homomorphic encryption scheme is secure in the semihonest model, Bob will not be able to infer any information regarding Alice’s private input from . As presents the index of the nearest neighbor of in the minority class , Bob is unable to get any information of the specific point of through . Therefore, even if Bob is corrupted, Alice’s private information is still secure and undisclosed from Bob.
Thus, we can deduct that that Theorem 1 holds.

4. Borderline Privacy-Preserving Synthetic Minority Oversampling Protocol

In this section, we present our Borderline Privacy-Preserving Synthetic Minority Oversampling Protocol (Borderline-PPSMOS) and analyze its security aspect.

Suppose that Alice has a minority class , where . Bob has a majority class , where . Both Alice and Bob wish to generate a minority sample based on and . After the protocol, Bob gets the output . Meanwhile, Alice cannot know any information about and , and Bob cannot know any information about .

4.1. Borderline-PPSMOS
4.1.1. Input

Alice inputs where . Bob inputs where .

4.1.2. Output

Alice gets nothing. Bob obtains a newly synthesized minority sample .

4.1.3. Preprocessing Stage

(1)Alice generates the key of the fully homomorphic encryption system.(2)Alice computes the ciphertext as follows.(i) where (3)Alice constructs a matrix , where in represents the square power of the Euclidean distance between the point and its -nearest neighbors.(4)Alice encrypts every element in and obtains .(5)Alice discloses and on the network.(6)Bob gets and which were published by Alice.

4.1.4. Processing Stage

(1)Bob computes using and the encryption algorithm .(i), where (2)For every element in , Bob calculates the ciphertext of the square power of the Euclidean distance between the and the elements in .(i)where (3)Bob generates m random numbers . Then, he obtains the ciphertext , using the public key and the encryption algorithm .(4)Bob connects to the encryption matrix to form a new matrix .(5)Bob performs row and column confusion on to obtain the confused matrix . Then he sends to Alice.(6)Alice receives and decrypts it with the private key to obtain the matrix .(7)For every row in , Alice computes the k-smallest value and denotes the position of these elements in the matrix . Then, Alice sends to Bob.(8)Bob performs the inverse obfuscation on matrix to get matrix .(9)For the row in , where , Bob counts the number of the elements smaller than . Similar to that in Borderline-SMOTE, we call the point , if .(10)Bob randomly selects a point from the class and randomly selects an element , which is greater than from the row of .(11)Bob generates two random numbers, and , where . Next, he generates the ciphertext and by using the public key and the encryption algorithm .(12)Bob performs an operation using the ciphertext and to obtain as follows. Then he sends to Alice.(13)Alice decrypts using the secret key and obtains . She then proceeds to send to Bob.(14)Bob gets the final result as follows.

4.2. Security Analysis

Theorem 2. Under the assumption that the underlying fully homomorphic encryption scheme is secure, Borderline-PPSMOS securely generates the minority sample in the semihonest model.

Proof. First, we analyse the situation where Alice is corrupted. In Borderline-PPSMOS, Alice receives from Bob. Alice is able to recover the plaintext using the private key. However, since Bob obtained by applying row and column confusion operation on , Alice will not be able to infer the true rank order of . Furthermore, as is confused by using , where are random numbers, Alice will not be able to know the information of set owned by Bob.
Also, when Alice receives from Bob, she can recover the plaintext:However, as and are random numbers, Alice does not have the ability to infer the matchup between and its samples. In addition, as is confused by the random number , Alice has no way of knowing Bob’s newly generated point . Hence, even if Alice is corrupted, Bob’s input and output are totally isolated from Alice and still secure.
Next, we analyze the case where Bob is corrupted. In the preprocessing stage, Bob gets the ciphertext , and public key . As the underlying homomorphic encryption scheme is secure in the semihonest model, Bob is unable to infer any information regarding Alice’ private input from and . In the processing stage, Alice computes the k-smallest value and denotes the position of these elements in matrix . During this step, Alice only sends the location index to Bob, which does not reveal any information of Alice’s input. Next, Bob gets from Alice. Similarly, as the homomorphic encryption scheme is secure in the semihonest model, Bob cannot infer from . Thus, even if Bob is corrupted, Alice’s private information is still secure and undisclosed from Bob.
Therefore, we can deduce that Borderline-PPSMOS is secure in the semihonest model, under the fully homomorphic encryption scheme, i.e., Theorem 2 holds.

5. Performance Analysis

In this section, we present the performance analysis of both PPSMOS and Borderline-PPSMOS. On the efficiency analysis, we look at computational complexity and communication complexity. Given that is the size of Alice’s input in PPSMOS, is the feature size, t is the size of the majority class, is the size of the minority class, and is a parameter, we analyze the protocols’ performances as follows.

First, we analyze the performance of PPSMOS. During the preprocessing stage, Alice performs the encryption operation for times while Bob gets ciphertexts. Furthermore, in the processing stage, Bob performs encryption operation 2 times and times for each homomorphic additive operation and homomorphic multiplicative operation. Alice, then, performs decryption operations for times. Bob sends ciphertexts to Alice.

Secondly, we analyze the efficiency of Borderline-PPSMOS. In the preprocessing stage, Alice performs the encryption operations for times, while Bob gets ciphertexts. Next, in the processing stage, Bob performs encryption operation for times, homomorphic additive operation for times, and homomorphic multiplicative operation for times. Alice performs decryption operations for times. Finally, Bob transferred ciphertexts to Alice.

We summarise the computational complexity and communication complexity of both protocols below in Table 1.

We visualize the operational efficiency of both PPSMOS and Borderline-PPSMOS during the processing stage by instantiating the parameters, as shown below in Figures 14.

From these figures, we can conclude the following: (1) the computational complexity and communication complexity of PPSMOS are lower than those of Borderline-PPSMOS; (2) the computational complexity and communication complexity of PPSMOS depend only on the feature size ; (3) the computational complexity and communication complexity of Borderline-PPSMOS depend on almost all of the parameters, i.e., , , and ; (4) Borderline Synthetic Minority Oversampling Protocol achieves better TP rate and F-Value than Synthetic Minority Oversampling Protocol [22]. Furthermore, as our privacy-preserving schemes do not affect the TP rate and F-Value of the underlying Minority Oversampling protocol, we can further deduct that Borderline-PPSMOS achieves a better TP rate and F-Value than PPSMOS.

6. Conclusion

In this paper, we propose two novel privacy-preserving oversampling protocols, PPSMOS and Borderline-PPSMOS, that are aimed to address the imbalanced dataset issue while preserving the privacy of the participants’ input and output.

PPSMOS works in a manner where the client inputs no majority examples, as opposed to Borderline-PPSMOS, where the client has some majority examples. Both PPSMOS and Borderline-PPSMOS are secure in the semihonest model. This means that both methods are suitable for the preprocessing stage of machine-learning and applicable to any cases where synthesizing minority examples in a privacy-preserving manner is needed. Our results show that PPSMOS is more efficient than Borderline-PPSMOS in general, while Borderline-PPSMOS achieves better TP rate and F-Value than PPSMOS.

While doing our work in the semihonest model and through our analysis, we found that our protocols are unable to resist malicious attacks, and their efficiency needs improvements. As future work, we will continue improving our research on these two aspects, as well as focusing on designing better privacy-preserving protocols that are to be used in the preprocessing stage of machine-learning.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This study was supported by Project of High-level Teachers in Beijing Municipal Universities in the Period of 13th Five-year Plan (CIT&TCD201904097) and The Fundamental Research Funds for Beijing Local Universities From Capital University of Economics and Business.