Artificial Intelligence for Cyberspace SecurityView this Special Issue
Research Article | Open Access
Joon Soo Yoo, Ji Won Yoon, "t-BMPNet: Trainable Bitwise Multilayer Perceptron Neural Network over Fully Homomorphic Encryption Scheme", Security and Communication Networks, vol. 2021, Article ID 7621260, 19 pages, 2021. https://doi.org/10.1155/2021/7621260
t-BMPNet: Trainable Bitwise Multilayer Perceptron Neural Network over Fully Homomorphic Encryption Scheme
Homomorphic encryption (HE) is notable for enabling computation on encrypted data as well as guaranteeing high-level security based on the hardness of the lattice problem. In this sense, the advantage of HE has facilitated research that can perform data analysis in an encrypted state as a purpose of achieving security and privacy for both clients and the cloud. However, much of the literature is centered around building a network that only provides an encrypted prediction result rather than constructing a system that can learn from the encrypted data to provide more accurate answers for the clients. Moreover, their research uses simple polynomial approximations to design an activation function causing a possibly significant error in prediction results. Conversely, our approach is more fundamental; we present t-BMPNet which is a neural network over fully homomorphic encryption scheme that is built upon primitive gates and fundamental bitwise homomorphic operations. Thus, our model can tackle the nonlinearity problem of approximating the activation function in a more sophisticated way. Moreover, we show that our t-BMPNet can perform training—backpropagation and feedforward algorithms—in the encrypted domain, unlike other literature. Last, we apply our approach to a small dataset to demonstrate the feasibility of our model.
Homomorphic encryption is a cryptographic scheme that has long been considered holy grails for some cryptographists. Its intriguing property is that any function in plaintext can be constructed just as in the encrypted domain while maintaining the same functionality. Moreover, homomorphic encryption based on the learning with error (LWE) scheme  ensures high-level security even in the postquantum computing environment. Thus, by using homomorphic encryption, one can construct an environment that is both secure and private, since no information about the data being leaked to an adversary based on these properties.
In recent years, some global companies and institutions have strived to construct a system to provide secure and privacy-preserving services to the clients. In the system, the client sends the encrypted data along with its query to the cloud company; the client’s encrypted data are evaluated in a “magic box” to output an encrypted result. Next, the cloud sends back the result, and the client decrypts the encrypted output using its secret key to obtain the desired result. Homomorphic encryption enables constructing the “magic box,” in which the encrypted data are being processed without revealing any information to the cloud. Moreover, in some homomorphic encryption schemes, the client is unable to retrieve any information about the design of the circuit from the cloud. Therefore, homomorphic encryption can provide privacy for both the cloud and the client.
The primary source of our model (t-BMPNet) is homomorphic encryption and neural network. Particularly, we are interested in designing a multilayer perceptron neural network (MLPNN)—a foundational model of deep learning—under a fully homomorphic encryption (FHE) scheme. Our goal is to construct a model that can learn from the clients’ encrypted data to update parameters in the network and provide accurate results for the clients. Specifically, we use the Boolean circuits approach in the FHE scheme; plaintexts are encrypted in bit-by-bit basis and computations are expressed as Boolean circuits. Our code is freely available at https://github.com/joonsooyoo/t-bmpnet.
1.2. Related Works and Some Problems
In general, most of the works [2, 3] related to the “training” of neural networks such as the DLPNN and convolutional neural network (CNN) are performed in a nonencrypted state (Figure 1(a)). As a representative example, Gilad-Bachrach et al.  proposed CryptoNet which is based on the modular arithmetic FHE scheme to construct a CNN model. In this model, a client uses a public-key homomorphic cryptosystem to encrypt data by her public key and send it to the server that has already pretrained the network. The server returns the evaluated result to the original source, and the user can decrypt the prediction result by . Thus, the limitation is trivial—it can only provide prediction results through the pretrained neural network that performs only the feedforward algorithm. Likewise, CryptoDL  can be categorized in this line of research where the training is executed in advance unencrypted.
The construction of a nonlinear sigmoid function is a critical problem in designing a neural network. However, much of the literature approach this problem with a rather simple idea; they approximate the sigmoid function with polynomials. This is because their works are mainly based on homomorphic binary operations—addition and multiplication. Mohassel and Zhang  used Taylor series to approximate the sigmoid function in logistic regression. Kim et al.  improved the accuracy of the polynomial approximation by using the least square method. Designing the nonlinear function with a polynomial approximation is apparently working in some sense; however, for the sake of designing arbitrary nonlinear functions, it cannot be applied in every circumstance. Also, it does not guarantee high accuracy as well; a more general approach should be discussed.
The work that includes training in the encrypted phase proposed by Phong et al. [4, 5]. They demonstrated training the multilayer perceptron neural network with the partial homomorphic operation, namely, additive operations (Figure 1(b)). The model learns from the client ’s data by their given gradient vectors homomorphically added to the global weight parameters . The server interacts with clients multiple times to enhance the optimization of the parameters, and later, these parameters are distributed to each client. The scheme is suitable for the model that assumes the client to be an honest entity such as a big organization; however, it is vulnerable to a semihonest or malicious client where the client can violate the circuit privacy  of the server. Moreover, the calculation of gradient vectors adds a potential burden to the client.
From this point of view, our key contributions of this study are summarized as follows:(i)We propose a novel approach of implementing the most accurate homomorphic sigmoid function among the existing literature from the basis of bitwise operations(ii)We present t-BMPNet—a general framework for designing a multilayer perceptron neural network over a fully homomorphic encryption scheme that performs training in the encrypted domain(iii)We propose a trainable FHE neural network under the minimum interaction between the client and the server compared to other FHE trainable neural networks(iv)Our approach broadens the horizon of feasibility in an application to various deep learning studies that require a secure cloud computing model
2.1. Homomorphic Encryption
To discuss a simple notion of homomorphic encryption, we consider two messages and from a message space and their corresponding ciphertexts and . We say that the encryption scheme is additively homomorphic if there exists an operation , such that , where is not necessarily be the same as the addition in the plaintext. Generally, operation is more complex and requires more computation than the normal addition. Likewise, we can consider the multiplicative homomorphism in the same way.
We refer to fully homomorphic encryption (FHE) when both of the algebraic operations are supported. Initially, the idea of FHE was proposed by Rivest et al.  in 1978 and had not been successful until 30 years later in 2005; the first attempt was realized by Boneh et al. . They managed to perform a somewhat homomorphic encryption scheme (SWHE) that allows numerous additions, but the caveat is that only a single multiplication can be performed. Before then, public cryptosystems such as RSA and Paillier  use partial homomorphisms allowing only a single algebraic operation that is multiplication and addition, respectively. Typically, when we refer to a homomorphic encryption scheme, they follow a structure of basic 4 steps: key generation, encryption, decryption, and evaluation. Formally, the definition of HE is as follows.
Definition (public-key homomorphic encryption): homomorphic encryption is a probabilistic polynomial-time algorithm that involves four stages of the steps as follows.(i)Key generation: the algorithm takes a security parameter as an input and outputs a secret key , a public key , and an evaluation key . We write KeyGen .(ii)Encryption: the algorithm takes a single message bit from the message space and a public key to output a ciphertext of a single bit. We write Enc .(iii)Decryption: the algorithm takes a secret and its corresponding ciphertext and outputs a message .(iv)Evaluation: the algorithm takes bits of ciphertexts with a function to output a ciphertext using an evaluation circuit Eval. We write Eval . Also, the evaluation circuit Eval needs to satisfy Dec .
In 2009, Gentry and Boneh  made a significant breakthrough in FHE in which its contribution can be divided into two pieces: leveled HE (or LHE in short) and bootstrapping. LHE supports both addition and multiplication, while the number of operations is limited. The reason for the restricted number of operations is that noise is accumulated after an evaluation of a Boolean circuit. After some number of evaluations of Boolean circuits (depending on the level of noise parameter designated) because of the noise accumulation, the ciphertext is not guaranteed to provide a correct decryption output. The problem of noise is handled by the bootstrapping procedure such that the noise is reduced after a “well-calculated” number of operations in the ciphertext in order to provide a “fresh” ciphertext.
Various FHE schemes have been actively proposed and improved based on the work of Gentry et al. in 2013 . These FHE schemes provide basic homomorphic addition and multiplication in common; however, they have several different features; it is important to select the right scheme for implementation considering different aspects of the schemes. In general, FHE schemes can be classified into three different categories—Boolean circuits, modular arithmetic, and approximate number arithmetic.
Note that each encryption scheme shows different performance in terms of precision, accuracy, and throughput; we briefly describe each scheme to help the understanding of the experimental section that includes comparisons of our scheme with the existing literature that are constructed based on different FHE schemes. In particular, we provide more details and emphasis on the Boolean circuit method since our work is based on this approach.
2.1.1. Modular Arithmetic Approach (BFV)
The modular arithmetic approach is generally used without bootstrapping (leveled HE) and thus perform a fast evaluation of ciphertexts. It is based on integer arithmetic and provides the exact result after decryption. This approach is efficient for SIMD computations over vectors of integers and supports fast scalar multiplication. In this study, we provide an encryption scheme proposed by Bos et al.  which is one of the modular arithmetic approaches and is the basis encryption model for CryptoNets .
The encryption model takes a plaintext message from a ring to the ciphertext of a ring . Observe that the rings have coefficients of integers over the modular space and . For a secret key polynomial , we choose two random polynomials and in , such that satisfies an equation as the first criterion of choosing a secret key. Next, we check if has an inverse for the public key , or else, we discard and iterate steps till we obtain the keys that satisfy the criterion.
Now, we encrypt message by the public key :where , are the random noise polynomials and is the reduction of the coefficients of mod to the symmetric interval around 0 with the same length . The decryption process simply takes the ciphertext and performs multiplication followed by rounding and modular operations: .
For the addition of two messages and , it is done by simply adding two corresponding ciphertexts and ; one can verify by taking decryption of to show that it matches with the result . However, in the case of the multiplication, an additional step, the relinearization process is required, so that the secret polynomial remains (not ) as it is after the multiplication of two ciphertexts and .
One last note is that this modular encryption scheme takes only integers for the input values; it cannot handle floating-point arithmetics. As we will discuss in the later sections, this feature prevents the calculation of nonlinear sigmoid function by approximated polynomials since the coefficients are represented as real numbers. Thus, when designing the neural network (e.g., CryptoNets ) with the modular arithmetic approach, the calculation of sigmoid function is a huge obstacle.
2.1.2. Approximate Number Arithmetic Approach (CKKS)
The approximate number arithmetic method is the most recently published FHE scheme proposed by Cheon et al.  in 2017 and considered as a nearly practical HE scheme to compute over real data. It can perform efficient SIMD computations over vectors of real numbers using batching and evaluates fast polynomial approximation. It demonstrates effectiveness in deep approximate computations such as logistic regression learning and often used without bootstrapping (i.e., leveled HE). One distinctive feature is that the result of the encryption model is approximated; the result includes an error that was generated from the encryption process. Thus, contrary to the modular arithmetic approach, it provides an approximate (not exact) result.
One notable feature that makes HEAAN distinguishable from other HE schemes is its encoding (and decoding) technique from a vector of complex (or real) numbers to plaintext message space of a polynomial ring by an isomorphic mapping of , such that . Then, a plaintext vector of real numbers can be encoded as a plaintext message of a polynomial by computing , where is a scaling factor.
We omit the detailed process of the model , but provide the simplified methodology of the whole process to compare with other approaches. First, the parameter generation process outputs , a modulus , and discrete Gaussian distribution selected from the choice of a security parameter . KeyGen yields and , where a secret polynomial of its coefficients randomly selected from a sparse distribution , a polynomial sampled uniformly random from and . Also, is selected by polynomials , such that , , and that are combined to output (mod .
Enc takes and outputs , where , and a polynomial of its coefficients is randomly chosen from . For Dec of , (mod . For addition of two ciphertexts and , we simply . Last, a multiplicative result of two ciphertexts and , and we let (mod . Then, .
As discussed in the later section, CKKS is not suitable for evaluating complex circuits, particularly in a neural network with a nonlinear activation function. This is because CKKS primarily utilizes addition and multiplication; it inevitably requires approximation by polynomials for nonlinear functions. In contrast, our approach (bitwise operation) can benefit from atomic operations, resulting in better accuracy and less multiplicative depths.
2.1.3. Boolean Circuit Approach (TFHE)
Unlike the previous two approaches, modular and approximate, the Boolean circuit method considers plaintext as bits and evaluates an arbitrary function by a sequence of Boolean gates. As of the proposal from Gentry et al. in 2013 , FHEW  and TFHE  schemes are the most eminent form of this approach. In particular, TFHE has made some significant improvements from FHEW in terms of faster computation of bootstrapping of noise generated from Boolean gates switching to a nearly practical scheme. Specifically, an execution time of a binary gate (AND, OR, and NAND) is about 13 milliseconds single core time by a factor of 53 improvement compared to FHEW. Therefore, throughout the study, our work utilizes the TFHE scheme for implementation; note that our work is not limited only to the TFHE scheme but can also be demonstrated through other Boolean circuit methods.
Our work considers two factors: (1) encrypted bits and (2) evaluation of Boolean circuit that works on encrypted bits. We explain in detail designing a Boolean circuit (e.g., neural network) that operates on encrypted bits in the later section of the study. But we emphasize to the readers that our approach is very different from the perspective of the previous approaches that evaluate integer or floating-point arithmetic circuits using basic operations such as HE addition and multiplication. Also, these works most often use a leveled version of FHE for the evaluation of circuits but fail to evaluate the deep depth of a circuit. However, they almost surely guarantee faster performance time than the Boolean circuits method in shallow network circuits. But, since our work is based on the TFHE scheme, we can evaluate an arbitrary depth of circuit facilitated by the bootstrapping procedure after an evaluation of each bootstrapping gate.
We briefly explain the basics of TFHE to bridge the understanding of its underlying work and our approach. TFHE operates over the real torus , that is, of real numbers mod 1. Notice that is an additive Abelian group; however, it is not a ring (not closed under multiplication). Instead, is a -module enabling a mapping of with under operation: with some properties . Based on the newly defined torus , we can encrypt a message by , where is a secret key for , is an uniform-random LWE sample (or ), and , where is a sub-Gaussian distribution. One can retrieve the original message by (traditionally) performing and rounding it to the nearest message in the message space.
The novelty of TFHE lies in the bootstrapping procedure of refreshing the ciphertext’s noise—taking most of the execution time—by defining an external product between TGSW and T (R) LWE: TGSW TLWE TLWE, where TGSW and T (R) LWE are the polynomials in and , respectively. In short, with a newly defined external product, TFHE manages to improve the performance time for bootstrapping that includes a series of procedures: KeySwitch SampleExtract BlindRotate (We refer the interested readers to [17, 18] for more details.).
We can construct bootstrapping FHE Boolean gates using the previous procedures. For example, AND bootstrapping gate between TLWE samples and over the message space can be constructed by , followed by the bootstrapping procedure. Likewise, other FHE Boolean gates such as OR, XOR, and NAND are designed. In practice, TFHE uses a message from the message space for encryption. Note that and correspond to 0 and 1, respectively, from the integer space for the Boolean arithmetics.
Note that the detailed process of TFHE is out of scope to understand our work; the importance lies in understanding the difference in the TFHE approach to other methods. Based on the bootstrapping FHE gates provided by TFHE—10 binary gates such as NAND, NOT, AND, and OR—we can play with encrypted bits and gates to construct an arbitrary circuit for evaluation (in our work, the neural network).
2.2. Bitwise Operations
We start from the ground truth that any evaluation FHE function, in theory, can be devised from the universal homomorphic gates. Some FHE schemes [13, 16, 18] allow the construction of FHE Boolean gates such as AND and OR that support bootstrapping of noise after each operation. Therefore, we utilize the FHE Boolean gates from the scheme and construct more complicated evaluation functions based on bitwise operations. However, unlike operations on the plaintext, it is necessary to consider designing evaluation functions from the worst-case scenario in FHE, or else, it means that the server knows about the path in the encrypted domain which is a contradiction. Having considered this fact, we designed various fundamental operations such as shift, compare, arithmetic, and nonlinear functions based on the bitwise operation, that is, considering movements of the encrypted bits. Table 1 provides the average execution time for the basic types of operations in our scheme.
Yoo et al.  explained the specific design of each bitwise function in detail, and therefore, we briefly introduce the concept of its design in this study.
2.3. Multilayer Perceptron Neural Network
The multilayer perceptron neural network (MLPNN) [20–22] is one of the representative algorithms of deep learning. It is a network of multiple layers that have perceptrons in each layer that play as neurons in our brain. It is a basic structure in the neural network system that endeavors to learn a representation of a given set of data by mimicking our brain system in a simple form. The technique is widely used in a variety of applications such as computer vision, speech recognition, and natural language processing.
2.3.1. Single Perceptron
To understand the structure of the network, we decompose its network by a set of layers and analyze a single perceptron in a layer. A perceptron computes a prediction value as a linear summation with respect to the input , where and represent a weight and a bias, respectively. More compactly, we can write as , where a bold lowercase letter and represent a vector and an inner product operation, respectively. As an output for the single perceptron, it is a binary number taking 0 or 1. The output is determined by a threshold value 0, and thus, a rule for the perceptron can be written:
Training the MLPNN is almost the same as training a single perceptron, but we append one more assumption to the structure; we want a small change in the weights and biases to cause a small change in the corresponding output. The concept can be immediately implemented when we define a sigmoid function that outputs a value between 0 and 1 with a “smoothness” property allowing differentiation of the function. In this way, the weighted sum is “squashed” in a small interval in which a small change in parameters affects a small change in the output. Also, the sigmoid function has a property that its derivative can be calculated by that involves only a single multiplication and a single subtraction. This algebraic property is especially helpful when we update the parameters in the backpropagation algorithm. Therefore, our rule for the output of the single sigmoid neuron is the following:where and .
MLPNN has a series of layers that is divided into three categories: an input layer, hidden layer (s), and an output layer. The input layer takes a set of data for , where each is a vector of elements denotes the number of neurons in a layer . The purpose of the network is to learn from the training data that can estimate its corresponding label as accurately as possible. This whole process of the training can be divided into two steps: feedforward and backpropagation algorithms.
The feedforward algorithm takes an encoding of data that are fed into the input layer of . Then, in each layer , we perform the same procedure as in the derivation of the outcome of the single sigmoid neuron with slightly different notations:
We refer to as the weight between the two neurons, and . The equation (4) can be simplified as and in a matrix form. Therefore, the feedforward algorithm aims to calculate the value —the likelihood that the data are classified as the label —in the output layer.
The backpropagation algorithm is a popular way of training the neural network using the gradient descent method. Its purpose is to train the network, such that it can correctly at its best estimate the label given its data. In order to achieve this, we define the cost function of a single training example bywhere is the desired value that takes a value of 0 or 1. The total cost function for the given training dataset is the mean squared error or MSE, that is, . Therefore, our goal of training is to find the values for weights and biases that minimize the cost function.
We use the gradient descent algorithm which is finding partial derivatives of with respect to and to iteratively adjust weights and biases:where is a learning rate. We also denote gradient vector which has elements of all partial derivatives and derived from a single data . Therefore, for a general gradient descent algorithm, we update our parameters by the mean of gradient vectors as in the equation (6).
3. Our Model: t-BMPNet over FHE
Our model is based on FHE that involves both the training and making predictions under the encrypted states (Figure 2). We set the model to be a server with a client participating in communication protocol where the client is “only” responsible for the encryption of the data and transmitting it to the server through a secure channel.
3.1.1. Threat Model
We assume that both the server and the client to be honest-but-curious entities which are different from [4, 5], where the clients are assumed as honest entities such as financial institutions or hospitals. In our model, the client should not necessarily be limited to these trustable systems but can be applied in a broader scope involving individuals.
3.1.2. High-Level Illustration
The communication protocol for learning from the client’s encrypted data in our model is as follows (Figure 2):(1)Client encrypts data and outsource the encrypted data using its public key to the server(2)The server evaluates feedforward and backpropagation algorithms in the encrypted state to learn from the client’s data to optimize the global weight parameters(3)The server performs classification work of each data using the feedforward algorithm and sends the result to client (4)Client receives and decrypts the given encrypted result that outputs
Based on this model, we propose (1) an encrypted training algorithm and (2) a more accurate design of the sigmoid function based on the bitwise operations on the ciphertexts as our main points of this study. Specifically, (1) and (2) are evaluation functions (Eval) that operate on ciphertexts.
3.2. Number Formatting, Encoding/Decoding, and Encryption/Decryption
Our number system uses a fixed-point number for the bitwise operation. We use a fixed-point number representation rather than a floating-point number representation due to the efficiency of designing various FHE functions.
We encode to a plaintext vector of size : , where (or and concatenated. To prevent confusion, we represent the plaintext by , where denotes the separation of the fractional part and integer part of . Notice that we enumerate the plaintext elements in the reverse order and allocate the left numbers of bits to the fractional part, to the integer part, and one for the signed bit (or . For instance, of a real number is encoded as for the -bit number system.
A plaintext is decoded to a real number by a rule of usual conversion from the fixed-point number to a real number. For a positive , we perform , whereas for a negative , we perform the two’s complement of followed by the same conversion to a real number with different signs. For instance, 8-bit plaintext is decoded to .
The plaintext is encrypted to a ciphertext , where each from the message space is encrypted over the torus under the same secret key of the size , where is the security parameter. is equivalent to , but we use both notations. As mentioned earlier in the preliminaries section, and are mapped to the message space ; each is encrypted from a randomly selected LWE sample by . We denote an encryption of under a secret key to be . For example, we refer to an encryption of the plaintext as .
As mentioned, we bootstrap noise after each gate operation; the noise of the ciphertext does not grow sufficiently large enough for the decryption circuit to output the incorrect message. Therefore, under the same original secret key , we are guaranteed to successfully retrieve the original plaintext message from the ciphertext by performing decryption such that followed by a rounding operation (or taking expectation).
3.3. Some Basic Operations
In this section, we summarize concise details of some of the basic operations given in Table 1 to show insights into how we designed the basis of our system. For the full details of its construction, we recommend the readers to refer to [19, 23]. Note that we use notations , , , and to denote logical gates XOR, AND, OR, and NOT that are homomorphically designed and provide bootstrapping after each operation. The following is a sketchy knowledge of some of the basic operations in our system.
Initially, the simplest case of all is the shift operation which basically is moving the bits in an array. Likewise, from a standpoint from the ciphertext bits, it is apparently the same as moving bits in a certain direction. Therefore, we can roughly write the right shift operation of by bits: ct. = ct..
In a similar manner, the addition operation HE.Add between two ciphertexts ct. and ct. is designed using the binary full-adder circuit: (1) ct. = ct. ct. ct. and (2) ct. = (ct. ct.) (ct. (ct. ct., where ct. and ct. represent the encrypted sum and carry at index , respectively. Additionally, we improved and optimized the full-adder circuit considering the execution time for each homomorphic operation which is elaborated in .
3.3.3. Two’s Complement
The two’s complement method HE.Twos is used for storing a signed number in both the integers and the real numbers in our system. It is an extension of the addition operation where we take 1’s complement followed by the addition of 1: HE.Add , where and .
We define the subtraction operation HE.Subt using the previous operations: addition and two’s complement. Simply, for the subtraction of two ciphertext inputs ( and ), we take two’s complement of followed by the addition of that leads to HE.Subt = HE.Add .
We define the multiplication operation by a sum of products of bits where multiplication between two bits is just AND operation. Let be the result of a product between two plaintext binary numbers and . We define by
Then, we can express by the sum of ’s of length : . Typically, we take ’s of from index to for the real number multiplication.
With this idea, we consider the multiplication of and for the case in the ciphertext. Since we have only considered multiplication in positive cases, and also cannot determine whether the and are positive or negative values, the algorithm for “encrypted” multiplication has to follow both of the cases. This is just an outline of the bitwise homomorphic multiplication, and for the interested readers, check  for more specific details of the algorithm.
For the sake of the flow of this study, we suggest the readers to refer to [19, 23, 24] for a more insightful understanding of the above operations as well as other homomorphic operations given in Table 1.
4. Learning from the Encrypted Data
4.1. Overall Process
Training a model involves mainly two phases (feedforward and backpropagation) that process back and forth to provide the optimal parameters that minimize the total cost , where superscript of indicates the number of epoch for th number of training . The Algorithm 1 illustrates the whole process of the training in plaintext, where lines (7–10) and lines (12, 15, 16) are the feedforward and the backpropagation algorithms, respectively.
Our model that learns from the encrypted data is close to the model in the plaintext with some similarities and differences. We follow the same path as in Algorithm 1 for training; we calculate the cost followed by updating parameters and . However, in the encrypted domain, since every parameters and data are encrypted, we need a different approach. Initially, we replace plaintext operations with homomorphic operations. For instance, in line 8, .
In particular, the sigmoid function is crucially important in both feedforward and backpropagation. In the feedforward algorithm, the activation function is used in every node to compute in line 9 for the input to the next layer. Moreover, in the backpropagation algorithm, the partial derivatives are derived from the sigmoid functions. Since line 12 is the step of obtaining partial derivatives without details of the actual steps, we state its process in more details as follows:where is a partial derivative of with respect to the th neuron at layer . Equation (8) demonstrates that all the partial derivatives , , and have which can be calculated by . Thus, we conclude that approximating the sigmoid function accurately is significant in both feedforward and backpropagation algorithms, thereby, increasing the accuracy of the learning model as a whole.
4.2. Key Operations
The sigmoid function mainly has four operations: addition, two’s complement, exponential function, and division. The former two have been discussed in the previous section. Now, we explain the other two key functions for the construction of the sigmoid function in the encrypted domain. The general approach for deriving exponential function and division is discussed in this section; we provide concrete examples in the Appendix section followed by figures illustrating a detailed process of bitwise operations performed behind the scene.
4.2.1. Exponential Function
Our binary exponential function requires several steps to perform in the ciphertext (Figure 3). This is because the algorithms are different depending on the sign of the input value. First, suppose we only consider deriving of in the plaintext. Then, we can divide by the fractional part and integer part . We express in a binary vector as follows:
The algorithm for positive is as follows: (1) right shift by , (2) right shift 1 in binary vector by , and (3) add the results of (1) and (2).
Equation (10) illustrates details of the plaintext positive exponentiation algorithm in steps (1) and (2).
The algorithm for the plaintext negative exponentiation is different from that of the positive exponentiation in terms of direction and magnitude of the shift and subtraction instead of addition. We also need to initiate to be a positive number by performing an absolute value operation in the beginning. We perform the algorithm with : (1) left shift by , (2) left shift 1 in binary vector by , and (3) subtract result (2) by (1).
The result of the positive and negative exponentiation yields a set of line segments that approximates the exponential function, that is, a line segment connects a point and a point . A simple intuitive proof of the idea is that, for instance, step (2) in positive exponentiation corresponds to the point in , and step (1) is a proportional increase of with respect to . Therefore, stays in the line from to .
We transform the above plaintext exponentiation to an encrypted version. To achieve this, the crucial task is to find the value of . In fact, it is impossible to find this value since, otherwise, it means that we know the value of from which is contradictory. Therefore, we must consider all possible cases that the exponentiation outcome can be. This task involves comparing values of to of all possible values that the value can have by using HE.Equi function. In short, the function HE.Equi outputs if the two given ciphertexts are equivalent, and otherwise . We denote to be the results of all the comparisons. We also perform all possible cases of the exponentiation and call it where refers to exponentiation of by times. Next, we bitwise HE.Equi and pairwise, which outputs the result of exponentiation if , otherwise . Finally, adding all the values of the results of HE.Equi is the result of the exponentiation which considered all possible scenarios.
Up to this point, we have proceeded steps in the absolute value operation, all the possible cases of and positive and negative exponentiation (Figure 3). Now, we consider the sign of the input whether to perform the positive or negative exponentiation algorithm. We perform bitwise AND operation of the result of the positive exponentiation with . The result is if is positive (i.e., sign bit which means and returns if is negative. Likewise, we perform the similar procedure: bitwise AND operates the result of negative exponentiation with . The procedure only gives whenever is negative, and otherwise . Last, we add the two results which output or depending on the sign of the input to obtain .
For a general exponential function , we preprocess an input at the very first of the whole algorithm, that is, we multiply by ; logarithm property outputs the following: . If the input is replaced by , we obtain . Therefore, in ciphertext, we multiply by in the first place to obtain the general formula .
Remark: In this study, we restrict the exponential function to be . Therefore, we multiply the input value by as a constant to make the algorithm work for .
The binary division algorithm in ciphertext takes two real number ciphers and and performs divided by . The result of the function is , where is a quotient (Figure 4). The algorithm is somewhat similar to the exponential function and intuitive to understand.
Suppose we consider the algorithm in plaintext, where are all unencrypted positive real number vectors and have a vector of size . First, we concatenate and , where is set to be a zero vector. We denote the vector as . Then, we repeat the steps as follows:(1)Right shift by 1 and compare with (2)If is less than , set to be 0, and otherwise, set to be 1 and subtract by .
We iterate the steps for times and obtain for the result of the plaintext real number binary division algorithm. The repeated steps are analogous to the normal division process; at each step of the iteration, we compare and subtract the values of the divisor and the dividend until we acquire quotient of size .
As for the binary division in ciphertext, it is necessary to consider the sign of the inputs in the same manner as in the binary exponential function. The difference is that we now consider two variables as our inputs compared to one variable in the exponential function algorithm. Thus, we consider two cases: (1) , are both positive or negative and (2) both have different signs. The result of division from the former condition is positive, whereas it is negative in (2). One good way is using XOR operation between the sign bits of and , where the operation outputs 0 if both have the same sign and else 1. We use to denote the result of the operation: .
Let be the result of the binary division step in Figure 4 which outputs a positive division result. Suppose and are the case (1) variables. Then, since is , bitwise AND operation between and is the itself, whereas if and are the case (2) variables, then the bitwise AND operation will output (since ). Thus, the result of the bitwise AND operation between and is case (1) output. Likewise, for the negative result of (2), we instead perform two’s complement operation in the beginning to make a negative division result. The bitwise AND operation of follows and outputs the negative result for case (2) and for case (1). Last, we add the former results: HE.ADD (bitwise and bitwise ).
The main part of the binary division where the goal is to obtain is designed in the following way. We first refer HE.Compare function  to an operation that outputs if the former input is less than the latter and otherwise, . This function is extremely useful since we can use it in each step of the iteration to set the value for , that is, . We also use to perform bitwise AND operation with the subtraction of by since the subtraction operation only performs when is . Hence, we can update in both cases of being and .
Generally, a goal for an encrypted model is to mimic the plaintext model with the purpose of providing the same level of functionality. In this sense, a key factor for constructing a “well” homomorphic MLPNN is how “well” the underlying functions are designed so as to evaluate the accurate result. That is, the output after decryption should be as close to the result of the plaintext model under the same configuration such as the number of layers and neurons. In this sense, our model has every homomorphic operation that can evaluate output with the same level of accuracy as the plaintext model except the sigmoid function; in fact, this is the same for the previous literature that we have discussed. Therefore, the “real” key factor to construct such a well-encrypted MLPNN model narrows down to the construction of the activation function—approximating the sigmoid function is crucial.
5.1. Sigmoid Function: Low-Degree Polynomial vs. Binary Approximation
Approximating the sigmoid function is important in two ways: it is used in (1) feedforward activation of neurons and (2) computation of partial derivatives in backward propagation. First, during the feedforward phase, the sigmoid function is computed in every neuron except the neurons in the input layer. Therefore, a model becomes error-prone when the sigmoid function is not well approximated. Also, in case (2), partial derivatives of the cost with respect to the parameters are calculated by derivatives of the sigmoid (equation (8)). Moreover, the number of sigmoid derivatives required is about triple times compared to that of (1) because the derivative of the sigmoid function is used for every partial derivative of with respect to parameters , , and a neuron .
Figure 5 shows graphs of sigmoid functions and their derivatives in terms of different approaches. We refer the true sigmoid function to the nonapproximated sigmoid function evaluated in plaintext and use our binary sigmoid function that takes 32-bit input. Moreover, we present the two most famous approximation techniques—Taylor series  and the least square method —for comparison. The Taylor series method is the most common method for approximating a nonlinear function as sums of polynomials (denoted by . Along with the line of approximation by polynomials, the least square method aims to approximate the nonlinear function by minimizing mean squared error (MSE) of , where is our target function, and is defined by . In our experiment, we use the Taylor series of degree 7 and the least square method of degree 9 polynomials. The equations are the following:where the coefficients of are .
5.1.1. Comparison Result
In Figure 5(a), a valid range for each approach is different—the binary sigmoid function shows a wider valid range than the two other approaches. The binary approximation provides the valid range of to 11, while the Taylor series and the least square approximation provides the ranges of to 2 and to 8, respectively. The reason for a failure of precision from the point of the binary sigmoid function is because of the design of our exponential function and the number of bits that are allocated to the integer part in our number formatting system; increasing the size of bits for an array or allocation of more bits to integers can provide a wider range.
Figure 5(b) shows graphs of sigmoid function derivatives regarding each approach. The binary sigmoid derivative function demonstrates the highest similarity to the derivative of the true sigmoid function within all ranges in the figure, whereas the sigmoid function of the Taylor series approximation only works in the interval of around to 2 in accordance with the range shown in Figure 5(a). An interesting result is that the least square approximation is unstable—less accurate from to 3 and 3 to 7.5 compared to its graph shown in Figure 5(a). This implies that the least square method is not a recommended approach for the sigmoid function in the backpropagation. Therefore, we conclude that our binary sigmoid approximation is significantly more accurate compared to the other two famous approaches that utilize low-degree polynomials to approximate the sigmoid function—unlike our binary approximation method not conforming to this tradition but builds upon the bitwise operations—considering the results shown in Figure 5.
5.2. Time Performance of t-BMPNet
5.2.1. Environment Setting
Our research is conducted in AMD Ryzen 5 3500X 6-Core 3.60 GHz, 8.0 GB RAM, Ubuntu 20.04.1 LTS, and we use TFHE version 1.0 to implement the bitwise scheme of our approach.
We experiment execution time of our binary approximation with respect to three key algorithms—sigmoid function, feedforward, and backpropagation—with different input bits (Table 2). Since it takes a considerable amount of time for the execution of some key functions in our system, a time performance measure of feedforward and backpropagation is conducted on MLPNN with the simplest setting—three layers, one neuron per layer, and a single data training.
As a result of the time performance of the training algorithm under t-BMPNet, we obtain 1.87, 7.14, and 28.02 minutes for 8, 16, and 32 input bits, respectively. Our main focus—feedforward and backpropagation algorithms—is measured at 17.00 and 11.02 minutes, respectively. As analysis for the number of operations evaluated in each algorithm, the feedforward algorithm takes 2 additions, multiplications, and sigmoid functions of total 6 operations, whereas the backpropagation algorithm requires 3 subtractions and 7 multiplications resulting in 10 operations for the training of a single data.
Since addition and subtraction operations are negligible compared to other heavy operations in terms of computational costs—for instance, the execution time for multiplication is times of addition and subtraction—concerning only the remaining operations, the net amount of operations for feedforward and backpropagation is 2 sigmoids and 5 multiplications, respectively. However, since the sigmoid function in 32-bit is approximately equivalent to multiplications, the feedforward algorithm takes a longer time than the backpropagation algorithm. In fact, the sigmoid derivative property reduces a significant amount of computational cost in the backpropagation algorithm. It facilitates bypassing computationally expensive derivatives of the exponential function in the sigmoid function evaluating only one multiplication and subtraction operation per function.
In general, for a small network in t-BMPNet, we expect that the feedforward is a more computationally expensive algorithm than the backpropagation algorithm since most of the time-consuming procedure takes in deriving values for sigmoid functions, whereas the backpropagation does not participate in this process. However, as the network becomes more complex, there is a turnover between the algorithms in terms of time performance. This is because, suppose for any two layers containing neurons, the time complexity of deriving sigmoid functions in the feedforward algorithm is , whereas, in the backpropagation algorithm, updating parameter requires evaluation of multiplication operations, and thus, the time complexity increases by (equation (8) and Algorithm 1).
5.3. Comparison with Other Approaches
We implement using SEAL version 3.10 to construct MLPNN in two different schemes—BFV and CKKS—for comparison with t-BMPNet. Both schemes are provided by the SEAL library and are the most actively used cryptographic schemes for implementing the neural network algorithms. Specifically, the BFV scheme is the cryptographic basis for the Low-Latency CryptoNets (LoLa) , the most recent version of CryptoNets designed for private inference. Likewise, the CKKS scheme is applied in neural networks since it is considered one of the most practical FHE schemes; much of the literature [27, 28] focus on the enhancement of throughput and memory efficiency of the neural network.
We implement MLPNN with the network architecture of [2, 3, 6, 7] (each entry represents the number of nodes in the layer) based on the three schemes—TFHE, BFV, and CKKS—to compare and analyze their characteristics (Table 3). We assume that the network parameters are encrypted, i.e., weights and biases are encrypted, and thus, operations are between ciphertexts. Also, we measure the time performance of the three approaches in two different ways by selecting different types of the activation function—square and sigmoid. Particularly, we measure the square function since CryptoNets and its variants use the square operation for the activation function.
5.3.1. Our Approach
t-BMPNet is designed as in the previous section using TFHE library with the difference in the architecture of the network. We measure the throughput of inferencing a single encrypted data considering the lengths of the input and obtain 4.94, 19.33, and 75.91 minutes, respectively, in the setting where the square function is an activation function. As for the total execution time using the sigmoid function, we obtain increased execution time 9.48, 39.32, and 147.5 minutes, respectively.
5.3.2. Other Approaches
MLPNN using BFV and CKKS schemes are designed in the leveled HE scheme, that is, unlike TFHE that uses FHE which performs bootstrapping for every gate operation, the bootstrapping is not supported; parameters are carefully chosen before the execution, so that the whole procedure is performed without having to undergo an expensive bootstrapping procedure. We pack a single data followed by encryption to yield a ciphertext polynomial in in both schemes. The encrypted weights and biases are encoded row-wise to perform matrix multiplication with some rotations involved.
In the BFV setting, we obtain 0.15 minutes for evaluation with square function with a polynomial degree of . Since the BFV scheme can only support integer arithmetic, the normal sigmoid function is not applicable to the scheme. In the CKKS environment, we use and acquire 0.57 and 0.97 minutes for the evaluation using the square and sigmoid functions, respectively.
One of the major benefits of using the TFHE scheme is that it is not constrained by the complexity of the neural network, that is, we can implement as many layers without having to consider parameters in the first place (this is because TFHE is a fully HE scheme with bootstrapping). However, in both leveled CKKS and BFV schemes, a complex neural network is not applicable. In our experiment setting with CKKS, a single inference on the neural network requires a total of 16 multiplicative depths of the circuit; in each layer, 6 multiplication operations are performed—one between weights and the input and the rest are allocated for the sigmoid function. In particular, the sigmoid function requires at least 5 multiplications since it is approximated by the least square polynomials with the highest exponent equal to 7; it requires at least 4 multiplications for evaluation, and finally, we need a multiplication between the coefficients and the exponents. Therefore, expanding the network follows an additional 6 multiplicative depths which is not practical for a complex network since the growth of the multiplicative depths requires an exponential increase of the modular space . In our experiment, we set the parameter to be the largest degree that the SEAL library can support in order to perform inference on encrypted data. A larger neural network cannot be implemented using our approach unless the library supports a higher degree of polynomials and use a lesser degree in the least square polynomials (however, as a tradeoff, the accuracy of the network will decrease).
t-BMPNet can train the model without the constraint of considering parameters from the beginning since it provides bootstrapping—similar discussion with the previous complexity problem. However, both CKKS and BFV schemes in the leveled environment cannot support training under the encrypted domain since training requires a feedforward and a backpropagation step for every iteration of the training procedure. One can choose the fully homomorphic encryption for BFV and CKKS scheme to facilitate complex neural networks; however, in terms of bootstrapping efficiency, TFHE is much more time-saving, e.g., less than a second per gate, whereas the CKKS bootstrapping procedure requires several minutes to refresh the ciphertext.
5.4. Application to a Dataset
We have previously demonstrated that the training of our model works in the encrypted domain within a small network consisting of three neurons. Now, we expand our model to experiment in a larger network—we compare the real training model with our simulated model implemented in Matlab 2019a in terms of mean squared error (MSE), a number of epochs, and classification result for the measure.
We use a small dataset from the Sharky neural network consisting of 32 coordinates of and with their labels. The goal of the dataset is to classify the points with a decision boundary that is determined by minimizing the cost. We randomly chose parameters of weights and biases and fixed their values as an initial setting and set the learning rate equal to 0.15. The network consists of 3 layers with 2, 4, and 2 number of neurons in each layer. We perform the backpropagation algorithm with the normal gradient descent method.
Sharky data consists of data in the range of to 1. We compare each MLPNN model with respect to different types of approximation techniques—plaintext, our approach, Taylor series, and least square method—to test accuracy for each model. In our experiment with the original Sharky dataset, we obtain 100% accuracy for all the models. We suspect that the results of 100% accuracy are because all of the techniques approximate sigmoid functions without a significant loss of precision within a certain degree of range. Thus, we perturb the original dataset by multiplying the data with a constant value —without changing labels—to test each model with different input ranges. For , the Taylor series approximation fails to locate the decision boundary, whereas the least square method fails at . The result in Figure 6 shows that both—real training MLPNN and t-BMPNet—at classify 32 points with the mean squared error equal to 0 at epoch number 35.
6. Discussion and Conclusion
In this study, we presented t-BMPNet that can evaluate multilayer perceptron neural net over fully homomorphic encryption scheme from the basis of bitwise operations. We introduced our basic FHE operations with the emphasis on the key operations—exponential function and division—for the construction of the sigmoid function. Our work is different from the other literature in several aspects: privacy for both the user and the server, backward and forward training in the encrypted domain, and minimal requirement of operations performed by users. Moreover, our work has achieved a highly accurate design of nonlinear sigmoid function compared to other literature that almost uses the polynomial approximation to construct the function; our research guarantees more accurate prediction result. Despite notable results, our research has shown low time performance. This is because our work is at the foundational stage of the bitwise neural network—our work leaves much room for improvements in time performance.
First, the overall time performance can be enhanced from improvements in bootstrapping of binary gates. In fact, since Gentry’s  proposal of bootstrapping of noisy ciphertexts in 2013, there has been a significant increase in the time performance for the bootstrapping procedures. FHEW  library reduced the time for bootstrapping in less than a half-second, and TFHE  library improved the speed by a factor of 53 resulting in 13 milliseconds single core time for evaluation. Next, some hardware acceleration techniques can be used to boost the time performance. For example, GPU and FPGA can increase the computational speed to improve latency and throughput. Moreover, not dramatically but to some extent, algorithms for the basic operations can be optimized to enhance the performance. For instance, the full-adder binary addition can be improved by using 9 NAND gates instead, since, in our scheme, NAND gates have better performance than other binary gates. Last, as a further step, using the Chimera  scheme can enhance the time performance in a significant amount. This is because the scheme performs as a bridge to enabling the exploitation of only the advantages of each FHE scheme such as TFHE, BFV , and HEAAN .
A. Binary Exponential Function Example
We provide an illustration of the exponential function being performed behind the scene with an example of . In this example, our goal is to perform a binary exponential operation of , that is, . Note that we only perform in this example. For the general result of , we multiply by in the first place; we omit this procedure as in Figure 7 and show only the remaining (important) parts of the procedure.
Initially, we encode the number into the plaintext vector of size (Figure 7(a)). We refer to dotted rectangular boxes as a plaintext vector, whereas its encryption is referred to as solid-line boxes. We designate the number written in red letters as the fractional part of the fixed-point number in order to distinguish it from the integer part. Then, the plaintext vector is encrypted bitwisely under the same secret key to output followed by absolute value operation to yield . We present numbers in parentheses next to the boxes to indicate its decimal form of encryption. For instance, in Figure 7(a), we write (1.5) for .
The core technique of the binary exponentiation algorithm is to search for the integer value of (denote it by which is the indicator of deciding the direction and number of shifts. However, since it is encrypted, in other words, we are unable to use to perform such actions, we exhaustively perform all possible cases of positive exponentiations (likewise negative). The key part is at Figure 7(b) where HE.Equi operation (or in Figure 7(b)) yields Enc (1) only when is equal to one of the values of Enc (0), Enc (1), , Enc ; each result is denoted by respectively. In this example, since , only is .
In Figures 7(c) and 7(d), ct. and ct. are the outcomes of positive and negative binary exponentiations of , respectively. For instance, ct. is HE.Add (ct., ct.), where ct. is the outcome of right shift Enc (1) by 1 and ct.v is the outcome of right shift by 1. In contrast, ct. is HE.Subt (ct., ct.), where ct. is the outcome of left shift Enc (1) by 1 and ct. is the outcome of left shift by 2.
Given ’s and their corresponding ’s (likewise, ’s), we perform bitwise AND operation (denote between the two ciphertexts. The result of the operation is non- only if is for some . In our example, and are the only non- values. Thus, after adding (i.e., HE.Add or denote in Figures 7(c) and 7(d)) ’s and ’s, we obtain and , respectively. (We refer to the results as and , respectively.)
So far, we have proceeded the positive and negative exponentiation steps in Figure 3. Now, we combine the positive exponentiation result and the negative exponentiation result considering the sign of . Since is negative (the sign bit is ), we expect our output to be . Thus, we perform bitwise AND operation between and which results in . On the contrary, yields the negative exponentiation result . By adding both of the results, we obtain for (Figure 7(e)).
As a final step, we retrieve the encrypted result and perform decryption followed by a decoding procedure. We decrypt using the same secret key and obtain plaintext . We decode the plaintext by performing and obtain 0.375 for our desired output of the binary exponential procedure.
B. Binary Division Function Example
As an example of binary division function in FHE scheme, we consider two real numbers and (Figure 8(a)). We first encode both numbers of length and obtain and , followed by their encryption and . We consider that the rest of the operations are being performed on the positive real ciphertexts; the absolute value operation for both and gives and , respectively.
Now, we consider , concatenation of two ciphertexts and (is initially set to ) generating -bits vector of encrypted bits. We iterate the following steps for times:(1)We right shift once and HE.Compare (, ), where is the right element of and assign the result to (e.g., in Figure 8(b)), and is still after the shift operation from the first iteration. The comparison operation between and yields for .(2)If (1) is less than or equal to (i.e., , we assign as it is. On the other hand, if (2) is greater than , we assign to . As in 7e, we use bitwise AND operation (or and to handle both conditions. We perform the following operations: and . We add (i.e., ) both results and assign it to . (e.g., in Figure 8(b)), and is less than . We need not . Thus, bitwise AND operation of , that is, , to elements of yields , whereas bitwise AND operation of (or ) to yields . Thus, we gain by adding both results.
After iterations, we obtain the result of -bits vector. Since we are only interested in the precision of -bit, we rescale by taking only the left bits from and call it . (Note that one can get better precision at most -bits of the result by performing more iterations). In our example, we obtain from .
Last, we consider the signs of and (Figure 8(d)). Using XOR operation between and , we obtain that is if and have the same sign, whereas if and have the opposite sign values. Taking advantage of this fact, we apply operation of to which yields only when and have the same sign value (i.e., ). In Figure 8(d), since and have the opposite sign value, is ; bitwise AND operation yields . Likewise, we can design a circuit that outputs only the negative result of , that is, HE.Twos by the following: . In our example, is ;