Security and Communication Networks

Volume 2018, Article ID 6728020, 15 pages

https://doi.org/10.1155/2018/6728020

## Robust Fully Distributed Minibatch Gradient Descent with Privacy Preservation

University of Szeged, and MTA-SZTE Research Group on AI, Szeged, Hungary

Correspondence should be addressed to Márk Jelasity; uh.degezs-u.fni@ytisalej

Received 3 November 2017; Revised 3 March 2018; Accepted 4 April 2018; Published 14 May 2018

Academic Editor: Po-Ching Lin

Copyright © 2018 Gábor Danner et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Privacy and security are among the highest priorities in data mining approaches over data collected from mobile devices. Fully distributed machine learning is a promising direction in this context. However, it is a hard problem to design protocols that are efficient yet provide sufficient levels of privacy and security. In fully distributed environments, secure multiparty computation (MPC) is often applied to solve these problems. However, in our dynamic and unreliable application domain, known MPC algorithms are not scalable or not robust enough. We propose a light-weight protocol to quickly and securely compute the sum query over a subset of participants assuming a semihonest adversary. During the computation the participants learn no individual values. We apply this protocol to efficiently calculate the sum of gradients as part of a fully distributed minibatch stochastic gradient descent algorithm. The protocol achieves scalability and robustness by exploiting the fact that in this application domain a “quick and dirty” sum computation is acceptable. We utilize the Paillier homomorphic cryptosystem as part of our solution combined with extreme lossy gradient compression to make the cost of the cryptographic algorithms affordable. We demonstrate both theoretically and experimentally, based on churn statistics from a real smartphone trace, that the protocol is indeed practically viable.

#### 1. Introduction

Data mining over personal data harvested from mobile devices is a very sensitive problem due to the strong requirements of privacy preservation and security. Recently, the* federated learning* approach was proposed to solve this problem by not collecting the data in the first place but instead processing the data in place and creating the final models in the cloud based on the models created locally [1, 2].

We go one step further and propose a solution that does not utilize centralized resources at all. The main motivation for a fully distributed solution in our cloud-based era is to preserve privacy by avoiding the central collection of any personal data, even in preprocessed form. Another advantage of distributed processing is that this way we can make full use of all the local personal data, which is impossible in cloud-based or private centralized data silos that store only specific subsets of the data. The key issue here of course is to offer decentralized algorithms that are competitive with approaches like federated learning in terms of time and communication complexity and that provide increased levels of privacy and security.

Previously, we proposed numerous distributed machine learning algorithms in a framework called gossip learning. In this framework, models perform random walks over the network and are trained using stochastic gradient descent [3] (see Section 4). This involves an update step in which nodes use their local data to improve each model they receive and then forward the updated model along the next step of the random walk. Assuming the random walk is secure, which in itself is a research problem on its own, see, for example, [4], it is hard for an adversary to obtain the two versions of the model right before and right after the local update step at any given node. This provides reasonable protection against uncovering private data.

However, this method is susceptible to collusion. If the nodes before and after an update in the random walk collude they can recover private data. In this paper we address this problem and improve gossip learning so that it can tolerate a much higher proportion of honest but curious (or semihonest) adversaries. The key idea behind the approach is that in each step of the random walk we form groups of peers that securely compute the sum of their gradients, and the model update step is performed using this aggregated gradient. In machine learning this is called minibatch learning, which, apart from increasing the resistance to collusion, is known to often speed up the learning algorithm as well (see, e.g., [5]).

It might seem attractive to run a secure multiparty computation (MPC) algorithm within the minibatch to compute the sum of the gradients. The goal of MPC is to compute a function of the private inputs of the parties in such a way that, at the end of the computation, no party knows anything except what can be determined from the result and its own input [6]. Secure sum computation is an important application of secure MPC [7].

However, we do not only require our algorithm to be secure but also fast, light-weight, and robust, since the participating nodes may go offline at any time [8] and they might have limited resources. One key observation is that for the minibatch algorithm we do not need a precise sum; in fact, the sum over any group that is large enough to protect privacy will do. At the same time, it is unlikely that all the nodes will stay online until the end of the computation. We propose a protocol that—using a binomial tree topology and Paillier homomorphic encryption—can produce a “quick and dirty” partial sum even in the event of failures, has adjustable capability of resisting collusion, and can be completed in logarithmic time.

We also put a great emphasis on demonstrating that the proposed protocol is practically viable. This is a nontrivial question because homomorphic cryptosystems can quickly become very expensive when applied along with large-enough key-sizes (such as 2048 bit keys), especially considering that in machine learning the gradients can be rather large. To achieve practical viability, we propose an extreme lossy compression, where we discretize floating-point gradient values to as few as two bits. We demonstrate experimentally that this does not affect learning accuracy yet allows for an affordable cryptography cost. Our simulations are based on a real smartphone trace we collected [8].

#### 2. Related Work

There are many approaches that have goals similar to ours, that is, to perform computations over a large and highly distributed database or network in a secure and privacy preserving way. Our work touches upon several fields of research including machine learning, distributed systems and algorithms, secure multiparty computation, and privacy. Our contribution lies in the intersection of these areas. Here we focus only on related work that is directly relevant to our present contributions.

Algorithms exist for completely generic secure computations, Saia and Zamani give a comprehensive overview with a focus on scalability [9]. However, due to their focus on generic computations, these approaches are relatively complex and in the context of our application they still do not scale well enough and do not tolerate dynamic membership either.

Approaches targeted at specific problems are more promising. Clifton et al. propose, among other things, an algorithm to compute a sum [7]. This algorithm requires linear time in the network size and it does not tolerate node failure either. Bickson et al. focus on a class of computations over graphs, where the computation is performed in an iterative manner through a series of local updates [10]. They introduce a secure algorithm to compute local sums over neighboring nodes based on secret sharing. Unfortunately, this model of computation does not cover our problem as we want to compute minibatches of a size independent of the size of the direct neighborhood, and the proposed approach does not scale well in that sense. Besides, the robustness of the method is not satisfactory either [11]. Han et al. address stochastic gradient search explicitly [12]. However, they assume that the parties involved have large portions of the database, so their solution is not applicable in our scenario.

Bonawitz et al. [13] address a similar problem setting where the goal is to compute a secure sum in an efficient and robust manner. They also assume a semihonest adversarial model (with a limited set of potentially malicious behaviors by a server). However, their solution requires a server and an all-to-all broadcast primitive even in the most efficient version of their protocol. Our solution requires a linear number of messages only.

The algorithm of Ahmad and Khokhar is similar to ours [14], as they also use a tree to aggregate values using homomorphic encryption. However, in their solution all the nodes have the same public key and the private key is distributed over a subset of elite nodes using secret sharing. The problem with this approach in our minibatch gradient descent application is that for each minibatch a new key set has to be generated for the group, which requires frequent access to a trusted server; otherwise the method is highly vulnerable in the key generation phase. In our solution, all the nodes have their own public/private key pair and no keys have to be shared at any point in time. Besides, these key pairs may remain the same in every minibatch the given node participates in without compromising our security guarantees.

We need to mention the area of differential privacy [15], which is concerned with the problem that the (perhaps securely computed) output itself might contain information about individual records. The approach is that a carefully designed noise term is added to the output. Gradient search has been addressed in this framework (e.g., [16]). In our distributed setup, this noise term can be computed in a distributed and secure way [17].

We also strongly build on our previous work [18]. There, we proposed an algorithm very similar to the one presented here. In this study we offer several optimizations of the algorithm and we propose the binomial topology for building the minibatch overlay tree. We also explore the issue of gradient compression necessary for keeping the cost of cryptography under control and we perform a thorough experimental study of the algorithm based on a smartphone churn trace.

#### 3. Model

*Communication*. We model our system as a very large set of nodes that communicate via message passing. At every point in time each node has a set of neighbors forming a connected network. The neighbor set can change over time, but nodes can send messages only to their current neighbors. Nodes can leave the network or fail at any time. We model leaving the network as a node failure. Messages can be delayed up to a maximum delay. Messages cannot be dropped, so communication fails only if the target node fails before receiving the message.

The set of neighbors is either hard-wired, or given by other physical constraints (e.g., proximity), or set by an overlay service. Such overlay services are widely available in the literature and are out of the scope of our present discussion. It is not strictly required that the set of neighbors are random; however, we will assume this for the sake of simplicity. If the set is not random, then implementing a random walk with a uniform stationary distribution requires additional well-proven techniques such as Metropolis-Hastings sampling or structured routing [19].

*Data Distribution. *We assume a horizontal distribution, which means that each node has full data records. We are most interested in the extreme case when each node has only a single record. The database that we wish to perform data mining over is given by the union of the records stored by the nodes.

We assume that the adversaries are honest but curious (or semihonest). That is, nodes corrupted by an adversary will follow the protocol but the adversary can see the internal state of the node. The goal of the adversary is to learn about the private data of other nodes (note that the adversary can obviously see the private data on the node it observes directly). Wiretapping is allowed, since all the sensitive messages in our protocol are encrypted.

We also assume that adversaries are not able to manipulate the set of neighbors. In each application domain this assumption translates to different requirements. For example, if an overlay service is used to maintain the neighbors then this service has to be secure itself.

#### 4. Background on Gossip Learning

Although not strictly required for understanding our key contribution, it is important to briefly overview the basic concepts of stochastic gradient descent search and our gossip learning framework (GOLF) [3].

The basic problem of* supervised binary classification* can be defined as follows. Let us assume that we are given a labeled database in the form of pairs of feature vectors and their correct classification, that is, , where , and . The constant is the* dimension* of the problem (the number of features). We are looking for a* model* parameterized by a vector that correctly classifies the available feature vectors and that can also* generalize* well, that is, which can classify unseen examples too.

Supervised learning can be thought of as an optimization problem, where we want to minimize the empirical risk:where function is a loss function capturing the prediction error on example .

Training algorithms that iterate over available training data or process a continuous stream of data records and evolve a model by updating it for each individual data record according to some update rule are called* online learning algorithms*. Gossip learning relies on this type of learning algorithms. Ma et al. provide a nice summary of online learning for large scale data [21].

* Stochastic gradient search* [22, 23] is a generic algorithmic family for implementing online learning methods. The basic idea is that we iterate over the training examples in a random order repeatedly, and for each training example we calculate the gradient of the error function (which describes classification error) and modify the model along this gradient to reduce the error on this particular example according to the following rule:where is the learning rate at step that often decreases as increases.

A popular way to accelerate the convergence is the use of minibatches, that is, to update the model with the gradient of the sum of the loss functions of a few training examples (instead of only one) in each iteration. This allows for fast distributed implementations as well [24].

In gossip learning, models perform random walks on the network and are trained on the local data using stochastic gradient descent. Besides, several models can perform random walks at the same time, and these models can be combined time-to-time to accelerate convergence. Our approach here will be based on this scheme, replacing the local update step with a minibatch approach.

#### 5. Our Solution

As explained previously, at each step, when a node receives a model to update, it coordinates the distributed computation of a minibatch gradient and then uses this gradient to update the model. Based on the assumptions in Section 3 and building on the GOLF framework outlined in Section 4 we now present our algorithm for computing a minibatch gradient.

##### 5.1. Minibatch Tree Topology

The very first step for computing a minibatch gradient is to create a temporary group of random nodes that form the minibatch. In our decentralized environment we do this by building a rooted overlay tree. The basic version of our algorithm will require the overlay tree not only to be rooted at the node computing the gradient but also to be* trunked*.

*Definition 1 (trunked tree). *Any rooted tree is* 1-trunked*. For , a rooted tree is *-trunked* if the root has exactly one child node, and the corresponding subtree is a -trunked tree.

Let denote the intended size of the minibatch group. We assume that is significantly less than the network size. Let be a parameter that determines the desired security level (). We can now state that we require an -*trunked tree* rooted at the node that is being visited by gossip learning. As we will see later, this is to prevent a malicious root to collect too much information.

Apart from the trunk, the tree can be arbitrary; however, we propose a* binomial tree* as a preferable choice. If every node already in the tree spawns a new child node in periodic rounds (starting from a single root node) then the result is a binomial tree. It is not possible to construct a tree of a given size faster, since in the case of a binomial tree each node keeps working continuously so the efficiency is maximal. Of course we assumed here that child nodes can be added only sequentially at a given node. However, if we also assume that all the nodes have the same up- and download bandwidth cap then adding nodes in parallel will be proportionally slower thus parallelism provides no advantage as long as we utilize the maximal available bandwidth. The same up- and download bandwidth requirement is naturally satisfied in our application domain because we assume that the protocol is allowed to use only a fixed, relatively small amount of bandwidth (such as 1 Mbps) and low bandwidth connections are excluded from the set of possible overlay connections.

Another advantage of binomial trees is that we can use the links in reverse order of construction for uploading and aggregating data along the tree. This way, we get a data aggregation schedule that is similarly efficient and also collision-free in the sense that each node communicates with at most one node at a given time.

The tree overlay network we have described so far can be constructed over a random overlay network by first building the trunk (which takes a random walk of steps) and then recursively constructing a binomial tree of depth , resulting in an -trunked tree of size and total depth . Every child node is chosen randomly from those neighbors of the node that are both online and not in the tree already. No attention needs to be paid to reliability. We generate the tree quickly and use it only once quickly. Normally, some subtrees will be lost in the process because of churn but our algorithm is designed to tolerate this. The effect of certain parameters, such as the binomial tree parameter and node failures, will be discussed later in the evaluation.

##### 5.2. Calculating the Gradient

The sum we want to calculate is over vectors of real numbers. Without loss of generality, we discuss the one-dimensional case from now on for simplicity. Homomorphic encryption works over integers, to be precise, over the set of residue classes for some large . For this reason we need to discretize the real interval that includes all possible sums we might calculate, and we need to map the resulting discrete intervals to residue classes in where defines the granularity of the resolution of the discretization. This mapping is natural, we do not go into details here. Since the gradient of the loss function for most learning algorithms is bounded, this is not a practical limitation. Also, in Section 7 we evaluate the effect of discretization on learning performance and we show that even an extreme compression (discretizing the gradient down to two bits) is tolerable due to the high robustness of the minibatch gradient method itself.

In a nutshell, the basic idea of the algorithm is to divide the local value at each node into shares, encrypt these with asymmetric additively homomorphic encryption (such as the Paillier cryptosystem), and send them to the root via the chain of ancestors. Although the shares travel together, they are encrypted with the public keys of different ancestors. Along the route, the arrays of shares are aggregated and periodically reencrypted. Finally, the root calculates the sum.

The algorithm consists of three procedures, shown in Algorithm 1. These are run locally on the individual nodes. Procedure INIT is called once after the node becomes part of the tree. Here, the function call ANCESTOR() returns the descriptor of the th ancestor on the path towards the root. The descriptor contains the necessary public keys as well. During tree building this information can be given to each node so the nodes can look up the keys of their ancestors locally. For the purposes of the ANCESTOR function, the parent of the root is defined to be itself. Function ENCRYPT() encrypts the integer with the public key of node using an asymmetric additively homomorphic cryptosystem.