Abstract

Secure Function Evaluation (SFE) has received recent attention due to the massive collection and mining of personal data, but remains impractical due to its large computational cost. Garbled Circuits (GC) is a protocol for implementing SFE which can evaluate any function that can be expressed as a Boolean circuit and obtain the result while keeping each party’s input private. Recent advances have led to a surge of garbled circuit implementations in software for a variety of different tasks. However, these implementations are inefficient, and therefore GC is not widely used, especially for large problems. This research investigates, implements, and evaluates secure computation generation using a heterogeneous computing platform featuring FPGAs. We have designed and implemented SIFO: secure computational infrastructure using FPGA overlays. Unlike traditional FPGA design, a coarse-grained overlay architecture is adopted which supports mapping SFE problems that are too large to map to a single FPGA. Host tools provided include SFE problem generator, parser, and automatic host code generation. Our design allows repurposing an FPGA to evaluate different SFE tasks without the need for reprogramming and fully explores the parallelism for any GC problem. Our system demonstrates an order of magnitude speedup compared with an existing software platform.

1. Introduction

The statistical analysis of data collected from human subjects has a long history in empirical sciences such as medicine, sociology, and economics. It has recently also become a ubiquitous practice among Internet companies, occurring presently at a massive and an unprecedented scale. Companies such as Google, Netflix, and Amazon routinely monitor and mine a broad array of behavioral signals collected from their users and monetize it through targeted advertising or personalized product recommendations. Behavioral data collection is therefore of considerable business value to online companies [1]; moreover, there are often benefits to society at large, as in aiding the detection of epidemics [2] or terrorist threats [3], in assessing news or product penetration [4], and in political online polling [5]. On the contrary, these practices have also given rise to privacy concerns and threats, documented extensively by researchers [612] as well as the popular press [1, 13].

1.1. Secure Function Evaluation

This state of affairs gives rise to the following challenge: given the benefits of mining behavioral data to both online companies and to society at large, is it possible to enable data mining practices without jeopardizing user privacy? A series of recent research efforts [1419] have attempted to address this issue through cryptographic means and, in particular, through secure function evaluation (SFE). SFE allows an interested party to evaluate any desirable polynomial-time function over private data, while revealing only the answer and nothing else about the data. This offers a strong privacy guarantee: an entity executing a secure data mining algorithm over user data learns only the final outcome of the computation, while the data are never revealed to the entity. SFE can thus enable, e.g., a data analyst, a medical professional, or a statistician, to conduct a study of sensitive data without jeopardizing the privacy of the participants (online users and patients).

Any algorithm to be executed over amounts of data at the scale encountered in the above settings needs to be highly efficient and scalable. SFE over private data therefore poses a significant challenge, as it comes at a considerable additional computational cost compared with execution in the clear. Prior work has made positive steps in this direction, showing that a variety of important data mining algorithms [1416] can be computed using Yao’s Garbled Circuits (GCs) [20, 21] in a parallel fashion. The function to be evaluated is converted to a binary circuit which is “garbled” in such a way that an evaluator of the circuit learns only the values of its output gates. Execution of this circuit is subsequently parallelized, e.g., over threads [15] or across a cluster of machines [16]. Nevertheless, this approach to parallelization leaves much to be desired: for example, in [16], even under parallelization over 128 cores, executing a typical data mining algorithm like Matrix Factorization (MF) through SFE is of the order of slower compared with (parallel) execution in the clear. In practice, this means that applying MF to a dataset of 1M entries requires roughly 11 days under SFE, a time largely prohibitive for practical purposes.

1.2. FPGA Overlays

There has been a surge of interest in FPGAs in the datacenter, as evidenced by a large number of systems that have recently become available. Amazon is offering FPGA instances through Amazon Web services [22], Microsoft has the Catapult system [23], and IBM offers cloud FPGA [24]. In this paper, we advocate leveraging hardware acceleration to tackle the scalability and efficiency challenges inherent in SFE. FPGAs are an excellent hardware platform for the implementation of SFE primitives and, in particular, garbled circuits. This is precisely because FPGAs are tailored to executing many low-level operations in parallel. The types of operations encountered in garbled circuits (namely, garbling and ungarbling gates) fit this pattern precisely: they involve, e.g., a series of symmetric key encryptions, XORs, and other well-defined primitive operations (see Section 2). Thus, an FPGA implementation of SFE benefits from both high-speed evaluation and hardware-level parallelization.

The amount of computation required to evaluate a garbled circuit for an application at the usual data mining scale cannot fit in a single FPGA. Thus, evaluating a function securely entails partitioning computations into subtasks to be programmed and evaluated over a single FPGA. A practical implementation therefore needs to allow repurposing an FPGA to quickly compute different SFEs or different subtasks of a larger SFE. For this reason, tailored approaches that are tied to the execution of a specific SFE structure and require full reprogramming of an FPGA with each new execution cannot be applied efficiently to the types of SFE problems we wish to address. To address these challenges, we propose a generic, reconfigurable implementation of SFE as a coarse-grained FPGA overlay architecture. As FPGAs have become more dense and capable of holding a large number of gate equivalents, there has been an increased interest in FPGA overlay architectures [2531]. An FPGA overlay consists of two parts: (1) a circuit design implemented on the FPGA fabric using the usual design flow and (2) a user circuit mapped onto that overlay circuit. Garbled circuits are excellent candidates for an FPGA overlay design. Precisely because components of a garbled circuit follow a generic structure, an overlay approach that does not reprogram FPGAs from scratch but simply reroutes connections between elementary components (in our case, garbled AND and XOR gates) leads to important efficiency improvements.

1.3. Contributions

This paper introduces SIFO: secure computational infrastructure using FPGA overlays. We make the following contributions:(i)We provide a complete workflow to map any garbled circuit problem to garbled circuit overlay cells on an FPGA, including software (SFE problem generator, parser, and scheduler) and FPGA overlay circuit to accelerate the GC problem.(ii)Our workflow and tools enable accelerating any garbled circuit operation without requiring knowledge of the underlying implementation. We integrate our implementation with FlexSC [32] which uses ObliVM [33] as the backend for any garbled circuit operation. In conjunction with our tools, each problem is analyzed, and layers of operations that can be executed in parallel are extracted. The resulting circuit is then mapped to our FPGA overlay architecture for processing.(iii)Our FPGA overlay architecture handles different parts of the same GC problem (if a problem is too large to fit in a single FPGA) as well as different GC problems without reprogramming. The FPGA is programmed once for all garbled circuit problems. Wiring and instantiation are determined at the execution time by the controller and the host. This overlay architecture is scalable and enables users to avoid the long design and compile time on FPGAs for new problems. The overhead for a new problem is very low, simply requiring the transfer of initial data and circuit information from host to device.(iv)We demonstrate the benefits of our approach by mapping a large number of circuit examples onto a heterogeneous computing platform featuring a Stratix V FPGA. We tackle different aspects of performance bottlenecks and alleviate them. This includes (a) investigating different numbers of FPGA overlay cells, (b) optimizing the host to FPGA communication via PCIe, and (c) managing onchip block memory to minimize accesses to the offchip DDR memory. We compare the performance of these improvements for various problems and show significant speedup against the naive design and against software implementation, ranging from 6.21 to 45.78 times faster than the latter. Many of the optimizations presented can be applied to other FPGA projects as well.

This journal paper represents an extension to our previously published research [34], which presented an implementation where the entire GC problem fits on a single FPGA, with all intermediate results fitting into onchip memory (block RAM). In the research presented here, we relax that constraint to significantly increase the size of problems supported. This introduces new challenges since with GC, data are not accessed in order. Our new implementation treats block RAM as a user-managed cache and investigates how best to access data so that data fetching does not become a bottleneck.

The remainder of this paper is structured as follows. Section 2 covers background information on garbled circuits as well as related work. The design methodology is presented in Section 3, which demonstrates the methodology of how we tackle the garbled circuit problem in a heterogeneous reconfigurable system and how we alleviate bottlenecks in the system to improve overall performance. Experiments and corresponding results are presented in Section 4. Finally, we present our conclusions and future work. Materials in this article are excerpted from the first author’s PhD dissertation [35].

2. Background

In this section, we introduce the relevant background on garbled circuits, including terminology and techniques. Related work on garbled circuit implementations is also discussed.

2.1. Garbled Circuits

Our research accelerates Secure Function Evaluation (SFE), specifically Garbled Circuits (GC), using FPGAs. In this model, there are two or more users with data which they wish to keep private and a function to be evaluated over those data. All parties know the function being evaluated and learn the outcome of the evaluation, but users do not reveal their data. A canonical problem exemplifying SFE is the “Millionaires’ Problem:” two millionaires wish to know who is worth more without revealing their personal worth to each other.

Garbled circuits were initially introduced by Yao [21] for two users and have been extended to multiple users. They rely on cryptographic primitives. In the variant we study here (adapted from [15, 36]), Yao’s protocol runs between (a) a set of private input owners, (b) an Evaluator, who wishes to evaluate a function over the private inputs, and (c) a third party called the Garbler, that facilities and enables the secure computation.

Garbled circuits work for any problem that can be expressed as a Boolean circuit. In our and many other implementations, this function is represented as a circuit made up of AND and XOR gates (recall that AND and XOR gates form a complete basis for Boolean circuits). The Evaluator wishes to evaluate a function f, represented as a Boolean circuit of AND and XOR gates, over private user inputs . We break the problem into three phases, as shown in Figure 1. In Phase I, the Garbler “garbles” each gate of the circuit, outputting (a) a “garbled circuit,” namely, the garbled representation of every gate in the circuit representing f and (b) a set of keys, each corresponding to a possible value in the string representing the inputs . These values are shared with the Evaluator. In Phase II, through proxy oblivious transfer [37], the Evaluator learns the keys corresponding to the true user inputs. In the final phase, the Evaluator uses the keys as input to the garbled circuit to evaluate the circuit, ungarbling the gates. At the conclusion of Phase III, the Evaluator learns . To ensure privacy of users’ data and to protect against side-channel attacks, both garbling and evaluation are run whenever user data change. Hence, garbling is performed as often as evaluation.

2.1.1. Garbling Phase

A function to be evaluated is represented as a Boolean circuit consisting of AND and XOR gates. In the garbling phase, each of these gates is garbled as described in this section. Each gate is associated with three wires: two input wires and one output wire. At the beginning of the garbling phase, the Garbler associates two random strings, and , with each wire in the circuit. Intuitively, each is an encoding of the bit value that wire can take.

Here, we describe how to garble an AND gate. The same principles can be applied to garble an XOR gate, using its respective truth table. We note, however, that, in practice, XOR gates are handled via the free-XOR optimization [38], discussed in Section 2.1.3. A garbled AND gate is shown in Figure 2. For each AND gate , where is the gate number, with input wires and output wire , the garbler computes the following four ciphertexts, one for each pair of values :

Here, SHA represents the hash function, indicates concatenation, is an identifier for the gate, and is the XOR operation. Note that each value k on a wire is implemented with 80 bits in our implementation. The “garbled” gate is then represented by a random permutation of these four ciphertexts. Observe that given the pair of keys , it is possible to successfully recover the key by decrypting using the following equation (note that the above encryption scheme is symmetric, as Enc and Dec are the same function.):

On the contrary, the other output wire key, namely, cannot be recovered. More generally, it is worth noting that the knowledge of (a) the ciphertexts, and (b) keys for some inputs and yield only the value of key ; no other input or output keys of gate can be recovered. Any Boolean function can be garbled in this manner, by first representing it in AND and XORs, and garbling each such gate.

2.1.2. Evaluation Phase

The output of the garbling process is (a) the garbled gates, each comprising a random permutation of the four ciphertexts representing each gate and (b) the keys for every wire in the circuit. At the conclusion of the first phase, the Garbler sends this information for all garbled gates to the Evaluator. It also provides the correspondence between the garbled value and the real bit value for the circuit-output wires (the outcome of the computation): if is a circuit-output wire, the pairs and are given to the Evaluator. To transfer the garbled values of the input wires, the Garbler engages in a proxy oblivious transfer with the Evaluator and the users, so that the Evaluator obliviously obtains the garbled circuit input value keys corresponding to the actual bit b of input wire .

Having the garbled inputs, the Evaluator can “evaluate” each gate, by decrypting each ciphertext of a gate in the first layer of the circuit by applying equation (2): only one of these decryptions will succeed (this can be detected, e.g., by appending a prefix of zeros to each key and checking if this prefix is present upon decryption), revealing the key corresponding to the output of this gate. Each output key revealed can subsequently be used to evaluate any gate that uses it as an input. Using the table mapping these keys to bits, the Evaluator can learn the final output.

2.1.3. Optimization

Several improvements over the original Yao’s protocol have been proposed, that lead to both computational and communication cost reductions. These include point-and-permute [39], row-reduction [40], and free-XOR optimizations [38], all of which we implement in our design. Free-XOR optimization, in particular, significantly reduces the computational cost of garbled XOR gates; XOR gates do not need to be encrypted and decrypted, as the XOR output wire key is computed through an XOR of the corresponding input keys. In addition, free-XOR optimization fully eliminates communication between the Garbler and the Evaluator for XOR gates; no ciphertexts need to be communicated for these gates. Our implementation takes advantage of all of these optimizations; as a result, the circuit for computing garbled AND gates differs slightly from the garbling algorithm outlined above.

2.2. Related Work

Acceleration of garbled circuits is a hot research area in the SFE field. Researchers use different parallel models and hardware platforms to speed up execution. These platforms include FPGAs, CPUs, and GPUs.

2.2.1. FPGA and ASIC Designs

TinyGarble [41] uses techniques from hardware design to implement GCs as sequential circuits and then optimizes these designs. The circuits can be optimized to reduce the non-XOR operations using traditional high-level synthesis tools and simulation.

The offline circuit synthesis will provide a ready-to-use circuit description for any garbled circuit problem. The resulting designs are customized for each problem; thus, for each new problem, a new circuit must be generated. In addition, their results describe simulations, but no actual hardware implementation. In [42, 43], the authors describe the first FPGA implementations of GC. In both these implementations, there is limited parallelism to allow garbling to happen in a small footprint. In [42], two FPGA-based prototypes are described: a system-on-chip with access to a single hardware cryptographic accelerator core and a stand-alone hardware implementation targeting ASICs. In [43], the authors use a nonstandard garbling technique in order to reduce communication. Our approach uses standard GC techniques as implemented in popular software implementations. In addition, our architecture aims to reduce the computational cost of garbling by using much more parallelism than these early FPGA implementations. For starters, we implement four SHA cores in hardware for each garbled AND gate. In addition, we implement as many garbled AND gates as we can keep busy at the same time and implement garbled circuits directly on top of an efficient overlay, which eliminates the need to recompile the hardware for every new user problem. With MAXelerator [44], the authors implement a very efficient garbling of matrix multiplication in FPGAs. While their design is more efficient for matrix multiplication, ours is more general purpose and supports any problem that a user may wish to garble.

2.2.2. CPU

One approach to accelerating GC on CPUs is to provide instructions that support encryption to speed up the base operations. JustGarble [45] shows that using AES-NI (Advanced Encryption Standard New Instruction), circuits can be garbled and evaluated faster than using traditional instructions. Intel AES-NI is a new encryption instruction set that improves AES operations in the Intel Xeon processor family. Others have proposed a 32 bit MIPS architecture specifically implemented with instructions to support SFE. GarbledCPU [46] is a MIPS-based general-purpose sequential processor which enables the high-level description of garbled circuits in hardware. Problems to be evaluated securely are compiled to an MIPS assembler and then run securely on their garbled MIPS processor. The goal of this project is to fabricate the MIPS core; FPGAs are used for prototyping the design. Using MIPS assembly code to represent the problem being evaluated alleviates the problem of lengthy FPGA place and route cycles. However, the availability of this specialized hardware is likely to be limited. Our approach introduces more parallelism than either of these CPU approaches, as we implement many hashing cores in parallel. In addition, through an overlay, we can rapidly switch between problems.

2.2.3. GPUs

Researchers have used GPUs for hardware implementations of garbled circuits. Fastplay [47] uses a GPU architecture to accelerate garbling arithmetic operations and achieves a 35 to 40x improvement over serial implementation. Kasper Fredericksen et al. [48] implement a protocol based on cut-and-choose of garbled circuits for malicious situation using GPUs. Husted et al. [49] implement free-XOR, pipeline, and OT extension on GPUs which exploit some of the parallel nature of these tasks. They report on the difference between implementations on Single Instruction, Multiple Data (SIMD) architecture of GPUs and on Multiple Instruction, Multiple Data (MIMD) architectures for multicore CPUs. They also comment on the difficulty of comparing different implementations. Husted assumes a malicious adversary and thus implements k different versions of a Garbled circuit which gives them increased parallelism. We assume an “honest but curious” adversary, which results in less parallelism.

2.2.4. Summary

Our approach, SIFO, differs from prior art with respect to (a) the level of parallelism implemented, (b) the ability to support any user problem, and (c) the ease to change between problems without requiring regeneration of the FPGA circuit.

3. System Design Methodology

Our approach implements a coarse-grained overlay architecture to accelerate GC problems. Garbled AND and XOR gates are implemented on an FPGA along with memory and control for support. Software tools support the mapping of different garbled circuit problems onto this overlay architecture and leverage the interaction between hardware and software while maintaining small communication and memory access overhead. We describe the hardware architecture (Section 3.1) and software structure (Section 3.3) and discuss why an overlay architecture is needed. We conclude the section with a discussion of optimizations implemented for performance improvement.

To demonstrate the utility of FPGAs in the datacenter for accelerating GC, we start with circuits generated from FlexSC based on ObliVM [33]. FlexSC is a software framework that allows developers without any cryptography expertise to convert algorithms expressed in a high-level language to GC. FlexSC generates a gate netlist of the problem to be garbled, where gates are restricted to AND and XOR gates. This research takes this netlist and processes it on an FPGA and compares it the same processing done by FlexSC on a CPU. In this paper, we focus on garbling. Our recent results [50] show that garbling takes up about two-thirds of the total run time and is thus the bottleneck in our overall design. In this paper, we focus on garbling; accelerating evaluation will be addressed in future work.

The overall process starts from user data and a problem to be garbled. The steps required are generating the netlist for the garbled circuit, mapping that netlist onto implementations of AND and XOR gates, generating the garble tables for the Evaluator, and then transmitting the table for each AND gate to the Evaluator. The Evaluator receives data inputs from the users via oblivious transfer. In the process of garbling, we use FLEXSC to generate the netlist and use the wire numbers from that netlist as the memory locations for each wire. Software that runs on the host processor does layer abstraction to process a circuit in the breadth-first order and assigns AND operations in the garbled circuit to specific AND gates on the FPGA. Hence, what is communicated to the FPGA is wire IDs and gate numbers for garbling. The garble tables are transferred back to the host processor for transferring to the evaluator. This flow is shown in Figure 3 and described more completely in this section.

3.1. Hardware Architecture
3.1.1. gAND and gXOR Overlay Cells

The garbled AND cells required for garbled circuit generation are much more complicated than single bit operations. To emphasize this fact, we refer to them as gAND in the reminder of this paper. Each wire of the gAND is represented with 80 bits. A basic garbling AND operation implements the functionality described in Section 2.1. The design we use, shown in Figure 4, implements the row-reduction [40] and “free-”XOR [38] optimizations. Each line of the truth table is implemented according to equation (1). This implementation requires four Secure Hash Algorithm (SHA) 1 cores, although only three output values need to be transmitted to the evaluator. and are two garbled values representing the value 0 on the wire for a gAND operation. R is a global variable based on which the cipher can get the garbled value represented by all zeros. For any wire i, . The implementation still uses four SHA-1 primitives which run in parallel; however, only three values in the garbling table need to be stored, reducing the size of the garble table by 25%. Since all values in the garble table need to be transmitted to the host, and later to the evaluator, this optimization also results in a 25% saving in the amount of data that needs to be transmitted for each gAND gate. The implementation includes two arbitrators for the four outputs of the SHA-1 operations. The first arbitrator decides the sequence of the result and picks one of them to XOR with the other three. The second arbitrator rearranges the sequence of those three values and stores them in the garble table. Note that these arbitrators do not introduce any latency to the system. The latency of a gAND gate is 82 cycles, which is determined by the latency of the SHA-1 core. The implementation, based on an open source core [51], uses 512 bit values derived from the garbled inputs and additional information. gAND requires 82 clock cycles on the FPGA and uses 3070 ALMs and 3750 one bit registers on our target hardware, a Stratix V FPGA.

There are several things to note about this implementation. First, SHA-1 is known to be vulnerable; however, since new keys are generated for every new problem and new set of inputs, this is not a concern in the context of GC. A user who wants a stronger privacy guarantee can replace the SHA-1 cores with AES or another cryptographic primitive. This may reduce the performance in our implementation as the number of cores that can be implemented in parallel could be reduced. Second, other optimizations have been introduced, most importantly the half-AND gate [52], which reduces the amount of data that needs to be transmitted between the garbler and Evaluator. This optimization will not accelerate garbling, the focus of this paper, but will reduce communication costs. These and other optimizations can easily be introduced into the design of gAND and will be considered in the future. Note that SHA was chosen in order for us to compare our performance directly with a widely used software implementation, FlexSC [32].

The Garbled Circuit XOR overlay cell (referred to as gXOR) benefits from the “free-” XOR protocol [38]. A free gXOR gate consists of 80 bit plaintext XOR operations. For any garbled circuit operation, it is guaranteed that using the free-XOR approach will have the same privacy guarantees as using standard cryptographic primitives. This optimization means that gXOR is both much smaller and much faster than gAND. Note that the gXOR gate is combinational and thus has no latency. Note that the time to garble an XOR operation is less than the time to transfer the input and output wire information from the host processor. However, it is still advantageous to do this in FPGA hardware since the input and output garbled values stay local to the FPGA and would otherwise have to be communicated back to the host.

3.1.2. FPGA Overlay Architecture

Figure 5 shows the overlay architecture we use for garbled circuit acceleration. This architecture includes the gAND and gXOR circuits described above, a workload dispatcher and data controller (described below), block RAM that is used as an onchip cache, and a DDR memory interface for accessing the main memory for the problem being garbled. An architectural decision in our overlay design is how many gAND and gXOR gates to instantiate. We experimented with different numbers of AND and XOR overlay cell combinations, and the results are presented in Section 4.

3.1.3. Workload Dispatcher and Data Controller

The workload dispatcher and data controller are responsible for fetching garbled values from memory based on input and output addresses, delivering input data to the correct gAND or gXOR overlay cell, and writing back the results to memory after each operation. The timing of the entire system is determined by the states in this module. Figure 6 shows the timing information of the workload dispatcher and data controller, which implements the following steps: (1) determining the type of the next batch of operations sent from the host, (2) reading input values from memory and forwarding them to the correct overlay cell, and (3) writing the output result back to the corresponding location in memory. Our design uses both onchip block RAM (BRAM) and offchip DDR memory. States in the FSM are added depending on the type of memory accessed. An entry using block RAM for storage will wait for only one clock cycle for a read or write; however, for the DDR memory, the memory access operation has variable latency and is not finished until the “complete” flag is raised. One of the challenges of GC is that memory locations are accessed in random order, and hence timing and organization of memory is complicated. We currently use onchip BRAM as a cache for values that would otherwise be stored in offchip memory. A detailed discussion of different memory optimizations is presented in Section 4. The onchip BRAM is organized with a single read port and single write port with 108 bit data, of which 80 bits are used. 6.75 Mbits of BRAM is used in this design to store values. Multiple BRAMs on the chip can be accessed in parallel. However, the random access nature of memory accesses makes it challenging to take full advantage of this feature. The onboard memory is accessed using 512 bit reads and writes, and four garbled values are accessed in one data word. Two data ports are available in parallel. The port widths are dictated by the architecture of the Gidel ProceV board.

3.2. Why Use an Overlay Architecture?

The examples used in Section 4 are shown in Table 1. These problems are addition, Hamming Distance (HD), multiplication, sorting, and matrix multiplication. We also analyze scalability of our design by testing several different sizes of these problems. Note that FlexSC tries to maximize the number of XOR gates used as XOR gates are much less computationally expensive to implement with the free-XOR optimization [38]. Thus, the percentage of gates that are AND gates never exceeds 26% in our examples.

The last column of Table 1 shows the number of times the FPGA would need to be reprogrammed assuming layers are processed one at a time, and 10 garbled AND gates are implemented on the FPGA. We assume as many XORs as needed which can be accommodated, as XORs take much less time and space to process, as discussed above. These examples motivate the need for an overlay architecture.

The FPGA architecture is implemented as a coarse-grained overlay circuit with a sea of gates approach, where the implemented gates are gAND and gXOR. Using this architecture, any user problem can be mapped to the garbling hardware with no need to reprogram the FPGA, and the results are available rapidly. This is in contrast to the traditional FPGA design workflow which would require synthesis, place, and route as well as downloading a new bitstream for each new problem. In our approach, all that is needed is to program the FPGA ahead of time and generate the software host code for each problem. The overlay architecture can be reused without recompilation, while the traditional FPGA workflow has to go through the entire tool flow. The traditional approach is infeasible for garbling large problems as many recompilations would be required for a single problem. Each compilation can take minutes to hours. Using the overlay architecture, we compile the hardware once and generate the software in minutes. Let us assume that we used a very efficient program, such as TinyGarble [41], to generate each instance of a problem. Let us further assume that TinyGarble can fit a design with 100 garbled AND cores on an FPGA, i.e., it is ten times more efficient in hardware usage compared with our approach. To handle multiple problems, such as those in a datacenter setting, each new problem would need to be generated, placed, and routed, and this takes on the order of tens of minutes. Hence, our approach is more efficient even for those problems that fit entirely on one FPGA. For large problems, such as the larger matrix multiply problems in Figure 1, the FPGA would have to be reprogrammed more than a hundred times, that would require hours. The overlay approach provides an architecture that maps different designs to the FPGA without requiring reprogramming. Thus the end-to-end run time of an application with FPGAs is faster than the end-to-end run time using FlexSC, as presented in Section 4. We summarize this discussion in Table 2.

3.3. Software Workflow

This section discusses the software workflow including problem generation, problem parsing, layer extraction, and code generation.

3.3.1. Problem Generation and Validation

Our design makes use of our GC overlay architecture, SIFO, in a way that is seamless for a user of FlexSC. FlexSC, based on ObliVM [33], is a software framework that allows developers without any cryptography expertise to convert algorithms expressed in a high-level language to GC. We modify FlexSC to output the netlist for a garbled circuit problem. This research extends FlexSC by taking the netlist, consisting of gAND and gXOR gates, and processing it on an FPGA. We use the same optimizations as FlexSC, namely, free XOR [38] and row reduction [40]. In its normal operation, FlexSC outputs the results of garbling each gate; we use these values for verification. Note that input values are random and generated for each new computation. These are generated on the host and used for both the FlexSC and FPGA versions to ensure consistent results. The speed and validity of results can thus be easily compared.

The netlist generated by FlexSC is garbled in the breadth-first order. To support this, we generate layer information and separate each layer into a “batch” of operations, where each batch represents the number of gates that can be garbled in parallel on the FPGA and is implementation-specific. As the netlist can be quite large, it may require many batches to garble a single layer. A typical Boolean gate generated from FlexSC has the form: wire ID1 AND/XOR wire ID2 = wire ID3. We use wire IDs as memory addresses; intermediate data from garbling need to be stored. We use both onchip and onboard memory for this purpose.

Our hardware consists of an FPGA board connected to a PC via PCIe. Figure 3 highlights the workflow for garbling a circuit that involves both software running on a CPU (on the right of the figure) and FPGA design (on the left).

The host transfers the information to the FPGA for processing. We automatically generate the host code for each problem through our tools. The host code for a user problem is responsible for initial data transmission, assigning gates in the problem to specific gAND or gXOR instances on the FPGA, and allocating the output of each garbling operation to memory.

Figure 7 shows the timeline for one batch of Boolean operations for the workload dispatcher and data controller, assuming that all information is transferred from the host before the batch begins operation. This is improved on by overlapping communication and computation as described in the optimization section.

3.3.2. Problem Parser and Layer Extractor

The problem parser analyzes the generated gate netlist from FlexSC for gate, wire, and layer information. The output consists of the total number of wires representing the total number of memory locations required, and, for each AND gate and XOR gate, the addresses that correspond to the input and output wires. Other output includes information for separating wires into different groups (onchip or onboard) which is used in implementing the hardware memory hierarchy. We consider several different approaches for using onchip memory effectively. Analysis of results for sample problems is presented in Section 4.1.

We process gates in the breadth-first order. The netlist generated from FlexSC is fed to a layer extractor which extracts each layer of the circuit that can be garbled in parallel. We also identify the primary input values whose wire ID is not the output of any gate. Layer extraction identifies the AND and XOR gates that can be processed at the same time. For most problems, an entire layer will not fit onto the implemented FPGA overlay architecture. Thus, a single layer may take several rounds. We refer to the number of gates that map directly onto the FPGA as a batch. Each operation in a batch is assigned a gate ID that corresponds to the gate it uses in HW. Wire numbers for input and output wires are used as the addresses in memory where input and output values are stored. The processor assigns wire IDs and gate IDs and transmits this information to the FPGA. At the end of a batch, the processor transmits the next batch of information and continues until the circuit is fully garbled. Within a batch, all gates belong to the same layer of the circuit. Note that the amount of information transferred from the host to the FPGA minimal. The data remain on the FPGA; only memory addresses and gate IDs are transmitted.

3.3.3. Host Code Generation

We developed the tools to automatically generate the host code based on any garbled circuit operation. The input is the layer information from the layer extractor. This tool generates the batches and assigns wire IDs and gate IDs for different problems. Initial input data are generated and sent to the FPGA. For very large problems, the host code separates the main function into groups of smaller problems to avoid exceeding the heap size allocated for a problem. The tools support debug mode, as well as different allocation policies for memory, which are discussed in more detail as follows. More details can be found in [35].

3.4. Optimizations

There are two major sources of bottlenecks in our design. The first is transferring data over PCIe. The second is the delay in accessing onboard memory. In this section, we address optimizations to the design that mitigate both of these bottlenecks.

3.4.1. PCIe Communication and FPGA Memory

The first few optimizations target improving communications over the PCIe bus. In our implementation, for each gate, the location of the input and output wire values and the gate type, AND or XOR, needs to be communicated. Since the circuits representing problems to be garbled are large, this information is transferred as a batch of operations at a time.

Our first optimization involves overlapping communication and computation of gate and wire information, as shown in Figure 8. Overlay cells can start working as soon as the information for a new Boolean operation has been transmitted. For different batches, the same gates implemented as part of the overlay architecture are reused for different garbled gates in the user design. This optimization is applied in all subsequent designs and in all reported results. Another optimization we apply is to remove unnecessary handshaking signals between the host and the FPGA.

The communication channel between the host and FPGA supports direct communication to data registers on the FPGA or, using DMA, to the onboard DDR memory. We use DMA to transmit the initial data (values on input wires) to the DDR memory. We directly transfer gate information to onchip registers. The time for the host to write to one register on the FPGA is 50 ns. As there are three addresses for a Boolean gate, the data transmission time is 150 ns per gate in a batch.

An optimization we apply is to pack more than one address into a register to reduce the number of transfers required. We use two registers to represent the three addresses needed for each gate. The total width of 2 addresses is , and the actually bit-width for each address location in our design is bits. Besides the flag bit representing the memory type, there will be 20 bits for a real address, which is enough to represent about one million wires. Figure 9 shows this optimization. We are investigating generating memory addresses locally to the FPGA, which will remove this limitation.

3.4.2. Hybrid Memory Hierarchy

The second source of bottleneck in our design is the transfer between the FPGA and the on-board DDR memory. Accesses to the DDR memory require many clock cycles with a latency of about 180 ns, and, as wires are not accessed in sequential order, we cannot take advantage of the burst mode. Block RAM available on the FPGA has much faster access times of one clock cycle (5 ns for a 200 MHz CLK), but is not large enough to support the size of problems we are processing. To address this issue, we make use of a hybrid memory hierarchy where some values are stored in the onchip memory while most values are stored offchip. In essence, we are using the onchip memory as a cache. However, unlike a traditional cache, the policy for using the cache is completely under user control.

Figure 4 shows the hardware architecture using both the block RAM (BRAM) inside the FPGA and the DDR memory on board. Our previous work only used onchip BRAM and thus was limited in the size of problems garbled [34]. In this research, we investigated two different allocation policies, for block RAM. We refer to these as directly used and most frequently used. Results for both policies are reported in Section 4. Software on the host determines whether a wire is stored in block RAM or in DDR. A single bit in the address indicates which it is. Wire IDs are generated on the host, so the code that generates wire IDs also implements the memory policy. Using a bit to indicate the location not only reduces the number of addressable memory locations but also removes the need to implement hardware to track locations of specific locations.

Addresses for wires are used to store values that represent the output of the garbled gate. These values are used in generating the garbled values for the next gate; however, they are not transmitted to the Evaluator. Only the garble tables need to be transmitted. Hence, values are stored in memory for the duration of the garbling computation across all layers, but are not needed after that.

Some wires are generated as outputs from one gate and feed directly into the another gate. In other words, their fanout is 1. In the directly used policy we store the values generated on wires that are directly used in block RAM to save the time to store and fetch these values. The criteria for such a value to be stored in block RAM are: (1) the wire is used only once after it is generated and (2) the Boolean gate which uses this wire ID is in the adjacent layer. The directly used policy saves significant memory bandwidth. The second criterion of only using values in an adjacent layer allows block RAM space to be reused once the garbled value is no longer needed. For the directly used policy we use a ping-pong buffering approach, where half the block RAM in any layer is used for reading and the other half for writing, and these roles are swapped with each layer. A total of 13 megabits of onchip BRAM is used.

For the most frequently used policy, the host code analyzes the complete netlist that is generated and identifies the wires that are most frequently used. These values are stored in block RAM for the duration of their lifetime. This policy is similar to a most frequently used cache allocation policy; however, in our design, the values are never stored in DDR RAM. The goal behind this policy is to reduce the number of reads and writes to offchip memory. The host code sorts wire IDs based on their number of accesses and assigns those wires with a large number of accesses to block RAM. Once a block RAM cannot fit more wires, the rest of the wire IDs are assigned to addresses in the DDR memory. We do not currently reuse memory locations; this is planned for future work.

Table 3 shows wire information. Each wire corresponds to one memory location. The more the wires, the more the memory needed to store the values that correspond to the output of each garbled gate. The problems have a wide range of number of wires from several dozen to over one million wires. There are two types of wires, 1-to-1 wires and 1-to-N wires. 1-to-1 wires include two types: wires where the output is immediately used in the next layer and wires where the output is not immediately used. We use the first type in the directly used policy. There are also 1-to-1 wires not in adjacent layers and 1-to-n wires where one output is used multiple times. The maximum number of 1-to-1 wires in a layer is the number of memory locations needed in block RAM for directly used policy if all 1-to-1 wires are kept on a chip.

Table 4 shows the percentage of each type of wire. Percent A is the number of 1-to-1 wires among all the wires. Most of the 1-to-1 wires are used in the next layer, represented in percent B. Percent C shows the percent of the 1-to-1 wire to be used in the next layer among all the wires. These data show that the directly used policy is a good fit for many garbled circuit problems, and may require less onchip memory compared with the most frequently used policy.

For the hardware architecture, the workload dispatcher and data controller are designed to accommodate this hybrid memory hardware architecture. The controller monitors the flag of the address provided by the host and determines whether the value should be stored in block RAM or DDR. We embed the flag as the last bit of the address, and if zero, the location is DDR; otherwise, it is block RAM. Timing results for both of the implemented policies are presented in the next section.

4. Experiments and Results

We compare our results to FlexSC [32] both for correctness and for performance. For software timing, we run FlexSC on an Intel Core i7 processor running at 3.6 GHz, using any optimizations implemented in FlexSC. Our target hardware platform consists of a host PC and FPGA card, specifically the ProceV board from Gidel. The ProceV board hosts a Stratix V FPGA with two DDR3 external memories each of which can support 8 GB, or a total of 16 GB. It provides communication between host and FPGA via a PCIe Gen 3 bus with 8 lanes, each of which supports 8 gigatransfers per second. Our results show the end-to-end effect of replacing FlexSC garbling in software with an FPGA solution.

In our results, we examine the effect of the basic overlay architecture that we implemented, as well as each of the optimizations discussed in Section 3. Note that all of the results presented use overlapped communications and computation as described in Section 3.4.1. In the reported results, the FPGA clock is 200 MHz. The interface clock responsible for data transmission between host and FPGA is running at 300 MHz. Timing results compare the FPGA design with an Intel processor running the same algorithm at 3.6 GHz.

4.1. Problem Analysis

The problems we analyze are addition, Hamming distance (HD), multiplication, sorting, and matrix multiplication. We also analyze scalability of our design by testing several different sizes of these problems. Results for some of these problems were reported in our previous publication [34]. The biggest difference between the work presented here and our previous work is that the previous work did not make use of offchip memory, which significantly limited the size of problems we can garble. The maximum number of gates in a previously presented example was 32,000 gates. Here, we analyze much larger problems, with up to a million gates. Problem information is summarized in Table 1. This table shows total numbers of layers, wires, and gates in the garbled circuit problems we analyze. The examples presented are a 6 bit adder, several different bit sizes for HD, several different bit sizes for multiplication, and an example of sorting four bit numbers. All of these were reported previously [34]. The new examples include different sizes of matrix multiplication that show how the problem size scales as well as larger problems that could not fit with the previous approach. The largest problem reaches one million garbled circuit operations and several thousand independent gates within each layer. Note that the largest number of layers is not from the largest problem. Also, note that, to best take advantage of the free-XOR optimization, FlexSC generates examples that predominantly make use of gXORs; gANDs never exceed more than 26% of total gates. In addition to these examples, we analyze different sizes of page rank (PR) and present results as follows.

4.2. Heterogeneous Computing System Results

We have implemented all of the design variants described in this paper on a Gidel ProceV board. In this section, we present step-by-step performance improvements using the different optimizations described in Section 3. We compare subsequent designs to one another and also present the speedup compared with software using FlexSC.

We experiment with different numbers of overlay cells implemented in hardware, as shown in Table 5. Results show speedup compared with FlexSC for 5 and 10 garbled AND gates; in both cases, a single XOR overlay cell is used. We do not observe much improvement when we increase the number of AND gates, which indicates that this is not the bottleneck in our base design. The bottleneck here is PCIe communications between host and FPGA board.

In the base design, the transmission time for information for each XOR operation is larger than the garbling XOR time itself. Using this information, we can remove any synchronization used by the host when sending XOR gates. Note that it is still worthwhile to garble the XOR gates on the FPGA because the output keys generated are used in subsequent gates in the design. Knowing that transmission time of each XOR operation is larger than the XOR operation time, we can remove the synchronization steps and let the host keep sending XOR gates. Table 6 shows the results of applying this optimization to the ten garbled AND gate design. Additional speedup is the speedup compared with the version with XOR check, and the total speedup compares this new design with FlexSC. In this design, the host sends all the XOR operations within one layer without synchronization before sending batches of AND operations. This optimization contributes significant speedup, and the effect of this optimization grows as the size of the user problem increases.

Next, we show the speedup from using both onchip block RAM and offchip DDR memory. Table 7 shows total time in μs and the speedup compared with the software version in FlexSC. In this table, we apply the directly used policy, described in Section 3.4. The results show that the smallest speedup compared with software is 8 times. Thus, using a hybrid memory architecture results in significant savings.

We also implemented the most frequently used policy using the hybrid memory system. We tried three of the larger problems to compare the two policies, as shown in Table 8. The policy comparison column compares the most frequently used policy with the directly used policy. If it is larger than 1, then the most frequently used policy is faster. Note that the results show that there is not much difference in performance between the two policies for large problems. This is likely due to the fact that the directly used policy reuses block RAM memory locations, while in our current implementation, the most frequently used policy does not. In addition, the preprocessing cost for the most frequently used policy is more expensive as the fanout of every wire needs to be computed. In addition, to reuse memory locations, lifetimes of these wires will need to be computed. While we plan to investigate this in the future, for now, we conclude that the directly used policy is the most advantageous.

For the FPGA operation, we use 200 MHz as the local clock. The PCIe protocol allows us to set a different “main” clock speed for transmitting data; for this, we use 300 Mz. Because the main clock is faster, the time to transmit the operands for a garbled XOR is no longer larger than the XOR operation time. Thus, we cannot apply XOR without synchronization between the host and FPGA. However, we can also use multiple XORs to improve the total performance. Table 9 shows the results of using 5 AND and 5 XOR overlay cells; 10 AND and 10 XOR; 15 AND and 15 XOR. Speedup improvement shows that the increase from changing from 5 to 10 is 1.25 times and changing from 10 to 15 is 1.22 times. We will continue to investigate adding more gates to see when this improvement saturates.

The speedup results from packing 3 addresses into 2 registers are shown in Table 10. We use the page-ranking examples, and the results show 1.06 to 1.13 speedup improvement compared with the method of using 1 register for 1 address. Note that this optimization limits the size of valid address bits that can be used to 20, which in turn limits the size of problems that can be garbled.

We combine all of the optimizations that led to speedup and present the results in Table 11. These results have applied the following optimizations: (1) 15 AND overlay cells and 15 XOR overlay cells; (2) hybrid memory system with the directly used policy; (3) 300 MHz main clock frequency for PCIe interface and 200 MHz local clock frequency; and (4) pipelined operation between the host and FPGA. The results are shown for working designs on the Gidel ProceV board and compared end-to-end system running time with the same problems running in software using FlexSC. We observe one or two orders of magnitude speedup across a range of problems. Note that software is running at 3.6 GHz, while the FPGA implementations are running at 200 MHz. FlexSC runs with one thread; however, parallelizing the particular implementation of GC with the optimizations used and the “honest but curious” model is not trivial. Note that the number of AND gates garbled per second continues to increase as the size of the problem grows. While we see significant speedup across all problems, the amount of speedup diminishes as the problem size grows. This is due to the fact that onchip BRAM cannot keep as large as a percent of memory locations of offchip memory as the size of each layer grows and highlight the importance of our hybrid memory optimization. We intend to continue to optimize our design to be able to garble larger and larger problems in less elapsed time.

4.3. Bandwidth Bottleneck

Under the two optimizations we employ (“free-” XOR gates [38] and garbled row reduction [40]), the garbler needs to send  bit ciphertexts (240 bits) to the evaluator per AND gate, and 0 ciphertexts (0 bits) per XOR gate in the circuit. The latter is due precisely to the use of the “free-” XOR optimization: garbled XOR gates require neither encryption during garbling nor any transmission using this technique.

Under these optimizations, garbling is computation-bound in our setting. Taking two cases in Table 11 ( 4 bit m_mult, 4 bit m_mult) as examples, our processing time indicates we can process 0.67 M and 0.75 M AND gates per second, respectively. These cases correspond to the case with the largest speedup and the largest example run. At the cost of 240 bits per gate, this garbling correspond to a required communication bandwith between the garbler and evaluator of 160.8 Mbps and 180 Mbps, respectively. This is well within the range of the bandwidth available at, e.g., Amazon Web Services (AWS) EC2 instances (5 GBps); this implies that a garbler and evaluator deployed by distinct entities on AWS would be computation, not communication-bound. We note that this observation, as well as our estimates, agrees with experimental observations of garbler-evaluator execution pairs on AWS [16].

5. Conclusions and Future Work

This article demonstrates a heterogeneous reconfigurable computing system using FPGA overlay architecture for general garbled circuit operations. This system lets the user implement and accelerate their application without any knowledge of either hardware development or secure function evaluation protocol by providing a complete workflow to transfer any garbled circuit problem onto it. We demonstrate the benefit of using this system by showing significant speedup compared with existing software platforms. This research makes possible the wider adoption of using garbled circuit schemes in the future.

For the hardware architecture on FPGA, our design uses a coarse-grained overlay architecture and enables the evaluation of different SFE tasks without the need for reprogramming. The host-side workflow includes garbled circuit generator, problem parser, and host code generation tools which can be configurable for different hardware architectures. These tools explore the parallelism for any GC problem and generate the host program based on the structure of the problem. We also provide analytical tools to show the different characteristics of a problem. We explore the bottlenecks while working on this heterogeneous reconfigurable computing system and tackle them using different methods. This exploration also provides other researchers’ directions for improving their own heterogeneous system designs.

There are several directions for future research. First is the further improvement of the heterogeneous system. This research may benefit from a closely connected FPGA such as the Intel HARP to alleviate the bottleneck of the PCIe interface. Another direction is to expand the overlay cell library to abstract more complicated computational patterns using Boolean AND and XOR operations. The current work uses garbled circuit AND and XOR overlay cells as two components of the hardware architecture library, and this fine-grained pattern suffers from DDR access delay in every operation. Based on Tables 3 and 4, we know that there are many 1-to-1 wires to be used in the next layer. One solution is to build other overlay cells which consist of cross-layer Boolean operations. Second is to separate a large problem into several small problems which can be computed independently through several host nodes each with its own FPGA board. This enables the expansion of the size of the problems into even larger data mining problems, such as page ranking with more nodes using GraphSC and eventually provides a large, scalable, efficient platform for privacy-preserving computation. We have already begun to test these ideas using Amazon Web Services F1 instances.

Data Availability

The examples used to support the findings of this study are available from the corresponding author upon request.

Disclosure

Xin Fang is currently affiliated with Qualcomm. The bulk of this research was done at Northeastern University.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was based upon work supported by the National Science Foundation under grant no. 1717213. The research was also supported by the Google Faculty Research Award. We would like to thank Mehmet Gungor and Kai Hwang for valuable contributions to this research.