#### Abstract

Private set intersection (PSI) allows participants to securely compute the intersection of their inputs, which has a wide range of applications such as privacy-preserving contact tracing of COVID-19. Most existing PSI protocols were based on asymmetric/symmetric cryptosystem. Therefore, keys-related operations would burden these systems. In this paper, we transform the problem of the intersection of sets into the problem of finding roots of polynomials by using point-value polynomial representation, blind polynomials’ point-value pairs for secure transportation and computation with the pseudorandom function, and then propose an efficient PSI protocol without any cryptosystem. We optimize the protocol based on the permutation-based hash technique which divides a set into multisubsets to reduce the degree of the polynomial. The following advantages can be seen from the experimental result and theoretical analysis: (1) there is no cryptosystem for data hiding or encrypting and, thus, our design provides a lightweight system; (2) with set elements less than , our protocol is highly efficient compared to the related protocols; and (3) a detailed formal proof is given in the semihonest model.

#### 1. Introduction

Private set intersection (PSI) can be described that participants complete computation based on their private inputs and cannot learn additional information other than the set intersection. PSI has a wide range of applications such as privacy-preserving contact tracing for infection detection [1, 2], private contact discovery [3], similar document detection [4], suspects detection [5], relationship path discovery in social networks [6], and satellite collisions matching [7].

PSI has been well studied. Several cryptographic technologies have been proposed to implement PSI. According to cryptographic techniques involved, PSI protocols are mainly divided into the following three categories:(1)PSI based on the public-key technology: the main cryptographic technique was homomorphic encryption. The protocols were designed in such a way that the sender encrypted sets and the receiver performed some operations on the ciphertexts using the property of homomorphic encryption; then, the sender decrypted them by using his private key and got the intersection. With small communication complexity, these protocols were suitable for the scenario where the participants had strong computing power but the communication bandwidth was a bottleneck. However, the protocols had a higher time complexity because of using public-key cryptography.(2)PSI based on the generic circuit: the protocols transformed any function into garbled Boolean circuit and then completed the generic secure computation. The circuit generator encrypted each circuit gate using a double symmetric cryptosystem and generated a garbled circuit; the evaluator computed keys for the output wires by decrypting the appropriate ciphertexts without learning any intermediate values. The key technique used in the protocols was symmetric cryptosystem. The advantage of the general circuit protocol was that it made the protocol easier to design and implement. But as a general solution, the garbled circuit could not achieve scalability, and the protocols were inefficient.(3)PSI based on the oblivious transfer (OT) scheme: this kind of protocols introduced some variants of OT. The protocols were that elements were stored in some data structures, and parties ran an OT for each bit of inputs to get private outputs. Then, each party performed XOR operations with random values and its own elements. Lastly, the sender sent the results to the receiver, who locally checked the existence of its inputs. To improve efficiency, most of OT variants were implemented by using the symmetric cryptosystem. Thus, these protocols had lower time complexity and communication complexity. Nevertheless, such protocols required additional keys-related computations such as secret key negotiations.

From the above analysis, PSI protocols based on public-key cryptosystem suffer from two constraints: low efficiency and needing a complicated system for private/public-keys management. On the other hand, PSI protocols based on symmetric cryptosystem have higher efficiency, but negotiating or secure transferring of secret keys leads to additional computations and communications. Furthermore, the secure storage of keys will burden the system. In the paper, we transform the problem of the intersection of sets into the problem of finding the roots of polynomials by using point-value polynomial representation and propose an efficient PSI protocol without any cryptosystem.

##### 1.1. Application Scenarios

Our work can be applied to the following several practical scenarios.

###### 1.1.1. Contact Tracing for Infection Detection

The COVID-19 pandemic has posed an unprecedented challenge for humans. Due to the highly contagious nature of the virus, social distancing is one fundamental measure that has already been adopted by many countries. Based on the matching of location information between infected patients and regular people, contact tracing for infection detection enables users to securely upload their data to the server, and later, in case one user got infected, other users can check if they had ever got in contact with the infected user in the past. To protect users’ private location information, PSI can be applied to securely compute shared location data.

###### 1.1.2. Suspects Detection

Two national law enforcement bodies have a list of suspected terrorists. Due to national laws, they may not be allowed to disclose their whole lists, even when collaborating. Using a PSI protocol, both agencies can find commonly suspected terrorists and share their information, while other relevant information will not be disclosed.

###### 1.1.3. Satellite Collisions

Different space agencies have their own orbiting satellites. In order to determine the collision problem of the same orbiting satellite pair and adjust the orbit of the satellite appropriately, these agencies need to share more detailed information. However, each agency does not want to disclose anything other than whether there was a collision in orbital information. Thus, it is necessary to use PSI for computing the probability of a collision among satellites without revealing their other private information.

##### 1.2. Contributions

We transform the problem of the intersection of sets into the problem of finding roots of polynomials by using point-value polynomial representation and propose a new approach to PSI protocol without any cryptosystem. Then, we optimize our protocol based on the permutation-based hashing technique that reduces the length of the stored elements and the degree of the polynomial. Eventually, our protocol and the related PSI protocols are implemented on the Linux platform. The main contributions are as follows.

###### 1.2.1. A New Approach to PSI Protocol

We propose a new approach for designing PSI protocol based on point-value polynomial representation and pseudorandom function. Firstly, we represent sets as polynomials’ point-value pairs. Each party denotes elements as a -degree polynomial and represents as distinct point-value pairs where . Secondly, we blind polynomials’ point-value pairs for secure transportation and computation. Each party blinds them as by using pseudorandom function and exchanges the blinded point-value pairs. Thirdly, we compute the sum of two blinded polynomials’ point-value pairs. Through computation and transportation, one party can get the sum of two blinded polynomials’ point-value pairs. Lastly, we can learn the polynomial by interpolation and get the intersection by computing the roots of the polynomial. With this representation, we could get the set intersection without any cryptosystem.

###### 1.2.2. Efficient Hashing PSI Protocol

We optimize the new PSI protocol using the permutation-based hashing method, which converts the hashed elements into shorter strings without collisions and reduces the degree of polynomials. The hashing is to create a two-dimensional table and map each element to its hashed bins, resulting in stored elements, which split an -degree polynomial into -degree polynomials. This approach improves efficiency remarkably.

###### 1.2.3. Implementation of Our Hashing PSI Protocol

We implement our hashing protocol and other related protocols in C/C++ on the Linux platform. We use Number Theory Library (NTL) [8] along with GNU Multiprecision (GMP) library [9] for polynomial arithmetic. Based on the detailed experimental data, we conclude that our protocol is more efficient than public-key-based and circuit-based PSI protocols and is more efficient than OT-based PSI protocols at set elements less than .

##### 1.3. Organizational Structure

The related works on PSI protocols are introduced in Section 2. In Section 3, polynomial representation, hash technique, and security definition are given. In Section 4, a new approach to PSI protocol without any cryptosystem is shown. In Section 5, an optimized PSI protocol with the permutation-based hashing method is proposed. Implementation and performance analysis are presented in Section 6. Finally, conclusion and future work are provided in Section 7.

#### 2. Related Work

According to the underlying cryptographic techniques, PSI protocols can be divided into the following three categories.

##### 2.1. PSI Based on the Public-Key

In 1986, Meadows [10] introduced a PSI protocol that could solve the problem of authentication of mutually suspicious parties. But, they revealed the cardinality of sets during the authentication. To solve this problem, an improved PSI protocol [11] was proposed.

The PSI protocol based on oblivious polynomial evaluation [12] was proposed in 2004, which used the homomorphic encryption, balanced hashing, and properties of polynomials. They represented its elements as roots of polynomials and used interpolation to find out the coefficients of polynomials and sent the ciphertexts of coefficients of polynomials to the server by using ElGamal [13] or Pailler [14] encryption. In this protocol, it would lead to a high cost of exponential calculation in homomorphic encryption if the degree of the polynomial was large. An extended version [15] was presented, where the client and the server used the cuckoo hashing technique to reduce its computational complexity.

In 2009, Jarecki et al. [16] showed a PSI protocol based on the composite residual hypothesis. The protocol used additive homomorphic and zero-knowledge proof to realize the pseudorandom function and then performed the intersection operation on the random values of the set. The client and the server carried on the parallel oblivious pseudorandom function (OPRF) to get the intersection. However, the protocol relied on the common reference model.

In 2010, Cristofaro et al. proposed PSI and Authorized PSI (APSI) protocols [17, 18]. But, these PSI protocols revealed the client’s set cardinality. To hide the client’s set cardinality, Ateniese et al. [19] presented a PSI protocol that was to batch the hash value of the client. In 2012, Cristofaro et al. [20] used RSA and OPRF techniques to reduce the total cost of cryptographic operations based on Cristofaro et al.’s constructions [17, 18].

In 2017, Chen et al. [21] gave a PSI protocol with a low communication complexity based on the fully homomorphic encryption technology. In 2018, Chen et al. [22] implemented an unbalanced labeled PSI protocol against malicious adversaries by using OPRF into a preprocessing phase.

##### 2.2. PSI Based on the Generic Circuit

The two main approaches were Yao’s garbled circuits [23, 24] and Goldreich protocol [25], which were to replace arbitrary functions with Boolean circuit computations. The communication overhead and the number of cryptographic operations depended on nonlinear gates’ number in the circuit. Thus, compared with the most special-purpose PSI protocols, the running time and communication complexity became more prominent problems for PSI protocols based on generic secure computation.

In 2012, Huang et al. [26] proposed several Boolean circuits for PSI protocols and evaluated based on Yao’s circuit, which used homomorphic encryption and adopted various circuit optimization techniques. The main method was that the client and the server sorted the elements in their sets locally and merged them in order through the garbled circuit and determined the equality of adjacent elements in the merged set. If they were equal, they would be the elements in the intersection. In 2015, Pinkas et al. [27] presented a circuit-phasing PSI protocol, which was up to 5 times faster than [26].

In 2018, Pinkas et al. [28] used a two-dimensional cuckoo hashing technique to realize a PSI based on the generic circuit, where it was asymptotically with better efficiency and could be extended to multiparties. For the general assumption of linear communication, Hemenway et al. [29], based on Pinkas et al.’s construction [27], represented a simple and generic circuit-based PSI protocol in 2019.

##### 2.3. PSI Based on the Oblivious Transfer (OT) Scheme

In 2001, Naor et al. [30] proposed an OT protocol with asymmetric cryptographic operations, which spent expensive public-key operations when performing OT. Huberman et al. [11] used OT extensions (OTs) technology [31] to reduce expensive public-key operations by using more efficient symmetric cryptographic operations.

In 2013, Dong et al. [32] showed a PSI protocol that could process elements up to a size of 100 million. This protocol was based on bloom filter (BF), garbled bloom filter (GBF), secret sharing, and OTs. The linear complexity and high scalability of the protocol came from the effective symmetric cryptosystem and parallel processing, respectively. But, there was a problem with this protocol that the server might cause a selective failure to terminate the protocol in the malicious setting when the client performed a specific input. Thus, Rindal et al. [33] brought up an efficient fix using the cut-and-choose approach. Based on the method [30], Pinkas et al. [34] optimized it by replacing OTs with random OT, which did not need to save the GBF structure, but let the server and the client generate BF structure as the input of OT.

In 2015, Pinkas et al. [27] applied the phase and permutation hashing methods, which resulted in a reduction of computation and memory. Kolesnikov et al. [35] improved Pinkas et al.’s construction based on efficient OPRF. Subsequently, Kolesnikov et al. [36] proposed an extended version based on the literature [35], which gave a lightweight protocol. Rindal et al. [33] gave the first implementation of PSI protocol against malicious adversaries. In 2018, Pinkas et al. [3] analyzed the current exiting protocols in detail and optimized PSI protocol using OPRF and the hashing techniques. A new PSI protocol was constructed by Pinkas et al. [37] in 2019, which used the 2-choice hashing [38], sparse OT extension, and the polynomial slice and stream techniques to reduce the communication cost and improve the efficiency of the protocol. In 2020, Pinkas et al. [39] proposed a PSI protocol based on a probe-and-XOR of strings (PaXoS) data structure, which not only had linear communication and computational complexity, but also can safely resist the malicious adversary in a nonprogrammable random oracle.

#### 3. Preliminaries

##### 3.1. Representing Set with Polynomial Point-Value Pairs

We give the transformation from operations of sets to operations of polynomials. This representation allows us to represent a set using a random point evaluation polynomial.

*Definition 1. *Polynomial representation of a set. Given a set , whose set cardinality is ; then, we define its characteristic polynomial asand thus every element for is a root of .

*Definition 2. *Polynomial in point-value pairs: distinct point-value pairs can represent a degree polynomial , where for . If is fixed, the vector represents the polynomial .

*Definition 3. *Set intersection: let and be two sets with the same degree and represented as polynomials and , respectively. Then, the intersection could be learnt by finding the roots of the following polynomial:where and are *d*-degree polynomials, and , and and are random values picked uniformly.

##### 3.2. Hashing Techniques

###### 3.2.1. The Simple Hashing

The simple hashing maps each element to its hashed positions. In particular, when hashing an element , it stores in the bin , where is a random function: . To contain the multiple elements, will be denoted as a double array for . The insert function of simple hashing can be described aswhere is introduced to represent the maximum number of each bin of hash table.

###### 3.2.2. Permutation-Based Hashing

Permutation-based hashing technique is to allow the hashed elements to be converted shorter strings that can be stored in the hash table for reducing storage space and computation complexity, which was proposed by Arbitman et al. [40]. Originally, an element is represented as bits, where , is the bins’ size in the hash table. Then, the element gets the index, , where is a random function: . Finally, the value stored in the bin is , . Thus, the stored data’s length is significantly reduced and efficiency will be improved.

##### 3.3. Security Definitions

This section focuses on the security definition of PSI protocol.

###### 3.3.1. Adversary

We consider a semihonest adversary who follows the protocol specifications while trying to obtain extra information from the exchanging messages.

###### 3.3.2. Functionality

The functionality being implemented in this paper is , that is, two parties; has a set and gets the intersection, and has a set but does not learn any output.

*Definition 4. *Semihonest security: in the semihonest model, a protocol is secure if each party does not get any information other than his input and output. This is formalized by the simulation paradigm. The view of the party during the execution of protocol on input tuple is denoted by that includes his input and output, internal random coins, and messages exchanged. We say that privately computes if there exist polynomial-time simulation algorithms, denoted as and , such thatwhere “” represents two views that are computationally indistinguishable.

#### 4. The New PSI Protocol Based on Point-Value Polynomial Representation

The functionality that is implemented in the new PSI protocol is . Let , , where . The new protocol is shown in Figure 1 and has the following four steps.

(a)Setup: party constructs a public finite field where is a large prime, a pseudorandom function that generates pseudorandom values in , and a vector with distinct nonzero values picked randomly from . Then, it publishes , , and .(b)Initialization: each party performs the following steps:(1)Select a dummy number and compute pseudorandom values for and then generate random polynomial (2)Construct polynomial (3)Compute vectors and with values and , respectively, for , which are used to represent polynomials and (4)Pick another random number and generate pseudorandom values for that are used to blind polynomial values(c)Intersection interaction: party tries to get the sum of two blinded polynomials point-value pairs whose roots are the intersection with party . To do so, the following computations will be performed.(1)Party computes for and sends the vector to party .(2)Receiving party message, party blinds the vector as follows: where . Then, it gets the blinded vectors and and sends them to party .(3)Party computes the blinded vector , whose elements for are computed as follows:(4)Party removes the blinding factors for as follows: Then, it sends the vector to party .(d)Intersection result: party gets set intersection.(1)Party unblinds the blinding factors for as follows:(2)Party restores polynomials by using point-value pair interpolation for .(3)Party checks each element for whether it is a root of . If it holds, it is an element of intersection; otherwise it is not.##### 4.1. Correctness of the New Protocol

Because for ,

Next,

Then,

And we can get the polynomial by point-value pairs for , where . From Definition 3, we can get the intersection by computing the roots of the polynomial .

Thus, the new protocol is correct.

#### 5. Efficient PSI Protocol Using Hashing

We optimize the above protocol using the permutation-based hashing. At first, each party constructs a two-dimensional hash table , where the first dimension is the index of the hashed element and the second dimension stores the elements. Then, each party pads the second dimension with random values to the maximum load. The permutation-based hashing makes each party break down its original set into several small subsets. Thus, it will greatly reduce the degree of polynomials and then significantly improve the efficiency of the protocol. Let , , where . The hashing PSI protocol is shown in Figure 2, and the details include the following steps.(a)Setup: party selects a permutation-based hashing function with the parameters and for the hash table, where is bins’ size in the hash table and denotes the maximum length in a bin. Next, it constructs a public finite field where is a large prime, a pseudorandom function that generates pseudorandom values in , and a vector with distinct nonzero values picked randomly from . Then, it publishes , , , , , and .(b)Hashing: each party performs the following.(1)Create a hash table by doing the following:(2)For every bin , get an array that holds the size of the actual mapped elements. If its size is less than , pick dummy elements, , and pad them to the bin .(c)Initialization: each party chooses a dummy quantity , and for each bin, does the following.(1)Generate a pseudorandom value by using :(2)Generate pseudorandom values and construct a random polynomial as follows:(3)Construct a polynomial to represent the elements in the bin :(4)Choose a random number and get pseudorandom values that are used to blind the polynomial values:(5)Compute vectors and with values and , respectively, for , which are used to represent polynomials and .(d)Intersection interaction: party tries to get the sum of two blinded polynomials point-value pairs whose roots are the intersection with party . To do so, the following computation will be performed.(1)Party computes for and and sends the vector to party .(2)Receiving party ’s message, party blinds every value as follows: where and . Then, it gets the blinded vectors and and sends them to party .(3)Party computes the blinded vector as follows: where and . Then, it gets the blinded vectors and sends them to party .(4)Party removes the blinding factors as follows: where and . Then, it gets the vectors and sends them to party .(e)Intersection result: for each bin , party restores the subpolynomial by the interpolation and gets the intersection by computing the roots of the subpolynomial.(1)Party removes the blinding factors for as follows:(2)Party restores the subpolynomial by using the point-value pairs interpolation .(3)Party finds the elements of the intersection by computing the roots of polynomial as follows:where , denotes the size of the actual elements in bin .

##### 5.1. Correctness of the Hashing Protocol

Because for ,

Next,

Then,

And the polynomials are restored using the point-value pairs , where . From Definition 3, the intersection could be learnt by finding roots of the polynomials .

Thus, the hashing protocol is correct.

##### 5.2. Security Proof

The above hashing PSI protocol is securely computing the set intersection in the presence of a semihonest adversary.

Theorem 1. *If is a pseudorandom function, then the hashing PSI protocol is secure in the presence of a semihonest adversary.*

*Proof. *We prove it by considering the cases where each of parties has been corrupted. In each case, we will construct a simulator who is only given the corrupted party’s input/output and generates a simulated view that has to be computationally indistinguishable from the real protocol.

*Case 1. *Corrupted party : in the real protocol, party view isThe simulator is constructed and performs the following steps:(1)Create an empty view and then append and to it.(2)Pick a set with elements at random such that .(3)Construct polynomials and representing sets and , respectively. And generate random polynomials and for .(4)Generate the Random Values and for (5)Blind the polynomials’ values and get random vectors , , , and , where , , , and (, ) are computed as follows:(6)Compute for , and get random vector .(7)Insert vectors , , , , and to the view.So, the view of the simulator ’s construction isNote that, in both views, the input and the output are identical. Pseudorandom function is used to blind the elements in the real protocol. We can get blinded vectors from the real protocol. On the other hand, through the calculation of the simulator, , and are random vectors. So, and , and , and , and , and are computationally indistinguishable.

*Case 2. *Corrupted party : in the real protocol, party view isWe construct a simulator who has the input and performs the following steps:(1)Create an empty view and then append to it.(2)Pick a set with elements at random.(3)Construct polynomials and , representing sets and , respectively. And generate random polynomials and for .(4)Generate the Random Values and for (5)Blind the polynomials’ values and get random vectors , , , and , where , , , and for , are computed as follows:(6)Compute for , , and get random vector .(7)Insert vector , , , , and to the view.So, the view of the simulator ’s construction isNote that, in both views, the input is identical and the output is empty. Pseudorandom function is used to blind the elements in the real protocol. We can get blinded vectors . On the other hand, through the calculation of the simulator , , and are random vectors. Thus, and , and , and , and , and and are computationally indistinguishable.

Combining the above, we can get thatTherefore, the hashing protocol is secure in the semihonest model.

#### 6. Evaluation

##### 6.1. Implementation

We ran our experiments in Ubuntu 18.04 with Linux 4.4.0.59 64-bit desktop PC. All protocols were implemented and executed using the same hardware equipped with Intel Core i7-7700K CPU with 3.6 GHz and 8 GB of RAM. We implemented our protocol and related protocols [3, 18, 26, 32, 34] in the same environment setting. Our protocol and related protocols had the same number of input elements, whose size was 32 bits. Our protocol was implemented using the Number Theory Library (NTL) along with the GNU Multiprecision (GMP) library for polynomial arithmetic.

We give the running times of related protocols in Table 1 and Figure 3. From them, it can be seen that our protocol is more efficient than public-key-based and circuit-based PSI protocols, and it is more efficient than OT-based PSI protocols with the set size less than .

##### 6.2. Experimental Results

A detailed analysis with related PSI protocols is given in Table 2. We evaluate the performance in terms of four properties: needing cryptosystem or not, simulated-based security, computation complexity, and communication complexity. From Table 2, our protocol enjoys the following advantages. (1) There is no need for a complicated cryptographic system in our protocol, which only uses hashing and pseudorandom function and provides a lightweight system. But, in other protocols, asymmetric encryption system or symmetric encryption system is needed. (2) Our protocol gives a detailed formal security proof by using the ideal/real simulation mechanism in the standard model while [3, 26, 34] only show an informal security analysis. (3) Computation complexity and communication complexity of our protocol are , while both of [26] are .

#### 7. Conclusion and Future Work

In this paper, we proposed a new approach to PSI protocol without any cryptosystem based on point-value polynomial representation and pseudorandom function and optimized it based on hashing techniques. Our protocol had high performance with set elements less than . In our protocols, there was a constraint that both parties should have the same set degree. In the future, we will extend our approach and study PSI protocols with a lightweight client where the server had a very large degree but the client’s degree is relatively small.

#### Data Availability

All the pseudocodes used to support the findings of this study are included within the article.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant nos. 61672010, 61702168, and 61701173 and the fund of Hubei Key Laboratory of Transportation Internet of Things (WHUTIOT-2017B001).