Abstract

Two-party private set intersection (PSI) plays a pivotal role in secure two-party computation protocols. The communication cost in a PSI protocol is normally influenced by the sizes of the participating parties. However, for parties with unbalanced sets, the communication costs of existing protocols mainly depend on the size of the larger set, leading to high communication cost. In this paper, we propose a low communication-cost PSI protocol designed specifically for unbalanced two-party private sets, aiming to enhance the efficiency of communication. For each item in the smaller set, the receiver queries whether it belongs to the larger set, such that the communication cost depends solely on the smaller set. The queries are implemented by private information retrieval which is constructed with trapdoor hash function. Our investigation indicates that in each instance of invoking the trapdoor hash function, the receiver is required to transmit both a hash key and an encoding key to the sender, thus incurring significant communication cost. In order to address this concern, we propose the utilization of a seed hash key, a seed encoding key, and a Latin square. By employing these components, the sender can autonomously generate all the necessary hash keys and encoding keys, obviating the multiple transmissions of such keys. The proposed protocol is provably secure against a semihonest adversary under the Decisional Diffie–Hellman assumption. Through implementation demonstration, we showcase that when the sizes of the two sets are and , the communication cost of our protocol is only 3.3% of the state-of-the-art protocol and under 100 Kbps bandwidth, we achieve 1.46x speedup compared to the state-of-the-art protocol. Our source code is available on GitHub: https://github.com/TAN-OpenLab/Unbanlanced-PSI.

1. Introduction

Private set intersection (PSI) protocol is a special case of secure two-party computation, which allows the receiver and the sender, holding the input sets X and Y, respectively, to compute without revealing anything else (other than upper bounds on their sizes) [1, 2]. PSI has served many privacy-preserving applications including COVID-19 risk scoring [3], contact tracing [4], advertising conversion rate calculation [5], and mobile privacy contact discovery [6].

The first PSI protocol is based on Diffie–Hellman (DH) key exchange algorithms [7, 8]. After that, PSI has been extensively studied and many protocols have been proposed to improve the performance, consisting of communication performance and computation performance. The current PSI protocols are largely based on two technologies, namely, DH key exchange [2, 710] and oblivious transfer (OT) [3, 1123]. Among OT-based protocols, some are based on cuckoo hashing [1119], and some are based on oblivious key-value stores (OKVS). In addition, some protocols are based on RSA [3, 2023] and homomorphic encryption [24, 25].

When measuring the efficiency of a PSI protocol, the communication cost and the computation cost are two major aspects [22]. Recent evidence suggests that communication costs are far more important than computation costs [2]. Existing PSI protocols primarily focus on the intersection of two sets with similar sizes and have obtained low communication costs. However, it is significant to note that the communication costs in these protocols are influenced by the sizes of both sets. This is due to the linear relationship between the communication cost and the size of each party’s set. Consequently, for unbalanced set sizes, the larger set size exerts a greater impact on the overall communication costs. Chen et al. [25] considered the effect of unbalanced sets on the communication cost and proposed a PSI protocol based on fully homomorphic encryption (FHE) such that the number of messages sent by each party had a linear relationship with the size of the smaller set. However, FHE requires a large ciphertext space, bringing about a room for improvement.

PSI protocols for unbalanced sets have considerable application scenarios. One significant application scenario is private contact discovery [6], where a user of a mobile application wishes to identify which of his friends in his address book are also users of the same application. However, the server is not allowed to reveal its users’ information and the user not want to submit his or her address book. In this case, the server has a large set with all the users, while the mobile side has a relatively small set. Another well-known application scenario of unbalanced PSI protocol is advertising conversion rate calculation [5] where the ad supplier knows the users who have seen a particular ad, and the company knows who made a purchase. The two parties are unwilling to expose the underlying data, but both parties would still like to compute how many users both saw an ad and made a corresponding purchase. According to our observation, unbalanced PSI protocol would be utilized in authentication schemes [2628] to match authentication information between two users who do not trust each other.

Therefore, we aim to construct a PSI protocol specifically tailored for unbalanced sets, eliminating the effect of the larger set size on the communication cost. By taking this approach, the communication cost is solely determined by the smaller set. The rationale for our research is derived from this motivation.

Focusing on the above motivation, we propose a PSI protocol based on trapdoor hash (TDH) function and Latin square for the intersection of a larger set (sender) and a smaller set (receiver).

Inspired by Döttling et al. [29] and Chase et al. [30], we construct private information retrieval (PIR) by TDH to realize low communication cost. Following this way, the number of invoking of PIR equals to the size of the smaller set. However, the hash key and the encoding key should be sent from the receiver to the sender for each PIR, resulting in large communication cost. To address this issue, we design a seed hash key, a seed encoding key, and a Latin square which are sent by the receiver. The sender can generate all the hash keys and encoding keys by himself from the seed keys. Intuitively, the seed hash key, the seed encoding key, the hash keys, and the encoding keys are vectors with the same dimension. We permute the items in the seed hash key to obtain each hash key and permute the items in the seed encoding key to obtain each encoding key, where the permutation rule is defined in the Latin square.

The main contributions of this paper are as follows:(1)We design a permutation rule according to the Latin square, by which the sender can generate a range of hash keys and encoding keys for all rounds of TDH function from only one seed hash key and one seed encoding key. The communication cost of transmitting multiple keys is reduced to transmitting two seed keys. The seed keys and the Latin square design do not involve the items of the two sets, thus it can be performed in the offline phase.(2)In the process of PSI, the smaller set is taken as the ergodic source, and the larger set is taken as the verification source. Every time an item of the smaller set is calculated, the TDH function is called once to verify whether it belongs to the intersection by retrieving it from the larger set. Consequently, the communication cost depends only on the smaller set, and this work is valid for unbalanced sets.(3)We implement out protocol and public it on GitHub: https://github.com/TAN-OpenLab/Unbanlanced-PSI.

The proposed protocol is provably secure against a semihonest adversary under the Decisional Diffie–Hellman (DDH) assumption. Performance analysis demonstrates that the proposed protocol enhances the communication performance in PSI protocols in terms of unbalanced sets where the size ratio of two sets exceeds 8.

Since its introduction, many techniques have been proposed to improve PSI’s performance. In this section, we discuss the state-of-the-art PSI protocols and focus on the communication cost of them. From here on, we assume that the receiver’s set has items, and the sender’s set has items , where each item has -bit length. We let and denote the statistical and computational security parameters, respectively.

Early PSI protocols based on DH have been around since 1986 [7, 8] and proven secure against semihonest adversaries. Current PSI protocol can be divided into two categories. The first category is semihonest PSI protocols [1113, 1820, 22, 25], and the second category is malicious PSI protocols [2, 3, 1416, 21, 23].

In semihonest protocols, the parties have to follow the exact prespecified protocol, which implies that they cannot change their inputs or outputs. PSZ14 [11] is based on private equality test, where the receiver and the sender, respectively, insert all their items in the bins by cuckoo hashing and all hash functions. The using of cuckoo hashing reduces the comparisons of equality test from to . The receiver compares bits for each comparison. PSSZ15 [13] is based on PSZ14 [11] and permutation-based hashing technique, which splits each item into left part with bits length and right part with bits length. Only the right part is compared, so bits are compared for each comparison. PSSZ15 [13] and PSZ14 [11] are based on the OT extension protocol proposed by KK13 [31]. KKRT [12] improved KK13 [31] by extending 1-out-of-256 OT to 1-out-of- OT and proposed batch, related-key oblivious pseudorandom function (OPRF). Their PSI protocol is 3.1–3.6× faster than PSSZ15 [13]. CLR17 [25] used FHE and improved it by batching, windowing, and modulus switching. They constructed a PSI protocol for unbalanced sets and achieved a communication overhead of . PRTY19 [20] improved the OT extension protocol proposed by IKNP03 [32], and proposed lightweight PSI based on sparse OT extension. The sender generates a polynomial P using its set and sends P to the receiver. The receiver computes the corresponding values of his items in P. For each item in the intersection, the results computed by the two parties will be the same. The communication cost of PRTY19 [20] is 40%–50% lower than that of KKRT [12], so it is targeted at low-bandwidth situations. However, the computation of PRTY19 [20] is not as efficient as KKRT [12], so KKRT is faster at high bandwidth. CM20 [22] achieves a better balance between KKRT [12] and PRTY19 [20]. Single-point OPRF of KKRT [12] is extended to multipoint OPRF, where the key of PRF is a matrix and a single PRF will compare all the items.

In malicious protocols, the parties may not follow the exact prespecified protocol, thus the inputs from both parties need to be verified. On the basis of KKRT [12], PRTY20 [21] added homomorphic function as linear error correcting code [16] and proposed a malicious security PSI protocol PaXoS, which was almost as fast as KKRT [12]. RT21 [2] presented a construction for a batched OPRF based on vector-OLE and the PaXoS data structure. GPR21 [23] considered that cuckoo hashing will lead to a failure probability p of OKVS structure. They therefore showed novel techniques to improve OKVS such that the failure probability was reduced to for a constant . RT21 [2] pointed out that OT-based PSI protocols required a certain number of base OTs first, which applies to large sets. On small sets, DH-PSI protocols are less costly, so they proposed a DH-PSI-based PSI for small sets, and reduced the communication cost by interpolating polynomials.

3. Preliminaries

3.1. PSI Functionality

We use the PSI functionality described in Chase and Miao’s [22] study. PSI allows two parties to compute the intersection of their data sets without revealing any additional information, as shown in Figure 1.

3.2. Security Model

We use the security model described in David et al.’s [1] and Goldreich’s [33] studies. PSI is a cryptographic protocol of secure two-party computation. There are two adversarial models which are usually considered, namely semihonest model and malicious model. A semihonest party is one who follows the protocol properly with the exception that it keeps a record of all its intermediate computations and may try to learn as much as possible from the messages they receive from other parties. A malicious adversary may deviate arbitrarily from the prescribed protocol in an attempt to violate security. The semihonest model and the malicious model are designed for different application scenarios, thus both of them have practical value and research value. Our protocol is designed under the semihonest model in this paper.

We say the protocol is secure if we can construct simulators who can generate the outputs without the information of the private sets. The outputs should be indistinguishable in probabilistic polynomial time from those generated by the real sender and receiver, respectively. This means that even if a semihonest adversary corrupts the sender or the receiver, it cannot obtain any meaningful information about the private sets.

3.3. Decisional Diffie–Hellman Assumption

Our protocol relies on the DDH assumption [34], which we state in the following.

Definition 1. Decisional Diffie–Hellman (DDH) assumption. A (prime-order) group generator is an algorithm that takes as an input a security parameter and outputs . is a multiplicative cyclic group of order , and with generator , where is always a prime number. We say that satisfies the DDH assumption (or is DDH hard) if for any PPT adversary , it holds that:where and .

3.4. Latin Square

The definition of Latin square is similar to [35].

Definition 2. Latin square. Let be a positive integer, and let be the set of distinct elements. A Latin square of order on is an -by- matrix, where each element of belongs to and each element of occurs once in each row and once in each column of .
In this paper, the following Latin square design method is used to generate an -order Latin square . Let the element of the row and the column be .Step 1.Randomly shuffle the elements in and let the result be row 0 of as .Step 2.Generate other elements of . For each element in row and column :where , .

According to the Latin square design above, we can obtain a Latin square of order . Each row can be seen as an arrangement of the elements of . Therefore, we regard and as the permutation rule by which we can permute a matrix to a new matrix . Let be an arbitrary matrix with columns. For each , let the column of be the column of if . Then, is a new matrix with columns

3.5. Private Information Retrieval from Trapdoor Hash Function

TDH function was proposed by Döttling et al. [29]. In this section, we introduce a PIR [36] scheme using TDH.

In PIR, the sender has a private bitstring of length , and the receiver wants to know the bit . The sender will not reveal any information except . Let be a multiplicative cyclic group of prime order , and is a generator of the group.

Receiver samples the trapdoor and samples an -dimensional vector of random group elements as the seed hash key, as shown in Figure 2. Then, computes a corresponding encoding key as , where for every , is to the power . The only exception is which is set as times . Receiver sends , , and to sender. Sender samples and calculates the hash value and the encoding value , as follows:

Collision resistance of Function (3) can be routinely established from the discrete logarithm assumption in .

Then, the sender sends and to the receiver, who verifies whether or , where means and means . For each item in the set of the receiver, the two parties invoke PIR to compute whether it belongs to the set of the sender.

4. The Proposed PSI Protocol

The proposed PSI protocol contains offline phase and online phase. In the offline phase, the PSI preparation is performed which does not involve the set items of two parties. In the online phase, two parties complete PSI with their private sets. Receiver and sender hold the private sets and , respectively. Let be the input domain, containing all the possible items of and . The parameters of the proposed protocol are shown in Table 1.

The framework of the proposed protocol is described in Figure 3. In the offline phase, receiver obtains row 0 of Latin square by shuffling , samples the trapdoor , the initial column number and the seed matrix . Then receiver sends , , and to sender.

In the online phase, both parties employ PIR to determine whether each item in the receiver’s set belongs to the sender’s set . First, the receiver maps the set item to the specific row number of Latin square and sends it to the sender. Then, the sender generates the row of Latin square, which satisfies that . The sender takes and as the permutation rule to permute the seed key matrix to the key matrix of . Then, the sender encodes to obtain the hash value and the encoding value of , and send them to the receiver. Finally, the receiver decodes and obtains whether belongs to .

4.1. The Offline Phase

Let algorithm be a prime-order group generator that takes as an input a security parameter and outputs , where is a multiplicative cyclic group, order is a prime number, and is the generator. The algorithm Shuffle() takes a vector as an input and shuffles all the elements in the vector to form a new vector.

The offline phase consists of the following steps.Step 1.Using the method described in Section 3.4, the receiver generates an n-dimensional vector and sends it to sender.Step 2.The receiver samples the trapdoor , and samples the initial column number .Step 3.The receiver generates the seed key matrix . In detail, the receiver samples an n-dimensional vector of random group elements as the seed hash key . Then calculates the seed encoding key as follows:where for every , . The only exception is which is set as . Let the seed hash key be the row of matrix , while the seed encoding key is taken as the first row of matrix . We have the equation as folllows:

Then, the receiver sends to the sender.

4.2. The Online Phase

In the online phase, the receiver and the sender calculate the intersection of their sets. There are items in receiver’s set , and items in sender’s set . For each item , both parties invoke PIR to determine whether it belongs to the sender’s set .

Recall that the receiver sends the seed key matrix to the sender in the offline phase, and is in the column of . To calculate , we should permute the columns of to obtain such that is in the column of . Each row in the Latin square contains all the items in in different order. Let row of Latin square be the target row such that . The process to obtain is shown in Algorithm 1.

Input:S0,k, xi, k
Output:ci
  a = 0
  While
   a = a + 1
   mod n
  EndWhile
  ci = a

Let be an algorithm which calculates row by row of the Latin square. For each column :

We can obtain a permutation rule from row and row . For each column of , let be the index of column such that . Then, set the row of equal to the column of , namely . The process is shown in Algorithm 2.

Input:G,S0,⋅,Sci,
Output:Gi
  Form = 0 to n−1
   m′ = 0
   WhileSci,m ≠ S0,m
    m′ = m′ + 1
   EndWhile
   
   
 EndFor

The row of matrix is the hash key, while the first row of matrix takes as the encoding key. Let be the algorithm which takes the hash value and the encoding value as output. More specifically, the sender samples , and calculates the hash value and the encoding value, as follows:

Then, sender sends to receiver.

Observe that when , then is equal to , and that otherwise, it is equal to . Let be the algorithm which decodes by the trapdoor and outputs the result bit . Let denotes the three cases above:

In summary, the steps of the online phase are as follows. For each item :Step 1.The receiver calculates from by algorithm Map(), and sends to the sender.Step 2.The sender generates row by algorithm GenLS(), and generates the key matrix by algorithm GenKey(). Therefore, the sender obtains the hash key and the encoding key of .Step 3.The sender calculates the hash value and the encoding value by Encode() and sends them to the receiver.Step 4.The receiver decodes and obtains whether belongs to .

For every item in , the receiver and the sender repeat the steps above and obtain as follows:

5. Example Analysis

In this section, we offer an illustrative instance of the proposed PSI protocol aimed at showcasing its practical feasibility.

Let be the input domain, and be the number of elements of . The receiver and the sender, respectively, hold sets and of sizes and . Let be a multiplicative cyclic group of order and with generator .

In the offline phase, receiver shuffles and obtains the row of Latin square .

After sampling the trapdoor and the initial column number , the receiver samples from as the row of the seed key matrix and calculates , where for every , . The only exception is which is set as . Let be the first row of the seed matrix . We have the equation as follows:

Finally, the receiver sends to sender.

In the online phase, the receiver determines the items of set that also belong to . For the item , the receiver finds out such that , and sends to the sender.

The sender calculates the fifth row of from the row using Equation (7) and permutes the column vectors of according to the row and the fifth. It is evident that , thus the second column of should be the same as the column of . Similarly, as , the seventh column of should match the first column of . In the same vein, we have the following equation:

Then, the sender calculates the hash value and the encoding value , and sends to receiver.

The receiver calculates and finds that , thus and . For the item , the receiver finds out such that and sends to sender.

Sender permutes the column vectors of according to the row and the first row of . We can see that , thus let the fifth column of be the column of . Similarly, since , let the column of be the first column of . In the same vein, we have the following equation:

Then, the sender calculates , , and sends to receiver.

The receiver calculates and finds that , thus and . Finally, the receiver obtains .

6. Proof of Security

Our protocol relies on the DDH assumption, which is resistant to semihonest attackers. Relying on the previous theory of security proof [3739], this section proves the security of the proposed protocol against the corrupt sender and the corrupt receiver, respectively.

6.1. Security against the Corrupt Sender

Theorem 1. The proposed protocol is resistant against the corrupt sender under the DDH assumption. Formally, we construct a simulator that takes the inputs and the outputs is indistinguishable from the real receiver.

Proof. According to the proposed protocol, the messages that receiver sends to sender are the row of Latin square , seed key matrix , and the row number . is generated by Shuffle() and thus indistinguishable from random. Consequently, we focus on the security of seed key matrix and the row number in this section. We prove is indistinguishable from the real receiver via the following hybrid argument.

Hybrid 0: Hybrid 0 is the real interaction. In the offline phase, receiver generates and sends seed key matrix honestly. In the online phase, for each item of X, receiver performs Map() and sends the row number according to Section 4.

Hybrid 1: Same as Hybrid 0 except that is replaced with a random matrix .

Recall that the row of is randomly sampled and indistinguishable from random. The elements in the first row of are calculated by the elements in the row. In this hybrid, the elements in the first row of are replaced by random elements and have the following equation:

Let , and the matrix:

The row of equals to that of . For the first row, the element to the element are equal to the element to the element in row 1 of , and the element to the element are equal to the element to the element of . Obviously, . When , then:

The distinction between and is the element in the first row and the column. When , let and . Recall that and are only held by the receiver. Let:

We have and as the four values are generated randomly. It can be shown that as . Consequently, we have and . Since and are indistinguishable under DDH assumption, . Then, we have . When , just let , and the conclusion is well-supported in the same vein. Then, we can observe the following equation:

Consequently, , namely Hybrid 0 and Hybrid 1 are indistinguishable.

Hybrid 2: Same as Hybrid 0, except we replace all the row number with random .

We prove Hybrid 1 and Hybrid 2 are indistinguishable in two aspects. When , namely there is only a single element in set . In this case, the relationship among different row numbers is not considered, and we focus on the security of a single row number. When , we focus on the relationship among different row numbers.

As described in Section 4.2, is the row number of Latin square , which denotes the permutation rule of the column vectors of seed key matrix . Now the seed key matrix is replaced with in Hybrid 1. Let the permutation result of be and according to and , respectively. Since all the elements of are random, and . Consequently, when , as well as and are indistinguishable.

When , for each , let vector , where the to the elements are random and the to the elements are true row numbers. Thus, , and the only difference between and is the element. Let be an arbitrary element of except . Let and . For an arbitrary column of , we have and . Every row of contains all the elements of , hence there exist elements equal to and , respectively, in the row . Let the column index of in row be , and let the column index of in row be , namely and . Hence:

and can be denoted by the elements of row 0 as and . Then plug them into Equations (21) and (22), and further we have the following equations:where and are column indexes. Since the row is sorted randomly, the distinction between and is random under any , resulting in the following equation:

It can be shown that:

Consequently, Hybrid 1 and Hybrid 2 are indistinguishable.

Taken together, simulator can be constructed to simulate receive, such that the simulation is indistinguishable from the real interaction. Consequently, the proposed protocol is resistant against the corrupt sender under the DDH assumption.

6.2. Security against the Corrupt Receiver

Theorem 2. The proposed protocol is resistant against the corrupt receiver. Formally, we construct a simulator that takes the inputs and the outputs is indistinguishable from the real sender.

Proof. To calculate each item in , the only message that receiver sends to sender is the hash value and the encoding value . In this section, we construct a simulator who holds , where denotes whether belongs to intersection. denotes does not belong to intersection, and denotes belongs to intersection. We prove simulation is indistinguishable from the real via the following hybrid argument.

Hybrid 3: The real interaction. To respond each received from receiver, sender samples , generates and sends the verification information honestly as shown in Equations (8) and (9).

Hybrid 4: Simulator receives , and samples . Then calculates:

Simulator sends to the receiver. The corrupt receiver calculates:

When , , the receiver obtains . When , , the receiver obtains . We have , thus receiver cannot distinguish and from the relationship between and . Due to the collision resistance of TDH function, distinguishing and is the discrete logarithm problem. Consequently, we have .

Taken together, Hybrid 3 and Hybrid 4 are indistinguishable. The proposed protocol is resistant against the corrupt receiver.

7. Performance Evaluation

7.1. Comparison of Communication

To demonstrate communication performance of the proposed protocol, we report on it in comparison with the state-of-the-art PSI protocols. The communication costs of different protocols are shown in Tables 2 and 3. Since [2, 3, 2123] proposed both malicious and semihonest protocols, we compare with the semihonest versions only.

We set the computational security and statistical security . is the size of elliptic curve group elements (256 is used here). The costs of base OTs are independent of input size and equal to . denote the sizes of receiver’s set, sender’s set, and input field , and we set . are the parameters of FHE, where and . increases as get higher, and Table 2 shows the distinct values of under distinct according to CLR17 [25]. denotes the width of OT extension matrix. is the upper bound on the number of cycles in a cuckoo graph of PaXoS. is the maximum stash size for cuckoo hashing. When three hash functions are utilized to map elements to bins, the relationship between and is shown in Table 3 according to KKRT [12].

Table 4 shows the communication costs of different protocols when and ranges from to . It is apparent from this table that the communication cost of the proposed protocol is proportional to and is not related to , thus the communication cost decreases as get smaller. When equal to or , the communication cost of the proposed protocol is higher than some of the other protocols. However, when , the communication cost of the proposed protocol is the lowest. The reason is the communication costs of the other protocols are related to both and . When the sizes of the two sets are , the ratio of them is , and the communication cost required by our protocol is 55.14% of the state-of-the-art protocol. When the size of the two sets is , the communication cost required by our protocol is only 0.6% of the state-of-the-art protocol.

Table 5 shows the communication costs of different protocols when and ranges from to . It is shown that the communication cost of the proposed protocol is invariant while the communication costs of other protocols rapidly increase with . When , the communication costs of other protocols are higher than the proposed protocol. In addition, the advantage of the proposed protocol is increasingly apparent as get larger.

7.2. Experimental Results

In order to evaluate the performance of our PSI protocol, we built and evaluated an implementation. Our source code is available on GitHub: https://github.com/TAN-OpenLab/Unbanlanced-PSI.

We implement our protocol in C++, and run our protocol on Ubuntu 16.04 with 8 GB RAM. We set , and other parameters are the same as in Section 7.1. We set the values of and , and record communication cost and online time. As Table 6 shows, the communication cost of the proposed protocol is 17 KB when and is not related to . The advantage of the proposed protocol over communication cost is increasingly apparent as increases. Particularly, when and , the communication cost of our protocol is 42.5% of the best existing protocol RT21 [2]. When , our protocol requires only communication cost of RT21 [2]. Although it has been shown from online time that the computation cost still remains to be reduced.

Due to the low communication cost, the proposed protocol is more suitable for the scenarios with low network bandwidth. As shown in Table 7, for the specific set sizes, the online time changes little with network LAN, 1 Mbps and 100 Kbps bandwidths. Although our protocol is not the fastest with network LAN and 1 Mbps bandwidth, we gain an apparent advantage with 100 Kbps bandwidth. With the set sizes and 100 Kbps bandwidth, our protocol achieves a 7.8x speedup compared to KKRT [12], a 5.31x speedup compared to SpOT [20], a 12.78x speedup compared to PaXoS [21], a 5.37x speedup compared to CM20 [22] and 1.46x speedup compared to RT21 [2]. Thus, our protocol is applicable to low bandwidth networks. With unbalanced sets , , and , our protocol is faster than other protocols under 100 Kbps bandwidth, and we achieve 1.03x, 1.26x, and 1.46x speedup compared to RT21 [2], which proves our protocol is applicable to two sets with larger difference.

We present the computation cost intuitively in Figure 4. When we fix the value of to and set the bandwidth to 100 Kbps, it is evident that for all protocols, the computation cost rises as the set size increases. The relationship between the set size of and the online time is linear, and the online time of our protocol is the lowest compared to the other protocols.

8. Conclusion

We propose a semihonest efficient PSI protocol for unbalanced sets based on trapdoor hashing and Latin square, which relies on the DDH assumption. By employing trapdoor hashing, the communication cost is only dependent on the smaller set, effectively eliminating the impact of the larger set size on communication cost. The use of Latin square reduces the number of times encoding keys need to be sent, enhancing communication efficiency. The results of the performance analysis clearly indicate that the proposed protocol exhibits optimization in terms of communication cost specifically for unbalanced sets on low bandwidth. Furthermore, the advantage of our protocol becomes more prominent as the disparity between the sizes of and increases. In future work, our focus will be directed toward reducing the computation time and storage cost associated with our proposed protocol.

Data Availability

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was funded by the National Key Research and Development Program of China under Grant No. 2023YFC3306201, the Fundamental Research Funds for the Central Universities No. N2317004, and the National Natural Science Foundation of China No. 61772125.