Abstract
Contact tracing is a critical tool in containing epidemics such as COVID19. Researchers have carried out a lot of work on contact tracing. However, almost all of the existing works assume that their clients and authorities have large storage space and powerful computation capability and clients can implement contact tracing on their own mobile devices such as mobile phones, tablet computers, and wearable computers. With the widespread outbreaks of the epidemics, these approaches are of less robustness to a larger scale of datasets when it comes to resourceconstrained clients. To address this limitation, we propose a publicly verifiable contact tracing algorithm in cloud computing (PvCT), which utilizes cloud services to provide storage and computation capability in contact tracing. To guarantee the integrity and accuracy of contact tracing results, PvCT applies a novel set accumulatorbased authentication data structure whose computation is outsourced, and the client can check whether returned results are valid. Furthermore, we provide rigorous security proof of our algorithm based on the Strong Bilinear Diffie–Hellman assumption. Detailed experimental evaluation is also conducted on three realworld datasets. The results show that our algorithm is feasible within milliseconds of client CPU time and can significantly reduce the storage overhead from the size of datasets to a constant 128 bytes.
1. Introduction
In public health domain, contact tracing is a critical approach for identifying people who may have come into contact with diagnosed people infected with some epidemics such as Ebola virus disease, H1N1 influenza pandemic, or coronavirus disease 2019 (COVID19). By tracing the contacts of infected individuals and treating them appropriately based on their testing results, public health departments can contain and mitigate the community transmission of infectious diseases. In history, contact tracing is commonly used as an important tool to fight against epidemics. For example, during the 2014–2016 outbreak of Ebola in West Africa, the World Health Organization issued guidelines to conduct contact tracing for breaking transmission chains of the Ebola virus [1]. Nowadays, contact tracing is also playing a critical role in all our efforts to contain the ongoing COVID19 pandemic.
Researchers have carried out a lot of works on contact tracing [2–8]. For instance, in [4], the authors put forward a privacypreserving system for contact tracing based on secure twoparty private set intersection cardinality technique. In [6], the authors develop a Bluetoothbased contact tracing system. In [7], the authors mainly focus on privacy leakage, and they propose a blockchainbased privacypreserving contact tracing algorithm. The authors in [8] propose a contact tracing algorithm with access control, which guarantees that only authorized people can execute contact tracing process. However, almost all of the existing works assume that their clients and authorities have large storage space and powerful computation capability and clients can implement contact tracing on their own mobile devices such as mobile phones, tablet computers, and wearable computers. In these literatures, the authorities such as hospitals or Centers for Disease Control and Prevention (CDC) are required to store all the travel records of diagnosed people, and the clients are demanded to store all their travel records at their own sides.
However, with the spread of diseases, the number of travel records of diagnosed people grows rapidly. Meanwhile, besides classical travel records such as accurate location and relevant timestamp, there is a special category of travel records, the transportation data. It contains information such as train/flight number, license plates, and other important information which is also quite useful in the contact tracing process. Meanwhile, the more complete these travel records are, the more comprehensive the contact tracing can be. In applications, all the travel records and transportation data are collected through mobile crowdsensing (MCS) technology, which is capable of sensing and collecting data using various mobile devices belonging to all the clients and authorities.
These terminal devices have certain limitations. First, they have limited hardware resources in storage and computation. These restrictions make those mobile devices unbearable to the large storage and computation burden incurred by the rapid increment of travel records scale. Second, it is also difficult to synchronize all the data as well as collaborate all the computation among the clients if there are only terminal devices. If clients are organizations such as colleges, synchronization delay can adversely affect the accuracy and efficiency of disease control over all of their students and staff. That is, the latency can directly lead to the failure of epidemic prevention. Hence, for clients and authorities, they need other ways to deal with the tricky situation introduced by limited mobile devices. To the best of our knowledge, there are no prior works that can be applied to such a scenario where clients are resourceconstrained and efficient management is required. In this paper, considering that, with cloud computing service provided by certain cloud service providers (CSPs) such as AWS and Azure, the client can pay for the resource he/she lacks and release it once his/her work is finished, and we resort to CSPs for storage and computation assistance.
Nevertheless, it brings up new issues when introducing CSPs:(1)CSPs can be compromised by a malicious adversary [9](2)Even if CSPs honestly follow the clients’ rules, there still exist various problems such as program malfunction, data loss, or some unintended mistakes
Therefore, CSPs cannot be fully trusted, and there is no guarantee that the returned results are correct and complete. However, in the contact tracing scenario, the accuracy and integrity of the returned result are of vital importance and sometimes matter tens of thousands of lives. Assume that someone is a close contact of one COVID19 diagnosed person, but CSPs return a false negative contact tracing result which indicates that he is not. In this case, not only he/she cannot obtain a timely treatment but also the people who are close contacts of him/her cannot be tracked anymore. This hinders the process to contain the pandemic.
In this paper, we aim to solve the problem that how to achieve both accuracy and integrity of contact tracing results in untrusted cloud computing setting. Currently, the mainstream method (if not the only method) to satisfy the above requirements is to take advantage of various verifiable computation techniques. A possible solution is to utilize general verifiable computation schemes [10, 11] by using techniques such as succinct noninteractive argument of knowledge (SNARK). However, almost all existing SNARK algorithms are too complicated to be deployed in practice [12]. Thus, in this paper, we adopt another kind of technique which enables adhoc verifiable computation through constructing an authentication data structure (ADS) [13, 14]. Briefly speaking, an ADS is a data structure whose computation can be outsourced to an untrusted server and the client can check if the returned result is valid.
There are some ADSbased query authentication techniques studied for outsourced databases [15–17]. However, there still exist several main challenges, making the conventional schemes inapplicable to contact tracing. First, the conventional schemes rely heavily on one data owner signing the ADS using a secret key. In contrast, in the contact tracing scenario, there are two data owners and two phases, one is the query phase and the other is the matching phase. Especially in the matching phase, only the authority can append new records of diagnosed people to its database, and a client cannot act as the authority in this phase because he does not have the authority’s secret key and cannot sign its ADS. Second, a traditional ADS is constructed on a fixed dataset, and such an ADS cannot be efficiently adapted to a contact tracing scenario in which the data are unbounded with the spread of the disease. Third, in conventional outsourced databases, ADS is always regenerated to support more queries. However, it is difficult for clients who have limited resources to implement. Thus, a more generic ADS is preferable to support different phases that may happen in contact tracing scenario.
To address this issue, we propose a novel set accumulatorbased ADS scheme that enables public verification over contact tracing, guaranteeing both accuracy and integrity check. And, hence, on that basis, we propose a novel framework called publicly verifiable contact tracing (PvCT), which employs publicly verifiable computation techniques to guarantee both integrity and accuracy of contact tracing result. More specifically, we provide each client and authority with an additional ADS. And, based on this ADS, untrusted CSP can construct and return a cryptographic proof, known as verification object (VO), for clients to verify the result of contact tracing. The information flows among CSPs, clients, and authorities, as illustrated in Figure 1.
To summarize, our contributions made in this paper are as follows:(i)We propose a novel PvCT framework and develop a set of verification algorithms that leverage our welldesigned ADS to guarantee both integrity and accuracy of contact tracing result over both traditional travel records and transportation data.(ii)We provide rigorous security proof based on Strong Bilinear Diffie–Hellman assumption for the proposed publicly verifiable contact tracing algorithms.(iii)We perform detailed performance evaluation on three realworld datasets. Experimental results demonstrate that our algorithm is feasible with client CPU time in milliseconds and significantly reduces the storage overhead from the size of datasets to a constant 128 bytes.
The rest of the paper is organized as follows. Section 2 reviews existing studies on contact tracing and verifiable query processing. Section 3 formally defines the problem and its security model followed by cryptographic primitives and assumptions in Section 4. Section 5 presents the detailed PvCT algorithms based on a family of verifiable set accumulators. Security proof and performance evaluation are given in Sections 6 and 7, respectively. Finally, we conclude our paper in Section 8.
2. Related Work
2.1. Contact Tracing Algorithms
Due to the rapid spread of the COVID19 pandemic and the importance of contact tracing, many research groups proposed their algorithms to improve contact tracing. Some of the algorithms rely on and expose records to a trusted thirdparty, such as BlueTrace [6], and some of them introduce a decentralized/public list approach: Private Kit [3] enables clients to log their own information (like locations) and can help the authority to contain an epidemic outbreak, Apple and Google [18, 19] have made joint efforts that support privacypreserving contact tracing by inferring linkages, and Epione [4] provides endtoend privacypreserving contact tracing, or privacysensitive protocols and mechanisms for contact tracing [20], in which all personal data is locally stored on the phone and it is voluntary for users to publish/upload the data. Although contact tracing has been intensively studied as mentioned above, there are still no existing works that take resourceconstrained clients into consideration. In other words, none of these works can be applied in cloud computing scenario. Table 1 provides a comparison of different contact tracing algorithms with respect to accuracy/integrity properties, client’s storage cost, and verifiability, and is the total number of contact tracing records. All of which are important for verifiable contact tracing in cloud computing scenario.
2.2. Verifiable Query Processing
Plenty of verifiable query processing algorithms have been studied to ensure the integrity of query results against an untrusted service provider (such as [15–17, 21–23]). Most of the existing works focus on outsourced databases and there are two basic approaches: enabling general queries using arithmetic/Boolean circuitbased verifiable computation schemes (SNARKs) and enabling adhoc queries using an authenticated data structure (ADS). Constructing efficient SNARKs and optimizing their implementation is a very active area of research [10–12, 24–28]. Pinocchio [12] utilized quadratic arithmetic programs to support arbitrary computation but at a very high expenses and occasionally impractical overhead. Moreover, it is difficult to amortize its preprocessing computation overhead if conducted a new one for each program. To remedy this issue, lots of work have been proposed, such as, recently, Xie et al. [11] proposed a zeroknowledge proof system in which the preprocessing time is only dependent on the size of the related circuit and irrespective of the circuit type.
The ADSbased algorithm is more efficient compared to the above SNARKs as it tailored to adhoc queries. Our proposed algorithm belongs to this sort of algorithms. An ADS is a special data structure with additional authentication properties. In most cases, it has the form of “additional authentication value + regular data structure” so that the computation of the corresponding regular data structure can be outsourced to an untrusted server and the client can check if the returned result is valid. And, there are two basic techniques which are commonly utilized to serve as an ADS: digital signature and Merkle Hash Tree (MHT). Digital signatures employed asymmetric cryptography to verify the authenticity of digital messages. To support verifiable queries, it requires the data owner to sign every data record through his secret key; meanwhile, the verifier (client) can use the owner’s public key to verify the authenticity of a value and the signature of the value. Hence, it cannot scale up to large datasets [17]. MHT, on the other hand, demands only one signature on the root node of a hierarchical tree [29]. Each entry in a leaf node is assigned a hash digest of a data record, and each entry in an internal node is assigned a digest derived from the child nodes. The data owner signs the root of MHT, which can be used to verify any subset of data records. MHT has been widely adapted to various index structures, such as the authenticated prefix tree for multiresource datasets [16] and the Merkle Btree for relational data [21]. However, so far, no work has considered the integrity issue for verifiable contact tracing over cloud computing.
3. Problem Definition
Different from common settings such as [4, 5] in which the clients only require a returned result no matter whether it is authentic or not, we present the definition of the problem in publicly verifiable contact tracing setting as follows. The clients in our system not only submit their queries and expect to receive a result but also require that the correctness of the result can be verified. Besides, the service in our scenario is provided by two cloud service providers.
As shown in Figure 2, we assume the complete version of data which can be used in contact tracing is collected from each individual of a client (like students of the college), and it is stored in a database of a public cloud server (such as AWS or Azure). The records of the th individual can be modeled as a set , and , where we assume that refers to locations where the th individual had been to, is the related timestamp when the th individual had been to that location, is the identity number of each individual which is known as a unique number of all the individuals belonging to the client, and is total number of the records of th individual.
Unlike traditional query scenario which only keeps one database, in our scenario, to protect the privacy of a diagnosed people as much as possible, we maintain a separate database called to keep the data of diagnosed people. In reality, this means that the related data of the diagnosed group is kept and obtained only through authorities, such as the hospital, Center of Disease Control, and other government departments. To support public query and verification, once there is a person who is diagnosed, the relevant authority will upload the patient’s location and related timestamps into a separate database which can be hold in a different cloud server. Here, it should be noted that since identity number is not essential for the final contact tracing matching part, only stores location and related timestamp of each diagnosed person. Meanwhile, in this way, we can shuffle the records of all the diagnosed people so that not only can it protect a diagnosed person’s personal information without any additional computationintensive overhead (such as encryption) but also can it prevent any adversary from finding out whom the records belong to.
Notably, our contact tracing scheme is the only one that takes into consideration information beyond just location, which differentiates us from all existing ones. Considering that if an individual takes some transportation such as airplanes, trains, taxis, or buses, then, clearly, information about the transportation is of great importance. Therefore, the relevant transportation information of the th individual (such as flight number, train number, and stations) is stored in both the databases. To enable verifiable contact tracing processing, an authenticated data structure (ADS) is constructed and embedded into each set of records uploaded by the clients or authorities.
3.1. System Model
Here, we now give a detailed description of our publicly verifiable contact tracing algorithm as follows.
First, in our scenario, clients are organizations such as colleges as we described in our introduction, and they would like to do check intimate contacts of their staffs under the rapid epidemic circumstances. Then, there are four main parties in this situation: individual client, authorities, Cloud Service Provider I (), and Cloud Service Provider II (). On this condition, we can easily find out there are two phases during the whole contact tracing processing: the query phase with and matching phase with . We discuss the following cases that may happen in these two phases separately.
Phase 1. In the query phase, clients may wish to search the records appearing in a certain time period uploaded by his choice. Specifically, query Q is in the form of , where is a certain time window selected by the client and is the identity number belonging to the client. As a result, returns all records such that .
Example 1. In a COVID19 contact tracing process, the time period of the query in Phase I can be a 14day time window from the day he issued the query back to the beginning. Then, a client may issue a query to find all of the records stored in from October 1st to October 14th of 2020 and being associated with person whose ID is Alice.
Phase 2. After the query phase, in the matching phase, clients may want to utilize their staff’s records which are obtained in the query phase to find out whether they are positive contacts of diagnosed people or not. Then, a client may transfer his staff’s records s to for executing the matching process. As a result, if the intersection between the staff’s records and the diagnosed people’s records is empty, returns the negative contact result to the client. Otherwise, returns the positive contact result and the intersection to the client.
Example 2. Assume the target staff’s records of clients are , and the records of diagnosed people are ; then, after the clients send their staff’s records to to do the matching process, can find out none of the diagnosed people had been to the same place as the staff, as the intersection between them is empty and the staff is a negative contact. Otherwise, if the records of diagnosed people are , then the can find out the staff is a positive one.
Additional examples can be found in Figure 2.
3.2. Threat Model
We consider CSPs, the two untrusted cloud service providers in the contact tracing framework, to be the potential adversary. Due to various issues such as security vulnerabilities, program bugs, and commercial interests, the CSPs can provide unfaithful contact tracing process, thereby returning incomplete or incorrect query and matching results. To address such kind of threat, we introduce publicly verifiable contact tracing that enable CSPs to prove the integrity and accuracy of query and matching results. Specifically, during the query phase, CSPs examines the ADS embedded in the records and constructs a verification object (VO) that includes the verification information of the related results. Using the VO, the client can establish the accuracy and integrity of the query and matching results, under the following criteria:(i)Accuracy: none of the records returned as results have been tampered with and all of them satisfy the query conditions. Meanwhile, there is no false positive/negative matching result can be verified correctly with nonnegligible possibility.(ii)Integrity: no valid record is missing regarding the query, and no positive result is missing in the matching phase.
Definition 1. (Accuracy). The result of public verifiable contact tracing is accurate if, for all PPT adversaries , there is a negligible function such that
The main challenge in this model is how to design an ADS which can be easily adapted to the contact tracing framework; meanwhile, VOs can be efficiently constructed, incurring small bandwidth overhead and fast verification time. We address this challenge in the next few sections.
4. Preliminaries
This section introduces major notations, as shown in Table 2, cryptograhic primitives, and security assumptions that are used in our algorithms’ design.
4.1. Cryptographic Primitives and Security Assumptions
4.1.1. Bilinear Pairings
Let be a cyclic multiplicative group of prime order and let be a random generator of . is also a cyclic multiplicative group of prime order . Then, a bilinear pairing is a map , and the map satisfies the following conditions:(i)Bilinearity: (ii)Nondegeneracy: there exists , i.e., generates (iii)Computability: group operations of and the calculation of bilinear map are both efficient, i.e., computable in polynomial time
For clarity of presentation, we assume, for the rest of the paper, a symmetric (Type I) pairing . We note that our construction can be securely implemented in the (more efficient) asymmetric (Type III) pairing case, with straightforward modifications (refer to [30], for a general discussion on pairings). And, our security proof is based on the Strong Bilinear Diffie–Hellman (SBDH) assumption over groups with bilinear pairings presented in [31].
Assumption 1. (Strong Bilinear Diffie–Hellman assumption). Let be the security parameter and be a tuple of bilinear pairing parameters. For any probabilistic polynomialtime (PPT) adversary Adv and for being a parameter of size polynomial in , there exists negligible probability such that the following holds:
Lemma 1. (see [32]).The intersection of two sets is empty if and only if there exist polynomials such that , where .
The above result is based on extended Euclidean algorithms over polynomials and provides our essential verification process with the ability to check the correctness of empty set intersection.
4.1.2. Cryptographic Hash Function
A cryptographic hash function is a mathematical algorithm which takes an arbitrary length string as its input and returns a fixedlength bit string. It is a oneway function, i.e., a function which is practically infeasible to invert. Meanwhile, it is collision resistant meaning that it is computationally infeasible to find two different messages, and , such that . Classic cryptographic hash functions include MD5, SHA1, SHA2, and SHA3, where in recent years, the widely used hash function SHA256 is a kind of SHA2 family.
Lemma 2. (polynomial interpolation with FFT). Assume there is a degree polynomial , and is all the coefficients of the polynomial, and given , it can be computed with complexity.
Lemma 2 presents an efficient process, given , and the coefficients of a degree polynomial can be quickly computed. This lemma is based on an FFT algorithm [33] that computes the DFT in a finite field, such as , and we used it in our constructions for arbitrary and performing field operations. And, a detailed proof has been shown in [32], so we omit it here.
4.2. Cryptographic Set Accumulators
Our set accumulator is parameterized by a set of operations . For example, for our construction, it includes(1)subset , intersection : these functions take two sets as input and output a set. And, for intersection , there exists two kinds of situations, and the first one is that the intersection is empty and the second one is that the intersection is not empty in which we should take its completeness into consideration.(2): these functions take the set as input and output a value with type Boolean or integer (the output can also be viewed as a set with one element). Our set accumulators are all based on bilinear pairing and SBDH assumption as we have presented above.
Inspired by [32, 34, 35], we give formal definition of our set accumulators which consists of the following PPT algorithms:(i): on input of the security parameter , it outputs a secret key and a public key .(ii): for a set , it computes the accumulation value of . In our construction, it can be efficiently computed without knowing the secret key and using only.(iii): in input of a query Q, sets , and the public key , it returns the result along with a proof .(iv): on input accumulation value ( of set , an result and a proof for the query and public key , and it outputs . If , the verification process indicates that the query result is valid; otherwise, the returned result is invalid and the accuracy and integrity of this query cannot be guaranteed.
More elaborated constructions of the set accumulator will be given in Section 5.3.
5. Constructions
In the following, Section 5.1 introduces the whole contact tracing process with Case I and Case II as we present in Section 3. Then, we enrich our algorithm by taking transportation data into consideration (Section 5.2). Furthermore, we give the detailed constructions of our main cryptographic building block and set accumulator in Section 5.3.
5.1. ADS Construction and Verifiable Contact Tracing Process
For simplicity, this section will only takes a client’s contact tracing query which is over one individual into consideration. We assume that database stores all the people’s data , where is total number of the people. The database stores all the diagnosed people’s records , where is total number of the diagnosed people.
Recall that, in the proposed framework, an ADS is generated for every individual of each client (to be noted, the length of time period which is recorded for an individual depends on the type of the contagious disease, i.e., the incubation period of the disease decides the length of each period; in COVID19, the period should be at least 14 days).
It can be utilized by the cloud service provider ( and ) to construct a verification object (VO) for each query. To this end, we extend the traditional data structure by adding an extra field, called .
Moreover, should have the following three desired properties to be functional as an ADS. First of all, should be able to summarize an individual’s records in a way that it can be used to construct a proof whether the result matches a query or not. Secondly, should be able to support batching or aggregation verification of several devices of one individual or among different individuals. Thirdly, should be in a constant size rather than a varying size grown in proportion to the number of records of an individual. Therefore, we propose to use accumulator as :where stands for the target set which we would like to aggregate.
While for better readability, we defer detailed constructions to Section 5.3.
5.1.1. Verifiable Contact Tracing
Given a query by a client and two database and , at the end, the client needs to know the result is positive or negative. However, to be noted, in our scenario, recall the process of contact tracing, which we present in Section 3; there are actually two phases contained in a verifiable contact tracing processing, and we will discuss them separately.
5.1.2. Query Phase
The first phase is Query Phase. Assume all records over one staff of the client (like one of college’s student) are , where is the number of records. Before the client uploads all the records to , for the ease of generation of and privacy concern, he utilizes a collisionfree hash function to hash every record into a fixedlength value, i.e.,, . Then, the client generates an accumulation value over the set of all his hashed records. Meanwhile, the client introduces a counter to count how many records the device collected each day; for the record set of the th date is , the device has records; then, we store it as an additional information for our verification. If there are records of days that the device will upload to , then we set .
After all the above setup processes, the client can issue a query to to obtain all of his records. Here, is the date that the client would like to retrieve his records and is the identity number of an individual which the client wants to retrieve over this query. Obviously, the main challenge in this phase is how to verify both the correctness and integrity of the returned result using the corresponding .
For instance, the query condition which the client issued is , that is to say, the result that the client would like to obtain all his records over an individual and his ID is , and the date the client issued is from 2020.09.01 to 2020.09.15. Because our verification algorithm is independent of the retrieval algorithm, in other words, any existing retrieval algorithm such as [36] can be used in our construction and do not affect accuracy and security of our algorithm, we omit the detail of retrieval process here and only consider after finished its retrieval process and obtain the corresponding result . Then, we can apply to generate a proof and use a counter to obtain the number of result set as the VO for the retrieval result. Accordingly, the client can first check if the number of records in the returned result set is equal to the sum of numbers stored in the client side (from the initial day to the last day of the query ). If the check holds, then the client can utilize to verify the accuracy of the result . The whole process of this phase is detailedly specified in Algorithm 1. If the verification process in this phase fails, i.e., b = 0, then the whole contact tracing execution aborts. Otherwise, the verification process in the query phase succeeds, then we proceed to the matching phase.

5.1.3. Matching Phase
The second phase is the Matching Phase. In this phase, the client issues a matching query ( here is a variation of set , due to , i.e., the identity number of the result does not required in the matching process, , is the number of records in the result set) to the in order to find out whether he/she is a close contact to diagnosed people or not. From the view of mathematics, the problem which we try to solve here is to find out whether there is an intersection between set and set , where is the number of records over all the diagnosed people. Then, the main challenge in this phase for our scheme is how to verify whether the intersection is empty or not. Recall what we described in Section 3, and the records of diagnosed people are owned by the authorities, such as hospitals or government. Then, before they upload these records to , the authorities generate an accumulation value and make this value public to all the clients for further verification.
For example, if the result set that the client obtained is . Assume there is an intersection ; then, if the intersection is empty, it means that none of the diagnosed people have been to A laboratory, B market, and C restaurant at the same time with the client. This also means the client is a negative contact. Then, can utilize to generate a proof and send the result to the client. According to the result, the client can use to verify if the negative judge by is trustworthy. If , it means the verification process in this phase succeeds, and the client can be sure of the negative result. Otherwise, the client will refuse to believe the negative result.
At the meantime, if the intersection is not empty, which means at least one of the diagnosed people has been to A lab, B park, or C restaurant at the same time with the client. We assume ; obviously, the client is a positive contact which is with high possibility to be infected. Then, can apply to generate a proof and send the result to the client. According to the result, the client can use to verify if the negative judge by is trustworthy. If , it means the verification process in this phase succeeds, and the client can be sure of the positive result and he can just seek medical care as soon as possible. Otherwise, the client will refuse to believe the positive result. The whole process of this phase is detailed in Algorithm 2.

5.2. Transportation Data
There is a special kind of data needing further discussion, which is transportation data. Assume clients or diagnosed people have used vehicles (such as airplanes, trains, buses, or taxis) for short or long trips; then, obviously, data of vehicles (such as flight/train number, bus plate, departure, and terminal station) is significant in our contact tracing algorithm. For example, if a diagnosed person has taken an airplane, then besides the departure and terminal station information, the closest contact of this diagnosed person is the passengers who took the same flight; in other words, the flight number is essential for the matching process. As for trains and other public transport such as buses or MRT, obviously there may exist multiple stations rather than an initial station and a terminal station. Therefore, the information of middle stations is also important in our contact tracing process. To conceptualize this, we analyze two possible cases of transportation dataset that may happen in the matching process, as shown in Figure 3.
(a)
(b)
In the first case shown in Figure 3(a), we assume that the client got on the bus at initial station and got off at middle station 1 and a diagnosed person got on the same bus at middle station . Then, mathematically, there exists two datasets consisting of station information; the first set starts from the initial station to middle station 1, i.e., ; the other set starts from middle station and ends at the terminal station of the bus, i.e.,. Obviously, under this circumstance, there is no intersection between set and set , in other words, the client is not a positive contact.
In the second case shown in Figure 3(b), we suppose that the client got on the bus at initial station and got off at middle station ; at the meantime, a diagnosed person got on the same bus at middle station 1. Same as Case 1, there exists two datasets; the first set starts from the initial station to middle station , and the other starts from middle station 1 to the terminal station. It is easy to find out there exists an intersection , and this intersection is not empty. That is to say, the client is a positive contact.
Based on the above analysis of different cases that may happened to transportation data, it requires us to do some additional precomputation on both the client and authorities sides. Meanwhile, in the VO construction part, it also needs to generate corresponding proof to transportation data.
First, for all circumstances, both the client and authorities need to generate accumulation value of their transportation data set through for further verification. Meanwhile, it is easy to find out that there is no need to verify the transportation data separately in the query phase because the integrity check of the whole dataset can also pledge to the transportation data.
Then, when we execute the matching process, as analyzed above, has to check whether there is an intersection between the transportation dataset of the client and diagnosed people. As we described before, the first case is that is empty. Mathematically, we can easily find out that proving intersection of these two sets is empty which is equivalent to prove the off station of the client is not a member of the transportation dataset of diagnosed people. We utilize to generate a corresponding proof , which will be sent to the client along with a negative contact tracing result. Accordingly, the client can use to verify the negative result.
The second case is that is not empty. Similarly, we can easily find out proving intersection of these two sets is not empty, is equivalent to proving that the off station is a member of the station set of diagnosed people . Then, we can utilize to generate a corresponding proof , which will be sent to the client along with a positive contact tracing result. Accordingly, the client can use to verify the positive result. All the detailed process is shown in bold print in Algorithms 1 and 2.
5.3. Construction of Set Accumulators
We now discuss the possible construction of the accumulator which can be used in Section 5.1.
Inspired by [32], we present a construction which is based on SBDH and bilinear pairing assumption. It consists of the following algorithms.(i): let be a bilinear pairing. Randomly choose from . Then, it outputs a secret key and .(ii): for a set , its accumulation value . Owing to the property of the polynomial interpolation with FFT, it can be efficiently computed without knowing the secret key .
To make a clear expression of our construction, here, we split the procedures that show how to construct four core proof and verify protocols that meet different query requirements into cases as follows.
5.3.1. Subset
(i): given two sets and public key , to verify whether is a subset of , i.e.,, we can compute .(ii): the client verifies the following equation:
This equation holds if and only if is a subset of . In other words, if is verified as correct, the client is assured that ; then, output . Else, output .
5.3.2. Empty
(i): given two sets and public key , it verifies whether . Based on the extended Euclidean algorithms over polynomials, if and only if , there exists polynomials and such that . In that case, we can compute .(ii): the client verifies the following equation:
And, this equation holds if and only if the intersection of and is empty. That is to say, if is verified as correct, the client is assured that . Then, output , else output .
5.3.3. Completeness
(i): let be the intersection of and , i.e.,. Similar to computing the proof of empty, based on the extended Euclidean algorithms over polynomials, if and only if , there exists polynomials and such that . In that case, we can compute .(ii): the client verifies the proof through the following equation:
And, this equation holds if and only if the intersection of and is empty. In other words, if is verified as correct, the client is assured that set contains all the common elements between and . Then, output , else output .
5.3.4. Intersection
(i): let be the intersection of and . Obviously, the correctness of the set intersection operation can be expressed by the combination of subset and completeness condition. That means holds, if and only if the following two conditions holds:(1)(2) According to the conditions mentioned above, we can easily find out that(ii): first, the client verifies the subset condition by checking the following equation:
If the above check on subset proof succeeds, the client verifies the completeness condition through checking the following equation:
If the above equation holds, then the client is assured that is the correct intersection. Then, output , else output .
5.3.5. Membership
(i): let be an element of set , i.e.,. Then, we can compute .(ii): the client verifies the membership using the following equation:
If the verification succeeds, output ; the client is assured that is an element of set . Else, output .
5.3.6. Nonmembership
(i): let be an element which is not belong to set , i.e.,. Set ; then, accordingly, we can compute , and set .(ii): the client can verify the nonmembership using the following equation:
If the verification succeeds, output ; the client is assured that is not an element of set . Else output .
6. Security Proof
In this section, we provide security proofs for our schemes; more specifically, proofs of security for the six setrelated operations: Subset (), Empty (), Completeness (), Intersection (), Membership (), and Nonmembership (), in accumulator settings. We will first provide security proofs for two more fundamental setrelated operations, namely, Set Containment and Set Disjointness, and then, reduce the security of the six setrelated operations in our scheme to the two.(1)Set Containment: this operation takes a set or an element belonging to the universe as its first input, and a set as its second input. It outputs “1” if or and outputs “0” otherwise. It is a generalization of Subset() and Membership(), which provides a unified interface for the two. Informally speaking, if one wants to check whether a set is a subset of or whether is an element of , in both cases she can use set containment operation. If the inputs are two sets, it is equivalent to Subset(), and if the inputs are an element and a set, it is equivalent to Membership().(2)Set Disjointness (): this operation takes as inputs sets , and it outputs “1” if , outputs “0” otherwise.
We then proceed to prove that if there exists an adversary who is able to create legal witness for an incorrect set operation result, an algorithm can be constructed to break  Strong Bilinear Diffie–Hellman assumption. We define our security games at first.
6.1. Security Game 1
 Strong Bilinear Diffie–Hellman Game: in this game, an adversary and a challenger involve in an interactive process:(1) prepares a SBDH instance and sends it to (2)If can return a legal pair , we say that wins this game, and thus, it breaks the SBDH assumption
6.2. Security Game 2
Valid Witness for Incorrect Set Operation Result Game: this is an abstraction of security games for all the related set operations in our scheme. So, in the description of this game, we do not refer to any concrete set operation:(1) prepares a group of system parameters (even if it sends the public part of to adversary , there is no essential difference except that in fact does not need to query for set witness; it is similar to security games in general publickey encryption, where adversary can encrypt message itself as well as query and get response from challenger ).(2) issues witness query on the arbitrary set she wants, subject to which the cardinality of should be less than or equal to . The total query number is also bounded by a polynomial of security parameter, to which we only implicitly refer.(3)After the query phase, if can return a legal witness (or several legal witnesses) for an incorrect set operation result, we say that wins this game, and thus, it breaks security of our scheme.
Theorem 1. If there exists an adversary who can provide a valid witness for an incorrect set containment operation result, there exists another algorithm can break  Strong Bilinear Diffie–Hellman assumption.
Let be a tuple of bilinear pairing parameters. Given elements , while is chosen uniformly at random from , suppose there exists a polynomialtime algorithm that can find two sets and and a legal witness such that and . Then, we can use to construct a polynomialtime algorithm to break  Strong Bilinear Diffie–Hellman assumption.
Proof. The main idea behind the proof is that algorithm simultaneously takes part in two security games and sits between challenger (in  Strong Bilinear Diffie–Hellman security game) and algorithm (in Set Containment security game). It can prepare parameters for by utilizing what it receives from , and then, it forms its own solution to a Strong Bilinear Diffie–Hellman instance after some calculation of ’s response:(1)Algorithm first interacts with challenger . It will receive a Strong Bilinear Diffie–Hellman instance to be challenged upon. W.l.o.g., and we denote this instance as . And, if it can successfully find a pair , we say that it succeeds in breaking the Strong Bilinear Diffie–Hellman assumption.(2)Algorithm can arbitrarily choose a set it wants as a query, with the only restriction that the cardinality of that set cannot be larger than . Suppose it chooses , sends it to algorithm , and asks for the accumulation value for it.(3)With parameters in , algorithm can easily respond to the request from in the last step. For example, to generate for a set , first calculates all the coefficients of polynomial , denoted as . Then, it calculates. After the calculation, it sends as the answer to .(4) may conduct other queries (notice that element update operation can be included in this case, e.g., two queries for set and set are equivalent to one query for an update (with value insertion) on , and identical to a delete operation on as well.) for the sets it wants, and will respond to it accordingly, which subjects to the condition that there is a upper bound of the total query number.(5)After the query phase, generates two pairs and , where and are the two sets, and