Abstract

As cloud computing has been popularized massively and rapidly, individuals and enterprises prefer outsourcing their databases to the cloud service provider (CSP) to save the expenditure for managing and maintaining the data. The outsourced databases are hosted, and query services are offered to clients by the CSP, whereas the CSP is not fully trusted. Consequently, the security shall be violated by multiple factors. Data privacy and query integrity are perceived as two major factors obstructing enterprises from outsourcing their databases. A novel scheme is proposed in this paper to effectuate -nearest neighbors (kNN) query and query authentication on an encrypted outsourced spatial database. An asymmetric scalar-product-preserving encryption scheme is elucidated, in which data points and query points are encrypted with diverse encryption keys, and the CSP can determine the distance relation between encrypted data points and query points. Furthermore, the similarity search tree is extended to build a novel verifiable SS-tree that supports efficient query and query verification. It is indicated from the security analysis and experiment results that our scheme not only maintains the confidentiality of outsourced confidential data and query points but also has a lower query processing and verification overhead than the MR-tree.

1. Introduction

As the spatial data resources have been developed by leaps and bounds, to be well geared into such transition, the enterprises are required to proliferate the resources of both the hardware and software resources and to recruit professionals to manage and maintain data. Accordingly, the data maintenance has been boomed overhead. On the other hand, cloud computing has become progressively popularized in recent years. This arises from their capability to offer scores of benefits, such as quick deployment, on-demand service, high scalability and cost reduction [14]. A growing number of companies are currently being motivated to outsource their daily business, even their core business, to the cloud service provider to eliminate the investment in hardware and software and to reduce the costs to maintain data. Moreover, using the advantages of cloud computing, the end users can use on-line software applications and access the service at any time and any place [5]. Outsourcing spatial database is progressively reflecting the trend of reality. Spatial data has scores of practical applications, such as environmental monitor, location-based services, flow control, etc. In the data outsourcing model, the cloud service provider (CSP) hosts the outsourced databases and provides query services for the clients, and the data owner loses the management and control of the outsourced data. Consequently, the confidentiality and security of data shall be violated. Data privacy and security problems count as the major factors obstructing the data owners from outsourcing their databases to the CSP [6, 7].

Data encryption is perceived as the most frequently adopted approach to maintain data confidentiality. Merely the authorized parties can conduct decryption. It is noteworthy that the destination of outsourcing data is to draw on the strong computing power and high bandwidth of the cloud service provider to offer the rapid and efficient services to users. Yet, the traditional encryption approaches, such as DES and RSA, are primarily designed to encrypt the confidential data, the encrypted data cannot support efficient query and analysis as well as the original data. Many effective schemes [811] have been proposed to support how to execute queries on encrypted data.

With the exception of data privacy, query integrity, also known as query authentication, is deemed as another critical problem to be solved in the domain of data outsourcing. Since the CSP is not fully trusted, it can return incorrect or incomplete query results. Extra authentication information shall be offered to the client to ensure the correctness and completeness of query results without having to trust the CSP. Correctness bespeaks that the records in the results really exist in the owner’s database and are not modified by any user. Completeness bespeaks that all the records that satisfy the query condition are included in the results. Query integrity, in particular, is crucial in the case of the results laying the foundation for critical decisions.

The -nearest neighbors (kNN) query is deemed as a crucial data analysis operation which can be used as an independent query or as a core module of data mining and has been applied in many practical applications, such as geospatial technology, location-based services, and pattern recognition. Recent studies [1216] have proposed various techniques to support either queries on encrypted data or query authentication. However, both privacy protection and query authentication should be provided in an insecure cloud computing environment. Thus, we focus on the query processing and query authentication on an encrypted spatial dataset. In this paper, we introduce an asymmetric scalar-product-preserving encryption to encrypt confidential data points and query points, and then we propose an authenticated spatial index structure based on the SS-tree [17], called verifiable SS-tree (VSS-tree), for secure query processing and query authentication. Our main contributions are illuminated as follows:(1)We introduce an asymmetric scalar-product-preserving encryption through which the data owner encrypts confidential data and query points with diverse encryption keys. The cloud server can perform a query on encrypted outsourced spatial database.(2)We extend SS-tree [17] and propose a novel verifiable SS-tree (VSS-tree) for the query processing and query authentication.(3)We perform a detailed security analysis and performance evaluation of our scheme.

The rest of this paper is organized as follows. The relevant work is reviewed in Section 2. The system model is proposed in Section 3. Section 4 specifies the encryption scheme. Section 5 elaborates on the VSS-tree. In Section 6, we perform security analysis of our scheme. In Section 7, the performance and experimental results are presented. Eventually, we conclude this paper in Section 8.

The encryption approach, called “bucket-based,” is proposed in [8, 9]. The domain of private data is subdivided into multiple disjoint ranges and each range is identified by a unique identifier. The cloud server performs a range query in the light of the identifiers and returns a super set of real result set. The client has to do extra processing to get the real results. Agrawal et al. [10] proposed an order-preserving encryption to support one-dimensional range query on encrypted data. The input data distribution is accordingly transformed into a user-specified target distribution. The encrypted data is kept in the same order as the original data, which simplifies the course to effectuate encrypted range query. Nevertheless, this scheme fails to resist known-plaintext attack [18]. Oliveira and Zaiane [11] proposed a distance-preserving transformation (DPT) approach. DPT transforms an original data point into a new point , where is a matrix and is a -dimensional vector. DPT ensures that the Euclidean distance between any two encrypted data points is equal to that between the corresponding original data points; that is, . However, DPT cannot resist level 2 and level 3 attacks [19]. Man et al. [20] proposed a data transformation approach to maintain data confidentiality. Using the transforming function, the data owner and user transform their original spatial data and query ranges into encrypted ones. The cloud server performs range queries on encrypted data. Chen et al. [21] proposed a random space encryption approach to support range query on encrypted data. The outsourced data and queries are encrypted on the trust agent. The cloud server indexes the encrypted data and executes queries on it. And yet, the trust agent may become the single point of failure and network bottleneck of the system. Kalnis et al. [22] adopted anonymity to the outsourced data. The cloud server cannot distinguish a record from at least records. Obviously, the cloud server fails to perform exact query in line with this scheme. Similarly, Chow et al. [23] adopted location anonymity to hide the real location of query points. Asymmetric scalar-product-preserving encryption (ASPE) is proposed in [12] for secure query on encrypted data. The outsourced data and query points are encrypted with diverse encryption keys, and the cloud server can execute a query on encrypted data. However, ASPE assumes that the clients are fully trusted, which is unrealistic in real applications. The client can easily obtain the encryption key from his legal inputs and outputs. Optimized ASPE is proposed in [13, 14] in which the clients are not trusted and only the data owner knows the encryption key. Data points and query points are extended to ()-dimensional points in [13], which requires more than double computation overhead than that of computing original data. Paillier homomorphic encryption is used in [14] to keep the query points confidential to the data owner. Thus, the client has to provide more computing resources to encrypt and decrypt the query points.

Digital signature chain mechanism is adopted in [2427] for query authentication. Each record is signed with its immediate predecessor or successor record (attribute). The records and their corresponding signatures are stored together on CSP. When answering a query, the CSP returns the matched records along with their corresponding signatures to the client, and thereupon the client verifies the correctness and completeness of query results according to the corresponding signatures. However, the computational complexity of digital signature is too high. Even though using signature aggregation [28], the client has to provide more computing power for signature verification and modular multiplications.

Being different from the signature chain, Merkle [29] first proposed Merkle Hash Tree (MHT), a memory-based binary tree with authentication information, for one-dimensional equality query. The digest rather than signature, computed by a one-way and collision-resistant hash function, is associated with each node. Only one signature is computed on the top of MHT. Devanbu et al. [30] extended MHT to support one-dimensional range query. Pang and Tan [31] extended MHT and proposed a verifiable B-tree (VB-tree). Data owner has to sign each record and node in the VB-tree, which results in high signature computation overhead. The VB-tree only guarantees the correctness of the query results. Li et al. [32] proposed Merkle B-Tree (MBT) to support disk-based query authentication. To response a range query, the CSP performs two depth-first traversals to find the leftmost and rightmost records of query results and build the verification object (VO). The VO includes the following information: the digests of entries contained in each visited internal node that do not overlap with the range; the query results along with the digests of the residual entries in the corresponding leaves; the leftmost and rightmost records of the query results for completeness verification.

In multidimensional database outsourcing, Cheng et al. [33] proposed verifiable KD-tree (VKD-tree) and verifiable R-tree (VR-tree), applying signature chain to the KD-tree and the R-tree to ensure the integrity of query results, respectively. Yang et al. [34] proposed Merkle R-tree (MR-tree) and Merkle -tree (M-tree) for query verification of spatial data. M-tree combines the concepts from MBT and -tree. The authentication information is combined with the -tree. Range query is performed by a depth-first traversal of M-tree. The VO comprises all the data entries in the visited leaf nodes and the MBRs along with the corresponding digests of the sibling nodes pruned in the visited internal nodes. Yiu et al. [15] presented a framework for authenticating moving kNN queries using the safe region approach.

In other query authentication approaches, Xie et al. [35] proposed a probabilistic query integrity authentication scheme. The data owner inserts some fabricated records into the database and outsources them together to the CSP. The CSP cannot distinguish the fabricated records from the real ones. In response to a query, the CSP returns the query results (including the real and fabricated records) to the client. The client verifies the correctness and completeness of query results by verifying whether all the qualified fabricated records are included in the results.

Both privacy protection and query authentication are realized in [3638]. Wang et al. [36] applied duplicated encryption for query verification. Part of the original data are encrypted with two diverse encryption keys. The user transforms a query into two different queries according to the encryption schemes and probabilistically verifies the query integrity by checking the two query result sets. Shamir secret sharing is used in [37] for privacy protection and multiparty computation. service providers collaboratively compute the aggregation results without gaining knowledge of intermediate results. The integrity of aggregate results is guaranteed by MHT. Pedersen commitment protocol and MHT are used in [38] for aggregation computation and query verification.

3. System Framework and Assumption

3.1. System Model

Figure 1 illustrates our system framework. It consists of three parts: data owner (DO), client, and cloud service provider (CSP). The data owner possesses two encryption keys. One is obtained from a trusted key distribution center, encompassing a private and a public key for signature. The other is generated by himself for encrypting sensitive data and query points. outsources the encrypted database to the . Whenever updates occur, the corresponding encrypted data and the new signature are forwarded to . The hosts the outsourced database and provides query services for the client. To process spatial query efficiently, maintains an authenticated spatial index structure. For each incoming query, it initiates the search algorithm to find a query results and builds the corresponding for query authentication. To maintain the confidentiality of query points, a client transmits the processed query points to for encryption. After receiving an encrypted query point, the client transmits it to for a query. Once receiving , the client extracts the query results and performs query authentication. In general, the clients are located at the edge of networks and possess low network bandwidth and computing power. The clients only trust the signature information that the published.

3.2. System Assumption

We assume that the privacy data are numeric and denoted by real numbers, like the position coordinates of navigation, minimum, maximum and amount, etc. Each multidimensional piece of data is denoted as a column vector. The is semitrusted; thus, it can directly access the outsourced database , fabricate or tamper with the data, and return a subset of real result set to save the computation power for providing paid services for more users. Simultaneously, performs our scheme honestly. To maintain the confidentiality of query points, the client first transmits a processed query point to DO for encryption and thereupon extracts the encrypted query point and transmits it to for a query processing. The client verifies a query results through and the public key published. Furthermore, the client is semitrusted, which may collude with or other clients to recover the original data. Therefore, the encryption key owned by DO should not be revealed to the client and the CSP.

In summary, the attacks can be divided into three levels based on the knowledge the attackers can learn.

Level 1. The attacker only observes encrypted database and encrypted query points. This is known as ciphertext-only attack proposed in [18].

Level 2. With the exception of encrypted data, the attacker also knows part of the original plain data and some encryption information, such as the maximum, minimum, and data distribution of encrypted data. However, the attacker does not know the corresponding encrypted values of those plain data. This corresponds to known-sample attack [39].

Level 3. In addition to the knowledge obtained in level 2, the attacker observes a set of plain data and knows the corresponding ciphertext, and this is known as known-plaintext attack in cryptography [18].

It turns out to be evident that the knowledge of lower level that the attacker learns is a subset of what a higher-level attacker learns. If an encryption scheme can resist higher-level attacks, it can also resist lower-level attacks. Since we usually capture known-sample attack in practical applications, we design our encryption scheme against known-sample attacks.

Based on these assumptions, we should preserve the confidentiality of outsourced sensitive data and query points and provide query integrity authentication for queries. The details are as follows:(1)Data privacy: the confidential data should not be revealed to anyone else. Only encrypted data is outsourced to the CSP.(2)Query privacy: query privacy bespeaks that a client’s query points should be kept private to himself. Neither nor can obtain the plain query points.(3)Key privacy: the existing research usually shares the key with the clients. The can easily obtain the key from the colluded or compromised client to recover the original data. Therefore, these schemes have to assume the clients are fully trusted. In our system assumption, each part of our system is semitrusted, the encryption key owned by DO should not be disclosed to anyone else.(4)Query authentication: based on the SS-tree, we propose a novel authenticated spatial index structure for queries and query authentication.

The main symbols used are listed in Table 1.

4. EASPE

4.1. Preliminary of ASPE

The basic idea of ASPE [12] is the observation that the distance between database points is not necessary for a query. According to (1), ASPE can determine the distance relation between any two data points , and query point . where is the scalar product of point with itself, which can be computed in advance and stored with the corresponding data together for queries. Then, ASPE does not need to preserve . For any two encrypted points , , . The distance between them can be computed by

It is easy to see from (1) and (2) that ASPE does not keep the scalar product to ensure that the CSP cannot compute the distance between any database points through (2). Moreover, the CSP is able to determine which data point is nearer to the query point through (1).

Definition 1 (asymmetric scalar-product-preserving encryption (ASPE)). An encryption function is an ASPE if and only if it satisfies the following two conditions:(1)For any point and any query point , .(2)For any and in , .

As can be seen from Definition 1, data point and query point must be encrypted with diverse encryption keys to ensure that the encrypted value of any query point is different from that of any data point in , even if .

When encrypting a data point, ASPE randomly generates a invertible matrix as the encryption key and extends every data point to a new -dimensional point which is encrypted into . When encrypting a query point , the client randomly selects a positive random number and extends the query point to a new -dimensional point , and then he encrypts into , where is the encryption key of query points. To determine whether an encrypted data point is nearer to a query point than is, the search algorithm checks whether :

Since is a positive random number, we can determine that

4.2. kNN Query on ASPE

As described in Section 4.1, the client is assumed to be fully trusted by the data owner, and the encryption key and configuration information are shared with the client. However, in a more practical scenario, a client may be compromised or colludes with the CSP so that the CSP can easily obtain the key and the private configuration to decrypt the encrypted data. One plausible approach is that the DO keeps the encryption key privately and performs a secure two-party computation protocol [29, 30] with the clients. DO encrypts a processed query point and only transmits the encrypted query point to the client without disclosing the encryption key . However, the combination of ASPE and secure two-party computation remains unable to maintain the key confidentially [14]. The encryption key shall be leaked to the others from legal outputs. The client can adequately choose enough query points and obtain the corresponding encrypted query points , and then the client obtains . Obviously, if is an invertible matrix, the client can obtain , by which the client can encrypt a new query point . Therefore, the encryption key and sensitive data are exposed to the attackers.

4.3. Enhanced ASPE

We propose an enhanced ASPE (EASPE) that keeps the encryption key confidential to the clients. Being different from ASPE, it is hypothesized in this paper that the three parties in our system model are not trusted by each other. Therefore, the DO must keep the encryption key confidentially and the key cannot be obtained by anyone, while the client should keep the query points secret to the DO and the CSP. Our encryption scheme is similar to the approach proposed in [14]. However, the scheme in [14] adopted Paillier homomorphic encryption to encrypt query points which burdened the client with more computation overhead. In our scheme, we apply a 1-out-of- oblivious transfer protocol [40] for query processing. A 1-out-of- oblivious transfer protocol [40] is a protocol such that one party, Bob, has inputs and the other party, Alice, learns one of the inputs for some of her choice, without learning anything about the other inputs and without allowing Bob to learn anything about . A random matrix encompassing the processed inquiry point is generated by the client and is sent to the DO for encryption.

Before encrypting data points, several artificial columns are introduced to the data points and are associated with some nonce random numbers generated independently which allows the same points to be encrypted into diverse points. Likewise, the client adds the same number of artificial columns to a query point and then perturbs the query point with some random numbers generated independently. The client sends a mixed matrix , encompassing the extended query point and some random vectors generated randomly, to the DO for encryption. Eventually, DO perturbs before matrix transformation so that the encrypted query points cannot reveal the key.

The outputs of , in the data process stage are denoted by , , respectively. finishes the encryption of and and outputs the ciphertexts and in the data encryption stage. It is noteworthy that cannot directly compute while encrypting the query point; nobody except the client knows the original query point. To simplify the description, is adopted to state our scheme in the first phase. Next, the two phases are elaborated on.

Data Processing. For each data point , first selects a positive integer as system security parameter in advance. In point perturbation, two random vectors of dimensions and of dimensions are generated by , taking up the encryption key and shared by all points in the database. The permutation function changes the sequence of the extended vector randomly. As the foregoing processing is effectuated, (5) is acquired.where indicates the th dimension of .

For each query point , firstly a positive random is selected, and a random vector of dimensions is created by the client, followed by the client’s extension of to and transmitting to the . generates a random vector of dimensions to perturb the last dimensions of . Accordingly (6) is acquired.Since the permutation function does not change the scalar product between data point and query point, then, . For any two data points and a query point , we haveSince random number is positive, it does not affect the comparative result of (7); that is,

Encryption Phase. generates an invertible matrix as the encryption key to encrypt , such that . For each query point , randomly generates a positive random number to compute .

The details of encryption process are as follows.

. Let . generates a -dimensional vector , a -dimensional vector , a invertible matrix , and a permutation function of numbers. DO sets the quadruple as the encryption key and keeps it privately.

EncDB. Once obtaining , computes the encrypted point . The encrypted database is denoted as .

EncQuery (1)After obtaining the extended -dimensional point , the client uses the 1-out-of- oblivious transfer protocol [40] to generate a matrix encompassing and other vectors. The first columns of the other column vectors in are generated randomly and are extended to -dimensional column vectors the same as the query point. The position of column vector is randomly selected from 1 to and is only known to the client himself. The client transmits to for encryption.(2)For each query point , randomly generates a random vector of c dimensions to confuse the last c dimensions of , and then he applies the permutation function to obtain . randomly selects a random positive number and computes matrix .(3)After obtaining , the client extracts the encrypted query point , that is, the th column vector of matrix .

To determine whether , the algorithm checks whether according to

5. VSS-Tree

The simplest approach to find the results of a query is to scan the entire database space. Yet, the query time and complexity are proportional to the data size and disk accesses, which usually cannot meet the needs of users. To improve the efficiency of spatial query, researchers build diverse spatial index structures, like R-tree [41], SS-tree [17], etc. In this section, we extend the SS-tree [17] with authentication information and build a verifiable SS-tree (VSS-tree) for query processing and query authentication.

5.1. VSS-Tree

Being different from the R-tree and the -tree, the similarity search tree (SS-tree) [17] applies bounding sphere rather than bounding rectangle for region shape. The SS-tree divides multidimensional points into isotropic neighbors. Due to the use of bounding sphere, the overlap area between regions is reduced, thereby improving query efficiency. The structure of SS-tree is shown in Figure 2. A verifiable SS-tree (VSS-tree) is built by extending the SS-tree with authentication information, and its structure is shown in Figure 3. The center of a bounding sphere is the centroid of the underlying points of its children. Compared with the R-tree, the SS-tree only spends nearly half storage. Because a bounding sphere can be denoted by a center and a radius, its storage cost is a multidimensional point plus an integer, while a rectangle is determined by the two points at the lower left and upper right corner, its storage is twice that of dimensions. This determines that the SS-tree has more fanout and lower height.

The structure of leaf nodes is defined as follows:where and denote the minimum and maximum values of the entries in the leaf node, respectively. An entry of the leaf node is denoted as a triple , where is a data point in database, is the enclosing sphere of , and is the hash value computed on the record that points to. An internal node of VSS-tree is elucidated as follows:where indicates the minimum bounding sphere that encompasses all the regions of the th children, consisting of a center and a radius. The pointer points to the th child. The variable indicates the number of points contained in the subtree whose top is the child . The hash value summarizes all the bounding spheres and their digests of the th child, that is, . The center of a bounding sphere is computed according to where is an index of its children, is an index to the dimensions, indicates the th dimensional coordinate of , and indicates the number of its children of . The radius of a bounding sphere is computed according to where indicates the center of the current node itself, and indicate the center and radius of the th child node, respectively, and indicates the distance between the centers and .

VSS-tree is built from bottom to top. All the leaf nodes are on the same level. Finally, the digest of root node is signed and published to .

Approaching the M-tree, the VSS-tree also supports dynamic operations, including insertion, update, and deletion. An update can be perceived as a combination of a deletion and an insertion. The insertion adopts enforced reinsert; that is, we first add a new data object to the reinsert list and then perform insertion for the entries in the reinsert list until the list is empty. The deletion is the same as the other query authentication approaches. In our scheme, scalar product comparison rather than Euclidean distance comparison is adopted to determine where to insert a data point. When a node overflows, the split algorithm shall be revoked. Accordingly, its coordinate variance on each dimension from the centroids of its children shall be computed, and the dimension will be selected with the highest variance for splitting it.

5.2. kNN Query on VSS-Tree

Figure 4 shows a query based on the VSS-tree. Given a query point , search algorithm gradually increases the search distance with the query point as the center until the search area just encompasses three data points. We perform the search algorithm and build the based on VSS-tree on the cloud server side. Diverse from other distance-based comparison methods, we determine the distance relationship by comparing the scalar products between the data points and the query points. The search algorithm is shown in Algorithm 1. According to (9), we can compare the distances between any two encrypted data points and encrypted query point to determine which point is closer to a given query point. Furthermore, we introduce a sorted list , storing the entries of a visited internal node, to avoid unnecessary access. comprises candidate results, denotes the maximum value in the , and if comprises less than objects, is elucidated as . The results shall be searched via the search algorithm from top to bottom, and if the current visited node is an internal node, all the entries and their corresponding scalar products shall be inserted into the list . On that basis, the CSP shall iterate through the ordered list and recursively invokes the search algorithm on its visited child nodes. Once a scalar product is greater than , the remaining entries in the are ignored, bespeaking that the results shall impossibly exist in the remaining entries. Thus, the minimum bounding spheres and digests of the remaining unvisited entries in the are inserted into . If the current visited node is a leaf node and the length of is less than , the visited entry and its scalar product will be inserted into list ; otherwise, whether to insert the entry into list shall be determined by judging whether its scalar product is less than . The is constructed in the search process, which consists of all the multidimensional encrypted data points in the visited leaf nodes, and the boundary spheres and their corresponding digests of the child nodes pruned of the visited internal nodes. Eventually, Algorithm 1 outputs list and .

Require:
node , query point ,
Ensure:
;
Append to
  if  Node is a leaf node then
for each entry in   do
VO.append()
if     then
else
Sort in ascending order
if     then
(11)
(12)
(13) end if
(14) end if
(15) end for
(16) else
(17) for each entry in   do
(18)
(19)
(20) end for
(21) Sort in ascending order
(22) for each entry in   do
(23) if     then
(24)
(25)else
(26)
(27) end if
(28) end for
(29) end if
(30) Append to

Once receiving the and , the client extracts the encrypted query results from the and performs query verification. Diverse from other approaches, the client obtains the maximum and verifies whether it is less than the other scalar products not in the list to check the completeness of the results. The verification process is as follows:(1)The client obtains from the list and verifies that any scalar product in the is less than or equal to , while the other scalar products are greater than .(2)The client verifies that any scalar product between the bounding sphere in the pair () and the query point is greater than .(3)The client checks whether the reconstructed hash agrees with .

The verification algorithm is shown in Algorithm 2.

Require:
, ,
Ensure:
C, hash, result
for each entry in   do
if is a data object   then
//Enlarge the to encompass
C.Enlarge
end if
if  E.getProduct.MaxDist and
in   then
result.add
end if
if   E.getProduct
and E.id not in   then
(11) Alarm the client
(12) end if
(13)if   is a symbol [ then
(14)
(15)end if
(16)if   is a pair   then
(17) if  
then
(18) Alarm the client
(19)end if
(20)
(21)
(22) end if
(23)if is a symbol ] then
(24)
(25)end if
(26) end for

The essence of the verification algorithm is to reconstruct the VSS-tree by scanning . During the process of verification, the bounding sphere is enlarged gradually by encompassing the objects read from . Eventually, the algorithm reconstructs the bounding sphere and digest of root node, and the client validates whether the reconstructed agrees with for query verification.

6. Security Analysis and Integrity Verification

6.1. Security Analysis

As described in Section 3, three parties are all semitrusted. In our scheme, the privacy issues of outsourced database , query points, and encryption key are deliberated. can directly access the outsourced database. We need to ensure their confidentiality against . We consider data privacy together with query privacy against under level-2 attack.

Theorem 2. EASPE is not distance-recoverable.

Proof. EASPE is an enhanced ASPE, its encryption key is , where the role of the invertible matrix is applicable to the encryption key of ASPE. and are adopted by EASPE to encrypt data points in and query points, respectively. As ASPE proves, our EASPE is also not distance-recoverable.

Theorem 3. EASPE is secure against level 2 attacks.

Proof. There are scores of types of level 2 attacks. According to the system security assumption, the following attacks are deliberated: distance-based inference attack, PCA, duplicate analysis, distribution analysis attack, and ICA-based attack. According to Theorem 2, EASPE is not distance-recoverable. Distance-based inference attack is obviously not feasible to our scheme.
Principal component analysis (PCA) has been proposed in [19] to match the correlations in the known data and the correlations in the encrypted data. Using the matched data, the attacker endeavors to reconstruct the entire original database. However, in EASPE, the values on each dimension of are a linear combination of the values on all dimensions in the original database. EASPE adds artificial columns and generates a random vector to confuse the original data. Furthermore, uses permutation function to change the sequence of the extended data point randomly. It turns out to be evident that EASPE does not preserve the correlations among the original dimensions in the transformed space, and thus PCA is not applicable to EASPE.
Duplicate analysis [10] is applicable to the attribute whose domain is small, such as the day of the week or the day of the month. Through the analysis of observations on encrypted data, the attacker may determine the domain of original attribute. Duplicate analysis is value-based encryption, that is, the values in each dimension are encrypted individually. However, EASPE is a tuple-based encryption, and duplicate analysis is not applicable to EASPE. Similarly, distribution analysis attack exists for estimating from . Observations on the encrypted database may help an attacker to determine the plain data fall into intervals . This attack is value-based encryption and is not applicable to EASPE.
ICA-based attack [18, 19] tries to recover the plain data from the transformed data . The approach is based on the observation that the eigenvectors of are computed by left-multiplied by . Therefore, by estimating and and matching their eigenvectors, the attacker can produce , an estimation of , and then data record is estimated as . This attack is on the assumption that the known samples follow the same distribution with the original data. The matrix must be orthogonal or full rank. However, we introduce one-time random vectors and for each data point and query point, respectively. Random vectors and are generated independently and privately kept by , and matrix can be generated as an invertible but nonorthogonal matrix. Hence, EASPE can impede both ICA and deriving the transformation matrix . EASPE is therefore resilient to ICA-based Attacks.
To keep the query points confidential to , a positive number is randomly selected and a random vector of dimensions is generated to extend a query point to a -dimensional point . And, then, 1-out-of- oblivious transfer protocol is used to generate a matrix including the processed query point and other random column vectors. The position of column vector is randomly selected in range from 1 to and is only known to the client himself. cannot learn which one the client has chosen.

Theorem 4. The encryption key is kept confidentially against and clients.

Suppose that a client can transmit a few number of query points to for encryption, and then the encryption key is derived from the correlation between plaintext and corresponding ciphertext. If we can keep the encryption key confidential to the clients. It turns out to be evident that the key is confidential to . Thus, we only need to prove that the encryption key is confidential to the clients.

Proof. A client transmits processed query points to and interacts with the during the query encryption stage. The encryption of query points is considered without applying permutation function in the first place. encrypts a processed query point into . The th dimension of is ; concretely,In (14), all the values of are kept confidential to the client. The client only knows the original query point and its corresponding encrypted query point . Let , and . The client can set up an equation . The client can obtain enough encrypted query points by his legal input or collusion with other clients. However, the invertible matrix is generated randomly and are one-time random parameters selected independently for each query point. is entirely random to the client. Moreover, the client can learn nothing about from . Furthermore, EASPE applies permutation function to the query points. The client cannot learn the correspondence of the dimensions of and . In addition, the permutation can prevent the client from setting up equation (14). Obviously, it enhances the security of our scheme. In conclusion, the encryption key is kept private against and the clients.

6.2. Integrity Verification

Our scheme provides correctness and completeness verification for queries.

Theorem 5. The correctness of query results can be ensured by our scheme.

Proof. Suppose that there is one or more falsified or modified data points in the results. We note that VSS-tree is built from bottom to top. All data points in the database are involved in the construction of the root hash. As we know that the hash function is one-way and collision-resistant. The digest of any falsified or modified data must be different from the original one, and this change propagates from the leaf node to the root node which makes the reconstructed root digest different from the original one and thus does not agree with . Therefore, the client can detect any falsified or modified data in the results.

Theorem 6. The completeness of query results is ensured by our scheme.

Proof. Suppose that a data point in a leaf node is one of a query results, but is not involved in the results. To make the reconstructed root hash match , either comprises all the data entries in or comprises the pair of . For the former, the client can determine that is one of a query results according to the verification algorithm and there exists at least one point in the results whose distance to is farther than that of . For the latter, the client can detect that the scalar product between and is less than , which means that comprises one or more data points that are closer to the query result , but is not visited by the search algorithm which can be detected during the verification process.

7. Experiment Evaluation

In this section, we mainly evaluate and compare the performance between DPT and our scheme. All programs are implemented in Java. Experiments are performed on an Intel Core i7-4790 3.6 GHz computer with 8 GB RAM running Windows 7. The block size is set as 2048 KB and the default value of security parameter as 1. The experiments are conducted on both synthetic and real datasets. The random points generated in the synthetic database are uniformly distributed in a -dimensional space. The real dataset adopted is the dataset “Shuttle” from the UCI repository, which comprises 58 K points and 9 dimensions. We run each experiment 100 times and take the average to show the performance of diverse schemes. We effectuate two experiments under diverse data cardinalities and dimensions in the synthetic database. In the first experiment, data cardinality is changed from 50 K to 500 K with a fixed dimension . In the second experiment, the dimensions are changed from 3 to 100 with a fixed data cardinality . The performance is evaluated from the following aspects: (1) data encryption; (2) construction and storage of the VSS-tree; (3) query; (4) query verification.

7.1. Key Generation and Data Encryption

As described in Section 4, the transition matrix used in EASPE is a invertible matrix. In practical applications, the dimension of spatial data is usually less than 100. In our experiments, we generate the encryption key only once, which takes less than 1 ms for diverse dimensions ranging from 3 to 100. Figure 5 illustrates the data encryption time on diverse data cardinalities. The encryption time includes: generating the encryption key and encrypting all the data points. The encryption time of the Shuttle dataset is shown in Table 2.

As can be seen from Figure 5, data encryption time is proportional to both data dimension and data cardinality. The encryption time of DPT is slightly shorter than that of EASPE in that EASPE performs multiplications and additions, while DPT performs multiplications and additions. As EASPE has more dimensions than ASPE, the encryption time of EASPE is slightly larger than that of ASPE.

7.2. Construction and Storage Cost of VSS-Tree

The storage cost of VSS-tree is indicated from Figure 6. The storage costs under all schemes are proportional to data dimension and data cardinality. Due to the added dimensions of EASPE, the storage cost of EASPE is larger than that of DPT and ASPE. Furthermore, the SS-tree only spends nearly half storage of that of the MR-tree as described in Section 5.1, the storage cost of the MR-tree is lager than that of the VSS-tree.

The build time of the VSS-tree is illuminated from Figure 7. The build time of the VSS-tree under both encryption schemes is proportional to both data cardinality and dimension. The build time of VSS-tree under EASPE is longer than that under DPT, this is because a d-dimensional data point is extended to a -dimensional data point in EASPE which makes the computation overhead greater than that under DPT. Eventually, it should be noted that the larger the parameter we set, the longer the time required to build the VSS-tree. The build time under the MR-tree is shorter than that under the VSS-tree, the reason is that bounding rectangle requires only comparison operations between each dimension of point, while bounding sphere needs to compute the center and radius.

The fanouts of internal node are exhibited in Figure 8 under diverse encryption schemes. Since we add dimensions to each data point in EASPE, the fanout of VSS-tree based on it is slightly less than that based on DPT and ASPE. The fanouts under all schemes decrease with the increase of the dimensions. This is because the storage cost of a record increases as the dimension increases. Furthermore, bounding rectangle is used in MR-tree whose storage is twice that of dimensions, the fanout of the MR-tree is less than that of the VSS-tree.

7.3. kNN Query Cost

We perform a query on the VSS-tree and set . Figure 9 shows that the query processing time is proportional to both data dimension and data cardinality. The query efficiency under EASPE is higher than that under DPT. This is because the search algorithm performs multiplications and additions to compute for each visited entry under EASPE, while Euclidean distance is computed in DPT, the search algorithm performs multiplications, subtractions and additions for each visited entry. As described in Section 5.1, the overlap area and regions in the MR-tree are larger than those in the VSS-tree, and more nodes need to be accessed for a query. Thus, the query processing time based on the MR-tree is longer than that based on the VSS-tree.

The size of directly affects the server’s response speed and network bandwidth resources. In our experiment, contains multidimensional data points of the visited leaf nodes, the bounding spheres, and corresponding digests of nodes pruned. Figure 10 illustrates that the size increases with data cardinality. Due to the use of bounding rectangle in the MR-tree, its size is larger than that under the VSS-tree.

Once receiving , the client extracts the query results from it and validates the correctness and completeness of the query results. The verification cost includes the following: scanning , hash computation, scalar product computation and comparison, and signature verification.

The verification time is shown in Figure 11. We can see that the verification time is proportional to the data cardinality. The verification time under EASPE is shorter than that under DPT. The reason is that verification algorithm computes under , while it computes under DPT.

8. Conclusion

In this paper, EASPE is firstly introduced to support secure query. EASPE is not distance-recoverable and only preserves the scalar products between data points in database and query points. In addition, we proposed a verifiable spatial data index structure VSS-tree to improve query efficiency and provide query verification. The security analysis and experiment results show that EASPE can resist level 2 attacks; the cloud server can efficiently perform a query on encrypted data points and query points. The encryption cost, query cost, and verification cost can meet the practical requirement.

In the future, the actual application scenarios shall be considered that there are more than one data source or the outsourced databases distributed on diverse cloud service providers. The VSS-tree shall be extended to support query authentication with multiple data sources or distributed databases.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research is supported by the National Nature Science Foundation of China (nos. 61772101, 61170169, 61170168, and 61602075).