Abstract

We discuss query optimization in a secure distributed database system, called the Secret Sharing Distributed DataBase System (SSDDBS). We have to consider not only subquery allocations to distributed servers and data transfer on the network but also decoding distributed shared data. At first, we formulated the subquery allocation problem as a constraints satisfaction problem. Since the subquery allocation problem is NP-complete in general, it is not easy to obtain the optimal solution in practical time. Secondly, we proposed a heuristic evaluation function for the best-first search. We constructed an optimization model on an available optimization software, and evaluated the proposed method. The results showed that feasible solutions could be obtained by using the proposed method in practical time, and that quality of the obtained solutions was good.

1. Introduction

The need for distributed databases (DDB) [1] is increasing as information technologies develop and the amount of data to be handled is growing exponentially. For databases such as those containing personal data, confidentiality, dependability, and robustness are becoming increasingly important. We have proposed a Secret Sharing Distributed Database (SSDDB) [2] that combines a secure distributed storage system [35] with a relational database system. The relations are divided into fragments in the SSDDB, and the fragments are managed by encrypting with a (𝑘,𝑛) threshold scheme [6]. Thus, the fragment must be both encrypted and decrypted every time there is a request to the database system to perform a search or some other functions. This makes it essential to have optimized parallel operation.

Optimization of the query process in an SSDDB proceeds in 3 stages: optimization of the query structure, optimization of subquery allocation to servers employed, one for each fragment request, and optimization of the combination of subquery results. This paper addresses the issue of subquery allocation to servers.

Optimization of the query process in an ordinary DDB is precisely described in [1]. A database management system must decide which server to use for processing when the same data are in distributed locations due to replication, even for ordinary distributed databases. Optimization of the query process for such a case is not sufficiently described in [1]. Evrendilek et al. proposed a query optimization problem which considers the data replication in [7]. However, with SSDDB, a server must be selected for decrypting fragments, and the procedures suggested for allocating subqueries to servers in ordinary DDBs are not very effective here. There exists no research on optimization of the query process for SSDDB as far as we know.

There are three factors to consider when allocating subqueries to servers: firstly, subquery allocation to servers; next, determination of the server storing the fragment needed to handle the subquery; finally, determination of the server for the necessary decrypting of the fragment. Subquery allocation to servers in an ordinal distributed database is known to be an NP-complete problem [7]. Since the subquery allocation problem in an ordinal distributed database is a special case of that of SSDDB, subquery allocation to servers becomes an NP-complete problem. This paper is an investigation of best-first search methods with a view to online use, and proposes heuristic methods for the best-first search.

Section 2 provides an overview of SSDDB and Section 3 demonstrates the formulation of the subquery allocation problem as a constraint satisfaction problem. A heuristic evaluation function for a best-first search is then proposed. Section 4 shows the execution of an experimental calculation and demonstrates the effectiveness of the proposed heuristic evaluation function.

2. Secret Sharing Distributed Database System

2.1. The (𝑘,𝑛) Threshold Scheme

The secret sharing distributed database system is software that safely performs distributed sharing of confidential information. The system described herein uses the (𝑘,𝑛) threshold scheme demonstrated by Shamir [6].

Let 𝐹 represent an algebraic number field and a confidential datum 𝑠 be an element of 𝐹. A (𝑘1)th degree polynomial equation is created using 𝑠 and random coefficients 𝑟𝑖𝐹 (𝑖=1,,𝑘1) 𝑓(𝑥)=𝑘1𝑖=0𝑟𝑖𝑥𝑖,(1) where 𝑟0=𝑠. The value 𝑤𝑗=𝑓(𝑥𝑗) is calculated for each of the 𝑛 distinct values of 𝑥𝑗𝐹 (𝑗=1,,𝑛,𝑥𝑗0). The value 𝑤𝑗 is called a share and 𝑥𝑗 is called the ID of share 𝑤𝑗.

Since 𝑓(𝑥) is a (𝑘1)th degree polynomial, it is uniquely defined by its values at 𝑘 different points (𝑥𝑗,𝑓(𝑥𝑗)). The confidential datum 𝑠 can be retrieved by evaluating 𝑠=𝑓(0).

In the (𝑘,𝑛) threshold scheme, it is impossible to decrypt 𝑠 using fewer than 𝑘 shares. There is no dependence on how shares are selected, so even if up to 𝑛𝑘 shares are lost, it remains possible to decrypt 𝑠.

2.2. SSDDB Structure

An outline of the structure of an SSDDB is presented in Figure 1. The SSDDB is composed of a database management system (DBMS) and a secret sharing storage system (SSSS). The DBM process on the DBMS and the SSS process on the SSSS are assumed to be connected to each other over the Internet and each is able to communicate with the other individually.

When the DBM receives a query from an external source, it creates a query processing plan. It carries out this processing while assigning subqueries to other DBMs for parts of the processing and sending requests to DBMs and SSSs to retrieve or decrypt the necessary data sets. Thus, the user can approach the DBMS in the same way as approaching an ordinary distributed database management system, but an SSSS is used for storage, so the specificity of an SSSS must be considered in the query optimization, as described below.

The data sets are divided into small sets (fragments) and encrypted using the (𝑘,𝑛) threshold scheme. This process is managed by the SSSS. A DBM is also capable of caching data sets. Only DBM3 in the figure is caching data.

The dashed lines in the figure indicate physical boundaries. DBM1, SSS1, DBM2, and SSS2 are each separate processes running on the same computer. In other words, no SSS process is running on the computer where DBM3 is running. A DBM can request an SSS to retrieve and decrypt a fragment on a different computer. However, this generates a cost due to transmission of data over the network.

3. Query Optimization for an SSDDB

3.1. Outline of Query Processing in an SSDDB

Optimization of query processing in an SSDDB consists of the three stages of firstly, query structure optimization, secondly, optimization of subquery allocation to multiple servers, one for each fragment request, and thirdly, optimization of the combination of the subquery results.

In query structure optimization, the query is subdivided by employing information in the fragments. The first task is to eliminate any redundant expressions sometimes included in user-generated queries. Queries are also rewritten or divided into subqueries in order to reduce the volume of intermediate data. At this point, SSDDB specificity is not yet required, so the ordinary procedure for a DDBS [1] can be used as is.

The generated subqueries can be processed by separate servers. In the second stage, a server for processing a subquery is selected so as to minimize the processing cost of the query, by considering fragment allocations and server loads. In the SSDDB, all fragments except for those in the cache are distributed as shares in the SSSS. Thus, no particular fragment is assigned to any particular server. The cost of recreating distributed fragments must be calculated. Sections 3.2 and 3.3 address the issues of allocating subqueries to servers while considering the specificity of the SSDDB.

Finally, the results of processing the subqueries assigned to the different servers are combined. As with the first step, SSDDB specificity is not required, so again the ordinary procedure for a DDBS [1, 7] can be applied as is.

3.2. Subquery Allocation Problem to Servers

Let 𝑆 be a finite set of servers with |𝑆|=𝑝. Here, the term server is assumed to mean a physical computer, at most one DBM process and at most one SSS process are running on each server. 𝐹 is a finite set of fragments with |𝐹|=𝑚, and 𝑆𝑄 is a finite set of subqueries with |𝑆𝑄|=𝑟.

Let 𝑑𝑗,+ be the cost required to decrypt some fragment 𝑓𝑗𝐹 on some server 𝑠𝑆. Here, + is the set of all nonnegative real numbers. Costs are set considerably higher in servers where an SSS process is not running, in order to avoid assigning decryption requests to such servers. When a fragment 𝑓𝑗 is decrypted on a server 𝑠, the cost for transmitting shares from other servers is assumed to be 𝑠𝑡𝑗,+.

The cost for processing a subquery 𝑠𝑞𝑖 on a server 𝑠 is assumed to be 𝑤𝑖,+. As was the case with decrypting costs, this should be set rather high for servers where a DBM process is not running. The cost necessary in order to transmit a fragment 𝑓𝑗 from a server 𝑠 to a 𝑠𝑒𝑟𝑣𝑒𝑟𝑠 is written 𝑡𝑓𝑗,,. The load cost of a server 𝑠 is assumed to be 𝑙+.

Some additional variables are employed here. 𝑣𝑖,𝑗 is set to the value of 1 when a subquery 𝑠𝑞𝑖 calls a fragment 𝑓𝑗; otherwise, to the value of 0. 𝑐𝑗, is set to 1 when a fragment 𝑓𝑗 is cached in a server 𝑠; otherwise, to 0. 𝑥𝑖, is set to 1 when a subquery 𝑠𝑞𝑖 is allocated to a server 𝑠; otherwise, to 0. 𝑦1,𝑗,2 is set to 1 when a fragment 𝑓𝑗 is transmitted to a server 𝑠1, where it is required, from a server 𝑠2; otherwise, it is 0. 𝑧𝑗, is set to 1 when a fragment 𝑓𝑗 is decrypted at a server 𝑠; otherwise, it is 0.

Normally, parameters of disk I/O, CPUs and the network can be used when constructing a query processing cost model in a distributed database. For an SSDDB, disk I/O can be considered a part of the share decryption cost; the following two costs are employed.

The calculation cost 𝑆𝐶 in a server 𝑠 is defined with the following expression: 𝑆𝐶=𝑙+𝑠𝑞𝑖𝑆𝑄𝑥𝑖,𝑤𝑖,+𝑓𝑗𝐹𝑧𝑗,𝑑𝑗,.(2)

The network cost 𝑁𝐶 for a server 𝑠 is defined thus: 𝑁𝐶=𝑓𝑗𝐹𝑧𝑗,𝑠𝑡𝑗,+𝑓𝑗𝐹𝑠2𝑆𝑦,𝑗,2𝑡𝑓𝑗,2,.(3)

The above expressions indicate that the subquery allocation problem to servers becomes a constraint satisfaction problem for the decision variables 𝑥𝑖,, 𝑦1,𝑗,2, and 𝑧𝑗,minmax𝑠𝑆𝑆𝐶+𝑁𝐶(4) subject to 𝑠𝑞𝑖𝑆𝑄,𝑠𝑆𝑥𝑖,=1,(5)𝑠𝑞𝑖𝑆𝑄,𝑓𝑗𝐹,𝑠1𝑆,𝑥𝑖,1𝑣𝑖,𝑗=𝑠2𝑆𝑦1,𝑗,2,(6)𝑠1𝑆,𝑓𝑗𝐹,𝑠2𝑆𝑦2,𝑗,1𝑧𝑗,1+𝑐𝑗,1=𝑠2𝑆𝑦2,𝑗,1.(7) Here, the object function prevents the subqueries from being concentrated on a single server. Equation (5) is to ensure that each subquery is processed only once by one of the servers. Equation (6) indicates that the fragment a server requires must always be transmitted from some server (please note that the 1=2 case is permitted). Equation (7) indicates that when a fragment 𝑓𝑗 is transmitted from a server 𝑠1 to a server 𝑠2, 𝑠1 must have either decrypted 𝑓𝑗 or been holding it in its cache.

3.3. Subquery Allocation to Servers Based on Best-First Searches

Subquery allocation to servers in an ordinal distributed database is known to be an NP-complete problem [7]. If 𝑘 is set to 1 in the (𝑘,𝑛) threshold scheme in an SSDDB, this describes an ordinal distributed database. Accordingly, the issue of subquery allocation to servers becomes an NP-complete problem. This paper is an investigation of best-first search methods with a view to online use.

In combinatorial optimization problem, searching the optimal solution is done by expanding states iteratively. The search process could be described by a tree structure, called the search tree. Figure 2 is an example of the search tree.

Based on preliminary experiments, we heuristically decided the order of decision variables as 𝑥𝑖,, 𝑧𝑗,, 𝑦1,𝑗,2 as shown in Figure 2. The depth of the search tree of each step is 𝑟, 𝑝𝑚, and 𝑟𝑚𝑝. But at the time when variables 𝑥𝑖, and 𝑧𝑗, are decided, most of variables 𝑦1,𝑗,2 have been decided by evaluating (5), (6), and (7).

In the best-first search, the most promising state is chosen by evaluating a function, called the heuristic evaluation function. Let us consider a situation expanding the state 𝑠 in Figure 2; we are going to choose a server to process the subquery 𝑠𝑞𝑖. The heuristic evaluation function 𝐺(𝑥𝑖,) shows an expected cost when choosing a server 𝑠, and the search process chooses the state whose 𝐺(𝑥𝑖,) is minimal.

The fragments in the SSDDB are encrypted using the (𝑘,𝑛) threshold scheme and the shares are distributed to the various servers. Therefore, in order to process a subquery, the shares are transmitted and the fragment must be decrypted. Let us assume here that a subquery 𝑠𝑞𝑖 is processed by a server 𝑠, and that in addition to the cost 𝑤𝑖, of processing 𝑠𝑞𝑖 on 𝑠, there is a cost for the decryption of the fragment 𝑓𝑗, which is needed for the processing in 𝑠𝑞𝑖, and transmitting 𝑓𝑗 to server 𝑠. Accordingly, the minimum possible value for the cost of processing a subquery 𝑠𝑞𝑖 on a server 𝑠 is as follows: 𝐺𝑥𝑖,=𝑙+𝑤𝑖,+min𝑠2𝑆𝑑𝑗,2+𝑠𝑡𝑗,2+𝑡𝑓𝑗,,2.(8)

When a subquery 𝑠𝑞𝑖 is supposed to be processed on a server 𝑠, a server to decrypt the corresponding fragment 𝑓𝑗 must be decided. The server 𝑠2 which gives the minimal value in (8) is the most promising one. Accordingly, a heuristic evaluation function for choosing a server to decrypt a fragment 𝑓𝑗 is given as follows: 𝐺𝑧𝑗,=0iftheserver𝑠givestheminimalvaluein(8)forsomesubquery,1,otherwise.(9)

Since most of variables 𝑦1,𝑗,2 could be decided by evaluating (5), (6), and (7), we have not prepared any heuristic evaluation function for the variable 𝑦1,𝑗,2.

4. Computational Experiment

This problem of subquery allocation to servers was solved with an ILOG solver (Ver. 5.1, ILOG). The problem was described using OPL. The operating system was Linux (CPU: Celeron 2.0 GHz, Memory: 1 GB).

The following 2 procedures were used for comparison.

No Heuristic
The ILOG Solver engine was used as is without heuristic evaluation functions. This solver determines variables in the order they are described. In the experimental model, they were described in the order 𝑥, 𝑧, 𝑦.

EDN
The heuristic evaluation function employed by Evrendilek et al. [7], which is referred to as EDN, was used in (10). 𝐺𝑥𝑖,=𝑙+𝑤𝑖,.(10) Equation (10) ensures that the processing for subqueries is distributed as widely as possible.

The results are presented in Table 1. Symbols 𝑝, 𝑚, and 𝑟 in the table denote the numbers of servers, fragments, and subqueries, respectively. 𝐷 denotes the dominant cost among the decryption cost (𝑑), subquery processing cost (𝑤) and transmission cost for shares and fragments (𝑡). As indicated by 𝑛, all of these costs are about the same. The dominant cost was selected at random from the range [100,999] and the other costs were randomly selected from [10,99]. Five problems were constructed using identical conditions.

The calculation time was limited to 1 hour and 𝑓 was the number of searches that were not completed within the time limit. Blank columns indicate that searches were completed for all problems. The value 𝑡𝑖 was the mean time required to find a feasible solution (initial feasible solution). The value 𝑡𝑜 was the mean time required to find the optimal solution. The value 𝑡𝑓 was the mean time required to complete the search. The means for 𝑡𝑜 and 𝑡𝑓 were taken over the searches that were completed. All times were stated in seconds.

The value 𝜖 is the ratio between the evaluated values of the initial feasible solution and the optimal solution and represents the mean over problems for which the optimal solution was identified.

The results when 𝑝, 𝑚, and 𝑟 were set to identical values and the scale of the problem was varied are listed in the upper part of Table 1. The time 𝑡𝑖 was within 0.1 s and the feasible solution was easily found with all the methods. Also, as 𝜖 shows, the proposed procedure proved to be better than the other procedures at finding the initial feasible solution.

The time required to complete the search grew exponentially with the scale of the problem.

Furthermore, the search was never completed within the time limit under the No Heuristic procedure, even when the problem scale was small. Thus, it is essential to develop a procedure for quickly deriving a useful feasible solution for the subquery allocation problem.

The proposed method was able to find optimal solutions for a greater quantity of the problems within the time limit than the EDN approach. Also, on average, the proposed method was superior from the perspective of times to find the optimal solution and complete the search. It should be noted that the values for 𝑡𝑜 and 𝑡𝑓 are the means of problems seeking optimal solutions within the time limit. For example, under 𝑝=8, 𝑚=8, 𝑟=8, and 𝐷=𝑤, for the proposed method 𝑡𝑜=452.08 s, while for EDN 𝑡𝑜=4.38 s. However, the proposed method successfully completed 3 problems, while EDN successfully completed only 1 problem. The given values for 𝑡𝑜 and 𝑡𝑓 for EDN represent those from that single problem, while the values from the proposed method are the means taken over the three problems.

Thus, the proposed method is superior to both EDN and the No Heuristic procedure in terms of both initial feasible solution and calculation time.

Let us turn to some observations about the relationship between problem characteristics and solution methods. First, when the transmission cost is dominant (𝐷=𝑡), the searches are completed more quickly than for other problems of similar scale. On the other hand, when the subquery processing cost or decryption cost is dominant (𝐷=𝑤 or 𝐷=𝑑), the time required for search completion increases with the scale of the problem. This seems to be because transmission cost depends greatly on server selection.

We observed changes in the calculation time when only one of the parameters 𝑝, 𝑚, or 𝑟 was changed (lower part of Table 1).

It takes more time to derive an initial feasible solution when the number of servers is increased; on the other hand, the time to identify an initial feasible solution remained within 1 s when either the number of fragments or subqueries was increased. The value of 𝜖 was raised relatively high by expanding the number of servers. Future studies must address the issue of developing procedures for identifying good feasible solutions at high speed in systems incorporating large numbers of servers.

5. Conclusions

It is essential to optimize query processing, as this has a great influence over database performance. This study provides a formulation of the subquery allocation problem in a secret sharing distributed database in terms of a constraint satisfaction problem. A heuristic evaluation function is proposed for solutions in best-first searches.

An experiment was conducted using an ILOG Solver to examine the effectiveness of the above function. Some issues remain for the proposed procedure when there are a large number of servers, and further research will be necessary. It will also be necessary to evaluate the performance of this method on an actual system.

Acknowledgment

The authors express their deep appreciation for the Grant-in-Aid for Scientific Research no. 19700060, which supported part of this study.