Abstract
We discuss query optimization in a secure distributed database system, called the Secret Sharing Distributed DataBase System (SSDDBS). We have to consider not only subquery allocations to distributed servers and data transfer on the network but also decoding distributed shared data. At first, we formulated the subquery allocation problem as a constraints satisfaction problem. Since the subquery allocation problem is NP-complete in general, it is not easy to obtain the optimal solution in practical time. Secondly, we proposed a heuristic evaluation function for the best-first search. We constructed an optimization model on an available optimization software, and evaluated the proposed method. The results showed that feasible solutions could be obtained by using the proposed method in practical time, and that quality of the obtained solutions was good.
1. Introduction
The need for distributed databases (DDB) [1] is increasing as information technologies develop and the amount of data to be handled is growing exponentially. For databases such as those containing personal data, confidentiality, dependability, and robustness are becoming increasingly important. We have proposed a Secret Sharing Distributed Database (SSDDB) [2] that combines a secure distributed storage system [3–5] with a relational database system. The relations are divided into fragments in the SSDDB, and the fragments are managed by encrypting with a threshold scheme [6]. Thus, the fragment must be both encrypted and decrypted every time there is a request to the database system to perform a search or some other functions. This makes it essential to have optimized parallel operation.
Optimization of the query process in an SSDDB proceeds in 3 stages: optimization of the query structure, optimization of subquery allocation to servers employed, one for each fragment request, and optimization of the combination of subquery results. This paper addresses the issue of subquery allocation to servers.
Optimization of the query process in an ordinary DDB is precisely described in [1]. A database management system must decide which server to use for processing when the same data are in distributed locations due to replication, even for ordinary distributed databases. Optimization of the query process for such a case is not sufficiently described in [1]. Evrendilek et al. proposed a query optimization problem which considers the data replication in [7]. However, with SSDDB, a server must be selected for decrypting fragments, and the procedures suggested for allocating subqueries to servers in ordinary DDBs are not very effective here. There exists no research on optimization of the query process for SSDDB as far as we know.
There are three factors to consider when allocating subqueries to servers: firstly, subquery allocation to servers; next, determination of the server storing the fragment needed to handle the subquery; finally, determination of the server for the necessary decrypting of the fragment. Subquery allocation to servers in an ordinal distributed database is known to be an NP-complete problem [7]. Since the subquery allocation problem in an ordinal distributed database is a special case of that of SSDDB, subquery allocation to servers becomes an NP-complete problem. This paper is an investigation of best-first search methods with a view to online use, and proposes heuristic methods for the best-first search.
Section 2 provides an overview of SSDDB and Section 3 demonstrates the formulation of the subquery allocation problem as a constraint satisfaction problem. A heuristic evaluation function for a best-first search is then proposed. Section 4 shows the execution of an experimental calculation and demonstrates the effectiveness of the proposed heuristic evaluation function.
2. Secret Sharing Distributed Database System
2.1. The Threshold Scheme
The secret sharing distributed database system is software that safely performs distributed sharing of confidential information. The system described herein uses the threshold scheme demonstrated by Shamir [6].
Let represent an algebraic number field and a confidential datum be an element of . A th degree polynomial equation is created using and random coefficients () where . The value is calculated for each of the distinct values of (). The value is called a share and is called the ID of share .
Since is a ()th degree polynomial, it is uniquely defined by its values at different points . The confidential datum can be retrieved by evaluating .
In the threshold scheme, it is impossible to decrypt using fewer than shares. There is no dependence on how shares are selected, so even if up to shares are lost, it remains possible to decrypt .
2.2. SSDDB Structure
An outline of the structure of an SSDDB is presented in Figure 1. The SSDDB is composed of a database management system (DBMS) and a secret sharing storage system (SSSS). The DBM process on the DBMS and the SSS process on the SSSS are assumed to be connected to each other over the Internet and each is able to communicate with the other individually.
When the DBM receives a query from an external source, it creates a query processing plan. It carries out this processing while assigning subqueries to other DBMs for parts of the processing and sending requests to DBMs and SSSs to retrieve or decrypt the necessary data sets. Thus, the user can approach the DBMS in the same way as approaching an ordinary distributed database management system, but an SSSS is used for storage, so the specificity of an SSSS must be considered in the query optimization, as described below.
The data sets are divided into small sets (fragments) and encrypted using the threshold scheme. This process is managed by the SSSS. A DBM is also capable of caching data sets. Only DBM3 in the figure is caching data.
The dashed lines in the figure indicate physical boundaries. DBM1, SSS1, DBM2, and SSS2 are each separate processes running on the same computer. In other words, no SSS process is running on the computer where DBM3 is running. A DBM can request an SSS to retrieve and decrypt a fragment on a different computer. However, this generates a cost due to transmission of data over the network.
3. Query Optimization for an SSDDB
3.1. Outline of Query Processing in an SSDDB
Optimization of query processing in an SSDDB consists of the three stages of firstly, query structure optimization, secondly, optimization of subquery allocation to multiple servers, one for each fragment request, and thirdly, optimization of the combination of the subquery results.
In query structure optimization, the query is subdivided by employing information in the fragments. The first task is to eliminate any redundant expressions sometimes included in user-generated queries. Queries are also rewritten or divided into subqueries in order to reduce the volume of intermediate data. At this point, SSDDB specificity is not yet required, so the ordinary procedure for a DDBS [1] can be used as is.
The generated subqueries can be processed by separate servers. In the second stage, a server for processing a subquery is selected so as to minimize the processing cost of the query, by considering fragment allocations and server loads. In the SSDDB, all fragments except for those in the cache are distributed as shares in the SSSS. Thus, no particular fragment is assigned to any particular server. The cost of recreating distributed fragments must be calculated. Sections 3.2 and 3.3 address the issues of allocating subqueries to servers while considering the specificity of the SSDDB.
Finally, the results of processing the subqueries assigned to the different servers are combined. As with the first step, SSDDB specificity is not required, so again the ordinary procedure for a DDBS [1, 7] can be applied as is.
3.2. Subquery Allocation Problem to Servers
Let be a finite set of servers with . Here, the term server is assumed to mean a physical computer, at most one DBM process and at most one SSS process are running on each server. is a finite set of fragments with , and is a finite set of subqueries with .
Let be the cost required to decrypt some fragment on some server . Here, is the set of all nonnegative real numbers. Costs are set considerably higher in servers where an SSS process is not running, in order to avoid assigning decryption requests to such servers. When a fragment is decrypted on a server , the cost for transmitting shares from other servers is assumed to be .
The cost for processing a subquery on a server is assumed to be . As was the case with decrypting costs, this should be set rather high for servers where a DBM process is not running. The cost necessary in order to transmit a fragment from a server to a is written . The load cost of a server is assumed to be .
Some additional variables are employed here. is set to the value of 1 when a subquery calls a fragment ; otherwise, to the value of 0. is set to 1 when a fragment is cached in a server ; otherwise, to 0. is set to 1 when a subquery is allocated to a server ; otherwise, to 0. is set to 1 when a fragment is transmitted to a server , where it is required, from a server ; otherwise, it is 0. is set to 1 when a fragment is decrypted at a server ; otherwise, it is 0.
Normally, parameters of disk I/O, CPUs and the network can be used when constructing a query processing cost model in a distributed database. For an SSDDB, disk I/O can be considered a part of the share decryption cost; the following two costs are employed.
The calculation cost in a server is defined with the following expression:
The network cost for a server is defined thus:
The above expressions indicate that the subquery allocation problem to servers becomes a constraint satisfaction problem for the decision variables , , and subject to Here, the object function prevents the subqueries from being concentrated on a single server. Equation (5) is to ensure that each subquery is processed only once by one of the servers. Equation (6) indicates that the fragment a server requires must always be transmitted from some server (please note that the case is permitted). Equation (7) indicates that when a fragment is transmitted from a server to a server , must have either decrypted or been holding it in its cache.
3.3. Subquery Allocation to Servers Based on Best-First Searches
Subquery allocation to servers in an ordinal distributed database is known to be an NP-complete problem [7]. If is set to 1 in the threshold scheme in an SSDDB, this describes an ordinal distributed database. Accordingly, the issue of subquery allocation to servers becomes an NP-complete problem. This paper is an investigation of best-first search methods with a view to online use.
In combinatorial optimization problem, searching the optimal solution is done by expanding states iteratively. The search process could be described by a tree structure, called the search tree. Figure 2 is an example of the search tree.
Based on preliminary experiments, we heuristically decided the order of decision variables as , , as shown in Figure 2. The depth of the search tree of each step is , , and . But at the time when variables and are decided, most of variables have been decided by evaluating (5), (6), and (7).
In the best-first search, the most promising state is chosen by evaluating a function, called the heuristic evaluation function. Let us consider a situation expanding the state in Figure 2; we are going to choose a server to process the subquery . The heuristic evaluation function shows an expected cost when choosing a server , and the search process chooses the state whose is minimal.
The fragments in the SSDDB are encrypted using the threshold scheme and the shares are distributed to the various servers. Therefore, in order to process a subquery, the shares are transmitted and the fragment must be decrypted. Let us assume here that a subquery is processed by a server , and that in addition to the cost of processing on , there is a cost for the decryption of the fragment , which is needed for the processing in , and transmitting to server . Accordingly, the minimum possible value for the cost of processing a subquery on a server is as follows:
When a subquery is supposed to be processed on a server , a server to decrypt the corresponding fragment must be decided. The server which gives the minimal value in (8) is the most promising one. Accordingly, a heuristic evaluation function for choosing a server to decrypt a fragment is given as follows:
Since most of variables could be decided by evaluating (5), (6), and (7), we have not prepared any heuristic evaluation function for the variable .
4. Computational Experiment
This problem of subquery allocation to servers was solved with an ILOG solver (Ver. 5.1, ILOG). The problem was described using OPL. The operating system was Linux (CPU: Celeron 2.0 GHz, Memory: 1 GB).
The following 2 procedures were used for comparison.
No Heuristic
The ILOG Solver engine was used as is without heuristic evaluation functions. This solver determines variables in the order they are described. In the experimental model, they were described in the order , , .
EDN
The heuristic evaluation function employed by Evrendilek et al. [7], which is referred to as EDN, was used in (10).
Equation (10) ensures that the processing for subqueries is distributed as widely as possible.
The results are presented in Table 1. Symbols , and in the table denote the numbers of servers, fragments, and subqueries, respectively. denotes the dominant cost among the decryption cost (), subquery processing cost () and transmission cost for shares and fragments (). As indicated by , all of these costs are about the same. The dominant cost was selected at random from the range and the other costs were randomly selected from . Five problems were constructed using identical conditions.
The calculation time was limited to 1 hour and was the number of searches that were not completed within the time limit. Blank columns indicate that searches were completed for all problems. The value was the mean time required to find a feasible solution (initial feasible solution). The value was the mean time required to find the optimal solution. The value was the mean time required to complete the search. The means for and were taken over the searches that were completed. All times were stated in seconds.
The value is the ratio between the evaluated values of the initial feasible solution and the optimal solution and represents the mean over problems for which the optimal solution was identified.
The results when , and were set to identical values and the scale of the problem was varied are listed in the upper part of Table 1. The time was within 0.1 s and the feasible solution was easily found with all the methods. Also, as shows, the proposed procedure proved to be better than the other procedures at finding the initial feasible solution.
The time required to complete the search grew exponentially with the scale of the problem.
Furthermore, the search was never completed within the time limit under the No Heuristic procedure, even when the problem scale was small. Thus, it is essential to develop a procedure for quickly deriving a useful feasible solution for the subquery allocation problem.
The proposed method was able to find optimal solutions for a greater quantity of the problems within the time limit than the EDN approach. Also, on average, the proposed method was superior from the perspective of times to find the optimal solution and complete the search. It should be noted that the values for and are the means of problems seeking optimal solutions within the time limit. For example, under , , and , for the proposed method s, while for EDN s. However, the proposed method successfully completed 3 problems, while EDN successfully completed only 1 problem. The given values for and for EDN represent those from that single problem, while the values from the proposed method are the means taken over the three problems.
Thus, the proposed method is superior to both EDN and the No Heuristic procedure in terms of both initial feasible solution and calculation time.
Let us turn to some observations about the relationship between problem characteristics and solution methods. First, when the transmission cost is dominant (), the searches are completed more quickly than for other problems of similar scale. On the other hand, when the subquery processing cost or decryption cost is dominant ( or ), the time required for search completion increases with the scale of the problem. This seems to be because transmission cost depends greatly on server selection.
We observed changes in the calculation time when only one of the parameters , or was changed (lower part of Table 1).
It takes more time to derive an initial feasible solution when the number of servers is increased; on the other hand, the time to identify an initial feasible solution remained within 1 s when either the number of fragments or subqueries was increased. The value of was raised relatively high by expanding the number of servers. Future studies must address the issue of developing procedures for identifying good feasible solutions at high speed in systems incorporating large numbers of servers.
5. Conclusions
It is essential to optimize query processing, as this has a great influence over database performance. This study provides a formulation of the subquery allocation problem in a secret sharing distributed database in terms of a constraint satisfaction problem. A heuristic evaluation function is proposed for solutions in best-first searches.
An experiment was conducted using an ILOG Solver to examine the effectiveness of the above function. Some issues remain for the proposed procedure when there are a large number of servers, and further research will be necessary. It will also be necessary to evaluate the performance of this method on an actual system.
Acknowledgment
The authors express their deep appreciation for the Grant-in-Aid for Scientific Research no. 19700060, which supported part of this study.