- About this Journal ·
- Abstracting and Indexing ·
- Advance Access ·
- Aims and Scope ·
- Article Processing Charges ·
- Articles in Press ·
- Author Guidelines ·
- Bibliographic Information ·
- Citations to this Journal ·
- Contact Information ·
- Editorial Board ·
- Editorial Workflow ·
- Free eTOC Alerts ·
- Publication Ethics ·
- Reviewers Acknowledgment ·
- Submit a Manuscript ·
- Subscription Information ·
- Table of Contents

Journal of Control Science and Engineering

Volume 2008 (2008), Article ID 310652, 13 pages

http://dx.doi.org/10.1155/2008/310652

## Fault-Tolerant Control of a Distributed Database System

^{1}Department of Electrical and Computer Engineering, Binghamton University, Binghamton, NY 13902-6000, USA^{2}US Air Force Research Laboratories at Rome Research Site, Rome, NY 13441-4505, USA

Received 31 December 2006; Accepted 11 September 2007

Academic Editor: Kemin Zhou

Copyright © 2008 N. Eva Wu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Optimal state information-based control policy for a distributed database system subject to server failures is considered. Fault-tolerance is made possible by the partitioned architecture of the system and data redundancy therein. Control actions include restoration of lost data sets in a single server using redundant data sets in the remaining servers, routing of queries to intact servers, or overhaul of the entire system for renewal. Control policies are determined by solving Markov decision problems with cost criteria that penalize system unavailability and slow query response. Steady-state system availability and expected query response time of the controlled database are evaluated with the Markov model of the database. Robustness is addressed by introducing additional states into the database model to account for control action delays and decision errors. A robust control policy is solved for the Markov decision problem described by the augmented state model.

#### 1. Introduction

A database, as described in [1], is a shared collection of related data and the description of this data, designed to meet the information needs of a client. A recent study by Wu et al. [2] on a distributed database system, as shown in Figure 1, revealed the benefits of a conscientious design of redundant architecture and the application of state information-based control. Such benefits were quantified in terms of mean time to system failure, steady-state availability, expected response time, and service overhead. The database system was viewed as a queuing network [3, 4] and mathematically modeled as a Markov chain [5]. The control authorities considered included the ability to restore the lost data sets in a single server and the ability to route service requests. In order to obtain an analytic model of manageable size for scrutinizing the effects of control, the queuing network was restricted to the closed type with a query population of three. In addition, all the event lifetime distributions were assumed to be exponential. A simulation study conducted by Metzler [6] using Arena [7, 8] with the above restrictions removed supported the conclusions in [2].

The first objective of this paper is to provide justification that the control policy applied in the aforementioned study [2] is optimal in a well defined sense. To that end, a Markov decision problem [9, 10] is formulated and the solution that minimizes a total expected discounted cost is sought. For the purpose of illustration, a simple problem that disregards the query states is set up, for which the policy developed in [2] is confirmed to be optimal.

In reality, however, it is not practical to monitor every state variable in a network. As a result, knowledge on a certain set of states is inferred based on the observables. On the other hand, a control action, in response to a state transition such as an occurrence of a server failure, must wait until a process of diagnosing the failure state [11] is complete. The time required for diagnosis is assumed to be a random variable and the outcome of the diagnosis usually has some degree of uncertainty as well. If servers must communicate through wireless channels, the likelihood of an erroneous decision and a delayed action is drastically increased. Recognizing that the assumption of instantaneous accessibility of the state information in the database system could lead to overly optimistic conclusions on system performance, Wu et al. [12] took a further step to analyze the effects of control action delays and decision errors for the same database system. Their analysis concluded that delays and errors can significantly degrade the performance of the database system.

Therefore, the second objective of the paper is to seek a robust control policy that mitigates the effects of such control action delays and decision errors. A robust solution obviously has a strong dependence on how uncertainties are modeled. This paper establishes an uncertain database model following the basic principles presented in [12]. The new model also captures the effect of routing delays of queries from a failed server to remote intact servers. A new Markov decision problem is then formulated and solved. Due to the increased dimension of the problem, approximate solutions are sought via numerical means.

This paper presents a novel model of a replicated data store wherein a set of information is partitioned and each partition is stored on multiple servers. This work is motivated by the recognition of the need for greatly enhanced availability of information management systems in air operations [13]. It addresses the desirability of hardware replication and state-information-controlled restoration, whereas published works in the field of distributed database and replication have discussed specific protocols and software failures [14].

The paper is organized as follows: Section 2 describes the baseline model of the controlled database system shown in Figure 1; Section 3 formulates and solves a Markov decision problem that justifies the control policy applied to the baseline model; Section 4 presents an approach to modeling control action delays and decision errors; Section 5 formulates and solves, using dynamic programming, a Markov decision problem with an uncertain model containing delays and errors, and analyzes the robustly controlled system in terms of system availability and query response time in the presence of control action delays and decision errors.

#### 2. Baseline Model and Notation for a Controlled Database System

The description of a baseline model for a replicated data store follows to a large extent that of Wu et al. [2]. In particular, a system of three servers is studied, each storing two partitions out of a total of three. Each partition has one “primary” server and one “secondary” server.

The distributed database system in Figure 1 contains three servers in parallel to answer three classes, , and , of queries for which relevant information can be found in the partitioned data sets, , and , of the database, respectively. Server would contain the data set corresponding to class as the primary set and a reproduction of data set as the secondary set. Alternate secondary data sets are reproduced in order to automate restoration of failed servers within the database. The failure of a server implies the loss of two sets of data within the server. A system level failure is declared when two servers fail, in which case one set of data is completely lost. The queues preceding servers , , and are named , , and , respectively. All queues are of sufficient capacity in the baseline model. Service is provided on a first-come-first-served (FCFS) basis at each server.

The three delay elements of average delay imply that there are always three queries present in the system at any given time. A new query is generated at a delay element with rate upon the completion of the service to a query at one of the servers. The delay elements are also intended to be reflective of the response time to the querying customers by other service nodes in the system that are not explicitly modeled. Any new query is assumed to have a likelihood of to visit server , where can be , , or .

The use of a queuing network model for the database is based on its suitability to involve control actions and to capture their effects on the system performance. The model is built in this study with the premise that event life distributions have been established for the process of query generation the process of service completion the process of server failure the process of data restoration and the process of system overhaul when the failed database system is repaired. All such processes are independent. Standard statistical methods that involve data collection, parameter estimation, and goodness of fit tests exist [15] for identifying event life distributions. Alternative distributions and goodness of these assumptions were investigated in [6]. Since all event lives are assumed to be exponentially distributed, the database system can be conveniently modeled as a Markov chain specified by a state space an initial state probability mass function (pmf) and a set of state transition rates .

##### 2.1. Model Specification

*State Space *

A state name is coded with a 6-digit number indicative of all queue lengths and server
states in the system. With some abuse of notations, a valid state representation is given by
, where queue
length , , with total
length limited by the three entities available in the closed-queue system. The server states , , are further
defined as “2” data are lost in both the primary and the secondary sets in a server, “1” the data
in the primary set have been restored and data in the secondary set have not been restored, and “0” data in both primary set and secondary set in a server are intact. A server is said to
be in the down state if it is either at states “1” or “2.” For example, state indicates that server is up with one customer
in its queue, server is down with both sets of data lost and one customer in its queue, and server is up and idle. Note that the queue length includes the customer being served. There are valid states in the baseline system. The total number of states is reduced to when all the states representing system level failures are aggregated into seven states memorizing
the possible queue length distributions and exploiting the symmetry of the three servers. A set of alternative state names are
assigned from with mapped to and the aggregated system
failure state mapped to . Although the symmetry of the system allows further reduction on the number of states to , the -state
model is retained for clarity of presentation.

*Initial State PMF *

It is assumed that the database system starts operation from state , that is, the initial state probability is given by vector

*Set of State Transition Functions *

Events that trigger the transitions and the corresponding transition rates are
given as follows. A newly generated query enters one of the servers with rate A query is answered at a server
with rate A complete data loss
occurs at a server with rate Data in the primary data set of a server are restored with rate or repaired with
overhaul rate Data in the secondary data set of the server are restored with rate
following the restoration of the primary data set. The failed database system is always renewed with overhaul rate

Let denote the random
state variable at time .
The set of state transition functions is given by
The continuous-time Markov chain can be solved from the forward Chapman-Kolmogorov equation [5, 10]

and is called an infinitesimal generator or a rate transition matrix whose entry is given by the rate associated with the transition from current state to next state . (See [2] for the complete rate
transition.) Control variable will be defined shortly. State probability mass function at time ,

is computed by
At this point, a baseline Markov model for the database system of Figure 1 has been established. Since transition rate matrix is dependent on control actions,
the state transition functions are being controlled, as are the state probabilities.

##### 2.2. Control Policy

Our intention is to eliminate all single point failures. Our approach is to base the control actions on the state information, which effectively alter the transition rates when loss of data occurs in a single server. The possible set of control actions includes restoration, overhaul, and no decision needed. There is one admissible set of control actions at each state. A state of no decision needed has an empty admissible set.

Taking into consideration the symmetry of the model, the control policy considered for this study is summarized as follows:

The presence of control in the transition rate matrix is seen via and The values of represent specific control actions associated with data restoration or system overhaul respectively. Previously in [2], system overhaul is considered only at state through .

##### 2.3. Performance Measures

Two of the four performance measures defined in [2] are reintroduced: steady-state availability and expected response time . These will be used later to validate the control policies that are derived under cost criteria intended to improve both and .

*Steady-State Availability*

Suppose as soon as the database system reaches a system level failure, an overhaul process starts.
Suppose, with a rate , the system is repaired, and at the completion of the repair, the system immediately starts to operate again. In this
case, the Markov chain becomes irreducible, and a unique steady-state distribution exists [5, 10]. The steady-state
availability, which can be roughly thought of as the fraction of time the database system is upto, is computed in [2]
by

where is
the sum of the system level failure state probabilities determined by solving

*Expected Query Response Time*

Query response time is the amount of time elapsing from the instant a
query enters a queue until it completes service [10]. With server failures, the average response time is calculated as
the expectation of the ratio of total amount of time that all queries spend waiting for service in queue, plus their service times
to the number of queries that are serviced. Consider, again, the irreducible chain modeling of the system in Figure 1. Let be the indicator function associated
with transition from state to
state that indicates a query
arrival. Let be the total number
of queries in queue at state
Then the total expected number of queries in queue at the steady state is given by

and the arrival rate at steady-state is

The calculation of the response time at steady-state then follows Little’s Law [4, 10]

#### 3. Restoration as Solution to Markov Decision Problem

Intuition suggests that by restoring the lost data sets in a single failed server, overhaul can be avoided, and therefore, the stationary control policy given in (5) ought to render service more available. However, the restoration process occupies one of the remaining servers, and therefore, may prolong the average response time of the system to queries. This section formulates and solves a Markov decision problem (MDP) for the database system to justify the optimality of the restoration policy used in [2].

The Markov decision problem considered in this paper assumes that a cost is incurred at every state transition, where is the state entered and is a control action selected from a set of admissible actions [9, 10]. The solution amounts to determining a stationary policy that minimizes the following expected total discounted cost:

where is a discount factor.

To simplify the presentation, state information on representing service demand is ignored for the moment. In this case, the inherent symmetry of the database system leads to a very simple 4-state Markov model as shown in Figure 2. As a result, the finite population assumption can be relaxed, that is, the closed queuing network of Figure 1 can either remain closed or can be revised to an open queuing network. In addition, query handling in the event of a server failure becomes completely unrestricted. Two different methods of query handling are to be examined in this section. (1) Each arrival query has equal likelihood to seek information in data sets , , or , but only the primary data set is available for query service in each server, and the secondary server is there to restore data in a failed server.(2)Upon a server failure, queries are rerouted to the two remaining servers where the secondary data sets also participate in query service though only one of the two intact servers can provide service to only two of the three classes of queries during restoration. The distinction in these two cases is captured in transition probabilities and in transition cost Fault-tolerant control policies are now developed for the two cases.

##### 3.1. Secondary Data Set Reserved for Lost Data Restoration

This subsection derives the optimal control policy with the first method for handling queries; each arrival query has equal likelihood to seek information in data sets , , or , but only the primary data set is available for query service in each server, and the secondary server is there to restore data in a failed server.

Figure 2 shows a discrete time Markov chain model for this case. This model is obtained by the application of a uniformization procedure [10] with a uniform rate that is greater than any total outgoing transition rates at any state of the original continuous time Markov process. All parameters in Figure 2 have been defined earlier.

A fault-tolerant control policy essentially determines whether to occupy one of the two working servers to restore the data in the failed server or to overhaul the entire system at the state of one server failure. It is determined by how the designer penalizes a control action at any given state. Table 1 specifies the one step cost at each state.

Let denote the random state variable at in the discrete time Markov chain. Control action indicates the system's decision to (or not to overhaul, or not to act) restore a failed server. in Table 1 is the cost incurred when control action is taken based on . It has been shown that under the condition for all and all that belongs to some finite admissible sets , the minimal cost satisfies the following optimality equation [9, 10]:

where have been marked in Figure 2. In addition, policy is optimal if and only if it yields for all . The four optimality equations can be expressed explicitly based on (11):

The above equations are solved for
for , using *Mathematica* [16].
Figure 3 is created with
and It can be seen that,
when the ratio of to is above the blue curve, (restoration) is optimal at all
states, whereas (overhaul) is
optimal when is below the red
curve. Between the two curves, is optimal, for transition from state “2” to state “0” implies restoration of primary data set, which cannot occur with control
action
Therefore, the mid-region optimal policy does not take place in the operation of the database system.

Note that in [2], which lies above the blue curve in Figure 3 for any Therefore, the always-restore policy implemented in [2] is optimal under the cost structure defined in Table 1.

##### 3.2. Secondary Data Set Available for Both Query Service and Data Restoration

This subsection considers the second method of query handling upon a server failure: overhaul can only occur at state “1,” which implies that queries of the failed server are rerouted to the two remaining servers where the secondary data sets also participate in query service though only one of the two intact servers can provide service to only two of the three classes of queries during restoration.

The uniformized Markov chain model is shown in Figure 4. In this case,

overhaul is held until a second server fails, and all classes of queries rely on the service of the two operating servers in the meantime.

Figures 5(a) and 5(b) compare the optimal cost-to-go's of the two methods of query handling as functions of restoration rate at fixed and Different line types specify different control actions. In Figure 5(b), for example, no control action is taken at state “0” unless where restoration takes place; the system is always overhauled at state “1;” no control action is taken at state “2” unless where restoration takes place; and no control action is ever taken at state “3.” It is seen that control policy change occurs at a higher ratio of with the second method (policy change at in Figure 5(b)) than that with the first method (policy change at in Figure 5(a)). Despite the slight favor toward overhaul, the optimality of the “always-restore” policy applied in [2] still holds with the second method at the nominal parameter values and where

#### 4. Augmented Model Including Control Delays and Decision Errors

This section establishes a full-state model to include the effects of decision errors and control action delays upon entering a state of a single server failure. The first two subsections follow [12] that treated these separately as the effect of decision errors when a control action is taken incorrectly but immediately upon entering a state, and the effect of delayed control actions when a correct control action is taken but after some time delay. There are deterministically diagnosable systems for which the only cost of diagnosis is time [11]. The third subsection presents a new model to be used in robust control policy design that combines the two augmented models and introduces also delays due to rerouting queries from failed sever to intact servers.

##### 4.1. Modeling the Effect of Erroneous Decisions

The control action considered in this study is state information based. Upon entering a state, for instance, , any information deficiency can result in uncertainty in decision making as to whether to take a control action or what control actions to take. In this case, every decision carries a risk [17].

A decision error in the database system could include the possibility that upon a server failure, the wrong server is identified as being failed. More specifically, for instance, has failed. However, is mistakenly observed as the failed server. Based on the false information, the control action would be for to restore data set in , whereas would be expected to continue to work. As a consequence, none of the servers can process queries for a period of time, and the database system is said to have entered an intermittent error state. It is assumed that from this state, only transitions representing service completion can occur. Figure 6 depicts a generic representation of such a case.

Without loss of generality, let be a state that is entered upon the loss of both data sets in a server. Let be the state entered upon the completion of primary data set restoration associated with the data loss. Let through be the states representing completion of services at other servers. Let be the state entered upon the arrival of a new query in one of the queues. ( are not shown explicitly in Figure 6.) Let through be the states entered upon data loss at other m servers. An intermittent state is introduced, as shown in Figure 6, to allow the representation of imperfect decision making upon entering . Therefore, there is an intermittent error state for each state that involves outgoing transitions with weakened control authorities due to some decision errors. In the database system of Figure 1, 60 states are added to the original 147 states of baseline model. It is assumed that once the primary data set restoration takes place for a particular server, the secondary data set restoration proceeds without a decision error.

Let denote the transition rate from state to state in the absence of decision error in the restoration of the primary database associated with the most recent data loss. Let be the probability of successful restoration, given that the event of restoration occurs. then is referred to as the thinning [5] of the Poisson arrival process associated with the restoration. The split of rate into rate and rate is sometimes also called a decomposition of a Poisson arrival process into type 1 with probability and type 2 with probability .

An imperfect decision corresponds to the value of being less than unity. As a consequence, the authority of control that is supposed to reinforce the restoration process is weakened. The smaller the value of , the weaker the control authority is.

The rate of recovery from decision error is denoted by . To state the fact that recovery from an intermittent error state to restoration cannot be faster than the error-free restoration process, is enforced. On the other hand, the outgoing transition rates from the intermittent error state to the states of data loss in other servers, that is, from to are bounded below by the corresponding rates going from to . These transitions further reduce the likelihood of reaching state .

It is now shown that decision errors always degrade the performance in terms of the state transition probability which is the probability that restoration to state occurs given current state . It turns out that this probability is readily obtained for a Markov chain

where

without decision error, in which case in (14), and

with decision error, in which case . Note that (15) and (16) are the same, and both enter (14). Therefore, (14) is proportional to , and is largest at when there is no decision error.

##### 4.2. Modeling the Effect of Delayed Control Actions

Time required for diagnosis [11] can be regarded as the universal cause of a control action delay. An example of the control action delay in the database system shown in Figure 1 would be that a total loss of data in a server is not immediately observed. As a result, the action of data restoration is delayed.

As in the previous subsection, let be a state that is entered upon a total loss of data in a server. Let be the state entered upon the completion of primary database restoration associated with the data loss. States through and states through also follow the earlier definitions. Figure 7 depicts a proposed model capable of describing a delayed restoration action by an exponentially distributed random amount with average units of time upon entering state With a single-stage delay for each state entered upon a total loss of data in a server, another 60 states are added to the baseline model.

In a more general case, there can be an -phased delay implemented in the augmented model by inserting states through in series between states and . Each state retains outgoing transitions to all through and through in addition to transition to The total amount of delay before restoration action is bounded below by random variable with a generalized Erlang distribution [5];

One may use an -stage Erlang to approach a constant delay, an -state hyperexponential to approach a highly uncertain delay, or a mixture of the two to acquire more general properties [10] in its distribution.

Note that there are two significant differences between the decision error model of Figure 6 and the control delay
model of Figure 7. First, the link to *restoration of primary database* is present in Figure 7 with a smaller likelihood
of transition, whereas the link to *restoration without delay* is absent in Figure 7. In addition, all links to *service
completion* are absent in Figure 6, but are present in Figure 7. Therefore, each case has its distinct
nature.

##### 4.3. Full-State Model of the Controlled Database System

Referring again to the closed queuing network view of the distributed database system in Figure 1, this section presents its augmented model that incorporates all three sources of uncertainties: decision errors (Section 4.1), control action delays (Section 4.2), and routing delays. Routing delays are incurred when queries at a failed server are rerouted to the remaining intact servers.

Rerouting of queries becomes desirable when the queries observe a server failure after they have entered the queue preceding the server. An exponentially distributed random routing time is introduced with rate for this purpose. A routing delay is assumed independent of a control action delay. The former captures the random time of diagnosis, whereas the latter captures random time of transmission of queries among servers. Model augmentation amounts to adding new transitions among existing states without the need for new states.

In order to establish a full state model with all uncertainty types, the representation of the composite state variable is modified to where and as in the baseline model described in Section 2; newly introduced uncertainty variable with “1” = control delayed and “2” = wrong decision made. This results in a 267 state-model. By exploiting symmetry, the 256 (147+60+60) state model can be reduced to a 96-state model. The binary control variables are defined as follows: to restore, to overhaul, and to reroute queries.

The states, the transitions, and the transition rates of the uncertain model are summarized in Figure 8, based on which transition matrix of a Markov chain can be built and used in the next section for robust control policy design. in Figure 8 is the newly introduced query transmission rate when the action for rerouting is called for. Error probability relates to in Figure 6 through Subscript “” denotes “primary” and “” denotes “secondary.” Use of symmetry is reflected in server state and arrival rates and

#### 5. Robust Control Policy Design

This section seeks robust control policies as solutions to the Markov decision problem:

where is the control policy sought, to restore, to overhaul, and to reroute queries, as defined in Section 4.3. Note that the full-state model enables the designer to consider service demand and to weigh availability against response time. Thus two cost criteria are established. The first criterion,

penalizes long queues that cannot effectively reduce in time due to server failure, and thus favors response time. The second criterion, shown in the following table, penalizes prolonged service time, again, due to server failure, and thus favors availability.

The size of the state space suggests numerical means for solutions. Mathematical programs will be applied to obtain the solutions. The steady-state availability and the expected query response time of the controlled database system with the optimal policy will then be examined under various conditions.

##### 5.1. Optimal Policy Design via Mathematical Programming

The rate transition matrix of the 96-state model can be obtained based on Figure 8 established in Section 4.3. This depends on State probability equation

originated from the forward Chapman-Kolmogorov (2) can now be uniformized to yield a discrete time Markov chain

where uniform rate can be chosen to be

Recall optimality (11)

as an alternative characterization of the solution to Markov decision problem (18), which produces a system of 96 equations.

Dynamic programming is the most natural numerical approach to policy design (18) because (11) is derived through taking limit of a finite horizon dynamic program [9, 10]

where and terminal cost In this case the optimal cost is given by More specifically, with taking values in a finite set, the minimal cost-to-go from of the 96-state Markov decision process satisfies

where is the minimal cost-to-go from of an -step finite horizon process.

The solution to a dynamic program results from an iterative calculation backwards along the horizon from to the first step For the dynamic programming calculation to converge to the true cost-to-go, must be significantly large, and must be less than 1.

Linear programming [18] can be considered as an alternative numerical approach to the solution of the Markov decision problem. In this case, the set of optimality equations is turned into a set of affine constraints on the set of optimization variables and the problem can be formally stated as follows:

The equivalence of the linear program formulation (26)–(28) and the optimality equation formulation can be easily established. First, (27) is trivially satisfied for all in both formulations because one-step cost is always nonnegative.

Suppose is the linear program solution. Then there must be one active (equality achieved) constraint for each of the affine inequality constraints of the form for each . Suppose for some the constraint(s) is not active. Then can be increased until one of the inequality constraints becomes active without violating the rest of the inequality constraints because as coefficient of on the right side of the inequality constraints (28). This, however, contradicts the assumption that is maximum. Therefore, is also the solution to the optimality equations (28).

Assume now that satisfies the optimality equations. It then automatically satisfies the inequality constraints (28), of which are active, one for each appearing on the left side. Suppose is not maximum. There is at least a for some that is smaller than the corresponding cost in max which implies that the corresponding constraint(s) for is (are) slack or inactive. This contradicts that satisfies the optimality equation. Therefore, must also be the solution of the linear program formulation (28). The equivalence is thus established.

The function linprog in MATLAB's Optimization Toolbox [19] solves the maximization problem above. The active constraints are checked with a MATLAB script to determine the optimal control policy.

The computational complexity of the dynamic program and that of the linear program are now compared. Finding the solution to a linear program generally requires a computation time proportional to [18] when , where is the number of optimization variables, and is the number of constraints. The computational complexity of an iterative dynamic programming solution can be approximated by assuming that each iteration is a series of linear programs. The linear programming solution to the set of optimality equations is of course a single linear program.

The number of control variables, , the number of states, , and the horizon length, , are critical to the computation time of these methods. First, consider the iterative method as a series of linear programs. Each individual iteration along the -step horizon consists of individual linear programs. Each individual linear program has variables and constraints. Therefore, the computation time is proportional to Now, consider the method of solving the optimality equations through linear programming. The single linear program has variables and constraints. Hence, its computation time is proportional to

Although the computation time grows faster for the linear program as the number of states increases, the horizon is typically much larger than for small discount factorin Therefore, the linear program is more efficient for moderate numbers of states and small discount factors.

##### 5.2. Availability and Response Time under Robust Control Policy

A selected set of results on the robust control policies solved via mathematical programming are presented in this subsection, and the system availability and query response time under some of the optimal policies are examined.

###### 5.2.1. Restoration-Overhaul Switching

Under the cost criterion (19) (minimum total discounted queue size), the optimal policy depends on the number of queries in the queue behind the failed server. No action is taken to overhaul the system until the two active queues are empty and the buildup of queries behind the failed server is significant. Figure 9(a) depicts a switching curve of of the control policy between overhaul and restoration before (solid) and after (dotted) state in Figure 9(b) is reached. Policy switching is determined by the amount of control action delay, the decision error probability, and the number of queries in the failed server. It can be seen that, while the two active queues are occupied or after the primary data is successfully restored, restoration is performed on the failed server as long as the server performing the restoration does not have any customers waiting in its queue.

Under the cost criterion stated in Table 2 (minimally reduced service time), the optimal policy always attempts to restore the failed server as long as the server performing the restoration does not have any queries waiting in its queue. The only exception is when three queries are piled into any single queue. In this case, overhaul occurs when the uncertainties are significant.

###### 5.2.2. Performance under Nominal and Robust Policies, and Effect of Routing Delay

This subsection examines the system steady-state availability and the expected query response under the robust policy, where random control delay and decision error are explicitly modeled, and under nominal policy where uncertainties are ignored. The results are similar for policies derived with either the queue size criterion (19) or the service time criterion (Table 2). The robust policy shows two distinct features in Figures 10(a) and 10(b): it switches control action when uncertainties (delay and error) becomes significant, and it balances between availability and response time in this situation.

The routing only policy does not attempt to restore the single failed server. Instead, queries are routed to an empty queue whenever the subsequent server contains the data for the query. The system is overhauled upon a second server failure. It offers some advantage in response time over the always-restore policy when there is no routing delay, as shown in Figure 11(a). It is also seen that the robust optimal policy experience improved performance with rerouting authority. However, a routing delay of about one second is significant enough to discourage the use of the routing-only policy, as shown in Figure 11(b).

#### 6. Conclusions

Uncertainties due to control delays, transmission delays, and decision errors in the distributed database system degrade the performance of the database system performance in terms of availability and response time. Restoration remains to be the optimal policy over a significant range of uncertainties. Beyond boundaries of the range, however, the optimal control policy switches to overhaul. By formulating and solving a Markov decision problem, the robustness of the control policies is investigated. Boundaries for which optimal actions alter are shown to exist and are quantified. The robust policies are shown to provide the best compromise among competing interests.

The authors have also investigated the optimal control policy for the database under the open queuing network setting in the face of delays and errors. Simulations with SimEvents [20] show that response time further depends on the arrival rate of queries. Simulation results will be reported separately. Simulation study of larger networks has also been planned.

#### Acknowledgment

This work was supported in part by AFOSR under Grants FA9550-06-0456 and FA9550-06-10249.

#### References

- T. Connolly and C. Begg,
*Database Solutions: A Step by Step Guide to Building Databases*, Pearson/Addison Wesley, New York, NY, USA, 2nd edition, 2004. - N. E. Wu, J. M. Metzler, and M. H. Linderman, “Supervisory control of a database unit,” in
*Proceedings of the 44th IEEE Conference on Decision and Control and European Control Conference (CDC-ECC '05)*, pp. 7615–7620, Seville, Spain, December 2005. - L. Kleinrock,
*Queueing Systems: Volume 2: Computer Applications*, John Wiley & Sons, New York, NY, USA, 1976. - G. Bolch, S. Greiner, H. de Meer, and K. S. Trivedi,
*Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications*, John Wiley & Sons, New York, NY, USA, 1998. - E. P. C. Kao,
*An Introduction to Stochastic Processes*, Duxbury Press, New York, NY, USA, 1997. - J. M. Metzler,
*The effect of supervisory control on a redundant database unit*, M.S. thesis, Binghamton University, Vestal, NY, USA, 2005. - Arena, 2004, Academic Version 7.01.00, Rockwell Software.
- W. D. Kelton, R. P. Sadowski, and D. T. Sturrock,
*Simulation with Arena*, McGraw-Hill, New York, NY, USA, 3rd edition, 2004. - D. P. Bertsekas,
*Dynamic Programming and Optimal Control*, vol. 1, 2, Athena Scientific, Belmont, Mass, USA, 1995. - C. G. Cassandras and S. Lafortune,
*Introduction to Discrete Event Systems*, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1999. - D. Thorsley and D. Teneketzis, “Diagnosability of stochastic discrete-event systems,”
*IEEE Transactions on Automatic Control*, vol. 50, no. 4, pp. 476–492, 2005. View at Publisher · View at Google Scholar - N. E. Wu, J. M. Metzler, and M. H. Linderman, “Controlled database unit subject to control delays and decision errors,” in
*Proceedings of the 8th International Workshop on of Discrete Event Systems (WODES '06)*, pp. 131–136, Michigan, Mich, USA, July 2006. View at Publisher · View at Google Scholar - N. E. Wu and T. Busch, “Reconfiguration of C2 architecture for improved availability to support air operations,”
*IEEE Transactions on Aerospace and Electronics*, vol. 43, no. 2, pp. 795–805, 2007. View at Publisher · View at Google Scholar - A. A. Helal, A. A. Heddaya, and B. K. Bhargava,
*Replication Techniques in Distributed Systems*, Kluwer Academic Publishers, Norwell, Mass, USA, 1996. - S. Zacks,
*Introduction to Reliability Analysis: Probability Models and Statistics Methods*, Springer, New York, NY, USA, 1992. - S. Wolfram,
*Mathematica 5.2*, Wolfram Media, Champaign, Ill, USA, 3rd edition, 2005. - N. E. Wu, “Coverage in fault-tolerant control,”
*Automatica*, vol. 40, no. 4, pp. 537–548, 2004. View at Publisher · View at Google Scholar - S. Boyd and L. Vandenberghe,
*Convex Optimization*, Cambridge University Press, New York, NY, USA, 2004. - MathWorks, “Optimization Toolbox User's Guide, for Use with MATLAB, Version 3,” 2006, The MathWorks. View at Google Scholar
- MathWorks, “SimEvents User's Guide, for Use with Simulink,” 2006, The MathWorks. View at Google Scholar