Optimal state information-based control policy for a distributed database system subject to server
failures is considered. Fault-tolerance is made possible by the partitioned architecture of the system and
data redundancy therein. Control actions include restoration of lost data sets in a single server using
redundant data sets in the remaining servers, routing of queries to intact servers, or overhaul of the
entire system for renewal. Control policies are determined by solving Markov decision problems with
cost criteria that penalize system unavailability and slow query response. Steady-state system availability
and expected query response time of the controlled database are evaluated with the Markov model of the
database. Robustness is addressed by introducing additional states into the database model to account
for control action delays and decision errors. A robust control policy is solved for the Markov decision
problem described by the augmented state model.
1. Introduction
A database, as described in [1], is a shared collection of related data and the description of this data, designed to
meet the information needs of a client. A recent study by Wu et al. [2] on a distributed
database system, as shown in Figure 1, revealed the benefits of a conscientious design of redundant architecture and
the application of state information-based control. Such benefits were quantified in terms of mean time to system
failure, steady-state availability, expected response time, and service overhead. The database system was viewed as a
queuing network [3, 4] and mathematically modeled as a Markov chain [5]. The control authorities considered
included the ability to restore the lost data sets in a single server and the ability to route service requests.
In order to obtain an analytic model of manageable size for scrutinizing the effects of control, the
queuing network was restricted to the closed type with a query population of three. In addition, all the
event lifetime distributions were assumed to be exponential. A simulation study conducted by Metzler [6] using Arena [7, 8] with the above restrictions removed supported the conclusions in
[2].
Figure 1: A queuing network representation of a partitioned database system with three servers.
The first objective of this paper is to provide justification that the control policy applied in the aforementioned
study [2] is optimal in a well defined sense. To that end, a Markov decision problem [9, 10] is formulated and the
solution that minimizes a total expected discounted cost is sought. For the purpose of illustration, a simple problem
that disregards the query states is set up, for which the policy developed in [2] is confirmed to be
optimal.
In reality, however, it is not practical to monitor every state variable in a network. As a result, knowledge on a
certain set of states is inferred based on the observables. On the other hand, a control action, in response to a state
transition such as an occurrence of a server failure, must wait until a process of diagnosing the failure state [11] is
complete. The time required for diagnosis is assumed to be a random variable and the outcome of the diagnosis
usually has some degree of uncertainty as well. If servers must communicate through wireless channels, the
likelihood of an erroneous decision and a delayed action is drastically increased. Recognizing that the assumption
of instantaneous accessibility of the state information in the database system could lead to overly
optimistic conclusions on system performance, Wu et al. [12] took a further step to
analyze the effects of control action delays and decision errors for the same database system. Their
analysis concluded that delays and errors can significantly degrade the performance of the database
system.
Therefore, the second objective of the paper is to seek a robust control policy that mitigates the effects of
such control action delays and decision errors. A robust solution obviously has a strong dependence
on how uncertainties are modeled. This paper establishes an uncertain database model following the
basic principles presented in [12]. The new model also captures the effect of routing delays of queries
from a failed server to remote intact servers. A new Markov decision problem is then formulated and
solved. Due to the increased dimension of the problem, approximate solutions are sought via numerical
means.
This paper presents a novel model of a replicated data store wherein a set of information is partitioned and each
partition is stored on multiple servers. This work is motivated by the recognition of the need for greatly enhanced
availability of information management systems in air operations [13]. It addresses the desirability of
hardware replication and state-information-controlled restoration, whereas published works in the
field of distributed database and replication have discussed specific protocols and software failures
[14].
The paper is organized as follows: Section 2 describes the baseline model of the controlled database system
shown in Figure 1; Section 3 formulates and solves a Markov decision problem that justifies the control policy
applied to the baseline model; Section 4 presents an approach to modeling control action delays and decision errors;
Section 5 formulates and solves, using dynamic programming, a Markov decision problem with an
uncertain model containing delays and errors, and analyzes the robustly controlled system in terms
of system availability and query response time in the presence of control action delays and decision
errors.
2. Baseline Model and Notation for a Controlled Database System
The description of a baseline model for a replicated data store follows to a large extent that of Wu et al. [2]. In particular, a system of three servers is studied, each storing two partitions out of a total of three.
Each partition has one “primary” server and one “secondary” server.
The distributed database system in Figure 1 contains three servers in parallel to answer three classes,
, and
, of queries for which relevant information can be found in the partitioned data sets,
, and
, of the database, respectively. Server
would contain the data set
corresponding to class
as the primary set and a reproduction of data set as the secondary set. Alternate secondary data sets are reproduced in order to automate restoration of failed servers
within the database. The failure of a server implies the loss of two sets of data within the server. A system level
failure is declared when two servers fail, in which case one set of data is completely lost. The queues preceding
servers
,
, and
are named
,
, and
,
respectively. All queues are of sufficient capacity in the baseline model. Service is provided on a first-come-first-served
(FCFS) basis at each server.
The three delay elements of average delay imply that there are always three queries present in the system at any given time. A new query is generated at a delay element
with rate upon the completion of the service to a query at one of the servers. The delay elements are also
intended to be reflective of the response time to the querying customers by other service nodes in
the system that are not explicitly modeled. Any new query is assumed to have a likelihood of
to visit
server
,
where
can be
,
, or
.
The use of a queuing network model for the database is based on its suitability to involve control
actions and to capture their effects on the system performance. The model is built in this study with
the premise that event life distributions have been established for the process of query generation
the process of service
completion the process of
server failure the process
of data restoration and the
process of system overhaul
when the failed database system is repaired. All such processes are independent. Standard statistical
methods that involve data collection, parameter estimation, and goodness of fit tests exist [15] for
identifying event life distributions. Alternative distributions and goodness of these assumptions
were investigated in [6]. Since all event lives are assumed to be exponentially distributed, the
database system can be conveniently modeled as a Markov chain specified by a state space an initial state probability
mass function (pmf) and a
set of state transition rates .
2.1. Model Specification
State Space
A state name is coded with a 6-digit number indicative of all queue lengths and server
states in the system. With some abuse of notations, a valid state representation is given by
, where queue
length , , with total
length limited by the three entities available in the closed-queue system. The server states , , are further
defined as “2” data are lost in both the primary and the secondary sets in a server, “1” the data
in the primary set have been restored and data in the secondary set have not been restored, and “0” data in both primary set and secondary set in a server are intact. A server is said to
be in the down state if it is either at states “1” or “2.” For example, state indicates that server is up with one customer
in its queue, server is down with both sets of data lost and one customer in its queue, and server is up and idle. Note that the queue length includes the customer being served. There are valid states in the baseline system. The total number of states is reduced to when all the states representing system level failures are aggregated into seven states memorizing
the possible queue length distributions and exploiting the symmetry of the three servers. A set of alternative state names are
assigned from with mapped to and the aggregated system
failure state mapped to . Although the symmetry of the system allows further reduction on the number of states to , the -state
model is retained for clarity of presentation.
Initial State PMF
It is assumed that the database system starts operation from state , that is, the initial state probability is given by vector
Set of State Transition Functions
Events that trigger the transitions and the corresponding transition rates are
given as follows. A newly generated query enters one of the servers with rate A query is answered at a server
with rate A complete data loss
occurs at a server with rate Data in the primary data set of a server are restored with rate or repaired with
overhaul rate Data in the secondary data set of the server are restored with rate
following the restoration of the primary data set. The failed database system is always renewed with overhaul rate
Let denote the random
state variable at time .
The set of state transition functions is given by
The continuous-time Markov chain can be solved from the forward Chapman-Kolmogorov equation [5, 10]
and is called an infinitesimal generator or a rate transition matrix whose entry is given by the rate associated with the transition from current state to next state . (See [2] for the complete rate
transition.) Control variable will be defined shortly. State probability mass function at time ,
is computed by
At this point, a baseline Markov model for the database system of Figure 1 has been established. Since transition rate matrix is dependent on control actions,
the state transition functions are being controlled, as are the state probabilities.
2.2. Control Policy
Our intention is to eliminate all single point failures. Our approach is to base the control actions on the state
information, which effectively alter the transition rates when loss of data occurs in a single server. The
possible set of control actions includes restoration, overhaul, and no decision needed. There is one
admissible set of control actions at each state. A state of no decision needed has an empty admissible
set.
Taking into consideration the symmetry of the model, the control policy considered for this study is summarized
as follows:
The presence of control in the transition rate matrix is seen via and The values of represent specific control actions
associated with data restoration
or system overhaul
respectively. Previously in [2], system overhaul is considered only at state through .
2.3. Performance Measures
Two of the four performance measures defined in [2] are reintroduced: steady-state availability and expected
response time .
These will be used later to validate the control policies that are derived under cost criteria intended to improve
both and .
Steady-State Availability
Suppose as soon as the database system reaches a system level failure, an overhaul process starts.
Suppose, with a rate , the system is repaired, and at the completion of the repair, the system immediately starts to operate again. In this
case, the Markov chain becomes irreducible, and a unique steady-state distribution exists [5, 10]. The steady-state
availability, which can be roughly thought of as the fraction of time the database system is upto, is computed in [2]
by
where is
the sum of the system level failure state probabilities determined by solving
Expected Query Response Time
Query response time is the amount of time elapsing from the instant a
query enters a queue until it completes service [10]. With server failures, the average response time is calculated as
the expectation of the ratio of total amount of time that all queries spend waiting for service in queue, plus their service times
to the number of queries that are serviced. Consider, again, the irreducible chain modeling of the system in Figure 1. Let be the indicator function associated
with transition from state to
state that indicates a query
arrival. Let be the total number
of queries in queue at state
Then the total expected number of queries in queue at the steady state is given by
and the arrival rate at steady-state is
The calculation of the response time at steady-state then follows Little’s Law [4, 10]
3. Restoration as Solution to Markov Decision Problem
Intuition suggests that by restoring the lost data sets in a single failed server, overhaul can be avoided, and therefore, the stationary
control policy given in (5) ought to render service more available. However, the restoration process occupies one of the remaining
servers, and therefore, may prolong the average response time of the system to queries. This section formulates and
solves a Markov decision problem (MDP) for the database system to justify the optimality of the restoration policy
used in [2].
The Markov decision problem considered in this paper assumes that a cost
is incurred at every state transition, where
is the state entered and
is a control action selected from a set of admissible actions [9, 10]. The solution amounts to determining a stationary
policy
that minimizes the following expected total discounted cost:
where
is
a discount factor.
To simplify the presentation, state information on representing service demand is ignored for the moment. In this
case, the inherent symmetry of the database system leads to a very simple 4-state Markov model as shown in Figure 2. As a result, the finite population assumption can be relaxed, that is, the closed queuing network of Figure 1 can
either remain closed or can be revised to an open queuing network. In addition, query handling in the event of a
server failure becomes completely unrestricted. Two different methods of query handling are to be examined in this
section.
(1)
Each arrival query has equal likelihood to seek information in data sets
, , or , but only the primary data
set is available for query service in each server, and the secondary server is there to restore data in a failed server.(2)Upon a server failure, queries are rerouted to the two remaining servers where the secondary data sets also
participate in query service though only one of the two intact servers can provide service to only two of the three
classes of queries during restoration.
The distinction in these two cases is captured in transition probabilities and in transition cost
Fault-tolerant control policies are now developed for the two cases.
Figure 2: Markov chain model of the database reflecting only the server states.
3.1. Secondary Data Set Reserved for Lost Data Restoration
This subsection derives the optimal control policy with the first method for handling queries; each arrival query
has equal likelihood to seek information in data sets , , or , but only the primary data set is
available for query service in each server, and the secondary server is there to restore data in a failed
server.
Figure 2 shows a discrete time Markov chain model for this case. This model is
obtained by the application of a uniformization procedure [10] with a uniform rate that is
greater than any total outgoing transition rates at any state of the original continuous time Markov process. All
parameters in Figure 2 have been defined earlier.
A fault-tolerant control policy essentially determines whether to occupy one of the two working servers to restore
the data in the failed server or to overhaul the entire system at the state of one server failure. It is determined by
how the designer penalizes a control action at any given state. Table 1 specifies the one step cost at each
state.
Table 1: One step cost
Let denote the random state
variable at in the discrete time
Markov chain. Control action
indicates the system's decision to (or not to overhaul, or not to act) restore a failed server. in Table 1 is the cost incurred
when control action is taken
based on . It has been shown
that under the condition for all and all that belongs to some
finite admissible sets ,
the minimal cost satisfies the following optimality equation [9, 10]:
where have been marked in
Figure 2. In addition, policy is
optimal if and only if it yields for all .
The four optimality equations can be expressed explicitly based on (11):
The above equations are solved for
for , using Mathematica [16].
Figure 3 is created with
and It can be seen that,
when the ratio of to is above the blue curve, (restoration) is optimal at all
states, whereas (overhaul) is
optimal when is below the red
curve. Between the two curves, is optimal, for transition from state “2” to state “0” implies restoration of primary data set, which cannot occur with control
action
Therefore, the mid-region optimal policy does not take place in the operation of the database system.
Figure 3: Optimal policy on the
graph.
Note that in [2], which lies above
the blue curve in Figure 3 for any
Therefore, the always-restore policy implemented in [2] is optimal under the cost structure defined in Table 1.
3.2. Secondary Data Set Available for Both Query Service and Data Restoration
This subsection considers the second method of query handling upon a server failure: overhaul can only occur at
state “1,” which implies that queries of the failed server are rerouted to the two remaining servers where the
secondary data sets also participate in query service though only one of the two intact servers can provide service to
only two of the three classes of queries during restoration.
The uniformized Markov chain model is shown in Figure 4. In this case,
Figure 4: Markov chain model of the database where overhaul does not occur until a second server failure.
overhaul is held until a second server fails, and all classes of queries rely on the service of the two operating servers in the meantime.
Figures 5(a) and 5(b) compare the optimal cost-to-go's of the two methods of query handling as functions of
restoration rate at fixed and Different
line types specify different control actions. In Figure 5(b), for example, no control action is taken at state “0” unless where
restoration takes place; the system is always overhauled at state “1;” no control action is taken at state “2” unless where restoration
takes place; and no control action is ever taken at state “3.” It is seen that control policy change occurs at a higher ratio of with the second method
(policy change at in Figure 5(b)) than that with the first method (policy change at in
Figure 5(a)). Despite the slight favor toward overhaul, the optimality of the “always-restore”
policy applied in [2] still holds with the second method at the nominal parameter values and where
Figure 5: Minimum cost-to-go versus restoration rate for the 4-state model with cost criteria of Table
1.
4. Augmented Model Including Control Delays and Decision Errors
This section establishes a full-state model to include the effects of decision errors and control action delays upon
entering a state of a single server failure. The first two subsections follow [12] that treated these separately as the
effect of decision errors when a control action is taken incorrectly but immediately upon entering a state, and
the effect of delayed control actions when a correct control action is taken but after some time delay.
There are deterministically diagnosable systems for which the only cost of diagnosis is time [11]. The
third subsection presents a new model to be used in robust control policy design that combines the
two augmented models and introduces also delays due to rerouting queries from failed sever to intact
servers.
4.1. Modeling the Effect of Erroneous Decisions
The control action considered in this study is state information based. Upon entering a state, for instance, , any
information deficiency can result in uncertainty in decision making as to whether to take a control action or what
control actions to take. In this case, every decision carries a risk [17].
A decision error in the database system could include the possibility that upon
a server failure, the wrong server is identified as being failed. More specifically,
for instance, has failed. However,
is mistakenly observed as the failed server. Based on the false information, the control action would be for
to restore data set
in
, whereas
would be expected to continue to work. As a consequence, none of the servers can process queries for a period of
time, and the database system is said to have entered an intermittent error state. It is assumed that from this state,
only transitions representing service completion can occur. Figure 6 depicts a generic representation of such a
case.
Figure 6: Decision error modeling with an intermittent error state.
Without loss of generality, let be a state that is entered upon the loss of both data sets in a server. Let
be the state entered upon the completion of primary data set restoration associated with the data loss. Let
through
be the states representing completion of services at other
servers. Let
be the state entered upon the arrival of a new query in one of the queues. ( are not shown explicitly in Figure 6.) Let
through
be the states entered upon data loss at other m servers. An intermittent state is introduced, as shown in Figure 6, to allow the representation of imperfect decision making upon entering .
Therefore, there is an intermittent error state for each state that involves outgoing transitions with weakened control
authorities due to some decision errors. In the database system of Figure 1, 60 states are added to
the original 147 states of baseline model. It is assumed that once the primary data set restoration
takes place for a particular server, the secondary data set restoration proceeds without a decision
error.
Let
denote the transition rate from state
to state
in the absence of decision error in the restoration of the primary database associated with the most recent data loss.
Let be the probability of successful restoration, given that the event of restoration occurs.
then is referred to as the thinning [5] of the Poisson arrival process associated with the restoration. The split of rate
into rate
and rate
is sometimes also called a decomposition of a Poisson arrival process into type 1 with probability and type 2 with probability .
An imperfect decision corresponds to the value of
being less than unity. As a consequence, the authority of control that is supposed to reinforce the restoration process is weakened. The smaller the value of , the weaker the control authority is.
The rate of recovery from decision error is denoted by
. To state the fact that recovery from an intermittent error state to restoration cannot be faster than the error-free
restoration process,
is
enforced. On the other hand, the outgoing transition rates from the intermittent error state to the states of data loss in other
servers, that is, from to
are bounded below by the corresponding rates going from
to
. These transitions further reduce
the likelihood of reaching state .
It is now shown that decision errors always degrade the performance in terms of the state transition probability
which is the probability
that restoration to state occurs given current state .
It turns out that this probability is readily obtained for a Markov chain
where
without decision error, in which case in (14), and
with decision error, in which case .
Note that (15) and (16) are the same, and both enter (14). Therefore, (14) is proportional to
, and is largest at
when there is no decision error.
4.2. Modeling the Effect of Delayed Control Actions
Time required for diagnosis [11] can be regarded as the universal cause of a control action delay. An example of the
control action delay in the database system shown in Figure 1 would be that a total loss of data in a server is not
immediately observed. As a result, the action of data restoration is delayed.
As in the previous subsection, let be a state that is entered upon a total loss of data in a server. Let
be the
state entered upon the completion of primary database restoration associated with the data loss. States through and
states through also follow the earlier definitions. Figure 7 depicts a proposed model capable of describing
a delayed restoration action by an exponentially distributed random amount with average units of time upon
entering state
With a single-stage delay for each state entered upon a total loss of data in a server, another 60 states are added to
the baseline model.
Figure 7: Control action delay modeling with a single-stage delay state.
In a more general case, there can be an -phased
delay implemented in the augmented model by inserting states through in series between
states and . Each state retains outgoing
transitions to all through
and through in addition to
transition to The total amount of delay before restoration action is bounded below by random variable with a
generalized Erlang distribution [5];
One may use an -stage Erlang to approach a constant delay, an -state
hyperexponential to approach a highly uncertain delay, or a mixture of the two to acquire more general properties
[10] in its distribution.
Note that there are two significant differences between the decision error model of Figure 6 and the control delay
model of Figure 7. First, the link to restoration of primary database is present in Figure 7 with a smaller likelihood
of transition, whereas the link to restoration without delay is absent in Figure 7. In addition, all links to service
completion are absent in Figure 6, but are present in Figure 7. Therefore, each case has its distinct
nature.
4.3. Full-State Model of the Controlled Database System
Referring again to the closed queuing network view of the distributed database system in Figure 1, this section
presents its augmented model that incorporates all three sources of uncertainties: decision errors (Section 4.1),
control action delays (Section 4.2), and routing delays. Routing delays are incurred when queries at a failed server
are rerouted to the remaining intact servers.
Rerouting of queries becomes desirable when the queries observe a server failure after they have entered the
queue preceding the server. An exponentially distributed random routing time is introduced with rate for
this purpose. A routing delay is assumed independent of a control action delay. The former captures the random
time of diagnosis, whereas the latter captures random time of transmission of queries among servers. Model
augmentation amounts to adding new transitions among existing states without the need for new
states.
In order to establish a full state model with all uncertainty types, the representation of the composite state variable is
modified to
where and as in the baseline model described in Section 2; newly introduced uncertainty variable with “1” =
control delayed and “2” = wrong decision made. This results in a 267 state-model. By exploiting symmetry, the 256
(147+60+60) state model can be reduced to a 96-state model. The binary control variables are defined as follows: to restore, to overhaul,
and to
reroute queries.
The states, the transitions, and the transition rates of the uncertain model are summarized in Figure 8, based on which
transition matrix of a Markov chain can be built and used in the next section for robust control policy design. in Figure 8
is the newly introduced query transmission rate when the action for rerouting is called for. Error probability relates
to in Figure 6 through
Subscript “” denotes “primary” and “” denotes “secondary.” Use of symmetry is reflected in server state and arrival
rates and
Figure 8: Transitions and transition rates of the uncertain database state model.
5. Robust Control Policy Design
This section seeks robust control policies as solutions to the Markov decision problem:
where is the control policy sought, to restore, to overhaul, and to reroute queries, as defined in Section 4.3. Note that the full-state model enables the designer to consider service
demand and to weigh availability against response time. Thus two cost criteria are established. The first criterion,
penalizes long queues that cannot effectively reduce in time due to server failure, and thus favors response time.
The second criterion, shown in the following table, penalizes prolonged service time, again, due to server failure, and
thus favors availability.
The size of the state space suggests numerical means for solutions. Mathematical programs will be applied to
obtain the solutions. The steady-state availability and the expected query response time of the controlled database
system with the optimal policy will then be examined under various conditions.
5.1. Optimal Policy Design via Mathematical Programming
The rate transition matrix of the 96-state model can be obtained based on Figure 8 established in Section 4.3. This depends
on State
probability equation
originated from the forward Chapman-Kolmogorov (2) can now be uniformized to yield a discrete time
Markov chain
where uniform rate can be chosen to be
Recall optimality (11)
as an alternative characterization of the solution to Markov decision problem (18), which produces a system of 96
equations.
Dynamic programming is the most natural numerical approach to policy design (18) because (11) is derived through taking limit of a finite horizon dynamic program [9, 10]
where and terminal cost In this case the optimal cost is given by More specifically, with taking values in a finite set,
the minimal cost-to-go from of the 96-state Markov decision process satisfies
where is the minimal cost-to-go from of an -step finite horizon process.
The solution to a dynamic program results from an iterative calculation backwards along the horizon from to the first
step For the dynamic programming calculation to converge to the true cost-to-go, must be significantly
large, and must be less than 1.
Linear programming [18] can be considered as an alternative numerical approach to the solution of the Markov decision
problem. In this case, the set of optimality equations is turned into a set of affine constraints on the set of optimization
variables and the problem can be formally stated as follows:
The equivalence of the linear program formulation (26)–(28) and the optimality
equation formulation can be easily established. First, (27) is trivially satisfied for all in both formulations
because one-step cost is always nonnegative.
Suppose is the
linear program solution. Then there must be one active (equality achieved) constraint for each of the affine inequality constraints
of the form for
each . Suppose for some the
constraint(s) is
not active. Then can be increased until one of the inequality constraints becomes active without violating the rest of the inequality constraints
because as
coefficient of on the right side of the inequality constraints (28). This, however, contradicts the assumption that is maximum.
Therefore, is
also the solution to the optimality equations (28).
Assume now that satisfies the optimality equations. It then automatically satisfies the inequality constraints (28), of which are active, one for each appearing on the left side.
Suppose is not maximum.
There is at least a for some that is smaller than the corresponding cost
in max which implies that the corresponding
constraint(s) for is (are) slack or
inactive. This contradicts that satisfies
the optimality equation. Therefore, must also be the solution of the linear program formulation (28). The equivalence is thus established.
The function linprog in MATLAB's Optimization Toolbox [19] solves the maximization problem
above. The active constraints are checked with a MATLAB script to determine the optimal control
policy.
The computational complexity of the dynamic program and that of the linear program are now compared.
Finding the solution to a linear program generally requires a computation time proportional to [18] when , where is the number of
optimization variables, and is the number of constraints. The computational complexity of an iterative dynamic programming solution can be
approximated by assuming that each iteration is a series of linear programs. The linear programming solution to the set of optimality equations is of course a single linear program.
The number of control variables, ,
the number of states, ,
and the horizon length, ,
are critical to the computation time of these methods. First, consider the iterative
method as a series of linear programs. Each individual iteration along the -step horizon consists of individual linear programs. Each
individual linear program has variables and constraints. Therefore, the computation time is proportional to Now,
consider the method of solving the optimality equations through linear programming. The single linear program has variables and constraints. Hence, its computation
time is proportional to
Although the computation time grows faster for the linear program as the number of states increases, the horizon is typically much
larger than for small
discount factorin
Therefore, the linear program is more efficient for moderate numbers of states and small discount factors.
5.2. Availability and Response Time under Robust Control Policy
A selected set of results on the robust control policies solved via mathematical programming are presented in this
subsection, and the system availability and query response time under some of the optimal policies are examined.
5.2.1. Restoration-Overhaul Switching
Under the cost criterion (19) (minimum total discounted queue size), the optimal policy depends on the number of
queries in the queue behind the failed server. No action is taken to overhaul the system until the two active queues
are empty and the buildup of queries behind the failed server is significant. Figure 9(a) depicts a switching
curve of of the control policy between overhaul and restoration before (solid) and after (dotted) state in
Figure 9(b) is reached. Policy switching is determined by the amount of control action delay, the decision error
probability, and the number of queries in the failed server. It can be seen that, while the two active queues
are occupied or after the primary data is successfully restored, restoration is performed on the failed
server as long as the server performing the restoration does not have any customers waiting in its
queue.
Figure 9: (a) Switching curves of the optimal policy under discounted queue size. (b) Partial database model
containing both control action delay and decision error.
Under the cost criterion stated in Table 2 (minimally reduced service time), the optimal policy always attempts to
restore the failed server as long as the server performing the restoration does not have any queries waiting in its
queue. The only exception is when three queries are piled into any single queue. In this case, overhaul occurs when
the uncertainties are significant.
Table 2: Discounted service rate with service demand consideration.
5.2.2. Performance under Nominal and Robust Policies, and Effect of Routing Delay
This subsection examines the system steady-state availability and the expected query response under the robust policy,
where random control delay and decision error are explicitly modeled, and under nominal policy where uncertainties
are ignored. The results are similar for policies derived with either the queue size criterion (19) or the service time
criterion (Table 2). The robust policy shows two distinct features in Figures 10(a) and 10(b): it switches control action when uncertainties
(delay and error) becomes significant, and it balances between availability and response time in this situation.
Figure 10: Response time (upper panel) and availability (lower panel) resulting from robust control policy (solid black) and nominal control policy (dotted red) versus (a) decision error and (b) control delay .
The routing only policy does not attempt to restore the single failed server. Instead, queries are routed to an
empty queue whenever the subsequent server contains the data for the query. The system is overhauled upon a
second server failure. It offers some advantage in response time over the always-restore policy when there is no
routing delay, as shown in Figure 11(a). It is also seen that the robust optimal policy experience improved
performance with rerouting authority. However, a routing delay of about one second is significant enough to discourage
the use of the routing-only policy, as shown in Figure 11(b).
Figure 11: Response time (upper panel) and availability (lower panel) resulting from robust control policy (solid black) and routing only policy (dotted red) versus (a) control delay and (b) routing delay
6. Conclusions
Uncertainties due to control delays, transmission delays, and decision errors in the distributed database system
degrade the performance of the database system performance in terms of availability and response time. Restoration
remains to be the optimal policy over a significant range of uncertainties. Beyond boundaries of the range, however,
the optimal control policy switches to overhaul. By formulating and solving a Markov decision problem, the
robustness of the control policies is investigated. Boundaries for which optimal actions alter are shown to exist
and are quantified. The robust policies are shown to provide the best compromise among competing
interests.
The authors have also investigated the optimal control policy for the database under the open queuing network
setting in the face of delays and errors. Simulations with SimEvents [20] show that response time further depends on
the arrival rate of queries. Simulation results will be reported separately. Simulation study of larger networks has
also been planned.
Acknowledgment
This work was supported in part by AFOSR under Grants FA9550-06-0456 and FA9550-06-10249.