Abstract

Optimal state information-based control policy for a distributed database system subject to server failures is considered. Fault-tolerance is made possible by the partitioned architecture of the system and data redundancy therein. Control actions include restoration of lost data sets in a single server using redundant data sets in the remaining servers, routing of queries to intact servers, or overhaul of the entire system for renewal. Control policies are determined by solving Markov decision problems with cost criteria that penalize system unavailability and slow query response. Steady-state system availability and expected query response time of the controlled database are evaluated with the Markov model of the database. Robustness is addressed by introducing additional states into the database model to account for control action delays and decision errors. A robust control policy is solved for the Markov decision problem described by the augmented state model.

1. Introduction

A database, as described in [1], is a shared collection of related data and the description of this data, designed to meet the information needs of a client. A recent study by Wu et al. [2] on a distributed database system, as shown in Figure 1, revealed the benefits of a conscientious design of redundant architecture and the application of state information-based control. Such benefits were quantified in terms of mean time to system failure, steady-state availability, expected response time, and service overhead. The database system was viewed as a queuing network [3, 4] and mathematically modeled as a Markov chain [5]. The control authorities considered included the ability to restore the lost data sets in a single server and the ability to route service requests. In order to obtain an analytic model of manageable size for scrutinizing the effects of control, the queuing network was restricted to the closed type with a query population of three. In addition, all the event lifetime distributions were assumed to be exponential. A simulation study conducted by Metzler [6] using Arena [7, 8] with the above restrictions removed supported the conclusions in [2].

The first objective of this paper is to provide justification that the control policy applied in the aforementioned study [2] is optimal in a well defined sense. To that end, a Markov decision problem [9, 10] is formulated and the solution that minimizes a total expected discounted cost is sought. For the purpose of illustration, a simple problem that disregards the query states is set up, for which the policy developed in [2] is confirmed to be optimal.

In reality, however, it is not practical to monitor every state variable in a network. As a result, knowledge on a certain set of states is inferred based on the observables. On the other hand, a control action, in response to a state transition such as an occurrence of a server failure, must wait until a process of diagnosing the failure state [11] is complete. The time required for diagnosis is assumed to be a random variable and the outcome of the diagnosis usually has some degree of uncertainty as well. If servers must communicate through wireless channels, the likelihood of an erroneous decision and a delayed action is drastically increased. Recognizing that the assumption of instantaneous accessibility of the state information in the database system could lead to overly optimistic conclusions on system performance, Wu et al. [12] took a further step to analyze the effects of control action delays and decision errors for the same database system. Their analysis concluded that delays and errors can significantly degrade the performance of the database system.

Therefore, the second objective of the paper is to seek a robust control policy that mitigates the effects of such control action delays and decision errors. A robust solution obviously has a strong dependence on how uncertainties are modeled. This paper establishes an uncertain database model following the basic principles presented in [12]. The new model also captures the effect of routing delays of queries from a failed server to remote intact servers. A new Markov decision problem is then formulated and solved. Due to the increased dimension of the problem, approximate solutions are sought via numerical means.

This paper presents a novel model of a replicated data store wherein a set of information is partitioned and each partition is stored on multiple servers. This work is motivated by the recognition of the need for greatly enhanced availability of information management systems in air operations [13]. It addresses the desirability of hardware replication and state-information-controlled restoration, whereas published works in the field of distributed database and replication have discussed specific protocols and software failures [14].

The paper is organized as follows: Section 2 describes the baseline model of the controlled database system shown in Figure 1; Section 3 formulates and solves a Markov decision problem that justifies the control policy applied to the baseline model; Section 4 presents an approach to modeling control action delays and decision errors; Section 5 formulates and solves, using dynamic programming, a Markov decision problem with an uncertain model containing delays and errors, and analyzes the robustly controlled system in terms of system availability and query response time in the presence of control action delays and decision errors.

2. Baseline Model and Notation for a Controlled Database System

The description of a baseline model for a replicated data store follows to a large extent that of Wu et al. [2]. In particular, a system of three servers is studied, each storing two partitions out of a total of three. Each partition has one “primary” server and one “secondary” server.

The distributed database system in Figure 1 contains three servers in parallel to answer three classes, ??,??, and ??, of queries for which relevant information can be found in the partitioned data sets, ??,??, and ??, of the database, respectively. Server ?????? would contain the data set corresponding to class ?? as the primary set and a reproduction of data set ?? as the secondary set. Alternate secondary data sets are reproduced in order to automate restoration of failed servers within the database. The failure of a server implies the loss of two sets of data within the server. A system level failure is declared when two servers fail, in which case one set of data is completely lost. The queues preceding servers ??????, ??????, and ?????? are named ??????, ??????, and ??????, respectively. All queues are of sufficient capacity in the baseline model. Service is provided on a first-come-first-served (FCFS) basis at each server.

The three delay elements of average delay 1/?? imply that there are always three queries present in the system at any given time. A new query is generated at a delay element with rate ?? upon the completion of the service to a query at one of the servers. The delay elements are also intended to be reflective of the response time to the querying customers by other service nodes in the system that are not explicitly modeled. Any new query is assumed to have a likelihood of ?????? to visit server ??????, where ???? can be ????, ????, or ????.

The use of a queuing network model for the database is based on its suitability to involve control actions and to capture their effects on the system performance. The model is built in this study with the premise that event life distributions have been established for the process of query generation (exp(??)=1-??-????), the process of service completion (exp(??)), the process of server failure (exp(??)), the process of data restoration (exp(??)), and the process of system overhaul (exp(??)) when the failed database system is repaired. All such processes are independent. Standard statistical methods that involve data collection, parameter estimation, and goodness of fit tests exist [15] for identifying event life distributions. Alternative distributions and goodness of these assumptions were investigated in [6]. Since all event lives are assumed to be exponentially distributed, the database system can be conveniently modeled as a Markov chain specified by a state space ??, an initial state probability mass function (pmf) ????(0), and a set of state transition rates ?.

2.1. Model Specification

State Space ??
A state name is coded with a 6-digit number indicative of all queue lengths and server states in the system. With some abuse of notations, a valid state representation is given by ????????????????????????????????????, where queue length ??????, ??????, ???????{0,1,2,3} with total length ??=??????+??????+??????=3 limited by the three entities available in the closed-queue system. The server states ??????, ??????, ???????{0,1,2} are further defined as “2” = data are lost in both the primary and the secondary sets in a server, “1” = the data in the primary set have been restored and data in the secondary set have not been restored, and “0” = data in both primary set and secondary set in a server are intact. A server is said to be in the down state if it is either at states “1” or “2.” For example, state 110020 indicates that server ?????? is up with one customer in its queue, server ?????? is down with both sets of data lost and one customer in its queue, and server ?????? is up and idle. Note that the queue length includes the customer being served. There are 540 valid states in the baseline system. The total number of states is reduced to 147 when all the states representing system level failures are aggregated into seven states memorizing the possible queue length distributions and exploiting the symmetry of the three servers. A set of alternative state names are assigned from ??={1,2,...,147} with 000000 mapped to ??=1 and the aggregated system failure state mapped to ??=141,,147. Although the symmetry of the system allows further reduction on the number of states to 56, the 147-state model is retained for clarity of presentation.

Initial State PMF {????(0),??=1,2,,147}
It is assumed that the database system starts operation from state ??=1(000000), that is, the initial state probability is given by vector ??(0)=[100].

Set of State Transition Functions p??,??(??)
Events that trigger the transitions and the corresponding transition rates are given as follows. A newly generated query enters one of the servers with rate (3-??)×??/3. A query is answered at a server with rate ??. A complete data loss occurs at a server with rate ??. Data in the primary data set of a server are restored with rate ?? or repaired with overhaul rate ??. Data in the secondary data set of the server are restored with rate ??, following the restoration of the primary data set. The failed database system is always renewed with overhaul rate ??.
Let ????? denote the random state variable at time ??. The set of state transition functions is given by The continuous-time Markov chain can be solved from the forward Chapman-Kolmogorov equation [5, 10]
and ??(??(??)) is called an infinitesimal generator or a rate transition matrix whose (??,??)th entry is given by the rate associated with the transition from current state ?? to next state ??. (See [2] for the complete rate transition.) Control variable ??(??) will be defined shortly. State probability mass function at time ??,
??(??)=[??1(??)??2(??)??147(??)],??=0,(3) is computed by At this point, a baseline Markov model for the database system of Figure 1 has been established. Since transition rate matrix ?? is dependent on control actions, the state transition functions ????,??(??) are being controlled, as are the state probabilities.

2.2. Control Policy

Our intention is to eliminate all single point failures. Our approach is to base the control actions on the state information, which effectively alter the transition rates when loss of data occurs in a single server. The possible set of control actions includes restoration, overhaul, and no decision needed. There is one admissible set of control actions at each state. A state of no decision needed has an empty admissible set.

Taking into consideration the symmetry of the model, the control policy considered for this study is summarized as follows:

??????=?????????0,uponenteringthestateofoneserverfailure,systemoverhauls;1,uponenteringthestateofoneserverfailure,systemrestores.(5) The presence of control in the transition rate matrix is seen via ??(??) and ??(??)=1-??(??). The values of ??(??) represent specific control actions associated with data restoration (??(??)=1) or system overhaul (??(??)=1), respectively. Previously in [2], system overhaul is considered only at state ??=141 through ??=147.

2.3. Performance Measures

Two of the four performance measures defined in [2] are reintroduced: steady-state availability ??sys and expected response time ??[??]. These will be used later to validate the control policies that are derived under cost criteria intended to improve both ??sys and ??[??].

Steady-State Availability
Suppose as soon as the database system reaches a system level failure, an overhaul process starts. Suppose, with a rate ??, the system is repaired, and at the completion of the repair, the system immediately starts to operate again. In this case, the Markov chain becomes irreducible, and a unique steady-state distribution exists [5, 10]. The steady-state availability, which can be roughly thought of as the fraction of time the database system is upto, is computed in [2] by
??sys=1-????(8),(6) where ????(8) is the sum of the system level failure state probabilities determined by solving
??(8)??=0,147???=1????(8)=1.(7)

Expected Query Response Time
Query response time is the amount of time elapsing from the instant a query enters a queue until it completes service [10]. With server failures, the average response time ??[??] is calculated as the expectation of the ratio of total amount of time that all queries spend waiting for service in queue, plus their service times to the number of queries that are serviced. Consider, again, the irreducible chain modeling of the system in Figure 1. Let ????,?? be the indicator function associated with transition from state ?? to state ?? that indicates a query arrival. Let ???? be the total number of queries in queue at state ??. Then the total expected number of queries in queue at the steady state is given by
??[??]=147???=1????(8)????,(8) and the arrival rate at steady-state is
????=147???=1????(8)147???=1????????????.(9) The calculation of the response time at steady-state then follows Little’s Law [4, 10] ??[??]=??????[??].

3. Restoration as Solution to Markov Decision Problem

Intuition suggests that by restoring the lost data sets in a single failed server, overhaul can be avoided, and therefore, the stationary control policy ??(??) given in (5) ought to render service more available. However, the restoration process occupies one of the remaining servers, and therefore, may prolong the average response time of the system to queries. This section formulates and solves a Markov decision problem (MDP) for the database system to justify the optimality of the restoration policy used in [2].

The Markov decision problem considered in this paper assumes that a cost ??(??,??) is incurred at every state transition, where ?? is the state entered and ?? is a control action selected from a set of admissible actions [9, 10]. The solution amounts to determining a stationary policy ??={??0(??0),??1(??1),} that minimizes the following expected total discounted cost:

????(??0)=????8???=0??????(????,????),(10) where 0<??<1 is a discount factor.

To simplify the presentation, state information on representing service demand is ignored for the moment. In this case, the inherent symmetry of the database system leads to a very simple 4-state Markov model as shown in Figure 2. As a result, the finite population assumption can be relaxed, that is, the closed queuing network of Figure 1 can either remain closed or can be revised to an open queuing network. In addition, query handling in the event of a server failure becomes completely unrestricted. Two different methods of query handling are to be examined in this section. (1) Each arrival query has equal likelihood to seek information in data sets ?? , ??, or ??, but only the primary data set is available for query service in each server, and the secondary server is there to restore data in a failed server.(2)Upon a server failure, queries are rerouted to the two remaining servers where the secondary data sets also participate in query service though only one of the two intact servers can provide service to only two of the three classes of queries during restoration. The distinction in these two cases is captured in transition probabilities and in transition cost ??(??,??). Fault-tolerant control policies are now developed for the two cases.

3.1. Secondary Data Set Reserved for Lost Data Restoration

This subsection derives the optimal control policy with the first method for handling queries; each arrival query has equal likelihood to seek information in data sets ??, ??, or ??, but only the primary data set is available for query service in each server, and the secondary server is there to restore data in a failed server.

Figure 2 shows a discrete time Markov chain model for this case. This model is obtained by the application of a uniformization procedure [10] with a uniform rate ??=3??+??+?? that is greater than any total outgoing transition rates at any state of the original continuous time Markov process. All parameters in Figure 2 have been defined earlier.

A fault-tolerant control policy essentially determines whether to occupy one of the two working servers to restore the data in the failed server or to overhaul the entire system at the state of one server failure. It is determined by how the designer penalizes a control action at any given state. Table 1 specifies the one step cost at each state.

Let ?????{1,2,3,4} denote the random state variable at ??=??/?? in the discrete time Markov chain. Control action ??(????)=1(or0,or??) indicates the system's decision to (or not to overhaul, or not to act) restore a failed server. ??(????,????) in Table 1 is the cost incurred when control action ???? is taken based on ????. It has been shown that under the condition 0=??(??,??)<8 for all ?? and all ?? that belongs to some finite admissible sets ????, the minimal cost ??*(??) satisfies the following optimality equation [9, 10]:

??(??)=min???????{??(??,??)+?????????,????(??)},(11) where ????,?? have been marked in Figure 2. In addition, policy ??* is optimal if and only if it yields ??*(??) for all ??. The four optimality equations can be expressed explicitly based on (11):

??(0)=min???1??+??3??+??????(0)+????????(3)???????????????????????????????????????????????????????=0,1??+????+??????(0)+??2??????(1)+????????(3)?????????????????????????????????????????????????????????????????????????????=1?;??(1)=min???1??+??3??+??????(1)+????????(3),1??+??3??+??????(1)+????????(3)?;??(2)=min???1??+??2??+??????(2)+????????(3),1??+????????(0)+??2??????(1)+????+??????(2)?;??(3)=min?????3??????(2)+????+??????(3),??3??????(2)+????+??????(3)?.(12)

The above equations are solved for ??*(??), for ??=0,1,2,3, using Mathematica [16]. Figure 3 is created with ??=10?? and ???[0,1). It can be seen that, when the ratio of ?? to ?? is above the blue curve, ??=1 (restoration) is optimal at all states, whereas ??=0 (overhaul) is optimal when ??/?? is below the red curve. Between the two curves, {??(2)=??,??(0)=??} is optimal, for transition from state “2” to state “0” implies restoration of primary data set, which cannot occur with control action ??(2)=0. Therefore, the mid-region optimal policy does not take place in the operation of the database system.

Note that ??/??=5 in [2], which lies above the blue curve in Figure 3 for any ???[0,1). Therefore, the always-restore policy implemented in [2] is optimal under the cost structure defined in Table 1.

3.2. Secondary Data Set Available for Both Query Service and Data Restoration

This subsection considers the second method of query handling upon a server failure: overhaul can only occur at state “1,” which implies that queries of the failed server are rerouted to the two remaining servers where the secondary data sets also participate in query service though only one of the two intact servers can provide service to only two of the three classes of queries during restoration.

The uniformized Markov chain model is shown in Figure 4. In this case,

??????=?????????0,uponenteringthestateofoneserverfailure,systemawaits,1,uponenteringthestateofoneserverfailure,systemrestores,(13) overhaul is held until a second server fails, and all classes of queries rely on the service of the two operating servers in the meantime.

Figures 5(a) and 5(b) compare the optimal cost-to-go's of the two methods of query handling as functions of restoration rate ?? at fixed ??=10?? and ??=0.001. Different line types specify different control actions. In Figure 5(b), for example, no control action is taken at state “0” unless ??=1.8?? where restoration takes place; the system is always overhauled at state “1;” no control action is taken at state “2” unless ??=2.6?? where restoration takes place; and no control action is ever taken at state “3.” It is seen that control policy change occurs at a higher ratio of ??/?? with the second method (policy change at ??=.026 in Figure 5(b)) than that with the first method (policy change at ??=.014 in Figure 5(a)). Despite the slight favor toward overhaul, the optimality of the “always-restore” policy applied in [2] still holds with the second method at the nominal parameter values ??=12,??=0.05, and ??=0.01, where ??/??=5>2.6.

4. Augmented Model Including Control Delays and Decision Errors

This section establishes a full-state model to include the effects of decision errors and control action delays upon entering a state of a single server failure. The first two subsections follow [12] that treated these separately as the effect of decision errors when a control action is taken incorrectly but immediately upon entering a state, and the effect of delayed control actions when a correct control action is taken but after some time delay. There are deterministically diagnosable systems for which the only cost of diagnosis is time [11]. The third subsection presents a new model to be used in robust control policy design that combines the two augmented models and introduces also delays due to rerouting queries from failed sever to intact servers.

4.1. Modeling the Effect of Erroneous Decisions

The control action considered in this study is state information based. Upon entering a state, for instance, ??, any information deficiency can result in uncertainty in decision making as to whether to take a control action or what control actions to take. In this case, every decision carries a risk [17].

A decision error in the database system could include the possibility that upon a server failure, the wrong server is identified as being failed. More specifically, ??????, for instance, has failed. However, ?????? is mistakenly observed as the failed server. Based on the false information, the control action would be for ?????? to restore data set ?? in ??????, whereas ?????? would be expected to continue to work. As a consequence, none of the servers can process queries for a period of time, and the database system is said to have entered an intermittent error state. It is assumed that from this state, only transitions representing service completion can occur. Figure 6 depicts a generic representation of such a case.

Without loss of generality, let ?? be a state that is entered upon the loss of both data sets in a server. Let ?? be the state entered upon the completion of primary data set restoration associated with the data loss. Let ??1 through ???? be the states representing completion of services at other ?? servers. Let ??1,,???? be the state entered upon the arrival of a new query in one of the queues. (???? are not shown explicitly in Figure 6.) Let ??1 through ???? be the states entered upon data loss at other m servers. An intermittent state ?? is introduced, as shown in Figure 6, to allow the representation of imperfect decision making upon entering ??. Therefore, there is an intermittent error state for each state that involves outgoing transitions with weakened control authorities due to some decision errors. In the database system of Figure 1, 60 states are added to the original 147 states of baseline model. It is assumed that once the primary data set restoration takes place for a particular server, the secondary data set restoration proceeds without a decision error.

Let ????,?? denote the transition rate from state ?? to state ?? in the absence of decision error in the restoration of the primary database associated with the most recent data loss. Let ?? be the probability of successful restoration, given that the event of restoration occurs. (1-??) then is referred to as the thinning [5] of the Poisson arrival process associated with the restoration. The split of rate ????,?? into rate ??????,?? and rate (1-??)????,?? is sometimes also called a decomposition of a Poisson arrival process into type 1 with probability ?? and type 2 with probability (1-??).

An imperfect decision corresponds to the value of ?? being less than unity. As a consequence, the authority of control that is supposed to reinforce the restoration process is weakened. The smaller the value of ??, the weaker the control authority is.

The rate of recovery from decision error is denoted by ????. To state the fact that recovery from an intermittent error state to restoration cannot be faster than the error-free (??=1) restoration process, ????=????,?? is enforced. On the other hand, the outgoing transition rates from the intermittent error state to the states of data loss in other servers, that is, from ?? to ????,??=1,2,...,??, are bounded below by the corresponding rates going from ?? to ????. These transitions further reduce the likelihood of reaching state ??.

It is now shown that decision errors always degrade the performance in terms of the state transition probability ?????? which is the probability that restoration to state ?? occurs given current state ??. It turns out that this probability is readily obtained for a Markov chain

where

without decision error, in which case ??=1 in (14), and

?(??)=??????1++????????+??????1++????????+????????+(1-??)??????(16) with decision error, in which case ??<1. Note that (15) and (16) are the same, and both enter (14). Therefore, (14) is proportional to ??, and is largest at ??=1 when there is no decision error.

4.2. Modeling the Effect of Delayed Control Actions

Time required for diagnosis [11] can be regarded as the universal cause of a control action delay. An example of the control action delay in the database system shown in Figure 1 would be that a total loss of data in a server is not immediately observed. As a result, the action of data restoration is delayed.

As in the previous subsection, let ?? be a state that is entered upon a total loss of data in a server. Let ?? be the state entered upon the completion of primary database restoration associated with the data loss. States ??1 through ???? and states ??1 through ???? also follow the earlier definitions. Figure 7 depicts a proposed model capable of describing a delayed restoration action by an exponentially distributed random amount with average ??-1 units of time upon entering state ??. With a single-stage delay for each state entered upon a total loss of data in a server, another 60 states are added to the baseline model.

In a more general case, there can be an ??-phased delay implemented in the augmented model by inserting ?? states ??1 through ???? in series between states ?? and ??. Each state ???? retains outgoing transitions to all ??1 through ????, and ??1 through ????, in addition to transition to ????+1. The total amount of delay before restoration action is bounded below by random variable ??=??1++????, with a generalized Erlang distribution [5];

One may use an ??-stage Erlang to approach a constant delay, an ??-state hyperexponential to approach a highly uncertain delay, or a mixture of the two to acquire more general properties [10] in its distribution.

Note that there are two significant differences between the decision error model of Figure 6 and the control delay model of Figure 7. First, the link to restoration of primary database is present in Figure 7 with a smaller likelihood of transition, whereas the link to restoration without delay is absent in Figure 7. In addition, all links to service completion are absent in Figure 6, but are present in Figure 7. Therefore, each case has its distinct nature.

4.3. Full-State Model of the Controlled Database System

Referring again to the closed queuing network view of the distributed database system in Figure 1, this section presents its augmented model that incorporates all three sources of uncertainties: decision errors (Section 4.1), control action delays (Section 4.2), and routing delays. Routing delays are incurred when queries at a failed server are rerouted to the remaining intact servers.

Rerouting of queries becomes desirable when the queries observe a server failure after they have entered the queue preceding the server. An exponentially distributed random routing time is introduced with rate ??/sec for this purpose. A routing delay is assumed independent of a control action delay. The former captures the random time of diagnosis, whereas the latter captures random time of transmission of queries among servers. Model augmentation amounts to adding new transitions among existing states without the need for new states.

In order to establish a full state model with all uncertainty types, the representation of the composite state variable is modified to ??=??????????????????????????????????????, where ???????{0,1,2,3} and ???????{0,1,2} as in the baseline model described in Section 2; newly introduced uncertainty variable ???{0,1,2} with “1” = control delayed and “2” = wrong decision made. This results in a 267 state-model. By exploiting symmetry, the 256 (147+60+60) state model can be reduced to a 96-state model. The binary control variables are defined as follows: ??1=1 to restore, ??2=1 to overhaul, and ??3=1 to reroute queries.

The states, the transitions, and the transition rates of the uncertain model are summarized in Figure 8, based on which transition matrix ?? of a Markov chain can be built and used in the next section for robust control policy design. ?? in Figure 8 is the newly introduced query transmission rate when the action for rerouting is called for. Error probability ?? relates to ?? in Figure 6 through ??=1-??. Subscript “??” denotes “primary” and “??” denotes “secondary.” Use of symmetry is reflected in server state ???? and arrival rates ??1,??2, and ??3.

5. Robust Control Policy Design

This section seeks robust control policies as solutions to the Markov decision problem:

????*(??0)=min??????8???=0???????????,?????,??0???={1,2,,95,96},(18) where 0<??<1,??={??0,??1,} is the control policy sought, ??=(??1,??2,??3),and?????{0,1}.??1=1 to restore, ??2=1 to overhaul, and ??3=1 to reroute queries, as defined in Section 4.3. Note that the full-state model enables the designer to consider service demand and to weigh availability against response time. Thus two cost criteria are established. The first criterion,

???????,?????=??1??????+??2??????+??3????????,(19) penalizes long queues that cannot effectively reduce in time due to server failure, and thus favors response time. The second criterion, shown in the following table, penalizes prolonged service time, again, due to server failure, and thus favors availability.

The size of the state space suggests numerical means for solutions. Mathematical programs will be applied to obtain the solutions. The steady-state availability and the expected query response time of the controlled database system with the optimal policy will then be examined under various conditions.

5.1. Optimal Policy Design via Mathematical Programming

The rate transition matrix ??(??(??)) of the 96-state model can be obtained based on Figure 8 established in Section 4.3. This ??(??(??)) depends on ??(??)=(??1(??),??2(??),??3(??)),?????{0,1}. State probability equation

??(??)=??(??)?????(??)?(20) originated from the forward Chapman-Kolmogorov (2) can now be uniformized to yield a discrete time Markov chain

??(??+1)=??(??)???+1???????(??)??,(21) where uniform rate ?? can be chosen to be

??=3??+3??+??+??+??+??+??.(22)

Recall optimality (11)

as an alternative characterization of the solution to Markov decision problem (18), which produces a system of 96 equations.

Dynamic programming is the most natural numerical approach to policy design (18) because (11) is derived through taking limit of a finite horizon dynamic program [9, 10]

where ??<1, and terminal cost ??0(??)=0,forall?????. In this case the optimal cost is given by ????(??0),??0???. More specifically, with ?? taking values in a finite set, the minimal cost-to-go from ??0 of the 96-state Markov decision process satisfies

where????(??0) is the minimal cost-to-go from ??0 of an ??-step finite horizon process.

The solution to a dynamic program results from an iterative calculation backwards along the horizon from ??0(??) to the first step ????(??). For the dynamic programming calculation to converge to the true cost-to-go, ?? must be significantly large, and must be less than 1.

Linear programming [18] can be considered as an alternative numerical approach to the solution of the Markov decision problem. In this case, the set of optimality equations is turned into a set of affine constraints on the set of optimization variables {??(??)}, and the problem can be formally stated as follows:

Maximize??(1)+??(2)++??(95)+??(96)(26)subjectto??(??)=0,?????={1,,96},(27)??(??)=[??(??,??)+?????????,????(??)]|??????????,?????.(28)

The equivalence of the linear program formulation (26)–(28) and the optimality equation formulation can be easily established. First, (27) is trivially satisfied for all ?? in both formulations because one-step cost ??(??,??) is always nonnegative.

Suppose (??(1),,??(96)) is the linear program solution. Then there must be one active (equality achieved) constraint for each of the affine inequality constraints of the form ??(??)= for each ??. Suppose for some ??, the constraint(s) ??(??) is not active. Then ??(??) can be increased until one of the inequality constraints becomes active without violating the rest of the inequality constraints because ??????,??<1 as coefficient of ??(??) on the right side of the inequality constraints (28). This, however, contradicts the assumption that ?????(??) is maximum. Therefore, (??(1),,??(96)) is also the solution to the optimality equations (28).

Assume now that (??(1),,??(96)) satisfies the optimality equations. It then automatically satisfies the inequality constraints (28), of which 96 are active, one for each ??(??) appearing on the left side. Suppose ?????(??) is not maximum. There is at least a ??(??) for some?? that is smaller than the corresponding cost in max ?????(??), which implies that the corresponding constraint(s) for ??(??)< is (are) slack or inactive. This contradicts that ??(??) satisfies the optimality equation. Therefore, (??(1),,??(96)) must also be the solution of the linear program formulation (28). The equivalence is thus established.

The function linprog in MATLAB's Optimization Toolbox [19] solves the maximization problem above. The active constraints are checked with a MATLAB script to determine the optimal control policy.

The computational complexity of the dynamic program and that of the linear program are now compared. Finding the solution to a linear program generally requires a computation time proportional to ??2?? [18] when ??=??, where ?? is the number of optimization variables, and ?? is the number of constraints. The computational complexity of an iterative dynamic programming solution can be approximated by assuming that each iteration is a series of linear programs. The linear programming solution to the set of optimality equations is of course a single linear program.

The number of control variables, ??, the number of states, ??, and the horizon length, ??, are critical to the computation time of these methods. First, consider the iterative method as a series of linear programs. Each individual iteration along the ??-step horizon consists of ?? individual linear programs. Each individual linear program has ?? variables and 2?? constraints. Therefore, the computation time is proportional to ????(??22??)=????(2??3). Now, consider the method of solving the optimality equations through linear programming. The single linear program has ?? variables and 2???? constraints. Hence, its computation time is proportional to ??22????=??32??.

Although the computation time grows faster for the linear program as the number of states increases, the horizon ?? is typically much larger than ??2 for small discount factorin ??in??=??/(??+??). Therefore, the linear program is more efficient for moderate numbers of states and small discount factors.

5.2. Availability and Response Time under Robust Control Policy

A selected set of results on the robust control policies solved via mathematical programming are presented in this subsection, and the system availability and query response time under some of the optimal policies are examined.

5.2.1. Restoration-Overhaul Switching

Under the cost criterion (19) (minimum total discounted queue size), the optimal policy depends on the number of queries in the queue behind the failed server. No action is taken to overhaul the system until the two active queues are empty and the buildup of queries behind the failed server is significant. Figure 9(a) depicts a switching curve of of the control policy between overhaul and restoration before (solid) and after (dotted) state ?? in Figure 9(b) is reached. Policy switching is determined by the amount of control action delay, the decision error probability, and the number of queries in the failed server. It can be seen that, while the two active queues are occupied or after the primary data is successfully restored, restoration is performed on the failed server as long as the server performing the restoration does not have any customers waiting in its queue.

Under the cost criterion stated in Table 2 (minimally reduced service time), the optimal policy always attempts to restore the failed server as long as the server performing the restoration does not have any queries waiting in its queue. The only exception is when three queries are piled into any single queue. In this case, overhaul occurs when the uncertainties are significant.

5.2.2. Performance under Nominal and Robust Policies, and Effect of Routing Delay

This subsection examines the system steady-state availability and the expected query response under the robust policy, where random control delay and decision error are explicitly modeled, and under nominal policy where uncertainties are ignored. The results are similar for policies derived with either the queue size criterion (19) or the service time criterion (Table 2). The robust policy shows two distinct features in Figures 10(a) and 10(b): it switches control action when uncertainties (delay and error) becomes significant, and it balances between availability and response time in this situation.

The routing only policy does not attempt to restore the single failed server. Instead, queries are routed to an empty queue whenever the subsequent server contains the data for the query. The system is overhauled upon a second server failure. It offers some advantage in response time over the always-restore policy when there is no routing delay, as shown in Figure 11(a). It is also seen that the robust optimal policy experience improved performance with rerouting authority. However, a routing delay of about one second is significant enough to discourage the use of the routing-only policy, as shown in Figure 11(b).

6. Conclusions

Uncertainties due to control delays, transmission delays, and decision errors in the distributed database system degrade the performance of the database system performance in terms of availability and response time. Restoration remains to be the optimal policy over a significant range of uncertainties. Beyond boundaries of the range, however, the optimal control policy switches to overhaul. By formulating and solving a Markov decision problem, the robustness of the control policies is investigated. Boundaries for which optimal actions alter are shown to exist and are quantified. The robust policies are shown to provide the best compromise among competing interests.

The authors have also investigated the optimal control policy for the database under the open queuing network setting in the face of delays and errors. Simulations with SimEvents [20] show that response time further depends on the arrival rate of queries. Simulation results will be reported separately. Simulation study of larger networks has also been planned.

Acknowledgment

This work was supported in part by AFOSR under Grants FA9550-06-0456 and FA9550-06-10249.