Abstract and Applied Analysis

Volume 2009, Article ID 103723, 17 pages

http://dx.doi.org/10.1155/2009/103723

## Policy Iteration for Continuous-Time Average Reward Markov Decision Processes in Polish Spaces

^{1}Department of Mathematics, Ningbo University, Ningbo 315211, China^{2}Department of Mathematics, Honghe University, Mengzi 661100, China^{3}The College of Mathematics and Computing Science, Changsha University of Science and Technology, Changsha 410076, China

Received 24 June 2009; Accepted 9 December 2009

Academic Editor: Nikolaos Papageorgiou

Copyright © 2009 Quanxin Zhu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We study the *policy iteration algorithm* (PIA) for continuous-time jump Markov decision processes in general state and action spaces. The corresponding transition rates are allowed to be *unbounded*, and the reward rates may have *neither upper nor lower bounds*. The criterion that we are concerned with is *expected average reward*. We propose a set of conditions under which we first establish the average reward optimality equation and present the PIA. Then under two *slightly* different sets of conditions we show that the PIA yields the optimal (maximum) reward, an average optimal stationary policy, and a solution to the average reward optimality equation.

#### 1. Introduction

In this paper we study the average reward optimality problem for continuous-time jump Markov decision processes (MDPs) in general state and action spaces. The corresponding transition rates are allowed to be *unbounded*, and the reward rates may have *neither upper nor lower bounds*. Here, the approach to deal with this problem is by means of the well-known policy iteration algorithm (PIA)—also known as Howard's policy improvement algorithm.

As is well known, the PIA was originally introduced by Howard (1960) in [1] for finite MDPs (i.e., the state and action spaces are both finite). By using the monotonicity of the sequence of iterated average rewards, he showed that the PIA converged with a finite number of steps. But, when a state space is not *finite*, there are well-known counterexamples to show that the PIA does not converge even though the action space is compact (see [2–4], e.g.,). Thus, an interesting problem is to find conditions to ensure that the PIA converges. To do this, extensive literature has been presented; for instance, see [1, 5–14] and the references therein. However, most of those references above are concentrated on the case of discrete-time MDPs; for instance, see [1, 5, 11] for finite discrete-time MDPs, [10, 15] for discrete-time MDPs with a finite state space and a compact action set, [13] for denumerable discrete-time MDPs, and [8, 9, 12] for discrete-time MDPs in Borel spaces. For the case of continuous-time models, to the best of our knowledge, only Guo and Hernández-Lerma [6], Guo and Cao [7], and Zhu [14] have addressed this issue. In [6, 7, 14], the authors established the average reward optimality equation and the existence of average optimal stationary policies. However, the treatments in [6, 7] are restricted to only a denumerable state space. In [14] we used the policy iteration approach to study the average reward optimality problem for the case of continuous-time jump MDPs in general state and action spaces. One of the main contributions in [14] is to prove the existence of the average reward optimality equation and average optimal stationary policies. But the PIA is not stated explicitly in [14], and so the value of the average optimal reward value function and an average optimal stationary policy are also not be computed in [14]. In this paper we further study the average reward optimality problem for such a class of continuous-time jump MDPs in general state and action spaces. Our main objective is to use the PIA to compute or at least approximate (when the PIA takes infinitely many steps to converge) the value of the average optimal reward value function and an average optimal stationary policy. To do this, we first use the so-called “drift" condition, the standard continuity-compactness hypotheses, and the irreducible and uniform exponential *ergodicity* condition to establish the average reward optimality equation and present the PIA. Then under two differently extra conditions we show that the PIA yields the optimal (maximum) reward, an average optimal stationary policy, and a solution to the average reward optimality equation. A key feature of this paper is that the PIA provides an approach to compute or at least approximate (when the PIA takes infinitely many steps to converge) the value of the average optimal reward value function and an average optimal stationary policy.

The remainder of this paper is organized as follows. In Section 2, we introduce the control model and the optimal control problem that we are concerned with. After our optimality conditions and some technical preliminaries as well as the PIA stated in Section 3, we show that the PIA yields the optimal (maximum) reward, an average optimal stationary policy, and a solution to the average reward optimality equation in Section 4. Finally, we conclude in Section 5 with some general remarks.

*Notation 1. *If is a Polish space (i.e., a complete and separable metric space), we denote by the Borel -algebra.

#### 2. The Optimal Control Problem

The material in this section is quite standard (see [14, 16, 17] e.g.,), and we shall state it briefly. The control model that we are interested in is continuous-time jump MDPs with the following form:

where one has the following.

(i) is a state space and it is supposed to be a Polish space.(ii) is an action space, which is also supposed to be a Polish space, and is a Borel set which denotes the set of available actions at state . The set is assumed to be a Borel subset of .(iii) denotes the*transition rates*, and they are supposed to satisfy the following properties: for each and , is a signed measure on , and is Borel measurable on ;, for all ;; for all

It should be noted that the property shows that the model is *conservative*, and the property implies that the model is *stable*.

*reward rate*and it is assumed to be measurable on . (As is allowed to take positive and negative values; it can also be interpreted as a

*cost rate*.)

To introduce the optimal control problem that we are interested in, we need to introduce the classes of admissible control policies.

Let be the family of function such that

(i)for each and , is a probability measure on ,(ii)for each and , is a Borel measurable function on .*Definition 2.1. *A family is said to be a *randomized Markov policy*. In particular, if there exists a measurable function on with for all , such that for all and , then is called a (deterministic) *stationary policy* and it is identified with . The set of all stationary policies is denoted by .

For each , we define the associated transition rates and the reward rates respectively, as follows.

For each , and ,

In particular, we will write and as and , respectively, when .

*Definition 2.2. *A randomized Markov policy is said to be *admissible* if is continuous in , for all and .

The family of all such policies is denoted by . Obviously, and so that is nonempty. Moreover, for each Lemma in [16] ensures that there exists a -process—that is, a possibly substochastic and nonhomogeneous transition function with transition rates . As is well known, such a -process is not necessarily regular; that is, we might have for some state and . To ensure the regularity of a -process, we shall use the following so-called “drift" condition, which is taken from [14, 16–18].

*Assumption A. *There exist a (measurable) function on and constants , , and such that(1) for all ;(2) for all , with as in ;(3) for all .

Remark in [16] gives a discussion of Assumption A. In fact, Assumption A() is similar to conditions in the previous literature (see [19, equation (2.4)] e.g.,), and it is together with Assumption A() used to ensure the finiteness of the average expected reward criterion (2.5) below. In particular, Assumption A() is not required when the transition rate is uniformly bounded, that is, .

For each initial state at time and , we denote by and the probability measure determined by and the corresponding expectation operator, respectively. Thus, for each by [20, pages 107–109] there exists a Borel measure Markov process (we shall denote by for simplicity when there is no risk of confusion) with value in and the transition function , which is completely determined by the transition rates . In particular, if , we write and as and respectively.

If Assumption A holds, then from [17, Lemma ] we have the following facts.

Lemma 2.3. *Suppose that Assumption A holds. Then the following statements hold.*(a)*For each , and ,
where the function and constants and are as in Assumption A.*(b)*For each , and ,
*

For each and , the *expected average reward* as well as the corresponding optimal reward value functions are defined as

As a consequence of Assumption A() and Lemma 2.3(a), the expected average reward is well defined.

*Definition 2.4. *A policy is said to be *average optimal* if for all .

The main goal of this paper is to give conditions for ensuring that the policy iteration algorithm converges.

#### 3. Optimality Conditions and Preliminaries

In this section we state conditions for ensuring that the policy iteration algorithm (PIA) converges and give some preliminary lemmas that are needed to prove our main results.

To guarantee that the PIA converges, we need to establish the average reward optimality equation. To do this, in addition to Assumption A, we also need two more assumptions. The first one is the following so-called standard continuity-compactness hypotheses, which is taken from [14, 16–18]. Moreover, it is similar to the version for discrete-time MDPs; see, for instance, [3, 8, 21–23] and their references. In particular, Assumption B() is not required when the transition rate is uniformly bounded, since it is only used to ensure the applying of the *Dynkin formula*.

*Assumption B. *For each ,(1) is compact;(2) is continuous in , and the function is continuous in for each bounded measurable function on , and also for as in Assumption A;(3)there exist a nonnegative measurable function on , and constants and such that
for all .

The second one is the irreducible and uniform exponential *ergodicity* condition. To state this condition, we need to introduce the concept of the weighted norm used in [8, 14, 22]. For the function in Assumption A, we define the weighted supremum norm for real-valued functions on by

and the Banach space

*Definition 3.1. *For each , the Markov process , with transition rates , is said to be *uniform *-*exponentially ergodic* if there exists an invariant probability measure on such that
for all , and , where the positive constants and do not depend on , and where .

*Assumption C. *For each , the Markov process , with transition rates , is uniform -exponentially *ergodic* and -*irreducible*, where is a nontrivial -finite measure on independent of .

*Remark 3.2. *(a) Assumption C is taken from [14] and it is used to establish the average reward optimality equation. (b) Assumption C is similar to the uniform -exponentially ergodic hypothesis for discrete-time MDPs; see [8, 22], for instance. (c) Some sufficient conditions as well as examples in [6, 16, 19] are given to verify Assumption C. (d) Under Assumptions A, B, and C, for each , the Markov process , with the transition rate , has a unique invariant probability measure such that
(e) As in [9], for any given stationary policy , we shall also consider two functions in to be equivalent and do not distinguish between equivalent functions, if they are equal -almost everywhere (a.e.). In particular, if -a.e. holds for all , then the function will be taken to be identically zero.

Under Assumptions A, B, and C, we can obtain several lemmas, which are needed to prove our main results.

Lemma 3.3. *Suppose that Assumptions A, B, and C hold, and let be any stationary policy. Then one has the following facts.*(a)*For each , the function
belongs to , where and is as in Assumption A.*(b)* satisfies the Poisson equation
for which the -expectation of is zero, that is,
*(c)*For all , .*(d)*For all , *

*Proof. *Obviously, the proofs of parts (a) and (b) are from [14, Lemma ]. We now prove (c). In fact, from the definition of in (2.5), Assumption A(), and Lemma 2.3(a) we have
which gives (c). Finally, we verify part (d). Obviously, by Assumption A() and Assumption C we can easily obtain for all , which together with part (c) yields the desired result.

The next result establishes the *average reward optimality equation*. For the proof, see [14, Theorem ].

Theorem 3.4. *Under Assumptions A, B, and C, the following statements hold.*(a)*There exist a unique constant , a function , and a stationary policy satisfying the average reward optimality equation**(b)** for all *(c)*Any stationary policy realizing the maximum of (3.10) is average optimal, and so in (3.11) is average optimal.*

Then, under Assumptions A, B, and C we shall present the PIA that we are concerned with. To do this, we first give the following definition.

For any real-valued function on , we define the dynamic programming operator as follows:

*Algorithm A (policy iteration)**Step 1 (initialization). *Take and choose a stationary policy .*Step 2 (policy evaluation). *Find a constant and a real-valued function on satisfying the Poisson equation (3.7), that is,
Obviously, by (3.12) and (3.13) we have
*Step 3 (policy improvement). *Set for all for which
otherwise (i.e., when (3.15) does not hold), choose such that
*Step 4. *If satisfies (3.15) for all , then stop (because, from Proposition 4.1 below, is average optimal); otherwise, replace with and go back to Step 2.

*Definition 3.5. *The policy iteration Algorithm A is said to *converge* if the sequence converges to the average optimal reward value function in (2.5), that is,
where is as in Theorem 3.4.

Obviously, under Assumptions A, B, and C from Proposition 4.1 we see that the sequence is nondecreasing; that is, holds for all . On the other hand, by Lemma 3.3(d) we see that is bounded. Therefore, there exists a constant such that

Noting that, in general, we have . In order to ensure that the policy iteration Algorithm A converges, that is, , in addition to Assumptions A, B, and C, we need an additional condition (Assumption D (or ) below).

*Assumption D. *There exist a subsequence of and a measurable function on such that

*Remark 3.6. *(a) Assumption D is the same as the hypothesis H1 in [9], and Remark in [9] gives a detailed discussion of Assumption D. (b) In particular, Assumption D trivially holds when the state space is a *countable* set (with the discrete topology). (c) When the state space is not *countable*, if the sequence is equicontinuous, Assumption D also holds.

*Assumption D’. *There exists a stationary policy such that

*Remark 3.7. *Assumption is the same as the hypothesis H2 in [9]. Obviously, Assumption trivially holds when the state space is a *countable* set (with the discrete topology) and is compact for all .

Finally, we present a lemma (Lemma 3.8) to conclude this section, which is needed to prove our Theorem 4.2. For a proof, see [24, Proposition ], for instance.

Lemma 3.8. *Suppose that is compact for all , and let be a stationary policy sequence in . Then there exists a stationary policy such that is an accumulation point of for each .*

#### 4. Main Results

In this section we will present our main results, Theorems 4.2-4.3. Before stating them, we first give the following proposition, which is needed to prove our main results.

Proposition 4.1. *Suppose that Assumptions A, B, and C hold, and let be an arbitrary stationary policy. If any policy such that
**
then (a)
*(b)*if , then
*(c)*if is average optimal, then
where is as in Theorem 3.4;*(d)*if , then satisfies the average reward optimality equation (3.10), and so is average optimal.*

*Proof. *(a) Combining (3.7) and (4.1) we have
Obviously, taking the integration on both sides of (4.5) with respect to and by Remark 3.2(d) we obtain the desired result.

(b) If , we may rewrite the Poisson equation for as

Then, combining (4.5) and (4.6) we obtain
Thus, from (4.7) and using the Dynkin formula we get
Letting in (4.8) and by Assumption C we have
Now take . Then take the supremum over in (4.9) to obtain
and so
which implies
Hence, from Remark 3.2(e) and (4.12) we obtain (4.3).

(c) Since is average optimal, by Definition 2.4 and Theorem 3.4(b) we have

Hence, the Poisson equation (3.7) for becomes
On the other hand, by (3.10) we obtain
which together with (4.14) gives
Thus, as in the proof of part (b), from (4.16) we see that (4.4) holds with .

(d) By (3.7), (4.1), (4.3), and we have

which gives
that is,
Thus, as in the proof of Theorem in [14], from Lemma 2.3(b), (3.7), and (4.19) we show that is average optimal, that is, . Hence, we may rewrite (4.19) as
Thus, from (4.20) and part (c) we obtain the desired conclusion.

Theorem 4.2. *Suppose that Assumptions A, B, C, and D hold, then the policy iteration Algorithm A converges.*

*Proof. *From Lemma 3.3(a) we see that the function in (3.13) belongs to , and so the function in (3.19) also belongs to . Now let be as in Assumption D, and let be the corresponding subsequence of . Then by Assumption D we have
Moreover, from Lemma 3.8 there is a stationary policy such that is an accumulation point of for each ; that is, for each there exists a subsequence (depending on the state ) such that
Also, by (3.13) we get
On the other hand, take any real-valued measurable function on such that for all . Then, for each and , by the properties we can define as follows:
Obviously, is a probability measure on . Thus, combining (4.23) and (4.24) we have
Letting in (4.25), then by (3.18), (4.21), and (4.22) as well as the “extension of Fatou's lemma 8.3.7" in [8] we obtain
To complete the proof of Theorem 4.2, by Proposition 4.1(d) we only need to prove that and satisfy the average reward optimality equation (3.10) and (3.11), that is,
Obviously, from (4.26), and the definition of in (3.12) we obtain
The rest is to prove the reverse inequality, that is,
Obviously, by (3.19) we have
Moreover, from Lemma 3.3(a) again we see that there exists a constant such that
which gives
Thus, by (4.24), (4.31), (4.32) and the “extension of Fatou's lemma 8.3.7" in [8] we obtain
which implies
Also, from (3.7), (3.16), and the definition of in (3.12) we get
Letting in (4.35), then by (3.18), (4.21), (4.22), (4.34), and the “extension of Fatou's lemma 8.3.7" in [8] we obtain
which gives
This completes the proof of Theorem 4.2.

Theorem 4.3. *Suppose that Assumptions A, B, C, and hold, then the policy iteration Algorithm A converges.*

*Proof. *To prove Theorem 4.3, from the proof of Theorem 4.2 we only need to verify that (4.26) and (4.27) hold true for as in Assumption and some function in . To do this, we first define two functions in as follows:
Then by (3.7) we get
which together with (4.24) yields
Applying the “extension of Fatou's Lemma" 8.3.7 in [8] and letting in (4.40), then by (3.18), (4.38) and Assumption we obtain
which implies
Thus, combining (4.42) and (4.43) we get
Then, from the proof of Proposition 4.1(b) and (4.44) we have
which together with (4.42), (4.43), and the definition of in (3.12) gives
The remainder is to prove the reverse inequality, that is,
Obviously, by (3.16) and (4.24) we get
Then, letting in (4.48), by (4.38), Assumption , and the “extension of Fatou's Lemma" 8.3.7 in [8], we obtain
which implies
and so
Thus, combining (4.46) and (4.51) we see that (4.47) holds. And so Theorem 4.3 follows.

#### 5. Concluding Remarks

In the previous sections we have studied the policy iteration algorithm (PIA) for average reward continuous-time jump MDPs in Polish spaces. Under two *slightly* different sets of conditions we have shown that the PIA yields the optimal (maximum) reward, an average optimal stationary policy, and a solution to the average reward optimality equation. It should be mentioned that the approach presented here is different from the policy iteration approach used in [14] because the PIA in this paper provides an approach to compute or at least approximate (when the PIA takes infinitely many steps to converge) the value of the average optimal reward value function and an average optimal stationary policy.

#### Acknowledgments

The author would like to thank the editor and anonymous referees for their good comments and valuable suggestions, which have helped us to improve the paper. This work was jointly supported by the National Natural Science Foundation of China (10801056), the Natural Science Foundation of Ningbo (201001A6011005) the Scientific Research Fund of Zhejiang Provincial Education Department, K.C. Wong Magna Fund in Ningbo University, the Natural Science Foundation of Yunnan Provincial Education Department (07Y10085), the Natural Science Foundation of Yunnan Provincial (2008CD186), the Foundation of Chinese Society for Electrical Engineering (2008).

#### References

- R. A. Howard,
*Dynamic Programming and Markov Processes*, The Technology Press of M.I.T., Cambridge, Mass, USA, 1960. View at MathSciNet - R. Dekker, “Counter examples for compact action Markov decision chains with average reward criteria,”
*Communications in Statistics*, vol. 3, no. 3, pp. 357–368, 1987. View at Google Scholar · View at MathSciNet - M. L. Puterman,
*Markov Decision Processes: Discrete Stochastic Dynamic Programming*, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, John Wiley & Sons, New York, NY, USA, 1994. View at MathSciNet - P. J. Schweitzer, “On undiscounted Markovian decision processes with compact action spaces,”
*RAIRO—Operations Research*, vol. 19, no. 1, pp. 71–86, 1985. View at Google Scholar · View at MathSciNet - E. V. Denardo and B. L. Fox, “Multichain Markov renewal programs,”
*SIAM Journal on Applied Mathematics*, vol. 16, pp. 468–487, 1968. View at Google Scholar · View at MathSciNet - X. P. Guo and O. Hernández-Lerma, “Drift and monotonicity conditions for continuous-time controlled Markov chains with an average criterion,”
*IEEE Transactions on Automatic Control*, vol. 48, no. 2, pp. 236–245, 2003. View at Publisher · View at Google Scholar · View at MathSciNet - X. P. Guo and X. R. Cao, “Optimal control of ergodic continuous-time Markov chains with average sample-path rewards,”
*SIAM Journal on Control and Optimization*, vol. 44, no. 1, pp. 29–48, 2005. View at Publisher · View at Google Scholar · View at MathSciNet - O. Hernández-Lerma and J. B. Lasserre,
*Further Topics on Discrete-Time Markov Control Processes*, vol. 42 of*Applications of Mathematics*, Springer, New York, NY, USA, 1999. View at MathSciNet - O. Hernández-Lerma and J. B. Lasserre, “Policy iteration for average cost Markov control processes on Borel spaces,”
*Acta Applicandae Mathematicae*, vol. 47, no. 2, pp. 125–154, 1997. View at Publisher · View at Google Scholar · View at MathSciNet - A. Hordijk and M. L. Puterman, “On the convergence of policy iteration in finite state undiscounted Markov decision processes: the unichain case,”
*Mathematics of Operations Research*, vol. 12, no. 1, pp. 163–176, 1987. View at Publisher · View at Google Scholar · View at MathSciNet - J. B. Lasserre, “A new policy iteration scheme for Markov decision processes using Schweitzer's formula,”
*Journal of Applied Probability*, vol. 31, no. 1, pp. 268–273, 1994. View at Google Scholar · View at MathSciNet - S. P. Meyn, “The policy iteration algorithm for average reward Markov decision processes with general state space,”
*IEEE Transactions on Automatic Control*, vol. 42, no. 12, pp. 1663–1680, 1997. View at Publisher · View at Google Scholar · View at MathSciNet - M. S. Santos and J. Rust, “Convergence properties of policy iteration,”
*SIAM Journal on Control and Optimization*, vol. 42, no. 6, pp. 2094–2115, 2004. View at Publisher · View at Google Scholar · View at MathSciNet - Q. X. Zhu, “Average optimality for continuous-time Markov decision processes with a policy iteration approach,”
*Journal of Mathematical Analysis and Applications*, vol. 339, no. 1, pp. 691–704, 2008. View at Publisher · View at Google Scholar · View at MathSciNet - A. Y. Golubin, “A note on the convergence of policy iteration in Markov decision processes with compact action spaces,”
*Mathematics of Operations Research*, vol. 28, no. 1, pp. 194–200, 2003. View at Publisher · View at Google Scholar · View at MathSciNet - X. P. Guo and U. Rieder, “Average optimality for continuous-time Markov decision processes in Polish spaces,”
*The Annals of Applied Probability*, vol. 16, no. 2, pp. 730–756, 2006. View at Publisher · View at Google Scholar · View at MathSciNet - Q. X. Zhu, “Average optimality inequality for continuous-time Markov decision processes in Polish spaces,”
*Mathematical Methods of Operations Research*, vol. 66, no. 2, pp. 299–313, 2007. View at Publisher · View at Google Scholar · View at MathSciNet - Q. X. Zhu and T. Prieto-Rumeau, “Bias and overtaking optimality for continuous-time jump Markov decision processes in Polish spaces,”
*Journal of Applied Probability*, vol. 45, no. 2, pp. 417–429, 2008. View at Publisher · View at Google Scholar · View at MathSciNet - R. B. Lund, S. P. Meyn, and R. L. Tweedie, “Computable exponential convergence rates for stochastically ordered Markov processes,”
*The Annals of Applied Probability*, vol. 6, no. 1, pp. 218–237, 1996. View at Publisher · View at Google Scholar · View at MathSciNet - I. I. Gīhman and A. V. Skorohod,
*Controlled Stochastic Processes*, Springer, New York, NY, USA, 1979. View at MathSciNet - Q. X. Zhu and X. P. Guo, “Markov decision processes with variance minimization: a new condition and approach,”
*Stochastic Analysis and Applications*, vol. 25, no. 3, pp. 577–592, 2007. View at Publisher · View at Google Scholar · View at MathSciNet - Q. X. Zhu and X. P. Guo, “Another set of conditions for Markov decision processes with average sample-path costs,”
*Journal of Mathematical Analysis and Applications*, vol. 322, no. 2, pp. 1199–1214, 2006. View at Publisher · View at Google Scholar · View at MathSciNet - Q. X. Zhu and X. P. Guo, “Another set of conditions for strong $n(n=-1,0)$ discount optimality in Markov decision processes,”
*Stochastic Analysis and Applications*, vol. 23, no. 5, pp. 953–974, 2005. View at Publisher · View at Google Scholar · View at MathSciNet - M. Schäl, “Conditions for optimality in dynamic programming and for the limit of $n$-stage optimal policies to be optimal,”
*Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete*, vol. 32, no. 3, pp. 179–196, 1975. View at Google Scholar · View at MathSciNet