Abstract

This paper is concerned with the asymptotic optimality of quantized stationary policies for continuous-time Markov decision processes (CTMDPs) in Polish spaces with state-dependent discount factors, where the transition rates and reward rates are allowed to be unbounded. Using the dynamic programming approach, we first establish the discounted optimal equation and the existence of its solutions. Then, we obtain the existence of optimal deterministic stationary policies under suitable conditions by more concise proofs. Furthermore, we discretize and incentivize the action space and construct a sequence of quantizer policies, which is the approximation of the optimal stationary policies of the CTMDPs, and get the approximation result and the rates of convergence on the expected discounted rewards of the quantized stationary policies. Also, we give an iteration algorithm on the approximate optimal policies. Finally, we give an example to illustrate the asymptotic optimality.

1. Introduction

This paper deals with the infinite horizon discounted continuous-time Markov decision processes (CTMDPs), as well as studies the asymptotic optimality of quantized stationary policies of CTMDPs, and gives the convergence rate results. The discount factors are state-dependent, and the transition rates and reward rates are allowed to be unbounded.

It is well-known that the discounted CTMDPs have been widely studied as an important class of stochastic control problems. Generally speaking, according to the various forms of discount factors, the infinite horizon discounted CTMDPs can be classified into the following three groups: (i) MDPs with a fixed constant discount factor α, see, for instance, Doshi [1], Dynkin and Yushkevich [2], Feinberg [3], Guo [4, 5], Guo and Song [6], Guo and Hernndez-Lerma [7], Hernndez-Lerma and Lasserre [8, 9], Puterman [10], and the references therein; (ii) MDPs with varying (state-dependent or state-action dependent) discount factors, for instance, see Feinberg and Shwartz [11], Gonzlez-Hernndez et al. [12], Wu and Guo [13], Wu and Zhang [14], and the references therein; (iii) MDPs whose the discount factor is a function of the history, see Hinderer [15], for example. This paper will study the infinite horizon discounted CTMDPs in the case of the group.

For the discounted criterion of MDPs, there are many works on the existence of solutions to the discounted optimality equation and of discounted optimal stationary policies, see, for instance, [1, 4, 6, 7, 16] for the CTMDPs and [810, 1315] for the discrete-time Markov decision processes (DTMDPs). These references, however, are on the discounted MDPs with a constant discount factor or the discounted DTMDPs with varying discount factors. Recently, the discounted CDMDPs with state-dependent discount factors are studied in [16], in which the authors established the discounted reward optimality equation (DROE) and obtained the existence of discounted optimal stationary policies. However, in [16], the discussion is restricted to the class of all randomized stationary policies (i.e., the policies are time-independent). Following these ideas, still within the discounted continuous-time MDPs, models with Polish spaces are studied in this paper. We will extend some results in [16] to the case of all randomized Markov policies and obtain the existence of discounted optimal stationary policies by more concise proof.

Although the existence of the optimal policies is proved, it is difficult to compute an optimal policy even in the stationary policies class for nonfinite Polish (i.e., complete and separable metric) state and action spaces. Furthermore, in applications to networked control, the transmission of such control actions to an actuator is not realistic when there is an information transmission constraint (imposed by the presence of a communication channel) between a plant, a controller, or an actuator. Thus, from a practical point of view, it is important to study the approximation of optimal stationary policies. Several approaches have been developed in the literature to solve this problem for finite or countable state spaces, see [1720]. Lately, for infinite Borel state and action spaces, [21, 22] give the asymptotic optimality of quantized stationary policies in stochastic control for DTMDPs. Inspired by these, in this paper, we are concerned with the asymptotic optimality of quantized stationary policies in CTMDPs with Polish spaces. To the best of our knowledge, the corresponding asymptotic optimality for CTMDPs with varying (state-dependent) discount factors has not been studied.

Therefore, this paper contains the following three main contributions:(a)For the CDMDPs with state-dependent discount factors, we extend some results in [16] to the case of all randomized Markov policies, and the proof of the existence of discounted optimal stationary policies is simplified under mild conditions and gives an algorithm to get ε-optimal policies.(b)We obtain that the deterministic stationary quantizer policies are able to approximate the optimal deterministic stationary policies under mild technical conditions and thus show that one can search for approximate optimal policies within the class of quantized control policies.(c)For the asymptotic optimality, we give the corresponding convergence rates results.

This paper is organized as follows. In section 2, we introduce the models of CDMDPs with the expected discounted reward criterion and state the discounted optimality problem. In section 3, under suitable conditions, we prove the main result on the existence of the solutions to the discounted optimal equation (DOE) and the existence of optimal stationary policies. In section 4, we give an iteration algorithm on the ε-optimal policies. In section 5, we establish conditions under which quantized control policies are asymptotically optimal and give the corresponding convergence rate results and the rates of convergence on the expected discounted rewards of the quantized stationary policies. Finally, we illustrate the asymptotic optimality by an example in Section 6.

2. The Markov Decision Processes and Discounted Optimal Problem

Consider the model of continuous-time Markov decision processes as follows:where is the state space, are sets of admissible actions, and is a compact action space. and are assumed to be Polish spaces (i.e., complete and separable metric spaces) with Borel -field and , respectively. and are Borel subsets of and , respectively. denotes the function of transition rates, which satisfy the following properties:(P1) is a signed measure on for each fixed , and is a Borel-measurable function on K for each fixed (P2) for all and (P3) is conservative, that is, for all , and then, (P4) the model in (1) is supposed to be stable, that is, for each , it holds that

The discount factors are the nonnegative measurable functions on . Finally, the reward rate function is assumed to be Borel-measurable on . Note that, is allowed to be unbounded from both above and below, and it can be regarded as a cost rate rather than a reward rate only.

The definitions of the randomized Markov policy , randomized stationary policy , and (deterministic) stationary policy are given by [8] [Definitions 2.2.3 and 2.3.2]. The sets of all randomized Markov policies, randomized stationary policies, and (deterministic) stationary policies are denoted by , , and , respectively. It is clear that , and for each , , and , we define the associated functions of transition rates and reward rates by

In general, we also write as and , respectively. Furthermore, for each , we define the functions of transition rates and reward rates by

In particular, we write them as and , respectively, when , that is, and . Also, for each , we denote

For any fixed policy , is also called an infinitesimal generator (see Doshi [1]). As is well known, any transition function depending on such thatfor all and is called a Q-process with transition rates , where is the Dirac measure at . By Guo [4], there exists a minimal Q-process with transition rates , but such a Q-process might not be regular, that is, there may exist for some and . To ensure the regularity of the Q-process, we propose the following “drift conditions.”

Assumption 1. There exists a measurable function on , and constants and such that(a)for all , (b)For each ,

Remark 1. (a)The function in Assumption 1 (a) is used to guarantee the finiteness of the optimal value function as below, and by [4] [Remark 2.2(b)], it is an extension of the “drift condition” in Lund et al. [23] for a time-homogeneous Q-process. Moreover, Assumption 1 (b) is used to guarantee the regularity of the Q-process, and it is not required when the transition rates are bounded (i.e., ).(b)Under Assumption 1, it holds that by Guo [4] [Theorem 3.2]. Then, the Q-process with transition rates is regular and unique. Hence, we write simply as . Since it is time-homogeneous, we discuss the case that the initial time is , and then, we write simply as .As it is well known (e.g., see Doshi [1] and Guo [5]), for each and the initial state , there exists a unique probability space , where the probability measure is completely determined by (see Guo [6], Section 2.3]), and a state and action process with the transition probability function such that (see Guo [5], Lemma 2.1])The expectation operator corresponding to can be denoted by . Moreover, for each , and , the expected reward is given byNow, we state the discounted optimality problem. For each and , the expected discounted reward criterion is defined asand the corresponding optimal value function is given byAlso, a policy is called optimal policy if for all and . Our main aim in Section 3 is to give conditions for the existence of optimal deterministic stationary policies.

3. The Existence of Optimal Stationary Policies

In this section, the existence and uniqueness of the discounted optimal equation (DOE) are shown, and the existence of the optimal policies is given for the CTMDPs defined in (1).

Note that, for any given measurable function on , a function on is called bounded if the weighted norm is finite. Such a function is called a weight function. It is clear that is a Banach space for all real-valued measurable functions on . To guarantee the finiteness of the optimal value function, we need the following assumptions.

Assumption 2. Let and be as in Assumption 1. For each , suppose that the following conditions hold:(a) is a compact set(b)The function is continuous on , and for each , there exists a constant such that (c)The discount factor is continuous on , and there is a constant such that (d)For any bounded measurable function on , the functions and are continuous on (e)There exists a nonnegative measurable function on , and constants and such that and for all and For each , let be any positive measurable function on such that , andwhere is Dirac measure (i.e., it is equal to 1 if and 0 otherwise). It is clear that is a probability measure on for each . For any , define an operator on asAnd, define a recursive sequence asNow, we give the discounted optimal equation (DOE).

Theorem 1. Under Assumptions 1 and 2 (b)-(c), the following assertions hold.(a) for all and , and (b)Let , then we have , and it is the solution of the following discounted optimal equation (DOE):

Proof. (a)By the assumptions, we havewhere the last inequality holds by Guo [4] [Theorem 3.2(b)]. Then, for each , and part (a) holds.(b)First, we obtain is monotone and nondecreasing by a similar calculation as in Ye and Guo [16] [Equation (15)]. Furthermore, it is clear that the operator is monotone and nondecreasing. Then, we have is monotone and nondecreasing, which yields that for all .Next, we show that . Note that ≥1 by Assumption 1, which yields thatThen, by induction argument, for all , we haveThus, , that is, .
Last, we show . By the monotonicity of and , we have for all , and so .
On the other hand, by the definition of the operator , we haveThen, letting , by Hern Hernndez-Lerma and Lasserre [9], [Lemma 8.3.7], we obtainwhich follows that . Thus, we have , that is, is the solution of DOE in (14).

Remark 2. Theorem 1 is not only the generalization of the control model with a constant discount factor in Guo [4] [Theorem 3.3(a)-(b)] but also the model in Ye and Guo [16] whose policies are restricted within the family of all randomized stationary policies.
The following Lemma 1 is a direct consequence of [16] [Theorem 3.2].

Lemma 1. Under Assumptions 1 and 2, for each and , the expected discounted reward criterion is the unique solution of the following equation:

Lemma 2. Under Assumptions 1 and 2, for each , , and , the following assertions hold.(a)ifthen, we have .(b)ifthen we have .

Proof. By (21), there exists a nonnegative measurable function on such thatNow, let , and we get the new Markov decision processes:in which only the reward rate function is different from the model in (1). Moreover, for each , , the expected discounted reward criterion is given byBy Lemma 1, we have , which gives part (a).
Similarly, we can prove (b).

Remark 3. Lemma 2 is the generalization of Ye and Guo [16] [Lemma 6.3].

Theorem 2. Under Assumptions 1 and 2, for each , the optimal value function is the solution of DOE in (12), and there exists a (deterministic) stationary policy such that

Proof. By Theorem 1(b), for each and , we havewhich together with Lemma 2(a) yields that , and then, . Note that is upper semicontinuous on ; then, by [9] [Lemma 8.3.8], we can obtain that there exists a policy such thatfor all . Thus, by Lemma 1, we have .

Remark 4. (a)Theorem 2 shows that the optimal value function is a solution to the DOE ((21)]) and ensures the existence of an optimal (deterministic) stationary policy.(b)By the construction of the new Markov decision processes, the proof of Theorem 2 is more concise than in [16] Theorem 3.3.

4. An Iteration Algorithm for -Optimal Policies

In this section, we provide an iteration algorithm for -optimal policies.

Step 1. (Initialization). Choose any , let in Assumption 1, and for each , let
Step 2. (Iteration). For each , let
where .
Step 3. (Approximation value). If
where , go on step 4, otherwise increment by 1 and return to step 2.
Step 4. (-optimal policy). For each , choose
and is -optimal policy.

In fact, for the operator on in Section 3, with , it holds that

Then, by Algorithm 1, we havewhich yields that

By a similar argument, we haveand then,

5. Asymptotic Optimality of Quantized Stationary Policies

5.1. Approximation of Deterministic Stationary Policies

In Section 3, we give the existence of the deterministic stationary policies for the CTMDPs in (1) under suitable conditions. However, in practice, sometimes, the action space cannot satisfy the continuity conditions in theoretical research. Thus, in this section, we will discretize and incentivize the action space, so that we can construct a sequence of policies, namely “quantizer policies,” which is the approximation of the deterministic stationary policies of the CTMDPs in (1).

To this end, we first give the definitions of quantizers and deterministic stationary quantizer policies.

Definition 1. A measurable function is called a quantizer from to , if is finite. Let denote the set of all quantizers from to .

Definition 2. A policy is called a deterministic stationary quantizer policy, if there exists a constant sequence of stochastic kernels on given such that for all for some , where is Dirac measure as in (11).
For any finite set , let denotes the set of all quantizers having range , and let denotes the set of all deterministic stationary quantizer policies induced by .
Denote the metric on as , and then, the action space is totally bounded by its compactness. For any fixed integer , there exists a finite point set such that for all ,where is called the net in . From this, for any a deterministic stationary policy , we can construct a sequence of quantizer policies to approximate to by the following methods.

Lemma 3 (The construction of quantizer policies). Let is the net in , for each and deterministic stationary policy , we defineThen, is a deterministic stationary quantizer policy sequence, and converges uniformly to as .

Proof. Lemma 3 holds obviously by [21] [Section 3].
We also call as the quantized approximations of . Next, we show their expected discounted rewards also satisfy the approximation. For this purpose, we need the following conditions as follows.

Assumption 3. Let be as in Assumption 1, for each , suppose that and are setwise continuous in for each and , that is, if , then setwise.

Lemma 4. Suppose that Assumptions 2 and 3 hold. Let be a deterministic stationary policy of the control model in (1) and be the quantized approximations of as in Lemma 3, then for each , the strategic measures induced by the quantized approximations of converge to in the weak topology. Therefore, converges to .

Proof. The proof is similar to that of [21], Proposition 3.1, and by Assumption 3 and the definition of the strategic measures as in [6], [Section 2.3] or [5], [Section II], we can get Lemma 4 holds.
Now, we give the approximation result on the expected discounted rewards of the deterministic stationary quantizer policies.

Theorem 3. Suppose that Assumptions 13 hold. Let be a deterministic stationary policy of the control model in (1), and be the quantized approximations of as in Lemma 3, then for each , we have

Proof. By the definition of the expected discounted reward criterion, we can getNote that, by Lemmas 3 and 4, we havewhich yields thatOn the other hand, we havewhere the last inequality holds by [4] [Theorem 3.2(b)]. Then, we haveas . By (40), we can get

5.2. Rates of Convergence

Definition 3. Let denote the total variation distance between measures and on the probability space , which satisfies

Assumption 4. For each , suppose that the model (1) satisfies the following conditions:(a) is a compact subset of for some d (b)For all , there exists a constant such that(c)For all and the function of transition rates , there exists a constant such thatBy Lemma 3 and Assumption 4, the following Lemma holds.

Lemma 5. For any measurable function , we can construct a sequence of quantizers from to , and there exists some constant such thatNow, we give the convergence rates result.

Theorem 4. Suppose that Assumptions 14 hold. Let be a deterministic stationary policy of the control model in (1) and be the quantized approximations of as in Lemma 3, then for each , it holds that

Proof. By Lemma 1, we can getThen, by Lemma 5 and Theorem 1, we havewhich yields that

6. An Example

In this section, we give an example to illustrate our main results.

Consider a control problem of hypertension. As it is well known, we can describe the blood pressure with Gaussian distribution, and thus, the quantity of blood pressure may take values in . When the current amount of blood pressure is at , a controlled amount is given by for each with . The rate of change of blood pressure is given as follows:for and , where is the circumference ratio, and is the Dirac measure at . It is clear that is a transition rate function. We denote by the cost of taking control when the current amount of blood pressure is at , and regard as an action. The discount factor is defined by for with a constant . Suppose that the constants satisfy that(i) and , where (ii)

Let , and , and then, by Steps 1–4 of the iteration algorithm in Section 4, the approximate optimal value isand the optimal stationary policy iswhere , .

Now, we can construct a sequence of quantizer policies of as follows:where .

Now, we compute , , and by assigning values to parameters , , , , and as follows:then, the optimal value is , and the optimal stationary policy is

The quantizer policies of are

Furthermore, the asymptotic approximation of the optimal policy is given by Figures 1 and 2 when and , respectively. This verifies for each , and by Theorem 3, it holds that .

7. Conclusions

In this paper, we are concerned with the asymptotic optimality of quantized stationary policies in CTMDPs with Polish spaces and varying (state-dependent) discount factors. First, we establish the discounted optimal equation (DOE) and give the existence of its solutions. Then, by a relatively simple proof, we obtain the existence of optimal deterministic stationary policies under suitable conditions in Theorem 2. Meanwhile, we generalize the relevant conclusions of Ye and Guo [16] in Theorem 1 and Lemma 2. Next, we discretize and incentivize the action space, construct a sequence of policies, namely “quantizer policies,” and obtain the approximation results and the rates of convergence for the optimal policies on the CTMDPs in (1) as in Theorem 3. Finally, we give an example to illustrate the asymptotic optimality.

Data Availability

No data were used to support the findings of this study.

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant No. 11961005) and the Opening Project of Guangdong Province Key Laboratory of Computational Science at Sun Yat-Sen University (Grant No. 2021021).