Abstract

This paper deals with Markov decision processes (MDPs) on Euclidean spaces with an infinite horizon. An approach to study this kind of MDPs is using the dynamic programming technique (DP). Then the optimal value function is characterized through the value iteration functions. The paper provides conditions that guarantee the convergence of maximizers of the value iteration functions to the optimal policy. Then, using the Euler equation and an envelope formula, the optimal solution of the optimal control problem is obtained. Finally, this theory is applied to a linear-quadratic control problem in order to find its optimal policy.

1. Introduction

This paper deals with the optimal control problem in discrete time and with an infinite horizon. This problem is presented with the help of the Markov decision processes (MDPs) theory. To describe the MDPs, it is necessary to provide a Markov control model. The components of the Markov control model are used to describe the dynamic of the system. In this way at each time () the state of the system is affected by an admissible action. This sequence of actions is called a policy. The optimal control problem consists in determining an optimal policy, which is characterized through a performance criterion. In this paper, the infinite-horizon expected total discounted reward is considered.

An approach for solving the optimal control problem is through the dynamic programming technique (DP) (see [14]). DP characterizes the optimal solution of the optimal control problem using a functional equation, known as the dynamic programming equation (see [14]). In the literature there exists conditions that guarantee the value iteration (VI) functions procedure, which is used to approximate the optimal value function of the optimal control problem. However, this technique has problems when the reward and/or dynamic have a complicated functional form (see [5, page 93]).

An alternative for solving this problem is using the Euler Equation (EE), which is well known in the applications of MDPs to economic models. This equation is established and solved in this context (in some cases empirically) (see [613]).

An iterative method for deterministic MDPs is presented in [14]. In this case, the EE is obtained in terms of the VI functions. Retaking this idea, this article presents an iterative method of finding the solution of EE in terms of the VI functions in stochastic MDPs.

In this paper, the Euler equation is obtained using an envelope formula (see [1517]) under interiority conditions of the VI functions. The envelope formula characterizes the performance criterion derivative with respect to the initial state of the system. The performance criterion derivative is important in analyzing the behavior of the Markov control process. Also, in [18], a general study about the performance sensitivities in the policy space is presented. In this context, two performance sensitivity formulas are studied: one for performance derivatives at any policy in the policy space and the other one for performance differences between any two policies in the policy space.

The technique proposed in this paper is used as follows. Firstly, EE is applied to obtain the VI functions. Secondly, applying the envelope formula, the maximizers of the VI functions are obtained. Then using the maximizers convergence to the optimal policy, the optimal control problem is solved. This procedure is exemplified by a linear quadratic problem.

The paper is organized as follows: in Section 2, the theory of MDPs necessary for subsequent sections is presented. In Section 3, some conditions on the Markov Control Model are presented to ensure both the differentiability of the VI functions and the optimal value function. These conditions guarantee the validity of a version of the EE for the VI. Finally, in Section 4, a linear quadratic problem is presented to illustrate the theory.

2. Markov Decision Process

A discrete-time markov control model is a quintuple , where is the state space, is the action space, is the set of feasible actions in the state , is a transition law and is the one-step reward function (see [3]). and are (nonempty) Borel spaces with the Borel -algebras and , respectively. is a stochastic kernel on given , where , and is a measurable function.

Consider a Markov control model and, for each , define the space of admissible histories up to time as , and , for .

A policy is a sequence of stochastic kernels on the action space given . The set of policies will be denoted by .

Let be the set of decision functions or measurable selectors, that is, the set of all measurable functions such that for all .

A sequence of functions is called a Markov policy. A stationary policy is a Markov policy such that for all , with , and it will be denoted by (see [3]).

Given the initial state , and any policy , there is a probability measure on the space , with and , the product -algebra (see [3]). The corresponding expectation operator will be denoted by . The stochastic process is called a discrete-time Markov decision process.

The total expected discounted reward is defined as and , where is called the discount factor.

Definition 2.1. A policy is optimal, if for each , The function defined by , will be called the optimal value function.

The optimal control problem consists in determining an optimal policy.

2.1. Dynamic Programming

Definition 2.2. A measurable function is said to be a solution to the optimal equation (OE) if it satisfies

Assumption 2.3. (a) The one-step reward function is nonpositive, upper semicontinuous (u.s.c), and sup-compact on . ( is a sup-compact function if the set is compact for every and .)
(b) The transition law is strongly continuous.
(c) There exists a policy such that , for each .

Definition 2.4. The value iteration (VI) functions are defined as follows: for all and , with .

The following theorem is well-known in the literature of MDPs (see, [14]). The proof can be consulted in (see [3, page 46]).

Theorem 2.5. Suppose that Assumption 2.3 holds. Then
(a) The optimal value function is a solution of the OE (see Definition 2.2).
(b) There exists such that
(c) For every , , when .

Remark 2.6. Under Assumption 2.3, it is possible to demonstrate that for each , there exists a stationary policy such that (see [3, page. 27, 28]).

3. Differentiability in MDPs

3.1. Notation and Preliminaries

Let and be Euclidean spaces and consider the following notation: denotes the set of functions with a continuous second derivative (when , will be denoted by and in some cases it will be written only as ). Let be a measurable function such that . , and denote the partial derivative of for and , respectively. The notations for the second partial derivatives of are , , and .

For any set , a point is called an interior point of if there exists an open set such that . The interior of is the set of all interior points of denoted by .

The set-value mapping from to is said to be(a) nondecreasing, if with then ,(b)convex, if and , then , with and .

Let be a measurable function. Define by .

The proof of the following lemma is similar to the proof of Theorem 1 in [16].

Lemma 3.1. Suppose that(a), furthermore is negative definite, for every ;(b) for each , argmax.
Then there exists a function such that , for every . Moreover and .

Remark 3.2. Observe that (a) implies that is a strictly concave function, for each . Then the maximizer is unique.
The proof of the following lemma can be consulted in [19], Theorem 25.7, page 248.

Lemma 3.3. Let be an open and convex set. Let be a concave and differentiable function, and be a sequence of differentiable, concave and real-valued functions on , such that , when , for all . Then

3.2. An Envelope Formula in MDPs

Let be a fixed Markov control model. Throughout this section it is assumed that Assumption 2.3 holds. Also, it is supposed that and are convex sets with nonempty interiors and is partially ordered. It is considered that the set-valued mapping is nondecreasing and convex, and has nonempty interior, for each . Also, it is assumed that the transition law is given by a difference equation: , with a given initial state fixed, where is a sequence of independent and identically distributed (iid) random variables, independent of and taking values in a Borel space . Let be a generic element of the sequence . The density of is designated by ; is a measurable function, with , and , is a measurable function too.

Since Assumption 2.3 is assumed, then Theorem 2.5 yields. Therefore, the optimal value function (see Definition 2.1) satisfies and the VI functions (see Definition 2.4) satisfy for each , with . In addition, by Theorem 2.5, there exists the optimal policy, which will be denoted by . Furthermore, there exists the maximizer of for each (see Remark 2.6).

Let be a function defined as , where

Define by for each , with and .

Assumption 3.4. (a) is a strictly concave function and is an increasing function on for each fixed;
(b) is a concave and increasing function, for each ; is a concave function, is an increasing function on , for each .

Lemma 3.5. Under Assumption 3.4, it results that is a strictly concave function and is unique, for all . Also, is a strictly concave function and is unique.

Proof. By Assumption 3.4(a), it suffices to prove Condition C1 (see [20, Lemma 6.2]), which guarantees the result. Let be defined by
Then for each , the function is concave in by Assumption 3.4(b). Indeed, since is a concave function, then and . Furthermore, it is known that is a concave and increasing function, for each , then
From similar arguments, it can be shown that if , then , for each and . Then the result follows.

Assumption 3.6. (a) and is negative definite for each ;
(b) and is invertible, for each ;
(c) , for each . Besides, has an inverse in the second variable , such that , and , for all , where, in this case, denotes the derivative of with respect to the second variable, and the determinant of is denoted as ;
(d) and the interchange between derivatives and integrals is valid (see Remark 3.8).

Lemma 3.7. By Assumption 3.6 it results that , with defined in (3.7).

Proof. The proof is similar to the proof of Lemma 5 in [16]. Assumption 3.6 allows to express the stochastic kernel (see (3.3)) in the following form: for each measurable subset of and , Then for the change of variable theorem, it results that It follows from (3.13) that can be expressed as
Now, using Assumption 3.6, the result follows.

Remark 3.8. In Lemma 3.7, Assumption 3.6(d) was used to guarantee the differentiability of the second order of the integral , with respect to or , where . This condition can be verified in practice when the derivatives of can be bounded in the following sense: for , ,, , , , for some functions integrable with respect to , for (see Remark 10 in [16]).

Assumption 3.9. (a) The optimal policy satisfies that , for each ;
(b) The sequence of the maximizers of the VI functions satisfies that , for each and .
Define by .

Remark 3.10. Assumption 3.9 evidently holds; if is open for every , then () belongs to the interior of , . Also, in some particular cases (see [8, 16]), the interiority of () is guaranteed by the mean value theorem.

Theorem 3.11. Under Assumptions 3.4, 3.6, and 3.9(a), it results that , and for each , where is defined in (3.17).

Proof. Let fixed. Note that Assumptions 3.4 and 3.6 imply that where is defined in (3.6). Indeed, since Assumptions 3.4(a) and 3.6(a) hold, it is known that and is negative definite. Moreover, Lemma 3.5 implies that is a concave function, and by Lemma 3.7, it follows that , obtaining that is negative semidefinite (see [21, page 260]). Furthermore, by Assumption 3.9(a) and applying Lemma 3.1, it concludes that and .
On the other hand, it is obtained that for each . Then, the first order condition and the invertibility of (see Assumption 3.6(b)) imply that , that is, Moreover, since satisfies (2.4) and is the optimal policy, then
Using the fact that , it is possible to obtain the following envelope formula: Equivalently, Finally, substituting (3.20) in (3.23), it follows that

Theorem 3.12. Under Assumptions 3.4, 3.6 and 3.9(b), it results that , , for each . Furthermore, for each and .

Proof. The proof will be made by induction. Let be fixed. Since where is defined in (3.8) and by Assumptions 3.4(a) and 3.6(a), it follows that and is negative definite. By Assumption 3.9(b), it yields that , and applying Lemma 3.1, it follows that , . Straightforward computations allow to obtain that Moreover, by Lemma 3.5 it is known that is strictly concave, then it is negative semidefinite (see [21, page 260]).
Let , then where is defined in (3.8).
Since , , , , then too. Moreover, Lemmas 3.5 and 3.7 imply that is a concave function and . It follows that is negative semidefinite.
Consequently, is negative definite. Now, since (see Assumption 3.9(b)), applying again Lemma 3.1, it follows that , . Furthermore, the first order condition implies that . By the invertibility of (see Assumption 3.6(b)), it follows that
On the other hand, and substituting (3.30) in (3.31), it is obtained that where is defined in (3.17).
Now, suppose that with . Using arguments similar to the case , it is possible to demonstrate that , and

Assumption 3.13. For each , the function has a continuous inverse function, denoted by .

Theorem 3.14. Under Assumptions 3.4, 3.6, 3.9 and 3.13, it follows that when , for each .

Proof. Let fixed. It is known by Lemma 3.5 and Theorem 3.11 that the optimal value function is concave and differentiable on . In addition, it is known that for each , is a concave and differentiable function on . Then from Lemma 3.3 it follows that when goes to .
Now by Assumption 3.13, it concludes that for , where is a stationary policy of and is the optimal policy. Finally, the convergence is guaranteed by the continuity of (see Assumption 3.13).

3.3. Euler Equation

Theorem 3.15. Under Assumptions 3.4, 3.6, 3.9, and 3.13 it follows that for each and , where is the function given in Assumption 3.13.

Proof. Let be fixed. By Lemma 3.5 and Theorem 3.12, it is known that and it is a concave function. Now, from the first order condition and the invertibility of (see Assumption 3.6(b)), it follows that Since and using the invertibility of (see Assumption 3.13), it follows that Finally, substituting (3.40) in (3.38), (3.37) is obtained.

Corollary 3.16. The optimal value function satisfies for each .

Proof. Let be fixed. It is known that the VI functions satisfy the Euler equation (3.37), so applying Lemma 3.3, it is obtained that when . Also from Assumption 3.13, is a continuous function. Then, when goes to infinite in (3.37), it follows that the optimal value function satisfies (3.41).

4. A Linear-Quadratic Model

Consider that , for each . The dynamic of the system is given by , with given. and are invertible matrices of size , is a sequence of iid column random vectors with values in . Let be a generic element of the sequence , assume that has a density with , and equals vector zero. Furthermore, it is assumed that if is a symmetric negative definite matrix of size , then is finite. In addition, it is assumed that the interchange between derivatives and integrals is valid (see Remark 3.8). A particular case of this assumption can be found in [16, page 315].

The reward function is given by where and denote the transpose of vectors and ; and are symmetric matrices of size , and both of them are negative definite.

Lemma 4.1. The linear quadratic model satisfies Assumption 2.3.

Proof. Note that is a compact set, for each and . Indeed, let and . If any sequence of satisfies , then there is a contradiction. Therefore is a set bounded below. Moreover, since and are negative definite, then is a set bounded above. In addition, if so that , then by the continuity of , it follows that , implying that is a closed set. Therefore, the reward function is sup-compact. Finally, note that is a nonpositive and continuous function on . So Assumption 2.3(a) holds.
On the other hand, let , then where denotes the indicator function of . Since the density is continuous, it is obtained that the transition law is weakly continuous, that is, Assumption 2.3(b) holds.
Finally, let be defined as then the dynamic of the system is given by for , with .
It follows that where .
Since is i.i.d, then where is given by (2.1). Therefore, Assumption 2.3(c) holds.

Lemma 4.2. The linear quadratic problem satisfies Assumptions 3.4, 3.6, 3.9, and 3.13.

Proof. It is easy to obtain that is a concave function and , implying Assumptions 3.4(a) and 3.6(a). Assumption 3.4(b) is satisfied using Condition C2 in [20]. Furthermore, observe that . Since is an invertible matrix, then Assumption 3.6(b) holds.
In addition, note that . Then it follows that has an inverse in the second variable, which is Therefore, Assumption 3.6(c) yields. Furthermore, since it results that Assumption 3.6(d) holds. On the other hand, Assumption 3.9 is satisfied since for eachFinally, it is obtained where is defined in (3.17), implying that the inverse of is which is a continuous function. Therefore, Assumption 3.13 is satisfied.

Lemma 4.3. VI functions for the linear quadratic problem satisfy for each , where with .

Proof. Observe that the validity of Theorem 3.15 is guaranteed by Lemmas 4.1 and 4.2. Now, since and are negative definite, then
By Theorem 3.15, it is known that satisfies the Euler equation (3.37), and by (4.13), it follows that Since and is equal to zero vector, then and by direct calculations, it is obtained that where
Now, suppose that for , with defined in (4.15). Then, by Theorem 3.15 and (4.13), it is known that
Then and using matrix algebra, it yields that where satisfies (4.15).

Lemma 4.4. The optimal policy for the linear quadratic problem is where satisfies

Proof. Lemma 4.3 and (4.13) allow to obtain for each and . Moreover, the validity of Theorem 3.14 is guaranteed by Lemma 4.2, that is, , implying the convergence of the sequence which, according to its definition in (4.15), guarantees that its limit, denoted by , must satisfy (4.25). Finally using matrix algebra (4.24) is obtained.

5. Conclusion

In this paper a method to solve the optimal control problem is presented. This method is based on the use of the Euler equation. The procedure proposed to solve the optimal control problem is by means of an envelope formula and the use of the convergence of the maximizers of values iteration functions to a stationary optimal policy. Future work aims to study possible error bounds for approximating the maximizers toward the optimal policy.