Nonuniqueness versus Uniqueness of Optimal Policies in Convex Discounted Markov Decision Processes
From the classical point of view, it is important to determine if in a Markov decision process (MDP), besides their existence, the uniqueness of the optimal policies is guaranteed. It is well known that uniqueness does not always hold in optimization problems (for instance, in linear programming). On the other hand, in such problems it is possible for a slight perturbation of the functional cost to restore the uniqueness. In this paper, it is proved that the value functions of an MDP and its cost perturbed version stay close, under adequate conditions, which in some sense is a priority. We are interested in the stability of Markov decision processes with respect to the perturbations of the cost-as-you-go function.
From the classical point of view (for instance, in Hadamard’s concept of well-posedness ) in a mathematical modeling problem, it is crucial that both the existence and the uniqueness are secured. But, in optimization, neither of these is guaranteed, and even if extra conditions ensure the existence of optimizers, their uniqueness will not automatically follow. For instance, in linear programming, we even have the extreme case that when there are two different optimal vectors all of their convex linear combinations become optimal automatically. But a slight perturbation of the cost functional will “destroy" most of the optimizers. In this sense, nonuniqueness in linear programming is highly unstable. This question is of interest with respect to the standard discounted Markov decision model, as in , which presents conditions that guarantee the uniqueness of the optimal policies.
In this paper, we study a family of perturbations of the cost of an MDP and establish that, under convexity and adequate bounds, the value functions of both the original and the cost-perturbed Markov decision processes (MDPs) are uniformly close. This result will eventually help us determine whether both the uniqueness and the nonuniqueness are stable with respect to this kind of perturbation.
The structure of this paper is simple. Firstly, the preliminaries and assumptions of the model are outlined. Secondly, the main theorem is stated and proved, followed by the main example. A brief section with the concluding remarks closes the paper.
2. Preliminaries: Discounted MDPs and Convexity Assumptions
Let be a Markov control model (see  for details and terminology) which consists of the state space , the control (or action) set , the transition law , and the cost-per-stage . It is assumed that both and are subsets of (this is supposed for simplicity, but it is also possible to present the theory of this paper considering that and are subsets of Euclidean spaces of the dimension greater than one). For each , there is a nonempty measurable set whose elements are the feasible actions when the state of the system is . Define . Finally, the cost-per-stage is a nonnegative and measurable function on .
Let be the set of all (possibly randomized, history-dependent) admissible policies. By standard convention, a stationary policy is identified with a measurable function such that for all . The set of stationary polices is denoted by . For every and an initial state , let be the total expected discounted cost when using the policy , given the initial state . The number is called the discount factor ( is assumed to be fixed). Here and denote the state and the control sequences, respectively, and is the expectation operator. A policy is said to be optimal if for all , where , . is called the optimal value function. The following assumption will also be taken into consideration.
Assumption 1. (a) is lower semicontinuous and inf-compact on (i.e., for every and the set is compact).
(b) The transition law is strongly continuous, that is, , is continuous and bounded on , for every measurable bounded function on .
(c) There exists a policy such that , for each .
Remark 2. The following consequences of Assumption 1 are well known (see Theorem and Lemma in ).(a) The optimal value function is the solution of the optimality equation (OE), that is, for all , There is also such that and is optimal.(b) For every , , with defined as , and . Moreover, for each , there is such that for each ,
Let be a fixed Markov control model. Take as the MDP with the Markov control model . The optimal value function, the optimal policy which comes from (3), and the minimizers in (5) will be denoted for by , , and , , respectively. Also let , , be the value iteration functions for . Let + , .
It will be also supposed that the MDPs taken into account satisfy one of the following Assumptions 3 or 4.
Assumption 3. (a) and are convex.
(b) for all , , , , and . Besides, it is assumed that if and , , then , and are convex for each .
(c) is induced by a difference equation , with , where is a measurable function and is a sequence of independent and identically distributed (i.i.d.) random variables with values in , and with a common density . In addition, we suppose that is a convex function on , for each , and if and , , then for each and .
(d) is convex on , and if and , , then , for each .
Assumption 4. (a) The same as Assumption 3(a).
(b) for all , , , , and . Besides, is assumed to be convex for each .
(c) is given by the relation , , where are i.i.d. random variables taking values in with the density , and are real numbers.
(d) is convex on .
Remark 5. Assumptions 3 and 4 are essentially the same as assumptions C1 and C2 in pages 419–420 of reference , with the difference that we are now able to assume that the function is convex and not necessarily strictly convex. (in fact, in , Conditions C1 and C2 take into account the more general situation in which both and are subsets of Euclidean spaces of the dimension greater than one). Also note that it is possible to obtain that each of Assumptions 3 and 4 implies that, for each , is convex but not necessarily strictly convex (hence, does not necessarily have a unique optimal policy). The proof of this fact is a direct consequence of the convexity of the cost function and of the proof of Lemma 6.2 in .
3. Main Result and an Example
For , consider the following MDP denoted by with the Markov control model , where , , where is the cost function for . Observe that both MDPs and coincide in the components of the Markov control model except for the cost function; moreover, is the same set in both models. Additionally we suppose that.
Assumption 6. There is a policy such that , for each .
Remark 7. Suppose that, for , Assumption 1 holds. Then, it is direct to verify that if satisfies Assumption 6, then it also satisfies Assumption 1.
For , let , , and , , denote the optimal value function, the optimal policy which comes from (3), and the minimizers in (5), respectively. Moreover, let , , be the corresponding value iteration functions for .
Remark 8. Suppose that, for , one of Assumptions 3 or 4 holds. Then, notice that as is a convex function, it is trivial to prove that is strictly convex. Then, under Assumption 6, it follows that satisfies C1 or C2 in  and that is strictly convex, where , , so is unique.
Let , so that and , , and take, for each , and .
Remark 9. It is easy to verify, using Assumption 1, that for each , and are nonempty and compact. Moreover, since and from Remark 2, , ; , for each and . It is also trivial to prove that, for each , ; hence , , , , for each and .
Condition 10. There exists a measurable function , which may depend on , such that , and for each and .
Remark 11. With respect to the existence of the function mentioned in Condition 10 that satisfies that for each and , it is important to note that this kind of requirement has been previously used in the unbounded MDPs literature (see, for instance, the Remarks presented on page 578 of ).
Theorem 12. Suppose that Assumptions 1 and 6 hold, and that, for , one of Assumptions 3 or 4 holds. Let be a positive number. Then,(a)if is compact, , , where is the diameter of a compact set such that and ;(b)under Condition 10, , .
Proof. The proof of case (a) follows from the proof of case (b) given that , (observe that in this case, if , then ).
(b) Assume that satisfies Assumption 3 (the proof for the case in which satisfies Assumption 4 is similar).
Firstly, for each ,
Secondly, assume that for some positive integer and for each ,
Consequently, using Condition 10, for each ,
On the other hand, from (11) and the fact that , , for each ,
In conclusion, combining (10), (13), and (14), it is obtained that, for each , (11) holds for all . Now, letting in (11), we get , .
The following corollary is immediate.
Corollary 13. Suppose that Assumptions 1 and 6 hold. Suppose that for one of Assumptions 3 or 4 holds (hence, does not necessarily have a unique optimal policy). Let be a positive number. If is compact or Condition 10 holds, then there exists an MDP with a unique optimal policy , such that inequalities in Theorem 12 (a) or (b) hold, respectively.
Example 14. Let , , for all . The dynamic of the system is given by . Here, are i.i.d. random variables with values in and with a common continuous bounded density denoted by . The cost function is given by , (observe that is convex but not strictly convex).
Lemma 15. Example 14 satisfies Assumptions 1, 3, and 6, and Condition 10.
Proof. Assumption 1 (a) trivially holds. The proof of the strong continuity of is as follows: if is a measurable and bounded function, then, using the change of variable theorem, a simple computation shows that
. As is a bounded function and is a bounded continuous function, it follows directly, using the convergence dominated theorem, that
is a continuous function on . Hence,
is a continuous function on .
By direct computations we get, for the stationary policy , , both and are less or equal to for all (observe that, in this case, and , ); consequently, Assumptions 1 and 6 hold.
On the other hand, Assumptions 3(a), (b), and (d) are immediate. Let . Clearly, is nondecreasing in the first variable.
Now, take and , , and . Then, considering that , , and are less or equal than one, hence, is convex, that is, Assumption 3(c) holds.
Now, for each ,
Hence, taking , , using (20) and, again, that , it is possible to obtain that for each and , and that .
4. Concluding Remarks
The specific form of the perturbation used in this paper is taken from [5, Exercise 28, page 81], where it is established that a convex function perturbed by a suitable quadratic positive function becomes strictly convex and coercive. In fact, this kind of perturbation is very much related to the one Tanaka et al. propose in their paper , and further research in this direction is being conducted.
Both state and action spaces are considered to be subsets of , just for simplicity of exposition. All the results hold in . In this case, if , then it is possible to take , where , (see [5, Exercise 28, page 81]), and all the results on this article remain valid.
Theorem 12, on the closeness of the value functions of the original and the perturbed MDPs, requires conditions that are all very common in the MDPs technical literature. The importance of the result lies in the fact that it is a crucial step to the study of the problem of stability under the cost perturbation of the uniqueness or nonuniqueness of optimal policies.
Finally, we should mention that this research was motivated by our interest in understanding the relationship between nonuniqueness and robustness in several statistical procedures based on optimization.
J. Hadamard, Sur les Problemes aux Derivees Partielles et Leur Signification Physique, Princeton University Bulletin, 1902.
D. Cruz-Suárez, R. Montes-de-Oca, and F. Salem-Silva, “Conditions for the uniqueness of optimal policies of discounted Markov decision processes,” Mathematical Methods of Operations Research, vol. 60, no. 3, pp. 415–436, 2004.View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
O. Hernández-Lerma and J. B. Lasserre, Discrete-Time Markov Control Processes, vol. 30, Springer, New York, NY, USA, 1996.View at: MathSciNet
J. A. E. E. Van Nunen and J. Wessels, “A note on dynamic programming with unbounded rewards,” Management Sciences, vol. 24, no. 5, pp. 576–580, 1978.View at: Google Scholar
A. L. Peressini, F. E. Sullivan, and J. J. Uhl, Jr., The Mathematics of Nonlinear Programming, Springer, New York, NY, USA, 1988.View at: Publisher Site | MathSciNet
K. Tanaka, M. Hoshino, and D. Kuroiwa, “On an -optimal policy of discrete time stochastic control processes,” Bulletin of Informatics and Cybernetics, vol. 27, no. 1, pp. 107–119, 1995.View at: Google Scholar | MathSciNet