Abstract

Conventional algorithms for solving Markov decision processes (MDPs) become intractable for a large finite state and action spaces. Several studies have been devoted to this issue, but most of them only treat infinite-horizon MDPs. This paper is one of the first works to deal with non-stationary finite-horizon MDPs by proposing a new decomposition approach, which consists in partitioning the problem into smaller restricted finite-horizon MDPs, each restricted MDP is solved independently, in a specific order, using the proposed hierarchical backward induction (HBI) algorithm based on the backward induction (BI) algorithm. Next, the sub-local solutions are combined to obtain a global solution. An example of racetrack problems shows the performance of the proposal decomposition technique.

1. Introduction

Stochastic models have recently gained a lot of attention in the artificial intelligent (AI) communities; it offers a suitable framework for solving problems with uncertainties. MDPs [1] are one such model that achieved promising results in numerous applications[24]. Most real-world problems have very large state spaces that require many mathematical operations and substantial memory. It is intractable to solve them with classical MDPs algorithms [5]. Motivated by these considerations, several recent types of research have used the decomposition technique to overcome the computational complexity. The decomposition approach introduced by Bather [6] divides the state space into strongly connected components (SCCs) according to certain levels and solves the small problems called restricted MDPs separately at each level to obtain the global solution of the original MDP by combining the partial solutions. Subsequently, Ross and Varadarajan [7] also proposed a similar decomposition technique for solving constrained limiting average MDPs. It is employed in various categories of MDPs (discounted, average, and weighted MDPs) in diverse studies [8, 9]. The weakness point of these approaches is their polynomial execution complexity. To accelerate the execution time Chafik and Daoui integrate the decomposition and parallelism schemes [10], unfortunately, the decomposition algorithm remains polynomial at runtime. Following, to accelerate the convergence time of the decomposition, Larach, and Daoui [11] investigated the state space decomposition approach into SCCs according to some levels based on Tarjan’s algorithm [12]. Subsequent work [1315] developed approaches to solve MDPs with factorization methods that were introduced by [16]. The goal of factoring a problem is to decompose it into smaller items. Factored MDPs produce compact representations of complex and uncertain systems allowing for an exponential reduction in the complexity of the representation [17]. These factorized approaches represent states as factorized states with an internal structure and state transition matrices as dynamic Bayesian networks (DBNs). However, methods for solving representations based on factorized DBNs do not exploit advances in tensor decomposition methods for representing large atomic MDPs. More recently, research such as [17, 18] exploits a similar idea to the thesis of Smart and [19] aims at improving the efficiency of MDP solvers by using tensor decomposition methods to compact state transition matrices. The solver uses the value iteration and policy iteration algorithms to compute the solution compactly. The authors try different ways to parallelize their proposed approaches, but no improvement in the execution time is observed. These methods are based on multiplications between small tensor components. More recently, work has been carried out to address this problem in the context of parallelism [20] presenting a way to decompose an MDP into SCCs and find dependency chains for these SCCs. They solve independent chains of SCCs with a proposed variant of the topological value iteration (TVI) algorithm, called parallel chained TVI aimed at improving the execution time on GPUs. In this context, research groups [2123] have improved iterative algorithms in parallel versions to accelerate their convergence.

The literature mentioned above focuses only on solving different types of MDPs under the infinite-horizon criterion. It is difficult to find papers that focus on solving large non-stationary finite-horizon MDPs. This paper is oriented in this underexplored direction, the main objective of this work is to propose a new decomposition technique tackling the challenges of reducing memory requirements and computational cost.

The proposed technique consists in partitioning the global problem into smaller restricted finite-horizon MDPs, each restricted MDP is solved independently, in a specific order, using the backward induction algorithm. Next, the sub-local solutions are combined to obtain a global solution.

There are several problems modeled as MDPs with an initial given state i0, then the optimal action f(i0), and the optimal value VT(i0) are computed by solving just the restricted MDP corresponding to the classes accessible from the class containing i0 (one does not need to consider all states). This is also an advantage of this method, which reduces memory consumption and speeds up the computation time.

The remainder of the article is organized as follows: the second section introduces the fundamentals of finite-horizon MDPs. The third section focuses on the decomposition technique and describes the new finite-horizon restricted MDP. The fourth section presents the proposed hierarchical backward induction algorithm. The last section illustrates the advantages of this decomposition technique by its application to a racetrack problem. The paper concludes with conclusions and prospects for future work.

2. Markov Decision Process

Markov decision processes have been widely studied as an elegant mathematical formalism for many decision-making problems in a variety of fields of science and engineering [24]. The objective is to approximate the best decision policies (action selection) to achieve maximum expected rewards (minimum costs) in a given stochastic dynamic environment satisfying the Markov property [1]. In this section, we will present non-stationary finite-horizon MDP with a finite state and action spaces.

Formally, a non-stationary MDP with a finite-horizon is defined by five-tuple (S, A, T, P, and R), where S and A are the state and action spaces; T is the time horizon; P denotes the state transition probability function, where is the probability of transition from state i to state j by taking action a at time t; St (At) is a random variable indicating the state (action) at time t; R supplies the reward function defined on state transitions, where indicates the reward gained if the action a is executed in the state at t period. Most solvers of MDPs attempt to find an optimal policy that specifies (optimal) action should be taken for each agent at each state. If the process will be considered a finite planning horizon T, an optimal policy π is given as the policy that maximizes the expected reward. π maximizes the value function of the Bellman equation [24]: , where is the total expected reward in T periods, given that the process starts from initial state i and the policy π is used.where Xt and Yt are the random variables representing, respectively, the state and action at time t. Besides, we define the optimal value vector VT:

It is well known that the backward induction algorithm is one of the most common iterative methods used to find an optimal policy. In the next section, we will discuss it in more detail.

3. Backward Induction Algorithm

In this section, we compute an optimal policy as well as the optimal value vector using the backward induction algorithm, its iterative process starting at the end of the planning horizon T, one computes the values for the previous periods. Then, after T iterations an optimal policy is found.

The following theorem introduced in [25] demonstrates the validity of the BI algorithm:

Theorem 1. Let , iS. Define recursively for t= T, T - 1,…, 1, a deterministic decision rule ft and the vector xt as follows:(i)(ii)then R= (f1, f2, …, fT) is an optimal policy and x1 is the optimal value vector VT.

Proof. (see [25]).
To accelerate the execution time of the classical BIalgorithm, the authors used the proposal of [11], they introduced for each action a the list of state-action successors denoted by , where . This allowed to reduce the time complexity from to arithmetic operations, where denotes the average number of state-action successors; T designs the horizon and E is the number of states. Algorithm 1 describes the ameliorated backward induction algorithm (ABI) as follows:

(1)ABI (In MDP: Out
(2)
(3) Take
(4)Repeat
(5)
(6)For each Do
(7) //The Deterministic Decision Rule
(8)
(9)For each Do
(10) is an optimal policy and is the optimal value vector.
11Return

4. Hierarchical Backward Induction Algorithm

The BI algorithm becomes quite impractical to compute an optimal policy for finite-horizon MDPs with large state space. For non-stationary finite-horizon MDPs, the computing load can increase further. To overcome this issue, we describe, in this section, a new decomposition technique for improving the performance and reducing the time running.

4.1. The Decomposition Technique

Let us consider an oriented graph G = (S, U), associated with the original MDP, where S is a set of nodes that represents a state space and is a set of directed arcs. There exists a unique partition S = C1∪C2∪Cp of the state space S into strongly connected classes. Note that the SCCs are defined to be the classes with respect to the relation on G defined by i is strongly connected to j if and only if i=j or there exists a directed path from i to j and a directed path from j to i. There are many good algorithms in graph theory for the computation of such partition, e.g., see [11].

Now, we construct by induction the levels of the graph G. The level L0 is formed by all closed classes Ci, that is for all for all . The level Lp is formed by all classes Ci such that the end of any arc emanating from Ci is in some levels Lp−1, Lp−2,. . ., L0. After finding the SCCs using Tarjan’s algorithm, their belonging levels are found by using the following algorithm (algorithm 2) introduced in [26].

(i)(Ω) ⟵ E; n ⟵ 0; Ln ⟵{ Ci: Ci is closed }
(ii)If L0=E Stop.
(iii)Otherwise, unless Ω ≠ ∅; do
(iv)Delete Ln (i.e., Ω ⟵ Ω/Ln and eliminate all arcs coming into Ln);
(v)L n+1 ⟵{ Ci: Ci is closed in the restricted MDP to Ω};
(vi)n ⟵n + 1.

For each level Ln, n=0, 1, 2, ..., L. Let (Clk), k ∈ {1, 2, …, K(l)} be the strongly connected classes corresponding to the nodes in level l (see Figure 1). Each class Clk leads to a partial MDPlk that is solved independently, the global solution is obtained by combining these partial solutions.

The hierarchical method used by several researchers for several categories of MDPs, addresses the “curse of dimensionality” of large MDPs, was described by [27] and later further developed by [11, 26]. It consists of breaking up the state space into small subsets, solving the restricted MDPs problems corresponding to these subsets, and combining these solutions to determine the solution of the global problem. Based on the above decomposition technique, the authors propose, a hierarchical backward induction (HBI) algorithm by decomposing the original finite-horizon MDPs into restricted MDPs corresponding to each SCC. These restricted MDPs are solved independently and according to their level.

The performance of the proposed algorithm is exposed after the search for the optimal policy of the known initial state, the algorithm solves only the restricted MDP corresponding to the reachable classes of the initial state. For example, in Figure 1, the initial state S0 is in class C10. Only the restricted MDPs corresponding to the SCCs: C10, C00, and C01 are solved.

In the next section, we will define new MDPs called the restricted finite-horizon MDPs.

4.2. The Restricted Finite-Horizon MDPs

In this work, we consider non-stationary finite-horizon MDPs with large finite state and action spaces. Now, we construct the restricted MDPnk corresponding to each strongly connected class Cnk, k ∈ {1, 2, …, K (n)} in level Ln as follows:(i)et .(ii)States space: ;(iii)Actions space: For ;(iv)Transition probabilities: For t= 1,2, …,T,(v)for ;(vi)Reward function: For t= 1,2, …,T.(vii)If ;(viii)If , {: .

N is the horizon and is the optimal value vector of MDPmh calculated in the previous levels.

According to the definition of the restricted finite-horizon MDPs, we remark that.

The restricted finite-horizon MDPs are solved according to the ascending order of levels and in the same level Lp. The restricted finite-horizon MDPs are independent, so they can be solved in parallel.

4.3. Hierarchical Backward Induction Algorithm

Based on the above restricted finite-horizon MDPs, the authors present in this section, a new algorithm called hierarchical backward induction (HBI) algorithm (algorithm 3). The main contribution is to show that the optimal value in the restricted MDPpk is equal to the optimal value in the original MDP (Theorem 2).

Now, the corresponding restricted finite-horizon MDPs are constructed and immediately solved by using this procedure:

(1)HBI (In MDP: , Out ()
(2)Find SCCs and their belonging levels using Tarjan’s algorithm
(3)For each level Lp, ,…, Lp Do
(4)For each class Clpk, in level Lp
(5)Construct the restricted finite-horizon MDPpk
(6)End For
(7)End For
(8)For each level Lp,,…, Lp Do
(9)For each class Cpk, in level Lp over planning horizon T, Do
(10)ABI (MDPpk)
(11)End For
(12)End For
(13)Return

The following theorem shows the validity of the HBI algorithm.

Theorem 2. Let R= (f1, f2, …, fT) and VT are, respectively, an optimal policy and the optimal value vector in the original MDP. If and are, respectively, an optimal policy and the optimal value vector in the restricted MDPpk, then for all i ϵ Cpk, for t=1,2, …,T, (i) = ft (i) is an optimal action in the original MDP and .

Proof. The proof is by induction. For (level L0); Let and are, respectively, an optimal policy and the optimal value vector in the restricted MDP0k.
According to Theorem 1, we have for t=T, T − 1,…, 1:from the definition of the restricted MDP0k, the state space S0k=C0k, the action space A0k(i) =A(i) for iS0k, for t= T, T − 1,…, 1 the transition probabilities for all i, j C0k, a ∈ A0k (i), the rewards .
Furthermore, the class C0k is closed then for t=1, …, T. From (3), we havetherefore, for all and t=1, …, T, is an optimal action for the global MDP andSuppose that the result is true until the level p-1. Now we shall show that the result is still true in the level p.
The state space of the restricted MDPpk is , whereLet and are, respectively, an optimal policy and the optimal value vector in the restricted MDPpk.
According to Theorem 1, we have for t = T, T − 1,…, 1:Based on the definition of the restricted MDPpk, for iSpk, if iCpk, Apk(i) =A(i), for a ∈ Apk(i), the rewards and the transition probabilities for all jSpk.
In fact, that and since , then for t= 1, …, T, verifiesBy consequence, for and t = 1, …, T, is an optimal action for the global MDP, and, .
Now, It remains to see the case where and .
From the recurrence hypothesis is an optimal action for the global MDP, calculated in previous levels, it remains to verify that .
Since for and t=1, …, T, then  = 1 and  =  .
It follows from (5):

Remark 1. If the initial state i0 is known, its optimal action f(i0) and its optimal value VT(i0) are computed by solving just few restricted MDPs: one does not need to consider all states. The following algorithm (algorithm 4) explains this issue.
It is clear that, f(i0) and VT(i0) are obtained by solving only MDPmk. ♦
To demonstrate the benefit of the proposed HBI algorithm, we consider a case study of the racetrack problem described in the following section.

Step 1.Determine the class Cmk such that i0 ∈ Cmk.
Step 2.Determine the classes Cnh, n∈{0, 1, …, m}; h ∈{1, 2,.., k(n)} such that the end of any arc emanating from Cmk is in the classes Cnh
Step 3.Solve the restricted MDPnh found in Step 2.

5. Case Study and Experimental Results

To show the advantage of the proposed HBI algorithm, we consider a standard control racetrack problem described by Martin Gardner [28] and Barto [29]. The goal is to control the movement of a race car along a predefined racetrack so that the racer can get to the finish line from starting the line in the minimum amount of time.

At each time step, the state of the racer is given by the tuples (Xt, Yt, Vx(t), Vy(t)) that represent the position and speed of the car in the x, y dimensions at time t. The actions are pairs a = (ax, ay) of instantaneous accelerations, where ax, ay ϵ {−1, 0, 1}. We assume that the road is ‘slippery’ and the car may fail to accelerate. An action a = (ax, ay) has its intended effect 90% of the time; 10% of the time the action effects correspond to those of the action a0 =(0; 0). Also, when the car hits a wall, its velocity is set to zero and its position is left intact. When the car is state (Xt, Yt, Vx(t), Vy(t)) and the action taken is a = (ax, ay), it transit with 90% probability to a state (Xt+Vx(t)+ax, Yt+Vy(t)+ay, Vx(t)+ax, Vy(t)+ay).

Let s = (Xt, Yt, Vx(t), Vy(t)), a = (ax, ay) and s″= (Xt+Vx(t)+ax, Yt + Vy(t)+ay, Vx(t)+ax, Vy(t)+ay), the transition probability is defined as follows:

To complete the formulation of the finite-horizon MDP problem, we need to define the reward function and the horizon. Independently of the action taken, the immediate reward for all non-goal states is equal to −1, i.e., Ri=  − 1 and it is equal to zero for any goal state reached, i.e., Rg=0. The horizon is determined after the decomposition into levels; indeed, during this decomposition, the maximum level obtained will be considered as the horizon.

To restrict the possible infinite state space, we assume that the speed of the car in the x, y dimensions are bounded in the range [ − 7,  + 7], i.e., Vx(t), Vy(t) ϵ [−7, +7]. The speed will not change if the agent attempts to accelerate beyond these limits.

The proposed algorithms are tested using Intel(R) Core(TM) i7-6500 U (2.6 GHz), C++ implementation, Windows 10 operating system (64 bits).

Figure 2 presents the three racetracks considered, blues cells represent the initial states, and green cells represent the goal states.

Table 1 presents the horizon, the number of SCCs, and the number of possible states obtained with the decomposition algorithm into levels for the three considered racetrack problems. As it can be seen, the number of states is reduced. Table 2.

Figures 35 show the policy obtained with VI, BI, and HBI algorithms for the three racetracks problems. As it can be seen, we obtain the same policy with the three algorithms.

6. Conclusion

In this paper, we have presented a new hierarchical backward induction algorithm for finite-horizon non-stationary MDP that is successful for large state spaces. It consists in decomposing the original problem into smaller restricted MDPs; indeed in each level and for each SCC a restricted finite-horizon MDP is constructed and solved independently from the other restricted MDPs of the same level. This proposed method accelerated the calculation time and reduces the memory requirement.

To show the advantage of the proposed HBI algorithm, we applied it to racetrack problems. The experimental results show that the HBI algorithm outperforms better the standard BI and value iteration algorithm. From a perspective, the use of parallelism techniques could further accelerate the convergence of the hierarchical finite-horizon MDPs.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.