Abstract

As classical methods are intractable for solving Markov decision processes (MDPs) requiring a large state space, decomposition and aggregation techniques are very useful to cope with large problems. These techniques are in general a special case of the classic Divide-and-Conquer framework to split a large, unwieldy problem into smaller components and solving the parts in order to construct the global solution. This paper reviews most of decomposition approaches encountered in the associated literature over the past two decades, weighing their pros and cons. We consider several categories of MDPs (average, discounted, and weighted MDPs), and we present briefly a variety of methodologies to find or approximate optimal strategies.

1. Introduction

This survey focuses on decomposition techniques for solving large MDPs. However, in this section we begin by discussing briefly some approaches to treat large Markov chain models because they contain basic ideas of methods to tackle large MDPs.

1.1. Large Markov Chain Models

Markov chain models are still the most general class of analytic models widely used for performance and dependability analysis. This class of problems is robust in terms of its ability to represent a very broad class of models of interest. Unfortunately, the characteristics of many models (complex interactions between components, sophisticated scheduling strategies, synchronization among parallel tasks, etc.) preclude the possibility of closed-form solutions, in general. Numerical solution methods are the most general solution methods [1]. The most pervasive practical limitation to the use of numerical solution techniques is the size of the state space of realistic models. One natural way to deal with this problem is via state space reduction techniques, that is, a transformation of the model to one with fewer states. Another is to modify the model to one with a structure which enables efficient solution. The transformed model in general is an approximation of the original model. Several classes of such transformations are described below.

There are several methods of state space reduction that are applicable to Markov chain models. One such method is combining “lumpable states” [2]. When lumping applies, it is exact, but only partial information is retrievable from the solution except in very special circumstances. Also, lumping is only valid for models that have certain very limited types of structure.

Another method to treat large state spaces is aggregation/disaggregation [3]. This applies particularly well to cases in which the system model can be viewed as an interacting set of tightly coupled subsystems. The solution method is generally an iterative one in which submodels are solved and the obtained results are used to adjust the submodels repetitively until a convergence criterion is met. This is an efficient procedure if (a) the model is decomposable into tightly coupled submodels, (b) the state space of each submodel is practically solvable, and (c) the number of such submodels is not too large.

There is an additional property of many models that is not capitalized on by any of the above methods. In many important cases, although the state space is extremely large, the stationary state probability distribution is highly skewed, that is, only a small subset of the states account for the vast majority of the probability mass. This is easily illustrated by considering the nature of several modeling application areas. This observation indicates that most of the probability mass is concentrated in a relatively small number of states in comparison to the total number of states in the model. During its lifetime, the system spends most of its time in this relatively small subset of states and only rarely reaches other states. The above observation is used to motivate truncation of the model state space, that is, choose to only represent a subset of the more “popular” states and ignore the remainder, deleting transitions from the represented states to the “discarded states”. Some form of heuristic is often used to decide which states to retain and which to discard. For dependability models a simple heuristic might be to discard all states in which the system has more than a certain number of failed components. A more sophisticated heuristic for determining which states to retain is described in [4]. The most persistent practical constraint to the use of the above state space truncation is that in certain models (health management, reliability, security problems) deleting some infrequently states may cause significant fatalities.

There is also the issue of how much error is introduced by state space truncation. For some transient measures, error bounds are easily obtained by introducing trap states [5]. For example, in dependability analysis we may be interested in interval reliability. For this purpose one can introduce a trap state and change all transitions to “discarded states” into transitions to the trap state. Then the importance of the error is a direct function of the probability that the system is in the trap state at the end of the interval. For-steady state measures, the issue of error bounds can be more difficult. There are a number of approaches available for computing bounds or for proving that one model provides a bound compared with some other model. Among these we can quote the method of Courtois and Semal [6, 7], the methods based on sample-path analysis and stochastic ordering [8, 9], and the bias terms approach of Van Dijk [10].

In [11], the authors explain with some examples where the model modifications only affect the less popular states and therefore have less effect on the performance measures. They also present some methods to compute bounds on performance measures.

1.2. Large Markov Decision Models

Over the past five decades there has been a lot of interest within planning community in using stochastic models, which provide a good framework to cope with uncertainties. Among them, Markov decision processes [1222], either fully observable (MDP or FOMDP) or partially observable (POMDP), have been the subject of several recent studies [2330]. The optimal strategy is computed with respect to all the uncertainties, but unfortunately, classical algorithms used to build these strategies are intractable with the twin drawbacks of large environments and lack of model information in most real-world systems [3133]. Several recent studies aim to obtain a good approximation of the optimal strategy. Among them, aggregation and decomposition techniques have been the subject of a lot of attention. These techniques are really different flavors of the well-known framework Divide-and-Conquer: partitioning the state space on regions, that is, transforming the initial MDP into small local MDPs, independently solving the local MDPs, and combining the local solutions to obtain an optimal (or near-optimal) strategy for the global MDP.

In preparing this survey, we first attempt to summarize some general decomposition approaches introduced in [3438]. These works have proposed some algorithms to compute optimal strategies as opposed to nearly all methods to treat large MDPs that compute only near optimal strategies. Moreover, they are suitable to several categories of MDPs (average, discounted and weighted MDPs) and can be applicable to many practical planning problems. These algorithms are based on the graph associated to the original MDP and by introducing some hierarchical structure of this graph.

A lot of other recent works are concerned with autonomous exploration systems which require planning under uncertainty. In [39], the authors present a main decomposition technique to cope with large state spaces for practical problems encountered in autonomous exploration systems. They assume that an expert has given a partition of the state space into regions; a strategy is independently computed for each region. These local solutions are pieced together to obtain a global solution. Subproblems correspond to local MDPs over regions associated with a certain parameter that provides an abstract summary of the interactions among regions. Next, two algorithms are presented for combining solutions to subproblems. The first one, called Hierarchical Policy Construction, solves an abstract MDP (each region becomes an abstract state). This algorithm finds for each region an optimal local strategy, that is, an optimal region to reach. The second algorithm, called Iterative Improvement Approach, iteratively approximates parameters of the local problems to converge to an optimal solution. In [40], the approach is analogous to the latter one but aims to show how domain characteristics are exploited to define the way regions communicate with each other. The authors represent the MDP as a directed graph which allows to precisely define the cost of going from one region to another. The approach in [41] is quite different. The authors present a particular structured representation of MDPs, using probabilistic STRIPS rules. This representation allows to quickly generate clusters of states which are used as states of an abstract MDP. The solution to this abstract MDP can be used as an approximate solution for the global MDP.

Other authors aim to use macro-actions to efficiently solve large MDPs [42, 43]. Given a partition of the state space, a set of strategies are computed for each region. A strategy in a region is called a macro-action. Once the transition and the reward functions have been transformed to cope with macro-actions, an abstract MDP is solved using these macro-actions. The problem with this approach is that the quality of the solution is highly dependent on the quality of the set of strategies computed for each region. To ensure high quality, a large set of macro-actions have to be computed. As the transition and reward functions have to be computed for each macro-action, the time needed to find macro-actions can outweigh the speedup provided by macro-actions during strategy computing. The approach in [44] is quite similar, but based on Reinforcement Learning (RL).

It is important to note that the above approaches differ essentially for the following reasons: the choice of the regions, the manner to combine the local solutions, and the quality of the final solution (optimal or only near optimal). For instance, in [35, 36] the authors use graph theory and choose the communicating classes as regions whereas in [39, 40, 45, 46] they use algorithmic geometry of field for determining the regions.There is also the remark that most of these state space decomposition methods are valid if (a) each subproblem is practically solvable, (b) the number of such subproblems is not too large, and (c) combining such subproblems is not difficult.

On the other hand, most research in Reinforcement Learning (RL) is based on the theoretical discrete-time state and action formalism of the MDP. Unfortunately, as stated before, it suffers from the curse of dimensionality [47] where the explicit state and action space enumeration grow exponentially with the number of state variables and the number of agents, respectively. To deal with this issue, RL introduces Monte Carlo methods, stochastic approximation, trajectory sampling, temporal difference backups, and function approximation. However, even these methods have reached their limits. As a result, we discuss briefly in this survey broad categorizations of factored and hierarchical approaches which break up a large problem into smaller components and solve the parts [4858]. The main idea of these approaches is to leverage the structure present in most real-world domains. For instance, the state of the environment is much better described in terms of the values of the states variables than by a monolithic number. This fact leads to powerful concepts of state aggregation and abstraction.

In order to solve linear programs of very large sizes, decomposition principles [59] that divide a large linear program into many correlated linear programs of smaller sizes have been well studied, among which the Dantzig-Wolfe decomposition [60] may be the most well known. Further, in view of [61] the MDPs can be solved as linear programs using Dantzig-Wolfe decomposition. Thus, in this paper we present briefly the classical Dantzig-Wolfe decomposition procedure.

The paper is organized as follows Section 2 presents the problem formulation. Section 3 treats extensively some decomposition approaches introduced in [3436, 38]. Section 4 presents the decomposition technique proposed by Dean and Lin [39]. Section 5 reviews some Reinforcement Learning methods proposed in the associated literature to alleviate the curse of dimensionality. Section 6 describes briefly the classic Dantzig-Wolfe decomposition procedure. We conclude and discuss open issues in Section 7.

2. Markov Decision Processes

We consider a stochastic dynamic system which is observed at discrete-time points . At each time point , the state space of the system is denoted by , where is a random variable whose values are in a state space . At each time point , if the system is in state , an action has to be chosen. In this case, two things happen: a reward is earned immediately, and the system moves to a new state according to the transition probability . Let be the random variable which represents the action chosen at time . We denote by the set of all histories up to time and by the set of probability distributions over

A strategy is defined by a sequence , where : is a decision rule. A Markov strategy is one in which depends only on the current state at time . A stationary strategy is a Markov strategy with identical decision rules. A deterministic (or pure) strategy is a stationary strategy whose single decision rule is nonrandomized. An ultimately deterministic strategy is a Markov strategy such that there exist a deterministic strategy and an integer such that for all Let , , , , and be the sets of all strategies: Markov strategies, stationary strategies, deterministic strategies, and ultimately deterministic strategies, respectively.

Let be the conditional probability that at time the system is in state and the action taken is , given that the initial state is and the decision maker uses a strategy Now, if denotes the reward earned at time , then, for any strategy and initial state , the expectation of is given by

The manner in which the resulting stream of expected rewards is aggregated defines the MDPs discussed in the sequel.

In the discounted reward MDP , the corresponding overall reward criterion is defined by , where is a fixed discount rate. A strategy is called discounted optimal if, for all We will denote this MDP by

In the average reward MDP, the overall reward criterion is defined by . A strategy is called average optimal if for all We will denote this MDP by

In the weighted reward MDP, the overall reward criterion is defined by , where is a fixed weight parameter, and is the discount rate in the MDP . We denote this MDP by . A strategy is called optimal if for all Let ; for any , some strategy is called optimal if . A strategy is called optimal if is optimal for all .

Remark 2.1. (i) It is well known that each of the two first above problems possesses an optimal pure strategy, and there are a number of finite algorithms for its computation (e.g., see [18, 6264]). (ii) In [65], the authors consider weighted MDPs and show that optimal strategies may not exist and propose an algorithm to determine an -optimal strategy.

3. Some Decomposition Techniques for MDPs

Classical algorithms for solving MDPs cannot cope with the size of the state spaces for typical problems encountered in practice [3133]. In this section, we will summarize two decomposition techniques for tackling this complexity: the first is proposed by Ross and Varadarajan [38] and the second is introduced by Abbad and Boustique [36]. These techniques propose some algorithms to compute optimal strategies for several categories of MDPs (average, discounted, and weighted MDPs), as opposed to most solutions for large MDPs that compute only near-optimal strategies and are suitable to only some types of planning problems. Also, we will present some related works.

3.1. Ross-Varadarajan Decomposition

Considered are discrete-time MDPs with finite state and action spaces under the average reward optimality criterion [38]. We begin by introducing some notions of communication for MDPs.

Definition 3.1. A set of states communicate if, for any two states with , there exists a pure strategy such that is accessible from under .An MDP is said to be communicating if the state space communicates.

Definition 3.2. A set of states strongly communicate if there exists a stationary strategy such that is a subset of a recurrent class associated with .

Definition 3.3. A set of states are a communicating class (a strongly communicating class) if (i) communicates (strongly communicates), and (ii) if and , then does not communicate (strongly communicate).

Ross and Varadarajan [38] show that there is a unique natural partition of state space into strongly communicating classes (SCC): and a set of states that are transient under any stationary strategy. This decomposition is inspired from Bather's decomposition algorithm [37]. The sets in the Bather decomposition are the strongly communicating classes and the set of transient states is the union of the sets in [37]. This decomposition is also later formalized by Kallenberg in [66, Algorithm ] which studied irreducibility, communicating, weakly communicating, and unichain classification problems for MDPs. A polynomial algorithm is given to compute this partition. Further, they propose an algorithm to solve large MDPs composed of the following steps: (i) solving some small MDPs restricted to each SCC, (ii) aggregating each SCC into one state and finding an optimal strategy for the corresponding aggregated MDP, and (iii) combining solutions found in the latter steps to obtain a solution to the entire MDP.

Here, we will define correctly the latter two types of MDPs.

The Restricted MDPs
For each , the restricted MDP to the class is denoted by , and is defined by: the state space . For each , the action space is for all ; the transition probabilities and the rewards are still analogous to those in the original problem; however, they are restricted to the state space and the action spaces .

The Aggregated MDP
The aggregated MDP is defined by: (i)the state space , where ,(ii)the action spaces if and if ,(iii)the transition probabilities (iv)the rewards if and , if , if .
Remark 3.4. For each , the restricted MDP is communicating, and then it can be solved by the simpler linear programming in [51]. As a consequence, all the states in have a similar optimal value .

The most important step in the algorithm above consists in solving the aggregated MDP which is also an MDP, and then it can be solved by using the classical algorithms [34]. Ross and Varadarajan [38] did not give any new method for solving the aggregated MDP. As a result, the aim of [34] is to provide some algorithms which exploit the particular structure of the aggregated MDP and improve the classical ones. The authors consider deterministic MDPs and aggregated MDPs without cycles. In the sequel of this subsection, we will present briefly these algorithms.

3.1.1. Deterministic MDPs

The work in [34] considers firstly deterministic MDPs and shows that the singletons are the only strongly communicating classes for the aggregated MDP. This result permits to prove the correctness of the following simple algorithm which constructs : an optimal strategy in the aggregated MDP.

Algorithm 3.5. Step 1. One has , v for and for . Step 2. Compute such that and determine . For , set and .Step 3. While , , : do , , .Step 4. (i) If , stop: is an aggregated optimal strategy. (ii) If , find such that and determine . For , set , , and

3.1.2. The Aggregated MDP and Cycles

Let be the graph associated with the aggregated MDP; that is, the state space represents the set of nodes and for some is the set of arcs. We say that an aggregated MDP has no cycle if the associated graph has no cycle containing two or more nodes. The next algorithm extends Algorithm 3.5 in the case where the original MDP may be not deterministic and its aggregated MDP has no cycle [34].

Algorithm 3.6. Step 1. One has ; for ; for . Step 2. Compute and determine . For , set .Step 3. While there exist , such that do: , , .Step 4. While there exists such that for all , , do: , .Step 5. If , while there exists such that, for all , do: ; if set ; if set and ; . If , stop; is an aggregated optimal strategy.

Remark 3.7. (i) The previous algorithm is applicable if the decomposition leads to , because in this case the aggregated MDP has no cycle. (ii) In , the authors have also considered an arbitrary MDP without any condition and they have presented two algorithms for the computation of an aggregated optimal strategy. The latter comes up with some significant simplifications on the classical policy improvement algorithm and linear programming algorithm. In the construction they have exploited the fact that the recurrent classes in the aggregated MDP are singletons.

3.2. Abbad-Boustique Decomposition

Abbad and Boustique [36] propose an algorithm to compute an average optimal strategy which is based on the graph associated to the original MDP introducing some hierarchical structure of this graph. The main contribution of their approach consists in constructing by induction the levels of the graph G and solving the restricted MDPs corresponding to each level. The local solutions of the latter MDPs provide immediately an optimal strategy in the entire MDP. This state space decomposition into levels is inspired by the work in[67].

Let be a directed graph associated to the original MDP. A communicating class for the MDP corresponds to a connected component in the graph . Thus, there exists a unique partition of the state space into communicating classes , ,, which can be determined via standard depth-first algorithms [68]. The level is formed by all classes such that is closed. The th level is formed by all classes such that the end of any arc emanating from is in some level , .

Let (), be the classes corresponding to the nodes in level . The restricted MDPs corresponding to each level , , are constructed, by induction, as follows.

3.2.1. Construction of the Restricted MDPs in Level

For each , we denote by the restricted MDP corresponding to the class , that is, the restricted MDP in which the state space is . Note that any restricted MDP, is well defined since any class is closed and can be solved easily by a finite algorithm (see [69]).

3.2.2. Construction of the Restricted MDPs in Level ,

Let . Let be the optimal value in state , computed in the previous MDP: . For each , we denote by the MDP defined by the following.(i)State space. for some ,.(ii)Action spaces. For each , the associated action space is if and .(iii)Transition probabilities. For each , if , and if , .(iv)Rewards. Let ; if , then .

If then there exist and

The basic result of Abbad-Boustique approach shows that the optimal value for a fixed state in any restricted MDP is equal to the optimal value in the original MDP. Consequently, optimal actions in the restricted MDPs are still optimal in the original MDP. Such approach is advantageous because it allows that an optimal action and the optimal value for a fixed state can be computed only through some restricted MDPs before solving the entire MDP; however, there is still considerable overhead in determining the communicating classes.

The work in [35] is a main related work on Abbad-Boustique decomposition. The authors have considered the discounted and weighted optimality criterion with finite state and action spaces. With these criterions the Ross-Varadarajan decomposition is not available. That is why the authors have used the approach introduced in [36], and they have constructed the levels and the restricted MDPs in a similar way as above. In the discounted optimality criterion, they also have showed that optimal actions in the restricted MDP are still optimal in the original MDP, and they have proposed an algorithm to find an optimal strategy. Under the weighted optimality criterion, they have first proposed an algorithm which constructs an -optimal strategy corresponding to each restricted MDP; whereas, in [65], for each state an optimal strategy is constructed by solving the entire MDP. Finally, by coalescing the last restricted -optimal strategies they have presented an algorithm which determines an -optimal strategy in the original MDP.

Remark 3.8. (i) The decomposition approaches presented in [34, 37, 57, 66] are merely motivating if (a) the cardinalities of the Strongly Communicating Classes (SCCs): and a set are small compared to the cardinality of the state space and (b) the number of the SCCs is not too large. However, they are most suitable to solve constrained MDPs. (ii) the approaches proposed in [35, 36] alleviate the last inconvenient; however, there is still considerable overhead in determining the communicating classes.

4. Dean-Lin Decomposition

the study by Dean and Lin in [39] is one of the first works that introduces decomposition techniques for planning in stochastic domains. Their framework as stated before is also a special case of Divide-and-Conquer: given a partition of the state space into regions, (i) reformulate the problem in terms of smaller MDPs over the subspaces of the individual regions, (ii) solve each of these subproblems, and then (iii) combine the solutions to obtain a solution to the original problem. In this section, we discuss briefly Dean-Lin approach and some related works.

Let be any partition of such that and , for all . We refer to a region as an aggregate (or a macro) state. The periphery of an aggregate state (denoted Periphery()) is the set of states not in but reachable in a single transition from some state in , that is, .

To model interactions among regions, a set of parameters are introduced. Let Periphery(), and for each denote a real-valued parameter. Let denote a vector of all such parameters, and let denote a subvector of composed of where is in Periphery(). Parameter serves as a measure of the expected cost of starting from a periphery state, and provides an abstract summary of how the other regions affect . Given a particular , the original MDP is decomposed into smaller MDPs. For a region and the subvector , a local MDP is defined by the following:(i)state space Periphery(),(ii)state transition matrix (): for , and for Periphery(),(iii)cost matrix (): for for and Periphery for Periphery().

Let denote an optimal strategy for the original MDP. If, the authors show that the resulting local strategies for the local MDPs define an optimal strategy on the entire state space. They further propose two methods for either guessing or successively approximating for all : a hierarchical construction approach and iterative improvement approach.

The former constructs an abstract MDP by considering individual regions as abstract states and their local strategies as abstract actions. The solution to this abstract MDP finds for each region an optimal region to reach, thus yielding a solution on the original MDP. Unfortunately, this approach does not guarantee to produce an optimal strategy; however, it has an intuitive interpretation that makes it particularly suitable for robot navigation domains.

The second method iteratively approximates to converge to an optimal solution. On each iteration, for each region a specific estimate of the parameter values of region is considered and the resulting MDP to obtain a local strategy is solved. By examining the resulting local strategies we get information to engender a new estimate of the parameter values that is guaranteed to improve the global solution. This information about local strategies also tells us when the current solution is optimal or within some specified tolerance and it is therefore appropriate to terminate the iterative procedure. The iterative approach computes several strategies for each region, which is not good news, but it provides an optimal strategy. Also, it is important to note that this approach is based on a reduction to the methods of Kushner and Chen [61] that demonstrates how to solve MDPs as linear programs using Dantzig-Wolfe decomposition [60]. For more details, we refer the reader to the longer version of the paper in [70].

Closely related to hierarchical construction approach, two methods have been proposed in [42, 43]. They also solve an abstract MDP composed of one abstract state per region, but many strategies, called macro-actions, are computed in each region. In [43], to compensate this weakness only a small set of strategies are calculated per region, without loss of optimality.

In [40], the approach aims also to solve weakly coupled MDPs, but it is quite different to Dean-Lin approach for the following reasons: (i) only one strategy is computed in each region which reduces time consuming on each iteration and (ii) the MDP is represented as a directed graph, and so a simple heuristic valuated graph is used to estimate periphery state values. This approach constructs strategies that are only near optimal; however, they are computed quickly.

The related work on abstraction and decomposition is extensive. In the area planning and search assuming deterministic action models, there is the work on macro-operators [71] and hierarchies of state space operators [72, 73]. Closely related is the work on decomposing discrete-event systems modeled as (deterministic) finite state machines [74]. In the area of reinforcement learning, there is work on deterministic action models and continuous state spaces [75] and stochastic models and discrete state spaces [76]. Finally, the approach described in [46] represents a special case of the Dean-Lin framework, in which the partition consists of singleton sets for all of the states in the envelope and a set for all the states in the complement of the envelope.

5. Hierarchical Reinforcement Learning

Reinforcement Learning (RL) is a machine learning paradigm in which an agent learns a behavioral policy through direct interaction with an unknown, stochastic environment [77]. Most research in RL is based on the theoretical discrete-time state and action formalism of the MDP. Unfortunately, it suffers from the curse of dimensionality [47], where the explicit state and action space enumeration grow exponentially with the number of state variables and the number of agents, respectively. So, RL has introduced Monte Carlo methods, stochastic approximation, trajectory sampling, temporal difference backups, and function approximation. However, even these methods have reached their limits. As a result, we discuss briefly in this section broad categorizations of factored, and hierarchical approaches which break up a large problem into smaller components and solving the parts.

5.1. Factored Approaches

When the state space of the MDP can be specified as a cross-product of sets of state variables , it is called a factored MDP (FMDP). The concepts of state abstraction and aggregation are strongly related to the idea of a factored state space. A factored formulation also allows for system dynamics to be specified using a more natural and intuitive representation instead of an probability matrix per action. Representations that can describe such structure are 2-slice Dynamic Bayesian Networks (DBNs) [78] and probabilistic STRIPS operators, the former being more popular in the literature.

In [48], the authors exploit such a factored state space directly, and reveal reduction in the computation and memory required to compute the optimal solution. The assumption is that the MDP is specified by a set of DBNs, one for each action, although the claim made is that it is amenable to a probabilistic STRIPS specification too. In addition to using the network structure to elicit variable independence, they use decision-tree representations of the conditional probability distributions (CPDs) to further exploit propositional independence. Next, they construct the structured policy iteration (SPI) algorithm which aggregates states for two distinct reasons: either if the states are assigned the same action by the current strategy, or if states have the same current estimated value. With the aggregation in place, the learning algorithm (based on modified policy iteration) only computes at the coarser level of these state partitions instead of that of the individual states. The algorithm itself is split into two phases, structured successive approximation and structured policy improvement, mirroring the two phases of classical policy iteration. It is important to note that SPI will see fewer advantages if the optimal strategy cannot be compactly represented by a tree structure, and for the reason that there is still big overhead in finding the state partitions.

In [53], Algebraic Decision Diagrams (ADDs) replace the decision-tree learning of SPI for the value function and strategy representation. The paper deals with a very large MDP (63 million states) and shows that the learned ADD value function representation is considerably more compact than the corresponding learned decision tree in most cases. However, a big disadvantage of using ADDs is that the state variables must be boolean, which makes the modified state space larger than the original.

In order to solve large weakly coupled FMDPs, the state space of the original MDP is divided into regions that comprise sub-MDPs which run concurrently (the original MDP is a cross-product of the sub-MDPs) [79]. It is assumed that states variables are only associated with a particular task and the numbers of resources that can be allocated to the individual tasks are constrained; these global constraints are what cause the weak coupling between the sub-MDPs. Their approach contains two phases: an offline phase that computes the optimal solutions (value functions) for the individual sub-MDPs and an online phase that uses these local value functions to heuristically guide the search for global resource allocation to the subtasks.

One class of methods for solving weakly coupled FMDPs involves the use of linear value function approximation. In [52], the authors present two solution algorithms (based on approximate linear and dynamic programming) that approximate the value functions using a linear combination of basis functions, each basis function only depending on a small subset of the state variables. In [80], a general framework is proposed that can select a suitable basis set and modify it based on the solution quality. Further, they use piecewise linear combination of the subtask value functions to approximate the optimal value function for the original MDP. The above approaches to solving FMDPs are classified under decision-theoretic planning in that they need a perfect model (transition and reward) of the FMDP. The work in [50] proposes the SDYNA framework that can learn in large FMDPs without initial knowledge of their structure. SDYNA incrementally builds structured representations using incremental decision-tree induction algorithms that learn from the observations made by the agent.

5.2. Hierarchical Approaches

To deal with large-scale FMDPs, RL approaches aim to leverage the structure of the state space. However, they do not impart enough structure to the strategy space itself. Hierarchical reinforcement learning (HRL) is a broad and very active subfield of RL that imposes hierarchical structure onto the state, action, and strategy spaces. To alleviate the curse of dimensionality, HRL applies principled methods of temporal abstraction to the problem; decision-making should not be required at every step but instead temporally extended activities or macro-operators or subtasks can be selected to achieve subgoals.

The work in [49] proposes a new approach which relies on a programmer to design a hierarchy of abstract machines that limit the possible strategies to be considered. In this hierarchy, each subtask is defined in terms of goal states or termination conditions. Each subtask in the hierarchy corresponds to its own MDP, and the methods seek to compute a strategy that is locally optimal for each subtask.

Many researches in RL allow the learner to work not just with primitive actions, but with higher-level, temporally-extended actions, called options [55, 57, 58, 81]. An option is a closed-loop policy that operates over a period of time, and is defined by the tuple (), where is its strategy, is the initiation set of states, and is the probability of termination in state . The theory of options is based on the theories of MDPs and semi-Markov decision processes (SMDPs), but extends these in significant ways. Options and models of options can be learned for a wide variety of different subtasks, and then rapidly combined to solve new tasks. Using options enables planning and learning simultaneously, at a wide variety of times scales, and toward a wide variety of subtasks. However, the agent's action set is augmented rather than simplified by options which intensify the dimensionality of action spaces.

A Hierarchical Abstract Machine (HAM) is a program that diminishes the number of decisions by partially representing the strategy in the form of finite state machines (FSMs) with a few nondeterministic choice points [56]. HAMs also exploit the theory of SMDPs, but the emphasis is on restricting the policy space rather than augmenting the action space. A HAM is a collection of three tuples , where is a finite set of machine states, is the initial state, and is the transition function determining the next state using either deterministic or stochastic transitions. The main types of machine states are: start (execute the current machine), action (execute an action), call (execute another machine), choice (select the next machine state), and stop (halt execution and return control). Further, for any MDP and any HAM , there exists an induced MDP [56] that works with a reduced search space using single-step and multistep (or high-level) actions. As a consequence, the induced MDP is in fact a SMDP, because actions can take more that one timestep to complete. A learning algorithm for the induced SMDP is a variation of -learning called SMDP -learning. This algorithm can be applied to the HAMs framework using an extended -table , which is indexed by an environment state , machine state , and action taken at a choice state . Just like options, these HAMs have to be expertly designed because they place strict restrictions on the final strategy possible for the original MDP.

In the MAX framework [51], the temporally extended actions or subtasks are organized hierarchically. Faster learning is facilitated by constraining and structuring the space of strategies, encouraging the reuse of subtasks, and enabling effective task-specific state abstraction and aggregation. Unlike options and HAMs, the MAX framework does not reduce the original MDP into one SMDP. Instead, the original MDP is split into sub-SMDPs , where each sub-SMDP represents a subtask.

The big contribution of [8285] consists in extending the MAX framework to the average reward setting with promising results. The work in [86] is a simple extension of the MAX framework to the multiagent setting, and it leverages the structure of the task hierarchy to communicate high-level coordination information among the agents. Learning the structure of the task hierarchy is a very promising area of HRL research. The work in [87] introduces HEX, an algorithm that uses frequency of change in the state variables to partition the state space into subtasks—the faster a variable changes, the more likely it is part of the state abstraction of a lower-level subtask. Empirical results pit HEX against MAX (with a predefined hierarchy) and show that, though there is initial overhead in discovering the hierarchy, HEX ends up performing comparably. The work in [88] uses planning to automatically construct task hierarchies based on abstract models of the behaviors' purpose. It then applies RL to flesh out these abstractly defined behaviors, and to learn the choices for ambiguous plans.

6. Dantzig-Wolfe Decomposition

Kushner and Chen [61] investigate the use of the Dantzig-Wolfe decomposition in solving large MDPs as linear programs. Thus, in this section we describe briefly this classic decomposition procedure. A reference for a more complete description is [89].

We consider linear programming problems with the following a block angular structure: where the are matrices and the are matrices with full row rank. Problems of this type can be decomposed by using the Dantzig-Wolfe decomposition. We get subproblems, corresponding to the constraints , denote the extreme points of the subproblem related to . Problem (6.1) can be reformulated as: Problem (6.2) is called the full master problem. Instead of looking at the full master problem we only consider a subset of the columns and then generate new columns when they are needed. The reduced or the restricted master problem has rows, where the last corresponds to the convexity constraints. We assign the dual variables to the first constraints of the restricted master problem and to the convexity constraints. In each master iteration, the restricted master problem is solved. Each of the subproblems is then solved with the cost vector Hence, the subproblem to be solved is

If there exists a solution with , then a column with negative reduced cost has been found, and it is introduced into the restricted master. The new column is given as , where is the unit-vector. It has the cost .

Both the master problem and the subproblems axis on the use of simplex method in solving the linear program in (6.2). We refer the readers to [90] for more details about (i) cycling prevention and (ii) the initialization of the simplex method using big-M method. Using the Dantzig-Wolfe decomposition, the solution is improved iteratively and converges to an optimum solution in a finite number of iterations.

7. Conclusion

The benefit of decomposition techniques is that we are able to deal with subproblems of smaller size; the tradeoff is that often extra effort is required to combine the solutions to these subproblems into a solution to the original problem. Thus, some new methods are expected to cope with the difficulty of combining the subproblems.

Many of the approaches discussed in this survey are collection of separate mature fields coming together to deal with the twin drawbacks of curse of dimensionality and lack of model information in most real-world systems. For instance, many Hierarchical Reinforcement Learning concepts utilize some notions of programming languages such as subroutines, task stacks, and control threads. We speculate if any other domains are forthcoming to be imported.