Computational Intelligence Approaches to Robotics, Automation, and ControlView this Special Issue
OL-DEC-MDP Model for Multiagent Online Scheduling with a Time-Dependent Probability of Success
Focusing on the on-line multiagent scheduling problem, this paper considers the time-dependent probability of success and processing duration and proposes an OL-DEC-MDP (opportunity loss-decentralized Markov Decision Processes) model to include opportunity loss into scheduling decision to improve overall performance. The success probability of job processing as well as the process duration is dependent on the time at which the processing is started. The probability of completing the assigned job by an agent would be higher when the process is started earlier, but the opportunity loss could also be high due to the longer engaging duration. As a result, OL-DEC-MDP model introduces a reward function considering the opportunity loss, which is estimated based on the prediction of the upcoming jobs by a sampling method on the job arrival. Heuristic strategies are introduced in computing the best starting time for an incoming job by each agent, and an incoming job will always be scheduled to the agent with the highest reward among all agents with their best starting policies. The simulation experiments show that the OL-DEC-MDP model will improve the overall scheduling performance compared with models not considering opportunity loss in heavy-loading environment.
Problems involving time-dependent success probability extensively exist in manufacturing, industrial, and military domains. One example is the scheduling of a procrastinator , whose speed and success probability of job processing will increase as the due date is approaching. Practice shows that a procrastinator’s performance varies under different time pressures when processing the same job. As higher time pressure is more likely to force a procrastinator to make mistakes when processing a sophisticated job, the success probability is consequentially dependent on the starting time. Another example is the antiship missile defense by SAM (surface-air-missile) systems shown in . SAM systems are scheduled to intercept the incoming antiship missiles within feasible interception time window. Killing probability of the interception is associated with the range at which the interception missile and the antiship missile meets, which in turn depends on the launching time of the interception. Usually, an earlier firing time means more flight time before engagement.
Both of the above examples imply that though early starting strategy for job processing guarantees maximal time window for processing, longer processing duration will be spent as a price. Compared with the classic on-line scheduling [3, 4], extra trade-offs should be considered by the agent between the current job and the possible incoming jobs. For example, Figure 1 gives the killing probability associated with the time at which the engagement occurs for a SAM system of a Halifax ship against the incoming antiship missile . It can be inferred that an antiship missile can be intercepted in the feasible time window . Hence the SAM system can choose the best engaging time to get the highest killing probability. If the interception fails, SAM system will have time window to take an immediate remedial interception. However, if the SAM system fires at earlier time and makes the engagement occur at , though the killing probability is lowered down, longer time window is left in case of interception fail. Therefore, the SAM system needs to make trade-off between a high killing probability of current interception and more feasible time left to take remedial action in case of interception fail. In addition, for an antiship missile and a SAM system, an earlier firing time always means more flight time before the engagement. Hence adopting the early firing strategy would cause SAM system to spend longer duration on the current interception, while losing more opportunities to intercept possible upcoming missiles. This is the second trade-off to be considered.
Similar trade-off also exists in the application of procrastinator on-line scheduling . To the best of our knowledge, multiagent on-line scheduling with the trade-off discussed above is not studied by previous researches such as time-dependent scheduling [6–9], on-line stochastic optimization [3, 4, 10, 11], and stochastic resource allocation in a multiagent environment [12–15]. Therefore, in this paper, we consider the above trade-offs in a multiagent scheduling process. There are several independent agents that can be scheduled to process the stochastically arriving jobs. Each job has a specific feasible time window, during which an agent can process it with time-dependent success probability. In case of fail, an agent will immediately make another try as long as the remaining time window allows. The objective is to complete all jobs with high probability. A general problem definition is introduced in Section 2, and Section 3 surveys closely related studies. Section 4 builds a DEC-MDP (decentralized Markov Decision Processes) to model the on-line multiagent scheduling process without considering the opportunity loss. An OL-DEC-MDP (opportunity loss-decentralized Markov Decision Processes) model is proposed in Section 5 to include the opportunity loss in the scheduling decision with proofs on its properties. Section 6 is the simulation evaluation of the OL-DEC-MDP, and Section 7 contains the conclusions and the future work.
2. Problem Definitions
There is a group of agents, denoted as , that should be scheduled to process a set of stochastically arriving jobs, which is denoted as . For each job , there is a time interval , during which the process of job is feasible. For example, and are the low bound and upper bound of the feasible time window for job to be processed. For each agent , there is a duration , which should be spent to process job for one time starting from time . The outcome of the process by the end of is either success or fail, and probability of success is denoted as .
Assumption 1. One agent can only be scheduled to process one job at a time.
Assumption 2. If the agent fails to complete the job by the end of the process, it will immediately start another try as long as the feasible processing time window of the job will not elapse before the next try can be finished.
Assumption 3. The agent will be released from the current job and be available for the next job, if either the current job is completed successfully or the current job is discarded because of insufficient time window left for another try.
According to the above assumptions, an agent will have several opportunities to complete a job before the feasible time window of job elapses depending on the process duration of each try. For example, If a process starting from time fails by the end of time , agent must try to reprocess the job immediately at time , as long as .
Assumption 4. For an assignment of job to agent , later a try of process starts; later the process will end, but the process duration will be shorter.
Assumption 4 is in accordance with the observations in the time-sensitive applications such as air defense. With the threat approaching, the time needed for an interception is diminishing. For example, the earlier a process starts, the earlier the effect can be observed, but a longer duration needs to be spent. According to Assumption 4, For any , , s.t. , if , then and .
Assumption 5. Each agent operates independently, and there is no resource competition or mutual influence between agents.
The objective is to schedule the agents on-line to successfully complete all the arriving jobs with highest probability. In the off-line case, the objective of the problem can be modelled as (1) s.t. objective function (1) is to maximize the probability of successfully completing all the incoming jobs. Constraint (2) implies that an agent can process a job more than one time. Constraint (3) ensures that a try of job processing should not be started if the feasible time window of the job will elapse before a try can be finished.
Compared with a similar model in Karasakal et al. , the starting time of job processing in the above model is continuously distributed in a job’s feasible time window. Moreover, in the on-line version, the future arriving jobs could not be known in advance, which makes the model more difficult to solve.
As a result, trade-off should be made during the on-line scheduling of the problem to ensure good scheduling quality:(1)the trade-off between the probability of successful process of the current job and the probability of the successful reprocess of the job in case of fail;(2)the trade-off between the reward of successful process of the current job and the opportunity loss that the agent might have with other incoming jobs during the current job processing.
3. Related Works
Job scheduling  is a classic domain to solve the problem, in which jobs need to be handled by one ore more machines regarding the constraints of due date, processing time, priorities, and so forth. There are many different models such as single or parallel machine model depending on the number of machines. If a job needs to be handled by a series of machines in-order, the models are called flow shops, job shops, or open shops under different situations. The objective is to handle all the jobs with a minimum makespan  or lateness [18, 19]. Time-independent uncertainties such as machine breakdowns, unexpected releases of jobs with high priority , duration of a processing , and execution uncertainty  are introduced in the scheduling, which are called stochastic scheduling. Recently, models on time-dependent scheduling are proposed, in which parameters of the scheduling are time-dependent. For example, learning effect and processing time are defined as increasing funtion [22, 23] or deceasing function [1, 24, 25] of their start times. However, most of the above studies are discussed in the off-line case, where all of the jobs exist from the beginning. Moreover, time-dependent parameters mainly focus on the processing times or the cost of processing, while time-dependent success probability of job processing is not discussed.
3.2. On-Line Stochastic Optimization
On-line stochastic optimization, such as the on-line packet scheduling, stochastic reservations, vehicle dispatching, or routing, has been studied [3, 4], in which a job or requisition arrives stochastically in queue to wait for a certain machine or a sever to be served. A job or requisition will be successfully processed once a machine is scheduled. Scheduling can be centralized or decentralized depending on whether the scheduling decision is made globally or by each agent. To model real-world problems, time-independent uncertainties such as action duration , resource consumption , and operation outcomes [12, 14] are introduced into the scheduling process. Sampling approach is also introduced to estimate the future arriving jobs to achieve a global optimal solution [26–28]. Similar to our problem, each agent of the above problem is dynamically scheduled against the stochastically arriving jobs, and there is no resource competition or dependences among the agents. However, time-dependent probability of success is not considered, and each job will be successfully processed once an agent is scheduled to process it. Moreover, job discarding is allowed if a more important job is arriving. Instead, in our problem, each agent should retry in case of fail as long as the time allows.
3.3. Stochastic Resource Allocation in Multiagent Environment
The most related studies in the area of stochastic resource allocation in multiagent environment mainly focus on the following problems; each agent can execute a task independently while different agents may share the same resources. An agent consuming shared resources may decrease the reward of other agents. As the outcome of the job execution is uncertain, the resources are allocated to achieve the global optimal solution.  solves this type of problems by introducing dynamic constraint satisfaction problem (DCSP) model into MDP and constructing a Markovian CSP (MaCSP) model. The best action at each Markovian step depends on the resource availability. As the state space increases exponentially with the number of agent and the types of resource, some studies propose heuristic search  and decomposition approaches [14, 15] in solving Decentralized Markov Decision Processes (DEC-MDP). As the dependency between different agents is taken into account, starting an action too early or too late by an agent may jeopardize the operation of others. Hence, trade-off is introduced into DEC-MDP to estimate the cost that one agent may suffer due to the negative influence of others .
Comparing between our problems with closely related studies is listed in Table 1.
4. Decentralize MDP (DEC-MDP)
In theory, models (1)–(3) can obtain the optimal scheduling solution for an off-line problem. However, in the on-line case, the scheduling decision should be made according to the state of each agent and the incoming job in real time. As a result, MDP provides a suitable approach to model the on-line scheduling by mapping the current state of agents and incoming jobs to an optimal scheduling decision. In order to construct the MDP model of a problem, the state space of the problem should be defined.
4.1. States of the Agent and the Job
The state of an agent is either busy when it is processing a job or unemployed when it is released from the current job. Let denote the state of agent at time : If a job is being processed by an agent, the state of the job is modelled as the ratio of the remaining time window feasible for the job to be completed. If it is waiting to be processed, its state is set to be 0; if it has been completed successfully, the state is set to be 1; otherwise the state is set to be −1. Let denote the state of job at time : For example, at time , there is an unemployed agent without any job coming. The state of the agent at is
At time , a job arrives with feasible time window [12, 30], and it is scheduled to agent which is due to start at time 12. Then By Assumption 4, there is . Suppose , and job processing ends successfully; then If job processing fails by the time , agent will start another try to process job immediately as long as holds. Suppose ; then If job processing fails again by the end of the second try (e.g., at time ), and the remaining time window of job (only 1 second is left, since ) is not enough for another try, then job will be discarded from time : In this case, agent will be available for other incoming jobs from time . As the different agents may be released or start to process a job at different times, it is hard to define a joint action, which is the set of actions for each agent in each decision step of the on-line scheduling process . Moreover, because of the time-dependent state space, the reward of a joint action is difficult to evaluate by a recursive approach as introduced in . Recently, in order to limit the set of state space in the multiagent environment, there is significant progress in extending the Markov Decision Processes (MDP) for optimizing decentralized control [13, 31]. In this paper, as there is no dependence or resource competition among agents, a decentralized MDP is adopted to model the decision process of each agent.
For an agent and its allocated job with time window , the corresponding DEC-MDP is defined as a tuple , where(i) is the state set regarding agent and job during the whole process period (e.g., from the time when the process is started for the first time, denoted as , to the time when agent is released from job );(ii) is the strategy set of agent , represented by the starting time that agent may choose to process job for the first time. ;(iii) is a function of time , which gives the success probability of completing job by agent by the time of , when the process or the reprocess starts at time . According to Assumptions 2 and 3, there is if or ;(iv) is the reward function as defined in (13).The initial state of a DEC-MDP is The absorbing state is Figure 2 shows the state transition process when agent starts to process job at time for the first time, in which the maximal retry times are , and is the smallest number that satisfies . The reward function is defined as in (13), which is represented by the probability that agent will successfully complete job when starting the first process at time . Consider s.t. when starting from time , agent can process job not more than times. With (13)–(15), the best time for agent to begin to process job is : Therefore, during the on-line scheduling, we prefer scheduling the incoming job to an agent with highest success probability: However, as stated in Section 4, this decision does not take the opportunity loss into account. Agent may lose higher rewards with upcoming jobs during its engagement with the current job. As a result, we introduce a opportunity loss decentralized MDP (OL-DEC-MDP) model in the next section.
5. Opportunity Loss Decentralized MDP (OL-DEC-MDP)
An OL-DEC-MDP model has the same state space, strategy set, and transition probability with a DEC-MDP. However, the reward function of an OL-DEC-MDP should be redefined to take the opportunity loss into account.
5.1. Opportunity Loss
As shown in Figure 2, the agent may try at most times before being released from the current job. It will not be available for other upcoming jobs during the period with the probability where and . As a result, agent will lose all upcoming jobs during with probability of . Hence the opportunity loss for agent to process job starting from can be defined as In the above equation, is the best starting time for agent to process job , which is decided by (16). is the set of all possible jobs that will arrive during period . The opportunity loss of the agent is defined to be the highest possible reward that the agent may lose during the period of engagement with its current job. Considering both the reward and the potential loss in scheduling decision, we now refine the reward function in OL-DEC-MDP as following: As a result, the best starting time of agent to process job is : Reward function (21) calculates the maximum reward when schedule agent to process job while taking the opportunity loss into account.
5.2. Computation of the Reward with Opportunity Loss
For an OL-DEC-MDP , if job is allocated to agent , then agent should choose a starting time from the time window , while . If derivation of the reward function (21) exists within interval , the optimal starting time can be decided as following: However, if derivation of reward function (21) does not exist within interval , can be decided with the following heuristics.(1)Start as early as possible. For example, set if the agent is available earlier than , or set to be the earliest time when agent becomes available. We denote under this heuristic as , .(2)Start as late as possible, as long as agent will still have the same number of retrying opportunities () with starting as early as possible (e.g., at time ). With this heuristic, there are and . We denote under this heuristic as .(3)Start at the time with highest success probability to complete job within the first try, while still having the maximal retrying opportunities (). For example, is the time point between with the highest success probability of the first try. With this heuristic, there is . We denote under this heuristic as .(4)Start at the time with the highest success probability to complete the job within the first try. With this heuristic, there is , while . We denote under this heuristic as .As a result, the best starting time of agent to process job considering both rewards and opportunity loss can be computed as According to (19), to compute , we should know ; for example, the set of all possible jobs that will arrive during time . can be estimated on-line by the sampling approach, as described in [26, 27], which can forecast the possible events according to the job arriving distribution.
5.3. On-Line Scheduling Based on OL-DEC-MDP
The detailed scheduling algorithm is given as following.
(1) Queuing Up the Incoming jobs. When a new job comes, it is queued up in a time-priority queue. A new arrived job with a smaller low bound of feasible time will have higher priority.
(2) Observing the System State Change. Each agent has a job list with length of 1, which indicates its next job to be processed. System state changes when(a)agent is released from current job and starts to process the assigned job in its job list (the agent’s job list will be empty);(b)agent fails to complete the current job and begins to make another try (the job in the agent’s job list will be still waiting, which will be rescheduled);(c)a new job is coming, and there exists at least one agent with empty job list (the incoming job will be assigned to some agent by being pushed into its job list).
(3) Scheduling/Rescheduling When System State Changes. When system state changes, scheduling or rescheduling decision will be made to decide or adjust the best next job as well as the best starting time for each agent. denote the current set of next job of all agents before rescheduling.(a)If , then dequeue jobs from queue. Let be the set of these dequeued jobs. Then, is the job set to be scheduled/rescheduled, as denoted by .(b)Order jobs in according to time priority, and clear the job list of all agents.(c)Schedule each job by order in to an agent as following.(i)Given job , it will be scheduled to the agent with highest reward: In the above computation, is the set of all agents with empty job list; is decided by (23) with given and . For each agent, If it has not been released when job comes, its earliest available time is set to be the ending time of its current process cycle. For example, the scheduling decision is made based on the assumption that all agents will be released by the end of its current process cycle. If the assumption is violated according to the observation, it is thought to be a system state change, and rescheduling will be made as described in step .(ii)Push job into job list of .
(4) Job Processing. when available (being released from the current job), agent will begin to process the job in its job list at time , and its job list will be cleared. By Assumption 2, agents will try many times to complete assigned jobs before success or time window of the job expires.
5.4. Properties and Proofs
Property 1. The time complexity to compute the best starting time for an agent to process job according to (21) is .
Proof. As shown in Figure 2, there is one state for an OL-DEC-MDP in , and two possible states in , where . Hence, the maximum number of possible states during the job process is . As a result, the time complexity to calculate the expectation value of completing current job is . On the other hand, the time complexity to calculate the maximal possible opportunity loss is . Therefore, the time complexity to calculate the reward for a given agent , job , and start time is . As and are constant for a given instance, we can set a constant . Therefore, the time complexity to compute in (21) is
Property 2. The average time for agent on processing job is , while and .
Proof. As shown in (17), given the starting time of the first processing , the probability is for the agent to spend on processing job , where and . Thus, the average time for agent to spend on job is
6.1. Evaluation Setting
In the evaluation, a scenario of antiship missile defence by SAM systems is studied, which is introduced in .
Suppose there are four ship-borne SAM systems that can be scheduled to intercept the incoming antiship missiles, and each SAM system is capable of working independently and intercepting antiship missile coming from any direction (the modern ship-borne vertical launching missile system matches these features and is becoming very popular). The feasible interception time window of each incoming antiship missile is set to be , in which is the time when the antiship missile is detected. The length of interception time window is decided by the detection capability of SAM system as well as the speed of the antiship missile. As a result, based on Section 5, if a SAM system is available when the missile is detected, there are , , and based on scenario in . The killing probability associated with the starting time of each interception is shown in Figure 3, which is approximated by a cubic multinomial. The duration function is defined as , where is the range at time between the SAM missile and the antiship missile; and are the velocities of the SAM missile and the antiship missile.
For the incoming antiship missile, the total number is set to be , and its arrival follows uniform distribution during a time span . To compute the opportunity loss as described in Section 5, a sampling method [26, 27] is implemented. Simulations are run under each combination of and to compare the scheduling result under different circumstances.
6.2. Quality of Scheduling
In the air-defence scenario, fail to intercept even one time may result in severe damage. Hence the quality of scheduling is measured by the probability of successfully interception of all incoming antiship missiles, which is denoted as P-interception. As shown in Figures 4 and 5, both DEC-MDP and OL-DEC-MDP based scheduling approaches illustrate that less intensive the attack comes (fewer antiship missiles with fixed time span or longer time span with fixed incoming antiship missiles), higher the P-interception will be. The reason is that if there are fewer antiship missiles per time unit, there should be more available SAM systems that can be scheduled, hence the overall interception performance will be improved.
However, as shown in Figure 6, the OL-DEC-MDP based scheduling approach always has a higher probability of overall interception compared with DEC-MDP model. It can be observed that, for the same time span, the improvement of OL-DEC-MDP becomes more significant as the number of incoming missiles () increases. For example, there is performance improvement under more intensive attack environment. On the other hand, the overall shape of the improvement along the time span (for the fixed number of incoming antiship missiles, longer time span means less intensive attack) tends out to be a “cap.” For example, the improvement rises sharply with the time span increasing at first and then comes down after reaching some peaks. The reason is that when the time span is small at first, which means that the antiship missiles are coming very intensively, it is very hard to improve the interception performance by OL-DEC-MDP since the SAM system reaches its saturation point under very intensive attack. The decision space left for each SAM system to decide the best starting time of the interception is quite small; hence OL-DEC-MDP has similar performance with DEC-MDP. However, when the intensity falls below the saturation point of the SAM system, the improvement brought by OL-DEC-MDP becomes gradually significant as opportunity loss is taken into account in on-line scheduling to achieve better overall performance. As the attack intensity continues to lower down with the increase of time span, the whole system has more than enough capability (available SAM systems) to intercept the incoming missiles; hence the improvement brought by OL-DEC-MDP becomes less significant.
Figure 7 shows the best starting time of interception used in OL-DEC-MDP obtained by heuristic strategy introduced in Section 5. -axis is the time indicating how long after an antiship missile is detected the first interception is launched. It can be observed that when the antiship missiles arrive intensively (which means smaller time span with fixed total incoming missiles), OL-DEC-MDP prefers to postpone the first interception launch. Study on the simulation data shows that the optimal starting time for interception under this case is near the time , which means that the strategy to achieve the highest killing probability against the antiship missile by one shot is the superior strategy. This observation can be inferred from Property 2 that the superior strategy in this case is to release the SAM system as early as possible to treat the next incoming antiship missile. On the other hand, when the attack is less intensive (which means longer time span with fixed total incoming missiles), OL-DEC-MDP prefers to start the interception earlier as to leave more feasible time for retrying in case of interception fail.
This paper proposes an OL-DEC-MDP model for on-line multiagent stochastic scheduling, which considers the starting time-dependent probability of success and processing duration. The probability of completing the assigned job by an agent would be higher when the process is started earlier, but the opportunity loss could be also high due to the longer engaging duration. As a result, OL-DEC-MDP model introduces the reward function considering the opportunity loss and schedules the incoming job to the agent with the highest reward. In order to measure the opportunity loss, OL-DEC-MDP model uses sampling method to predict the upcoming jobs and introduces heuristic strategies to compute the best starting time of an agent against an incoming job. The simulation experiments show that the OL-DEC-MDP model will improve the overall scheduling performance compared with models without considering opportunity loss, such as DEC-MDP. The overall trend of performance improvement is studied under different scenarios, which shows that the performance improvement is most significant if the jobs are coming intensively but within the saturation point of the multiagent system.
For the future research, we should extend the model to more general cases.(1)Dependency Between Agents. In some cases, agents may interfere with other’s operation. For example, if soft weapons such as chaff rocket are used during the interception, there may be mutual interference between different air defence weapons: firing a chaff rocket may prevent the missile guiding radar of the SAM system from working normally. In future work, the mutual influence between agents will be considered in constructing available strategy set and computing action reward.(2)Partial Observation. For some real-world problem, the result of the action can only be partially observed. For example, the result of interception by a SAM system may not be totally observed by other agents due to the limitation of sensing capability. Hence the reward and the opportunity loss should be reevaluated, and POMDP (partial observation MDP) based approach could be a good candidate.(3)On-line Learning. The sampling approach implemented in OL-DEC-MDP is based on the prior knowledge of the arrival distribution of the incoming jobs. If the prior knowledge of the arrival distribution does not exist, the on-line learning method could be used to learn and predict the future incoming jobs.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This paper is supported by the NSF of China under Grant nos. 61273322 and 71001105.
P. Van Hentenryck and R. Bent, Online Stochastic Combinatorial Optimization, The MIT Press, Cambridge, Mass, USA, 2006.
M. Debczynski and S. Gawiejnowicz, “An exact algorithm and a heuristic for scheduling linearly deteriorating jobs with arbitrary precedence constraints and the maximum cost criterion,” in Proceedings of the Federated Conference on Computer Science and Information Systems (FedCSIS '12), pp. 401–405, September 2012.View at: Google Scholar
S. Gawiejnowicz, Time-Dependent Scheduling, Monographs in Theoretical Computer Science. An EATCS Series, Springer, Berlin, Germany, 2008.View at: MathSciNet
C. Besse and B. Chaib-draa, “An efficient model for dynamic and constrained resource allocation problems,” in Proceedings of the 2nd International Workshop on Constraint Satisfaction Techniques for Planning and Scheduling Problems (COPLAS '07), 2007.View at: Google Scholar
P. Plamondon and B. Chaib-Draa, “Stochastic resource allocation in multiagent environments: an approach based on distributed q-values and bounded real-time dynamic programming,” International Journal on Artificial Intelligence Tools, vol. 21, no. 1, Article ID 1250003, 25 pages, 2012.View at: Publisher Site | Google Scholar
E. Burns, J. Benton, W. Ruml, S. Yoon, and M. B. Do, “Anticipatory on-line planning,” in Proceedings of the 22nd International Conference on Automated Planning and Scheduling (ICAPS '12), pp. 333–337, June 2012.View at: Google Scholar
H. S. Chang, R. Givan, and E. K. P. Chong, “On-line scheduling via sampling,” in Proceedings of the Conference on Artificial Intelligence Planning and Scheduling (AIPS '00), pp. 62–71, 2000.View at: Google Scholar
N. Meuleau, E. Benazera, R. I. Brafman, E. A. Hansen, and M. Mausam, “A heuristic search approach to planning with continuous resources in stochastic domains,” Journal of Artificial Intelligence Research, vol. 34, no. 1, pp. 27–59, 2009.View at: Google Scholar
R. Bellman, “A markovian decision process,” DTIC Technical Document, DTIC, 1957.View at: Google Scholar
A. Beynier and A. Mouaddib, “A polynomial algorithm for decentralized Markov decision processes with temporal constraints,” in Proceedings of the 4th International Conference on Autonomous Agents and Multi agent Systems (AAMAS '05), pp. 963–969, July 2005.View at: Google Scholar