Input: Learning rate , discount factor , learning exploration factor , Individual original MDP , Individual |
optimal -values of agent i, threshold value proportion , integer for Monte Carlo sampling, time limit for |
per episode learning |
() Initialize with individual optimal policy of agent , initialize to ; |
() Identify coordinated states for agent calling Algorithm 1; |
() Initialize local state for agent , check whether initial states is in coordination; |
() for do |
() observe current local state for agent ; |
() ; |
() if agent , agent is in coordination at time then |
() select according to using ; |
() else |
() select according to using ; |
() end if |
() receive reward and transition state for each agent ; |
() if agent , is part of an augmented coordinated state and is included in the new global |
state then |
() if is not in state space then |
() extend to include joint state and all the available actions pair ; |
() ; |
() end if |
() mark agent is in coordination at time and coordinated states for agent is ; |
() end if |
() if agent , agent is in coordination at time then |
() if agent is in coordination at time then |
() Update according to (5); |
() else |
() Update according to (6); |
() end if |
() else |
() if agent is in coordination at time then |
() Update according to (7); |
() else |
() Update according to (8); |
() end if |
() end if |
() , , ; |
() if is a terminal state then return; |
() end for |