Research Article

UAV Swarm Confrontation Using Hierarchical Multiagent Reinforcement Learning

Algorithm 1

The high-level policy training in h-MADDPG.
Input: Pretrained low-level policies for all agents
Output: model
1: Randomly initialize the high-level networks and critic networks
2: for each episode do
3:  Get local observation and global state
4:  
5:  whiledo
6:   Select macro actions for all agents, where
7:   fordo
8:    Select primitive actions conditioned on the macro actions , where are the intrinsic observations
9:    Execute primitive actions
10:    Observe new intrinsic observations and receive extrinsic rewards
11:   end for
12:   Get new local observation and new global state
13:   
14:   Store transition in
15:   Sample a random minibatch of M transitions from
16:   Update the parameters of and according to Equation (6) and (7)
17:   
18:  end while
19: end for