Research Article
UAV Swarm Confrontation Using Hierarchical Multiagent Reinforcement Learning
Algorithm 1
The high-level policy training in h-MADDPG.
Input: Pretrained low-level policies for all agents | Output: model | 1: Randomly initialize the high-level networks and critic networks | 2: for each episode do | 3: Get local observation and global state | 4: | 5: whiledo | 6: Select macro actions for all agents, where | 7: fordo | 8: Select primitive actions conditioned on the macro actions , where are the intrinsic observations | 9: Execute primitive actions | 10: Observe new intrinsic observations and receive extrinsic rewards | 11: end for | 12: Get new local observation and new global state | 13: | 14: Store transition in | 15: Sample a random minibatch of M transitions from | 16: Update the parameters of and according to Equation (6) and (7) | 17: | 18: end while | 19: end for |
|