Research Article

A Study of Continuous Maximum Entropy Deep Inverse Reinforcement Learning

Algorithm 2

Continuous Maximum Entropy Deep Inverse reinforcement learning with Hot start.

Input  : : initialized with demonstration data set, Initialize
critic network and actor with weights and
Initialize target network and with weights
Initialize replay buffer
for episode = 1, M do
Initialize a random process N for action exploration
Receive initial observation state
for t = 1, T do
Select action according to the current policy
and exploration noise
Execute action and observe reward and observe new state
use to get reward function weight and
reward shaping using Algorithm 1
Store transition in
Sample a random minibatch of N transitions
Update critic by minimizing the loss:
Update the actor policy using the sampled policy gradient:
Update the target networks:
end for
end for