Research Article

Simultaneous Pickup and Delivery Traveling Salesman Problem considering the Express Lockers Using Attention Route Planning Network

Algorithm 1

Pseudocode of the routing deep Q-learning algorithm.
Input: replay memory capacity N, training episode M, target network update interval C
Output: trained policy network parameter set
(1)Initialize replay memory D to capacity N;
(2)Initialize policy network Q with random parameter θ;
(3)Initialize target network Q with parameter ;
(4)for episode ← 1 to Mdo
(5) Initialize sequence and preprocessed sequence ;
(6) Initialize turn number t = 1;
(7)while sequence done do
(8)  if With probability then
(9)   Select a random action at from accessible points;
(10)  else
(11)   Select ;
(12)  end
(13)  Execute action at in emulator;
(14)  Observe reward rt and status set xt+1;
(15)  Set and preprocess ;
(16)  Store transition in D;
(17)  Sample random minibatch transitions from D;
(18)  
(19)  Perform a gradient descent step on w.r.t θ;
(20)  ift%C = 0 then
(21)   Set , i.e., set
  end
  ;
end
end