Research Article
Simultaneous Pickup and Delivery Traveling Salesman Problem considering the Express Lockers Using Attention Route Planning Network
Algorithm 1
Pseudocode of the routing deep Q-learning algorithm.
Input: replay memory capacity N, training episode M, target network update interval C | | Output: trained policy network parameter set | (1) | Initialize replay memory D to capacity N; | (2) | Initialize policy network Q with random parameter θ; | (3) | Initialize target network Q with parameter ; | (4) | for episode ← 1 to Mdo | (5) | Initialize sequence and preprocessed sequence ; | (6) | Initialize turn number t = 1; | (7) | while sequence done do | (8) | if With probability then | (9) | Select a random action at from accessible points; | (10) | else | (11) | Select ; | (12) | end | (13) | Execute action at in emulator; | (14) | Observe reward rt and status set xt+1; | (15) | Set and preprocess ; | (16) | Store transition in D; | (17) | Sample random minibatch transitions from D; | (18) | (19) Perform a gradient descent step on w.r.t θ; | (20) | ift%C = 0 then | (21) | Set , i.e., set | | end | | ; | | end | | end |
|