Research Article
Double Deep Recurrent Reinforcement Learning for Centralized Dynamic Multichannel Access
1) For time-slot t = 1, …, T do | 2) Observe an input x(t) and feed it into the online network | 3) Generate an estimation of Q-value Q(a) for all available actions by the online network | 4) Take N actions , with ε-greedy method (according to (12)) and obtain instantaneously reward rn(t) for each SU | 5) Observe an input x(t + 1) | 6) Mark ah(t), rh(t) as the action and the reward with high scores Q(a) | 7) Store tuple x(t), ah(t), rh(t), x(t + 1) in replay memory | 8) Sample random minibatch of tuples xj, aj, rj, xj+1 from replay memory | 9) Set | 10) Perform a gradient descent step on | 11) End for |
|