Research Article
The Implementation of Deep Reinforcement Learning in E-Learning and Distance Learning: Remote Practical Work
Algorithm 2
The DQN algorithm and experience replay.
Initialize memory M to capacity C | | Initialize value function Q (action) with random weights | | For episode = 1, N do (for each episode N) | | Initialize state S1 = (X1) (starting computer screen pixels) at the beginning of each episode | | //Pre-process and feed the computer screen to our DQN, | | //which will regress the Q-values of all possible actions in the state. | | For t = 1, T do | | Chose an action using the epsilon-greedy policy. | | //With the prospect epsilon, we chose a random action a and with probability 1-epsilon, | | //chose an action that has a maximum Q-value, such as | | Execute action Atin emulator and observe reword and image Xt+1 | | Set St+1 = St, At, Xt+1and pre-process | | Store the transition in M | | Sample random mini-batch of the transitions from M | | Set | | figure the loss function | | //which is just the squared difference between goal Q and predicted Q. | | Do gradient descent with respect to our actual network parameters in order to reduce the loss function. | | After every “” iteration, copy our actual network weights to the goal network weights. | | End for | | End for |
|