Research Article

State Aware-Based Prioritized Experience Replay for Handover Decision in 5G Ultradense Networks

Algorithm 1

SA-PER handover decision algorithm.
Input: Iteration number NUM_EPISODES, step number MAX_STEPS, node number node_num, measurement information SINR, length of update step D.
Output: Handover decision matrix A.
1: Initialize action-value function Q, replay buffer B and handover decision matrix A. The initialized parameters of the main Q-network and target Q-network are consistent. .
2: fori=1, NUM_EPISODESdo
3:  forj=1, MAX_STEPSdo
4:  fork=1, node_numdo
5:   According to Eq. (6), the immediate reward rt is computed.
6:   According to Eq. (11), the dwell time is computed. According to Eq. (14), the load coefficient Load is obtained. By the state aware method, the network state st in time slot t is constructed. According to Eq. (16, 17), the state decision matrix Ms is normalized.
7:   By the ε-greedy method, the action at corresponding to state st is determined and the handover decision matrix A is updated.
8:   The next state st+1 is produced and the transition (st, at, rt, st+1) is stored in buffer B.
9:   In PER method, according to Eq. (18, 19), the priority and probability of sample are computed. According to Eq. (20), the weight of importance sampling method is computed. The sampling data is the input of main-Q network, and the action-value function Qm(st,at) is computed.
10:  According to Eq. (22), the action am corresponding to the maximum value of Qm is obtained and input the target Q-network Qt. And the action-value Qt(st+1, am) is computed.
11:  Adopt the stochastic gradient descent method, according to Eq. (24), the parameters θx of main Q-network are updated.
12:   end for
13:   Every D steps, the parameters of target Q-network are updated by the parameters of main Q-network. .
14:  end for
15: end for
16: Return the handover decision matrix A.