Wireless Communications and Mobile Computing

Research Article

State Aware-Based Prioritized Experience Replay for Handover Decision in 5G Ultradense Networks

Algorithm 1

SA-PER handover decision algorithm.

Input: Iteration number NUM_EPISODES, step number MAX_STEPS, node number node_num, measurement information SINR, length of update step D.
Output: Handover decision matrix A.
1: Initialize action-value function Q, replay buffer B and handover decision matrix A. The initialized parameters of the main Q-network and target Q-network are consistent. .
2: fori=1, NUM_EPISODESdo
3: forj=1, MAX_STEPSdo
4: fork=1, node_numdo
5: According to Eq. (6), the immediate reward r_t is computed.
6: According to Eq. (11), the dwell time is computed. According to Eq. (14), the load coefficient Load is obtained. By the state aware method, the network state s_t in time slot t is constructed. According to Eq. (16, 17), the state decision matrix M_s is normalized.
7: By the ε-greedy method, the action a_t corresponding to state s_t is determined and the handover decision matrix A is updated.
8: The next state s_t+1 is produced and the transition (s_t, a_t, r_t, s_t+1) is stored in buffer B.
9: In PER method, according to Eq. (18, 19), the priority and probability of sample are computed. According to Eq. (20), the weight of importance sampling method is computed. The sampling data is the input of main-Q network, and the action-value function Q_m(s_t,a_t) is computed.
10: According to Eq. (22), the action a_m corresponding to the maximum value of Q_m is obtained and input the target Q-network Q_t. And the action-value Q_t(s_t+1, a_m) is computed.
11: Adopt the stochastic gradient descent method, according to Eq. (24), the parameters θ_x of main Q-network are updated.
12: end for
13: Every D steps, the parameters of target Q-network are updated by the parameters of main Q-network. .
14: end for
15: end for
16: Return the handover decision matrix A.