Abstract

The frequent handover and handover failure problems obviously degrade the QoS of mobile users in the terrestrial segment (e.g., cellular networks) of satellite-terrestrial integrated networks (STINs). And the traditional handover decision methods rely on the historical data and produce the training cost. To solve these problems, the deep reinforcement learning- (DRL-) based handover decision methods are used in the handover management. In the existing DQN-based handover decision method, the overestimates of DQN method continue. Moreover, the current handover decision methods adopt the greedy strategy which lead to the load imbalance problem in base stations. Considering the handover decision and load imbalance problems, we proposed a load balancing-based double deep Q-network (LB-DDQN) method for handover decision. In the proposed load balancing strategy, we define a load coefficient to express the conditions of loading in each base station. The supplementary load balancing evaluation function evaluates the performance of this load balancing strategy. As the selected basic method, the DDQN method adopts the target Q-network and main Q-network to deal with the overestimate problem of the DQN method. Different from joint optimization, we input the load reward into the designed reward function. And the load coefficient becomes one handover decision factor. In our research, the handover decision and load imbalance problems are solved effectively and jointly. The experimental results show that the proposed LB-DDQN handover decision method obtains good performance in the handover decision. Moreover, the access of mobile users becomes more balancing and the throughput of network is also increased.

1. Introduction

As one important network of the future wireless networks, the STIN [1] offers mobile users communication services with wide coverage, high reliability, and low delay. This new integrated networks consist of terrestrial segment network and satellite segment network, which provides the future smart city with a ubiquitous network. Because of the increase of user’s equipment and the mobility of users, the mobile users need to connect the optimal candidate cell to continue the communication. But the overlapped area, the access limitation of base station, and the random mobility of users make the handover more complex. How can mobile users decide the optimal candidate cell? Which decision factor is the key factor? Meanwhile, mobile users also need to deal with the handover decision problem in data transmission services. This research problem can directly influence the handover performance and Quality of Service (QoS) of mobile users in wireless networks. The handover management technology includes three steps: (1) information collection, (2) handover decision, and (3) handover execution [2]. Aiming at the frequent handover, handover failure, and load imbalance in the terrestrial segment of STIN, combining with the load balancing strategy, a load balancing-based double deep Q-network (LB-DDQN) handover decision method is proposed.

The developments of handover decision algorithm attract more attention from academia and industry. The traditional handover decision methods include the simple additive weighting method, TOPSIS, decision function method, and Q-learning. These traditional methods focus on the measurement report, handover threshold and decision function, which relay on the priori knowledge and produce training cost. And the latest method is deep reinforcement learning (DRL) which combines the feature analysis ability of deep learning and decision-making ability of reinforcement learning. The DRL method effectively resolves the mobility of users and dynamics of networks. In the existing DRL methods, the value function-based DRL and the policy gradient-based DRL are the core of the fundamental approach and hot spots. The policy gradient-based DRL method is widely used in the Markov decision process with continuous action space. The DQN-based method is used in action space with discrete low dimension. And the researched handover decision problem in STIN is one typical deterministic discrete decision problem. And the DDQN method adopts the target Q-network and main Q-network to deal with the overestimate problem of DQN method. Therefore, we select the improved DDQN method to train the handover decision process.

Figure 1 shows that users’ mobility patterns and distribution differences lead to the load imbalance in the overlapped area of MBS1 and MBS2. And the current handover decision methods adopt the greedy strategy which also lead to the load imbalance problem in base stations. In Figure 1, this arrow expresses the moving direction of LEO satellites along the periodic trajectory in the orbit. These vital decision factors affect the handover decision and lead to frequent handover and decreased network throughput. Recently, few studies pay attention to both handover decision and load balancing problems. Moreover, the unevenly distributed mobile users are forced to switch to the base stations with low load in the handover-based load balancing method. Unlike from this method, the SINR, delay, and load coefficient were selected as the handover decision factors in the LB-DDQN handover decision method. Combining with the load balancing strategy, the improved DDQN method constructed the Markov decision model to realize the handover decision. In our research work, the handover decision and load imbalance problems are solved effectively and jointly. The proposed method has good performance of handover and meets the demands of load balancing. In this research, our contributions are summarized as follows: (1)We proposed the LB-DDQN handover decision method in STIN to deal with the frequent handover and load imbalance. The handover decision problem in the terrestrial segment of the STIN was resolved. Furthermore, the feasibility of the LB-DDQN handover method was proved by the experiments and analysis(2)We constructed the load balancing strategy, including load coefficient, the reward of load, and load balancing evaluation function. The reward of load analyzed the load’s influence on the handover in the LB-DDQN method. With the help of the load balancing evaluation function, the load condition of the whole network was intuitively evaluated(3)We selected the load coefficient as the handover decision factor. In the training process of handover decision, the load balancing was also considered. The handover decision and load balancing were both realized. In our research work, the handover decision and load imbalance problems are solved effectively and jointly

The rest of this paper is organized as follows. The related works of handover decision are surveyed in Section 2. The system model is described in Section 3. The LB-DDQN handover decision method is proposed in Section 4. Simulation setups and results of experiments are provided in Section 5. Finally, Section 6 concludes this paper.

The existing handover decision methods in the network can be divided into four categories [3]: the decision function method, the multiattribute decision method, the context-aware decision method, and the artificial intelligence decision method. In [4], a multiattribute decision method computed the weights of decision factors. The simple additive weighting method (SAW), the technique for order of preference by similarity to ideal solution (TOPSIS), and the grey relational analysis method (GRA) were adopted to select the optimal candidate cell base station. In [5], the analytic hierarchy process method (AHP) was used to obtain the weights of decision factors, and the orders of candidate base station were computed by the SAW method. In [6], the AHP method obtained the weights of decision factors, and the GRA method ranked the candidate cell base stations. In [7], combining with the property normalization and weight calculation, the improved multiattribute decision method ranked the candidate cell base stations. In [8], the designed decision maker performed the network selection and handover decision. In [9], a received signal strength indicator- (RSSI-) based fuzzy logic method executed fast handover and seamless handover. Considering the frequent handover and ping-pong effect, a predicted signal to interference plus noise ratio- (SINR-) based handover decision method was proposed [10]. In [11], a speed-aware-based handover decision method was proposed to deal with the influence of the handover process on the network throughput in the two layers’ cell network. This handover decision method failed to choose the best candidate base station but the base station which kept the maximum service cycle. This selection makes sure of the cooperation between base stations and the elimination of interference. In [12], the improved fuzzy TOPSIS method reduced the scope of the candidate base stations, which increased the throughput of the network. In [13], a handover decision method using fuzzy logic and combinatorial fusion was proposed. According to the RSSI, data rate, and network delay, the handover was predicted. Combining with the TOPSIS, the proposed handover decision method assisted the mobile users to selecting the proper candidate base station [14]. In [15], combining with the improved competitive auction technique, the quality of uplink and downlink, and load factor, and the fast game-based handover decision method was proposed. The stochastic geometry analysis method was proposed [16]. By this method, the influence of different network topologies on handover decisions was analyzed. Moreover, the approximated handover number was estimated. In [17], an analysis framework of handover based on stochastic geometry was proposed to analyze the number of base stations, triggering time, and mobility patterns of users. The authors proposed a handover decision method based on fuzzy logic for saving the energy of mobile devices in an integrated LTE and Wi-Fi network [18]. And the traditional handover decision methods rely on the historical data and produce the training cost. The decision function method, the multiattribute decision method, and the context-aware decision method depend on the collected information about networks and users. This information plays an essential role in the handover decision. However, the collection of these vital information takes too much time. The delayed information and the dependence of prior knowledge lead to the frequent handover and load imbalance in the STIN.

The test of the reinforcement learning method on satellites designed by NASA proved that the artificial intelligence decision method had a good performance, and the deployment of this method was feasible [19]. In [20], the Markov handover decision model was constructed, and the hybrid vertical handover decision method was proposed. In [21], a Markov decision process- (MDP-) based handover decision method was proposed to optimize the QoS of network communication. In [22], considering the channel quality and QoS of communication, a reinforcement learning-based handover decision method was proposed from the point of user number. Google DeepMind team proposed the deep reinforcement learning method (DRL) which was evaluated in the Atari 2600, and this method achieved the excellent performance [23]. This new artificial intelligence method was used in communications and networking to deal with dynamic network access, data rate control, wireless caching, data offloading, and resource management [24]. In [25], a multiagent DRL method was proposed to resolve the distributed handover management problem. In this method, considering the cost factor, the user was modelled as an agent and the handover decision was optimized. In [26], the mobility patterns of users were classified, and the asynchronous multiagent DRL method was used in the handover decision process. In [27], the convergence speed and accuracy of the Q-network were optimized by the evolution strategy. The reinforcement learning-based handover decision method had good decision-making ability and handover performance. However, the state space, action space, and reward function of the different network scenarios need to be adjusted, which leads to the performance fluctuation. Moreover, the reinforcement learning-based handover decision method need to search the Q-table efficiently. This kind of method is suitable for the discrete state space problem. Replacing the Q-table by the neural network, the DRL-based handover decision method is good at dealing with the continuous state space problem. Therefore, our research adopts the improved DRL method to train the handover decision process.

3. System Model

3.1. Network Model and Problem Formulation

The terrestrial segment of the STIN makes up of macro cells and mobile users. In the satellite segment, there is always one LEO satellite that provides the satellite communication service. The mobile users select base station or LEO satellite to transmit data. Our research focused on the handover decision problem in the terrestrial segment of STIN. The network time is divided into many time slots. In each time slot , every mobile user selects the optimal target base station from the candidate base stations. When the mobile users move out of the range of network or the network time is end, the state of the user update to tend, includes many discrete handover decision scenarios. After the construction of the Markov handover decision model, the handover decision process is optimized by the neural network.

The handover decision process of mobile users is modelled as the discrete Markov decision process, expressed by , where is the discrete state space of the network, which is composed of network parameters. The parameter is the action space that is composed of the candidate base stations set. The parameter shows that the action can be determined by the state . The function of reward computes the positive or negative rewards from the network environment. The discount coefficient describes the value of future reward. Figure 2 shows that the agent obtains the recent state in each time slot . The action is determined by the strategy . After receiving the parameter of action , the network environment returns a reward which shows whether the action is proper. Then, the state of network environment is update to .

The proposed LB-DDQN handover decision method is aimed at maximizing the total reward of the handover decision. In the interaction of agent and environment, the discounted total reward is defined as where is the immediate reward of handover. The discount coefficient of future reward is named . According to Equation (1), the Bellman operator updates the action-value function . where and are state and action, respectively, in time slot . According to Equation (2), the optimal Bellman operator is defined as

Equation (3) describes that the maximum value of the action-value function is computed. In the LB-DDQN handover decision method, the neural network is used to estimate the action-value function.

3.2. State Space and Action Space

In each time slot , the size of candidate base station set for mobile users is . We select the SINR, delay, and load coefficient as the decision factors. As for the candidate base station , the state space is expressed as . The total network state space is defined as . The public interface X2 shares the load information of each base station. Moreover, the SINR and the delay information are obtained by the regular measurement reports. These selected decision factors assist in the handover decision.

The action of handover for the mobile user is expressed by parameter . And the action space is made up of all the indexes of the candidates base station, expressed as . According to the -greedy strategy, in each time slot , the mobile user selects one proper base station to connect. When the action parameter , the base station whose index is 0 is selected.

3.3. Reward Function

Considering the selected network parameters in state space, the reward function is defined as where the normalized reward of the parameter for the candidate base station in time slot was expressed by . The variable is the weight coefficient of the parameter (). These weights are computed by the AHP method [6]. The load reward is computed from the load coefficient. The SINR can be obtained from the measurement report of base station. As the positive parameters such as SINR and load, the reward function is defined as where the variables and are the maximum value and minimum value of network parameter . The network parameter for candidate base station in time slot is expressed as which is evaluated between [0,1]. As for the negative parameter, such as delay, the reward function is defined as

4. LB-DDQN-Based Handover Decision Method

4.1. Traditional Handover Decision Methods

The traditional handover decision methods include SAW [4], TOPSIS [17], and Q-learning [28]. Compared with the DRL handover decision method, these traditional methods depend on the measurement report and prior knowledge. The traditional handover decision methods are unsuitable for the dynamically changing network environment. The load balancing problem is also not fully considered. The SAW method computes the order of the candidate base station. The sum of normalized parameters is defined as where the variable express the sum of normalized network parameter in the candidate base station . The variable is the weight of the parameter . The variable is the normalized value of parameter in the candidate base station , where and . The TOPSIS [17] selects the optimal candidate base station. The Euclidean distance is defined as where and are the Euclidean distances between and the optimal solution set {} and the worst solution set {}, respectively. We use the variable and to express the maximum value and the minimum value of parameter , respectively. And the closeness efficient is defined as when the variable close to 1 and the is small which means the Euclidean distance between the candidate solution and optimal solution is small. The update of the action-value function in the Q-learning method is defined as

4.2. Load Balancing Strategy

One base station services the fixed number of mobile users at the same time. When the number of connected users exceeds the upper limit of load or the mobile user cannot connect to the target base station, the handover request is failed. The proposed LB-DDQN handover decision method realizes the handover decision and load balancing simultaneously by the load balancing strategy. The limited resource block and unevenly distributed users fail the handover request. By the LB-DDQN method, the number of handover failure is effectively decreased. The load coefficient, load reward, and load balancing evaluation function are presented in the designed load balancing strategy. In time slot , the load coefficient of base station expressed by which is defined as

The variable expresses the number of serving users. Assume that one mobile user connects up to one base station and occupies one resource block. In time slot , the variable is the number of serving users. The variable expresses the total number of resource blocks. When the load coefficient of the base station increases, the number of serving user increases. At this time, the load reward of the base station decreases, and the probability of handover selection decreases. On the contrary, when the load coefficient of the base station becomes small, the available source blocks and load reward become large. This base station is more possibly selected as the optimal candidate base station. In time slot , the load reward of the base station is defined as where the load reward of the base station in time slot is expressed by . Its value range is [0.5,1]. In time slot , the load balancing evaluation function is defined as where is the value of load balancing function in time slot . When the value of this variable is bigger, the distribution of mobile users is more imbalanced. When the variable close to 0, the load of the base stations is balanced. The operation obtains the average load coefficient of all base stations.

4.3. Implementation of LB-DDQN

The DRL handover decision method adopts the neural network to estimate the optimal value of the action-value function. In the training process of handover decision, the normalized parameters of state space are regarded as the input of the neural network, and the optimal value of the action-value function is output.

In [23], the update of action-value function in the DQN method is defined as where is the action-value function of main Q-network and is the action-value function of target Q-network. We proposed the improved LB-DDQN handover decision method. When the maximum value of is obtained, the handover action corresponding to the optimal is determined. The update of action-value function is defined as

The loss function of DDQN method is the difference value between the target value and the estimated action-value function . The loss function is defined as

In the training process of handover decision, the loss function returns the gradient loss to update the parameters of main Q-network at each iteration. With the updates of the parameters, the loss value of loss function decreases. And the performance of the handover becomes better. The loss function of the DDQN method is optimized by the stochastic gradient descent method. The gradient of loss function is defined as

As Figure 3 shows, the DDQN handover decision method adopts the main Q-network and the target Q-network . These two neural networks are initialized with the same network parameters. In the main Q-network , the network parameters are updated at each iteration. The main Q-network is used to estimate the value of action-value function. After every steps, the network parameters of target Q-network are updated by the parameters of main Q-network. . The target Q-network is used to compute the expected value of action-value function.

In the training process of handover decision, the experimental data is saved in the replay buffer. By the experience replay method and the small batch sampling method, the randomly sampled data is used as input data to train the parameters of Q-network. By the -greedy strategy, the exploration and the exploitation operations of the optimal handover action are realized. The detailed steps of the LB-DDQN handover decision method are described as Algorithm 1.

Input: Iteration number NUM_EPISODES, step number MAX_STEPS, node number node_num, measurement information SINR and delay, length of update step D.
Output: Handover decision matrix A.
1: Initialize action-value function Q, replay buffer B and handover decision matrix A. The initialized parameters of the main Q-network and target Q-network are consistent. .
2: for i=1, NUM_EPISODES do
3: for j=1, MAX_STEPS do
4:  for k=1, node_num do
5:   According to Eq. (4, 5, 6), the immediate reward rj,k is computed.
6:   According to Eq. (11, 12, 13), the load coefficient Li,t and load reward HO_reward are obtained. Construct the sequences of state sj,k include: SINR, delay and HO_reward.
7:   By the ε-greedy strategy, the handover action aj,k corresponding to the state sj,k is determined. And the handover decision matrix A is updated.
8:   Construct the next state s'j,k, the experience data (sj,k, aj,k, rj,k, s'j,k) is saved in the replay buffer B.
9:   Using experience replay and small batch sampling method, the randomly sampled data from the replay buffer B is produced and input the main Q-network Qm. Then the action-value function Qm(s,a) is obtained.
10:    According to the Eq. (16, 17), the action am corresponding to the maximum value of Qm is obtained and input the target Q-network Qt. And the action-value Qt(s'j,k, am) is computed.
11:    Adopt the stochastic gradient descent method, according to Eq. (18), the parameters θk of main Q-network are updated.
12:   end for
13:   Every D steps, the parameters of target Q-network are updated by the parameters of main Q-network. .
14: end for
15: end for
16: Return the handover decision matrix A.

5. Simulation Results and Discussions

5.1. Simulation Environment

This research makes sure of the handover performance and load balancing requests simultaneously. A PC carries out the experiments with 3.2 GHz quad-core i5-1570 and 16GB of RAM. The OS is win 10, 64 bits, and the simulation platform is python3. Figure 4 shows that it is 2290 meters long and 1800 meters wide in the virtual town scenario. In this network area, there is one LEO satellite that provides 24-hour communication services. It includes 31 base stations whose communication range is 500 meters. Assume that the base station bandwidth is 10 MHz and the upper limits of connected users for the base station are 50. One user only occupies up to one resource block, and the bandwidth of each subchannel is 180 kHz. The starting point of the mobile user is randomly selected from 11 crossings. The speed of the mobile user is randomly selected from 5 km/h, 25 km/h, 50 km/h, 70 km/h, and 120 km/h. The mobile user is moving at a constant speed in straight lines. The number of mobile users is 50, 100, 200, and 200, respectively.

5.2. Simulation Parameters

The handover rate, handover failure rate, and throughput of the network are used to evaluate the handover performance. The simulation parameters are illustrated in Table 1.

The handover rate and handover failure rate are defined as where the variable is the number of successful handover and the variable is the total number of handover decision. The expresses the number of handover requests. The range of HOR and HOF is [0, 1]. The network parameter SINR is defined as where the variable is the effective power and the variable is the interference power. And the variable is the noise power. The throughput of network Th is defined as where the variable is the bandwidth of the subchannel.

5.3. Simulation Results
5.3.1. Average Handover Number of User

As Figure 5 shows, the handover numbers of the LB-DDQN method are compared. The speeds of the user are set to 5 km/h, 25 km/h, 50 km/h, 70 km/h, and 120 km/h, respectively. The amounts of mobile users are set to 50, 100, 200, and 300, respectively. This figure assists in analyzing the influence of user speed and amount of users on handover performance. As we can see, when the speed of mobile user increases, the handover number of users gradually decreases. This is because in the network time , when the user speed increases, the user is earlier entering the final state. And the decrease of the effective sampling points in the simulation leads to the decreased of handover decision number and handover times. In the virtual town scenario of the network, when the user speed is 120 km/h, and the length of the road is 1.8 km, the number of effective sampling points is 545. When the user speed changes to 5 km/h, the number of effective sampling points is 6000. Moreover, when the user speed is fixed, the increase of user number results in the increased of handover number. It is observed that the increase of mobile users leads to the increase of distribution difference, handover number, and interference signal. Meanwhile, handover management becomes more complex, because the SINR factor is also one of the handover decision factors, which leads to the increase of handover number.

As Figure 6 shows, the SAW [4], TOPSIS [17], Q-learning [28], DQN [27], and ES-DQN [27] handover decision methods are compared with the proposed LB-DDQN handover decision method in a different number of users. This figure shows the performance difference of these handover decision methods. As the increase of the number of users, the average handover number of user also increases. As for the traditional handover methods, the Q-learning-based handover method has the optimal handover performance. As for the artificial intelligence-based handover decision method, considering the load balancing factor as the decision factor, the proposed LB-DDQN method optimizes the handover decision process, and the average handover number of user decreases.

As we can see from Table 2, when the number of user is 50, the optimal handover decision method is the SAW method whose average handover number of a user is 9.1. The average handover number of the LB-DDQN method is 10.9. When the number of users is 100, the optimal handover decision method is the LB-DDQN method and the corresponding average handover number is 11.32. When the number of users is 200, the optimal handover decision method is Q-learning, and the average handover number is 16.53. And the average handover number of the LB-DDQN method is 19.07. When the number of user is 300, the optimal handover decision method is Q-learning, and the average handover decision method is 21.67. The average handover number of the LB-DDQN method is 23.74. The proposed LB-DDQN method has good handover, which is the same as the Q-learning. Moreover, the LB-DDQN method makes sure of the performance of handover and the continuity of data services.

5.3.2. Handover Rate and Failure Rate

As Figure 7 shows, the handover rate and failure rate of different handover decision methods are compared. The number of user is 100. According to Equation (19), the handover rate and failure rate are computed. As for the handover rate, the optimal handover decision method is Q-learning. The handover rate of Q-learning is 0.0019. And the handover rate of LB-DDQN is 0.0021. As for the failure rate, the optimal method is the ES-DDQN method. Its failure rate is 0.0067. And the LB-DDQN is 0.007 which is better than Q-learning method. We find that the proposed LB-DDQN method has the good performance of handover rate and failure rate. Considering the load factor, the access of mobile users is more balanced, and the continuity of data services is enhanced. Compared with the DQN and ES-DQN methods, the LB-DDQN method decreases the handover rate, which makes sure of the request of handover. At the same time, by the load balancing strategy, the failure rate also decreases.

5.3.3. Throughput of Network

As Figure 8 shows, the throughput of the network for different handover decision methods are described. The number of users is 100. The optimal method is the proposed LB-DDQN handover decision method whose throughput of the network is 0.4221 Mbps. The network throughput of the -learning method is 0.4012 Mbps. We find that by combining the load factor, the network throughput of the LB-DDQN method is higher than those of the others. Our load balancing strategy eliminates the effects of frequent handover, handover failure, and load imbalance.

5.3.4. Load Balancing Function Value

As Figure 9 shows, the evaluation of load balancing for these methods is described. The number of mobile users is 50, 100, 200, and 300, respectively. The load is smaller, and the distribution of mobile user is more balanced. As the number of users increases, the value of load balancing function increases. Moreover, the optimal method is the LB-DDQN handover decision method. This is because our method combines the load balancing strategy, and the load coefficient is also the decision factor.

As Table 3 shows, the detailed value of load balancing function is described. The number of users is 50, 100, 200, and 300, respectively. When the number of users is 50 and 100, respectively, the optimal method is the Q-learning method. The values of load balancing function are 0.0724 and 0.1154, respectively. When the number of users is 200 and 300, respectively, the optimal method is the LB-DDQN handover method. The values of load balancing function are 0.2388 and 0.3577, respectively. As the number of users increases, the difference in distribution of mobile users is more complex. The performance of the proposed load balancing strategy for the mobile user is good and proper.

5.3.5. The Convergence of the LB-DDQN Method

As Figure 10 shows, with the increases of generation number, the average handover number of users is convergent quickly. The number of mobile users is 100. When the number of generations is 127, the average handover number of users is 19.91. When the number of generations is 1000, the average handover number of a user is 11.32. Because of the random initialization for the Q-network, the initial value of average handover number for mobile users is very high. By the multiple iterations, experience replay, and small batch sampling methods, the weights of the main Q-network and the target Q-network are matured. The results of our method are also convergent rapidly.

6. Conclusions

The LB-DDQN handover decision method and the load balancing strategy are proposed in this paper. And the validation of our method is realized in the virtual simulation of the STIN. The designed load balancing strategy combines the load coefficient and load reward to assist the training of handover decision. The frequent handover and handover failure are optimized by the LB-DDQN handover decision method and load balancing strategy. The distributions of mobile users with different numbers are more balancing, and the number of handover failures is decreased. Furthermore, the LB-DDQN method adapts to the different conditions of user speeds, movement routes, and user numbers. Its adaptability and performance of handover decisions are good and low cost.

Data Availability

The data used to support the findings of this study are available from Dong-Fang Wu (at [email protected]).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (No. 61772385).