Abstract

Enhanced licensed-assisted access (eLAA) is an operational mode that allows the use of unlicensed band to support long-term evolution (LTE) service via carrier aggregation technology. The extension of additional bandwidth is beneficial to meet the demands of the growing mobile traffic. In the uplink eLAA, which is prone to unexpected interference from WiFi access points, resource scheduling by the base station, and then performing a listen before talk (LBT) mechanism by the users can seriously affect the resource utilization. In this paper, we present a decentralized deep reinforcement learning (DRL)-based approach in which each user independently learns dynamic band selection strategy that maximizes its own rate. Through extensive simulations, we show that the proposed DRL-based band selection scheme improves resource utilization while supporting certain minimum quality of service (QoS).

1. Introduction

The rapid mobile traffic demand has resulted in the scarcity of the available radio spectrum. To meet this ever-increasing demand, extending systems like long-term evolution (LTE) to unlicensed spectrum is one of the promising approaches to boost users’ quality of service by providing higher data rates [1]. In this regard, initiatives such as the licensed-assisted access (LAA) [2], LTE-unlicensed (LTE-U) [3], and MulteFire (MF) systems [4] can be mentioned. The focus of this article is, however, on the LAA system, which 3GPP has initially introduced and standardized in Rel.-13 for downlink operations only [2]. By using the carrier aggregation (CA) technology, carriers on licensed band are primarily used to carry control signals and critical data, while the additional secondary carriers from unlicensed band are used to opportunistically boost the data rates of the users [5]. To obey regional spectrum regulations such as restrictions on the maximum transmitting power and channel occupancy time [6] while fairly coexisting with the existing systems such as WiFi, it is mandatory for an LAA base station (BS) to perform listen before talk mechanism before transmitting over unlicensed band [79]. The enhanced version of LAA, named as enhanced licensed-assisted access (eLAA), that supports both uplink and downlink operations was later approved in Rel.14 [10]. The uplink eLAA mode over unlicensed band is designed to meet the channel access mechanisms of the two bands, meaning the BS performs LBT and allocates uplink resources for the scheduled users, and then the scheduled users perform the second round of LBT to check whether the channel is clear or not before uplink transmission [11]. The degradation of uplink channel access due to two rounds of LBT mechanism is investigated in [1214]. If a scheduled user senses an active WiFi access point (AP) which is hidden to the BS, then the channel cannot be accessed, wasting the reserved uplink resource. Scheduling based approach in uplink eLAA, while there are unexpected interference sources, can significantly affect the utilization of uplink resources.

To improve the utilization of unlicensed band resources, several approaches have been suggested. In [1517], multi-subframe scheduling (MSS), a simple modification of the conventional scheduling, is proposed. MSS enables a single uplink grant to indicate multiple resource allocation across multiple subframes. Providing diverse transmission opportunities may enhance the resource utilization; however, the resources can still be wasted if the user fails to access the channels. In [14, 18], schemes that switch between random access and scheduling are proposed, but their focus is limited to unlicensed spectrum. Joint licensed and unlicensed band resource allocation that takes a hidden node into account is proposed in [19] for the downlink eLAA system. Furthermore in [20], a scheme that does not require uplink grant along with the required enhancement to the existing LTE system is proposed.

In this paper, we attempt a new learning approach in which each user makes dynamic band selection (licensed or unlicensed) independently for uplink transmission, without waiting for scheduling from BS. To this end, we implemented each user as a DRL agent that learns the optimal band selection strategy relying only on its own local observation, i.e., without any prior knowledge of WiFi APs’ activities and time-varying channel conditions. Through continuous interactions with the environment, the potential users to be affected by hidden nodes learn the activities of WiFi APs and make use of it in the band selection process. The learned policy not only guarantees channel access but also ensures a transmission rate above a certain threshold, despite the presence of unpredictable hidden nodes. Such a learning approach would be a useful means of handling the underlying resource utilization problems in uplink eLAA.

The rest of the paper is organized as follows. Section 2 describes the system model considered in the paper. Section 3 gives a brief overview on deep reinforcement learning (DRL), followed by DRL formulation of the band selection problem. The proposed deep neural network architecture and training algorithm are also discussed. Simulation results are presented in Section 4, and finally conclusion is drawn in Section 5.

2. System Model

We consider a single cell uplink eLAA system that consists of an eLAA base station (BS) and user equipment (UE) that can also operate in unlicensed band through carrier aggregation technology. Let denote a set of user indices which are uniformly distributed within the cell and designate a set of unlicensed band interference sources such as WiFi access points (APs) which are located outside the coverage area of the cell within a certain distance. The system model is shown in Figure 1.

In order to get uplink access, each UE makes a scheduling request to the eLAA BS, who is responsible for allocating resources. Before granting uplink resources, the eLAA BS is required to undergo a carrier-sensing procedure within its coverage limit. Once the channel is clear, it reserves resources for uplink transmission. Then, the scheduled user performs another round of listen before talk procedure before transmission. If the user detects transmission from hidden nodes, nearby WiFi APs that are outside the carrier-sensing range of the eLAA BS, then the reserved uplink resources over unlicensed band cannot be accessed.

We assume the channel between the BS and the -th UE, denoted as , evolves according to the Gaussian Markov block fading autoregressive model [21] as follows: where is the normalized channel correlation coefficient between slot and . From Jake’s fading spectrum, where , , and are the Doppler frequency, slot duration, and the zeroth-order Bessel function of the first kind, respectively. The error is a circularly symmetric complex Gaussian variable, i.e., , where is the path loss corresponding to the reference at a distance and is the path loss exponent. The channel is initialized as , where is distance of the -th user from the BS.

Let and be the total bandwidth in unlicensed and licensed bands, respectively. At time slot , let the number of users associated with unlicensed and licensed band be and , respectively. If all UEs on licensed band are uniformly allocated to orthogonal uplink resources, then the bandwidth of the UEs is constrained as

Similarly, expecting that the total unlicensed bandwidth is equally shared among UEs in a virtual sense, then the bandwidth of UEs on unlicensed band can be constrained as

Denoting and as uplink transmit power and the noise spectral density, we may compute the signal-to-noise ratio (SNR) of the received signal at the BS for unlicensed band user (assuming it occupies channel) as

Likewise, the SNR for licensed band user is given as

The dynamics of each WiFi APs activity are modeled as a discrete-time two-state Markov chain as shown in Figure 2. Each AP can be either in active () or inactive () state. The transition probability from state j to is denoted as

Note that the users do not have the knowledge of the underlying dynamics of WiFi APs’ activities, i.e., transition probabilities.

Let represent the transmission probability of an active WiFi AP. In slot , let be the number of contending active APs within the sensing range of -th UE. Assuming that all activities of WiFi AP’s are independent, the probability of UE having at least one hidden node is

In order to calculate the uplink rate (throughput) of the users, we refer to the lookup table, given in Table 1, which maps the received SNR to spectral efficiency (SE) [22]. Then, the uplink rate of UE using unlicensed band is given as

Similarly, the uplink rate of UE n using licensed band is given as

In each time slot , the goal of each UE is to select the band that maximizes the uplink rate. Note that if a certain band, e.g., licensed band, is overloaded by a large number of UEs, the individual rate of the users in the band will be significantly reduced. This will constraint each UE to take advantage of the unlicensed band whenever the APs are inactive. Hence, learning the WiFi APs’ activities and channel conditions is critical to effectively use the uplink resources while boosting individual data rate.

3. DRL-Based Decentralized Dynamic Band Selection

3.1. Deep Reinforcement Learning (DRL): Overview

In reinforcement learning (RL), an agent learns how to behave by sequentially interacting with the environment. As shown in Figure 3, at each time , the agent observes the state , where is the state space, and executes action from the action space . The interaction with the environment produces the next state and scalar reward .

The goal of the agent is to learn an optimal policy that maximizes the discounted long-term cumulative reward, expressed as where is the discounting factor and is the total number of time steps (horizon) [23].

One of the most widely used model-free RL methods is Q-learning in which the agent learns policy by iteratively evaluating the state-action value function , defined as the expected return starting from the state , taking the action a, and then, following the policy . In order to derive the optimal policy, at a given state , the action that maximizes the state-action value function should be selected, i.e., and then similarly follow optimal actions in the successor states.

In Q-learning, a lookup table is constructed that stores the action value for every state-action pair (, ). The entries of the table are updated by iteratively evaluating the Bellman optimality equation as: where is the learning rate. However, the look up table approach in Q-learning is not scalable for problems with the large state and action spaces. DRL approximates the value functions with deep neural network (DNN) instead. In deep Q-network (DQN), the action-value function is estimated by DNN, parametrized by , which takes the state as input. Then, action is selected according to the following -greedy policy:

To stabilize the learning process, it is common to use a replay buffer that stores transitions and mini batch of samples are randomly drawn from the buffer to train the network. Moreover, a separate quasistatic target network, parametrized by , is used to estimate the target value of the next state. The loss function is computed as

is updated by following stochastic gradient of the loss as , while the target parameter is updated according to every C steps [24]. The details of DQN algorithm is summarized in Algorithm 1.

Initialize replay buffer
Initialize action value function with parameter
Initialize target action value function with parameter
Input the initial state to the DQN
for do
   Execute action from using -greedy policy
   Observe and from the environment.
   Store the transition into the replay buffer
   Sample random minibatch of transitions from
   Evaluate the target
   Perform a gradient descent step on with respect to
   Every C steps, update the target network according to
end for
3.2. DRL Formulation for Dynamic Band Selection

Each user is implemented as DRL, specifically by deep Q-network (DQN) agent that relies on the output of their deep neural network to make dynamic band selection decisions between licensed and unlicensed bands. The DRL formulation is presented below. (i)Action

In each time slot , the -th agent samples an action from the action set (ii)State

After executing the action , the agent receives binary observation and reward from the environment. The observation is either if the uplink rate in the selected band exceeds the minimum threshold rate or otherwise. The state of the agent is defined as history of an action-observation pairs with length : (iii)Reward

Depending on the selected action, the agent receives the following scalar reward: where and are given according to Equations (8) and (9), while and are the uplink minimum threshold rates on unlicensed and licensed band, respectively.

3.3. Deep Neural Network Description

For dynamic band selection, each UE trains independent DQN. The structure of the deep neural network is shown in Figure 4.

The deep neural network consists of long short-term memory (LSTM) layer, fully connected layers, and rectified linear unit (ReLu) activation function.

Long short-term memory (LSTM) is one class of recurrent neural networks (RNNs) which are designed to learn a specific pattern in a sequence of data by taking time correlation into account. They were initially introduced to overcome the vanishing (exploding) gradient problem of RNNs in the course of back propagation. Regulated by gate functions, the cell (internal memory) state of an LSTM learns how to aggregate inputs separated by time, i.e., which experiences to keep or throw away [25]. In our formulation, note that the states of the agents, which are histories of action-observation pairs, have long-term dependency (correlation) emanating from the dynamics of WiFi APs’ activities that follow a two-state Markov property, and the time-varying channel conditions according to Gaussian Markov block-fading autoregressive model. LSTM is crucial for the learning process since it can capture the actual state by exploiting the underlying correlation in the history of action-observation pairs. Therefore, the state must pass through this preprocessing step before it is directly fed to the neural network.

A deep neural network consists of multiple fully connected layers, in which each of the layers abstracts certain feature of the input. Let be the input to the layer, while and are the weight matrix and bias vector, respectively. The output vector of a layer, denoted as , in a fully connected layer can be described by the following operation: where is the element-wise excitation (activation) that adds nonlinearity. In our simulations, we input the states to an LSTM layer with hidden units of 64, whose output is fed to two fully connected hidden layers with 128 and 64 neurons. The output layer produces action values for both actions. ReLu activation function is used on all the layers to avoid the vanishing gradient problem [26]. The target network also adopts the same neural network structure.

3.4. Training Algorithm Description

The DQNs of the agents are individually trained according to Algorithm 2. The loss function given by Equation (14) is used to train the DQN. The hyperparameters are summarized in Table 2.

for each agent do
  Initialize replay buffer
  Initialize action value function with parameter
  Initialize target action value function with parameter
  Generate initial state from the environment simulator
end for
for do
  for each agent do
   Execute action from using -greedy policy
   Collect reward and observation
   Observe the next state from the environment simulator
   Store the transition into
   Sample random minibatch of transitions from
   Evaluate the target
   Perform a gradient descent step on with respect to
   Every C steps, update the target network according to
  end for
end for

Note that the agents do not have a complete knowledge of the environment, such as the action of other agents, the underlying dynamics of the WiFi APs’ activities, and varying channel conditions. Instead, through sequential interaction with the environment, each agent makes decisions on band selection solely based on local feedbacks (reward and observation) from the base station. This significantly reduces the training complexity (cost) at each user. Moreover, since the training can be conducted in an offline manner, the trained weights can be used in deployment phase. Retraining the weights is done infrequently; for example, if the environment significantly changes.

4. Simulation Results

4.1. Simulation Setup

For each realization, we first distribute 10 users uniformly in a square area of . Within 30 m distance from the coverage area of the cell, WiFi APs are distributed in homogeneous Poisson point process (PPP) with rate . Figure 5 illustrates the network model of one realization for node deployment of BS, users, and APs.

We set the dynamics of each WiFi AP activity according to the following transition matrix:

We further assume that the uplink transmission of a user over unlicensed band can be interfered from any active WiFi AP within 30 m range. Table 3 summarizes the values of all simulation parameters used for evaluating the proposed algorithm.

4.2. Performance Evaluation

We compared the policy learned by the DRL agents to two benchmark schemes: random policy and fixed distance policy. In random policy, each user randomly decides which band to select, while in fixed policy, decision is made based on the location of the user. Assuming the BS knows the location of the users at each slot ; hence, the distance from BS, only users within meters from the base station transmit using unlicensed band resources, since they are less susceptible to interfering WiFi APs. The others transmit using licensed band resources. Since we assumed transmission from a WiFi AP can affect unlicensed band uplink transmission of any user within 30 m distance, according to the node deployment in Figure 5, in fixed policy users with from the BS are assigned to unlicensed band resources. The trained DRL policy of each agents should learn this distance without any prior assumption while selecting band. Furthermore, by learning the activities of the APs, the agents should make a dynamic selection.

Figure 6 compares the per user average success rate of the users for different thresholds at history length , , and . The dynamic DRL agents entertain around 90% of success rate, outperforming the users of the fixed distance-based policy with all of the thresholds we set. The gain from the fixed distance-based policy is attributed to two factors. The first one is that DRL agents, without any prior assumption, learn the optimal distance from the BS to make a decision on band selection. In other words, if user is located outside the optimal distance range , then it transmits over licensed band to avoid interference from nearby WiFi APs. The second factor is that the agents capture the dynamics of both time-varying channel and WiFi APs’ activities while making use of it in dynamically selecting band. It implies that during the absence of transmission from nearby WiFi APs, even if , user exploits the opportunity of transmitting over unlicensed band; hence, avoids overloading other users on licensed band.

To further investigate the gain coming from dynamic decision on band selection, we evaluate the per user average success rate of the users for different throughput thresholds in Figure 7. As the threshold values (over both bands) increase from 3 to 5, the gap on performance (per user average success rate) also increases. This indicates that capability of the DRL agents is crucial to maintaining appreciable success rate under a stringent requirement on quality of service (QoS).

In Figure 8, per user average throughput obtained by the three policies for history length , , and is compared. As depicted, the per user average throughput achieved by DRL agents outperforms the other two schemes. The ability of DRL to adapt to changing environment and learn robust policy enabled the agents to outperform a fixed distance-based policy which falls short when either of the bands is overloaded. In other words, even if there is an opportunity to transmit on unlicensed band, due to inactivity of nearby WiFi APs, cell edge users in fixed distance-based policy fail to take advantage of it. Further gain can be obtained by tuning the hyperparameters.

The effect of the number of interfering WiFi APs on the performance of the DRL agents is investigated for history length , and in Figure 9. As the number of WiFi APs increases (when increases), the gain due to dynamic decision on band selection reduces since the number of contenders for unlicensed band resources increases. However, the agents still retain the gain coming from learning the optimal distance for band selection. The performance of the fixed distance-based policy is unaffected by the number of WiFi APs.

Next, in Figure 10, we compare the effect of history size on the performance of the DRL agents. We observe that shorter history sizes tend to converge relatively faster. The variation of convergence time of the learned policy is however marginal. This implies the convergence time of the learned policy is generally less sensitive to history size. Note that all the results are averaged from three numerical simulations.

5. Conclusion and Future Works

To improve the underlying resource utilization problem in uplink eLAA, we presented a learning-based fully decentralized dynamic band selection scheme. In particular, employing the deep reinforcement learning algorithm, we have implemented each user as an agent that makes a decision based on the output of the DQN, without waiting for scheduling from BS. It is shown that despite the lack of the knowledge of the underlying dynamics of WiFi APs’ activities, the DRL agents successfully learn a robust policy to make a dynamic decision on band selection. Such dynamic and decentralized learning approach can significantly improve the resource utilization problem associated with unlicensed band, due to hidden nodes, in the uplink eLAA system. In a future study, we want to extend this work to more complicated scenarios that involve joint resource allocation over the two bands. Moreover, to improve the gain presented in this paper, different architectures and hyperparameters should be investigated.

Data Availability

We have not used specific data from other sources for the simulations of the results. The proposed algorithm is implemented in python with TensorFlow library.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Brain Korea 21 Plus Project in 2019.