Mobile Information Systems

Volume 2016, Article ID 4565203, 18 pages

http://dx.doi.org/10.1155/2016/4565203

## Dynamic Resource Allocation with Integrated Reinforcement Learning for a D2D-Enabled LTE-A Network with Access to Unlicensed Band

Laboratory of Information Communication Networks, School of Information Science and Technology, Hokkaido University, Sapporo, Japan

Received 30 May 2016; Revised 8 September 2016; Accepted 16 October 2016

Academic Editor: Juan C. Cano

Copyright © 2016 Alia Asheralieva and Yoshikazu Miyanaga. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We propose a dynamic resource allocation algorithm for device-to-device (D2D) communication underlying a Long Term Evolution Advanced (LTE-A) network with reinforcement learning (RL) applied for unlicensed channel allocation. In a considered system, the inband and outband resources are assigned by the LTE evolved NodeB (eNB) to different device pairs to maximize the network utility subject to the target signal-to-interference-and-noise ratio (SINR) constraints. Because of the absence of an established control link between the unlicensed and cellular radio interfaces, the eNB cannot acquire any information about the quality and availability of unlicensed channels. As a result, a considered problem becomes a stochastic optimization problem that can be dealt with by deploying a learning theory (to estimate the random unlicensed channel environment). Consequently, we formulate the outband D2D access as a dynamic single-player game in which the player (eNB) estimates its possible strategy and expected utility for all of its actions based only on its own local observations using a joint utility and strategy estimation based reinforcement learning (JUSTE-RL) with regret algorithm. A proposed approach for resource allocation demonstrates near-optimal performance after a small number of RL iterations and surpasses the other comparable methods in terms of energy efficiency and throughput maximization.

#### 1. Introduction

D2D communication is a direct communication between the users transmitting over the cellular spectrum (inband) or operating on an unlicensed band (i.e., outband). The main advantages of inband D2D communication are the increased spectrum efficiency and possibility of quality of service (QoS) provisioning for different cellular/D2D users. The chief obstacles to the implementation of inband D2D access are (i) interference mitigation (between the users transmitting over the same frequency bands) and (ii) resource allocation [1]. Effective resource allocation and interference management strategies can significantly improve the performance of cellular networks. The objectives here could be different (such as improvement of spectrum efficiency, cellular coverage, network throughput, or user experience) but to achieve the optimal system performance, the problems of cellular/D2D mode selection, spectrum assignment, power allocation, and interference mitigation should be considered jointly in the algorithm design. Related contributions in this area are [2–10] studying the problem of interference mitigation for underlying D2D communication. It should be noted, however, that the majority of proposed formulations (except [2, 3]) does not deal with the issues of mode selection, spectrum assignment, and interference management in a joint fashion but rather by splitting the original problem into smaller subproblems (see e.g., [10]) or by separating the time scales of these subproblems (e.g., [9]). Hence, although the complexity of such methods is less than the complexity of a joint resource allocation, their efficiency in maximizing some certain optimality criterion is clearly downgraded. Outband D2D communication (carried over Wi-Fi Direct [11], ZigBee [12], or Bluetooth [13]) eliminates the need for interference mitigation but can be distorted by the randomness of unlicensed channels. Existing works on outband D2D access focus on such issues as power consumption (e.g., [14–17]) and coordination between cellular and wireless interfaces ([18–21]). Some of these works ([14, 15, 21]) suggest control of unlicensed band by the cellular network (which requires a certain amount of cooperation and information exchange between different radio interfaces). Other works (e.g., [17, 18, 20]) imply autonomous operation of D2D devices (based on stochastic modeling of unlicensed channels).

The main contributions of this work are as follows. We consider a network-controlled D2D communication in which the licensed and unlicensed spectrum resources, user modes, and transmission power levels are allocated to different device pairs by the LTE eNB to maximize the overall network utility. We consider a general network deployment scenario where the unlicensed band is assumed to be provided by one or more radio access technologies (RATs) based on the orthogonal frequency division multiple access (OFDMA), carrier sense multiple access with collision avoidance (CSMA/CA), frequency-hopping code division multiple access (FH-CDMA), or any other multiple access method. It is assumed that all device pairs are equipped with different wireless interfaces allowing them to connect to the appropriate RAT and use a CSMA/CA to avoid collisions when operating on the unlicensed band. Hence, each unlicensed channel becomes available to a D2D pair only when it is idle. Unlike many previous works, we jointly solve the problems of inband/outband access, mode selection, and spectrum/power assignment by combining these problems into one optimization problem which allows to allocate the inband network resources and offload the D2D traffic in a most effective way (in terms of maximizing the overall network utility). Note that the formulated problem can be solved to optimality only if the global channel and network knowledge (including the precise information on the operating conditions of the licensed and unlicensed channels) is available to the eNB. However, because of the absence of an established control link between the unlicensed and cellular radio interfaces, the eNB cannot get any information about the quality and availability of the unlicensed channels. As a result, a considered resource allocation problem becomes a stochastic optimization problem that can be dealt with by deploying a learning theory [22] (to estimate the random unlicensed channel environment).

Consequently, we formulate the outband D2D access as a dynamic single-player game in which the player (eNB) estimates its possible strategy and expected utility for all of its actions based only on its own local observations using a JUSTE-RL with regret (originally proposed in [23]). The main idea behind RL is that the actions leading to the higher network utility at the current stage should be granted with higher probabilities at the next stage [22]. In the simplest form of RL (described, e.g., in [24]), a learning agent estimates its best strategy based on its observed utility without any prior information about its operating environment. This form of RL requires only algebraic operations but its convergence to the equilibrium state is not guaranteed [25]. In* Q*-learning [22], a utility is estimated using some value-action function. This RL method converges to a Nash equilibrium (NE) state. However, it requires maximization of the action-value at every stage which can be computationally demanding [22]. In JUSTE-RL algorithm (described, in detail, in [23]), a learning agent estimates not only its own strategy but also the expected utility for all of its actions. Unlike* Q*-learning, JUSTE-RL does not need to perform optimization of the action-value (since only algebraic operations are required to update the strategies) and, hence, it has a lower computational complexity. On the other hand, compared to a basic RL algorithm, JUSTE-RL converges to a -NE [23, 25].

It is worth mentioning that, in wireless communications, RL has been studied in the context of various spectrum access problems. In [26, 27], the learning has been employed to minimize the interference (created by adjacent nodes) in partially overlapping channels. This problem has been formulated as the exact potential graphical game admitting a pure-strategy NE and, therefore, the proposed approach is not realizable in a broader range of problems. A cognitive network with multiple players has been analyzed in [28]. In this work, the learning and channel selection have been separated into two different procedures which increased the complexity of a proposed resource allocation approach. Besides, the stability of a final solution was not verified. A multi-player game for inband D2D access, where the players (D2D users) learn their optimal strategies based on the throughput performance in a stochastic environment, has been studied in [29]. It was assumed that each D2D user can transmit over the vacant cellular channels using a CSMA/CA implying that there are no channels with interfering users (i.e., each orthogonal channel can be occupied by at most one cellular/D2D user). Although the authors consider a scenario with two D2D users operating on the same channel, it is not clear how a D2D user can sense whether the user operating on the channel is cellular or D2D. An autonomous D2D access in heterogeneous cellular networks comprising multiple low-power and high-power BSs with (possibly) overlapping spectrum bands has been investigated in [30]. This problem has been modeled as a stochastic noncooperative game with multiple players (D2D pairs) admitting a mixed-strategy NE. The goal of each player was to jointly select the wireless channel and power level to maximize its reward, defined as the difference between the achieved throughput and the cost of power consumption constrained by the minimum tolerable SINR requirements of this D2D pair. To solve this problem, a fully autonomous multiagent* Q*-learning algorithm (which does not require any information exchange and/or cooperation among different users) is developed and implemented in an LTE-A network.

The rest of the paper is organized as follows. A general network model for inband and outband network operation is described in Section 2. A general problem and the algorithms for unlicensed and licensed resource allocation are formulated in Section 3. The algorithm implementation, including the proposed resource allocation procedure in an LTE-A networks and performance evaluation, is presented in Section 4. The paper is finalized in Conclusion.

#### 2. Network Model

In this paper, the problem of resource allocation for D2D communication is investigated for both the uplink (UL) and downlink (DL) directions. Similarly, the discussion through the rest of the paper is applicable (if not stated otherwise) to either direction. Consider a basic LTE-A network consisting of one eNB and user pairs, denoted , with being the set of user pairs’ indices. It is assumed that a fixed licensed spectrum band of the eNB spans resource blocks (RBs), numbered , with denoting the set of RBs’ indices comprising the bandwidth. The network runs on a slotted-time basis with the time axis partitioned into equal nonoverlapping time intervals (slots) of the length , with* t* denoting an integer-valued slot index. Each pair of users can communicate with each other either by the traditional cellular mode (CM) via the eNB or in a D2D mode (DM) without traversing the eNB. Let be the set of the indices of device pairs that can operate only in CM and let denote the set of the indices of potential D2D pairs (The indices in and can be determined based on, e.g., user application (such as video sharing, gaming, and proximity-aware social networking) in which the pair of devices could potentially be in range for the direct communication. Such information can be acquired from a standard session initiation protocol (SIP) procedure (which handles the session setups and users arrivals in LTE networks). Interested readers are referred to [31] for a comprehensive description of an SIP procedure and its use in the D2D access.).

In our network, any potential D2D pair can be allocated with cellular or D2D mode (based on the results of resource allocation procedure). Consequently, we define a binary mode allocation variable , , equaling 1, if PU_{n} is allocated CM at slot* t*, and 0, otherwise. Note that , for all . Further, we consider the following models of D2D access.(i)Inband D2D: a D2D pair operates within the licensed LTE spectrum in an underlay to cellular communication.(ii)Outband D2D: a D2D pair transmits over the unlicensed band by exploiting other RATs, such as Wi-Fi Direct [11], ZigBee [12], or Bluetooth [13] (It is assumed that all user devices are equipped with the corresponding wireless interfaces to be able to communicate using a suitable RAT.). We assume that there is no coordination and/or information exchange between different wireless interfaces.To differentiate the pairs according to their D2D access, we define a binary channel access variable , , equaling 1, if PU_{n} operates inband at slot* t*, and 0, otherwise. Note that all cellular users can access only the LTE bands. Hence, , for all .

##### 2.1. Inband Network Operation

In LTE/LTE-A, RBs are allocated to cellular users by the eNBs using a standard packet scheduling procedure [32]. The use of packet scheduling in a D2D-enabled LTE-A network is described, in detail, in [33]. In short, a packet scheduling process can be explained as follows. In the UL direction, at the beginning of any slot* t*, each user is required to collect and transmit its buffer status information. After collecting this data, a user sends the scheduling request (SR) with its buffer status information to the eNB via a dedicated physical uplink control channel (PUCCH). After receiving all the SRs, the eNB allocates the RBs to the users (according to a certain scheduling algorithm) and responds to all the SRs by sending the scheduling grants (SGs) together with the allocation information to the corresponding users via dedicated physical downlink control channels (PDCCHs) [33]. In the DL, the eNB readily finds out the DL buffer status for each user, allocates the RBs, and sends the SGs with allocation information via PDCCHs [33]. In the framework used in this paper, the above scheduling process is applied for both the cellular and D2D communication with some modifications (the corresponding resource allocation procedure will be described in Section 4).

Let us further define a binary RB allocation variable , , , equaling 1, if PU_{n} is allocated with RB_{k} at slot* t*, and 0, otherwise. Each RB can be allocated to at most one cellular user. Hence,The number of D2D users operating on the same RBs is unlimited. Additionally, to maximize the network utilization, we enforce each RB to be allocated to at least one user. That is, Note that both the OFDMA used for DL transmissions and single carrier frequency division multiple access (SC-FDMA) applied in the UL direction provide orthogonality of resource allocation to cellular communications. This allows achieving a minimal level of cochannel interference between the transmitter-receiver pairs located within one cell [34]. Thus, when information is transmitted by cellular/D2D user, it will be distorted only by the users operating on the same RB(s).

Let , , , and , denote the channel gain coefficient between the transmitter and receiver of PU_{n} and PU_{m} operating on RB_{k} (for , indicates the channel gain coefficient between PU_{n} operating on RB_{k} and the eNB). In LTE system, the instantaneous values of can be obtained from the channel state information (CSI) through the use of special reference signals (RSs) [35] and, hence, they are known to the eNB and the users. Then, for any PU_{n} operating on RB_{k}, the SINR at slot* t* in the UL direction is described bywhere is the variance of zero-mean additive white Gaussian noise (AWGN) power and is the transmission power allocated to PU_{n} at slot* t*. Clearly, is nonnegative and cannot exceed some predefined maximal level ; that isAt any* t*, the inband service rate of PU_{n} depends on the number of RBs allocated to this device pair and the SINR in each RB. That is, where is the service rate of PU_{n} (in bits per slot or bps) over licensed (inband) spectrum and is the bandwidth of one LTE RB (* ω* = 180 kHz).

##### 2.2. Outband Network Operation

We consider* M* separate outband wireless channels numbered, for notation consistency, as (In this paper, we consider a general scenario when the unlicensed outband access can be based on OFDMA, CSMA/CA (in case of Wi-Fi Direct), FH-CDMA (in case of Bluetooth), or any other multiple access method.). We denote by the set of channel indices within the unlicensed band and use a binary channel allocation variable , , , to indicate if PU_{n} is allocated with the unlicensed channel (in which case, ) or not (). Note that , for all and (since cellular users can access only the LTE bands). For , equals 0, if , and 1, otherwise (i.e., if ). Hence, To avoid collisions, the D2D pairs use a CSMA/CA method when operating outband. As a result, each unlicensed channel is available to D2D communication only when it is idle. Additionally, to reduce the possibility of collisions between D2D users, we assume that, at any slot* t*, at most one device pair can transmit over each unlicensed channel . That is,The transmission procedure for the pair of D2D users operating outband is described as follows. At the beginning of slot* t*, one of the users starts sensing the allocated unlicensed channel (for simplicity, we assume perfect sensing). If the channel is free, the transmission phase (of the length , such that ) begins. Note that the duration of is random. It depends on the availability of the channel and the applied CSMA/CA scheme. The probability density function (p.d.f.) of is not calculated here (since it has no impact on the further analysis in this paper). An example of such calculations can be found in [36].

Let , for all , denote the channel gain coefficient between the transmitter and receiver of PU_{n} operating on unlicensed channel . Then, the SINR of PU_{n} transmitting over the channel at slot* t* can be expressed byand the service rate of PU_{n} over unlicensed (outband) spectrum is described bywhere is the bandwidth (in Hz) of unlicensed channel . Note that neither the eNB nor D2D users have prior information about quality and availability of unlicensed channels. Therefore, the exact values of and are unknown to the eNB and the D2D users.

#### 3. Resource Allocation Problem

##### 3.1. Problem Statement

We define a binary -dimensional RB allocation matrix and a binary -dimensional unlicensed channel allocation matrix asrespectively. We also define a binary* N*-dimensional D2D access allocation vector , a binary* N*-dimensional mode allocation vector , and a real-valued* N*-dimensional power allocation vector . Then, the sets of all admissible values for , , , , and are described by Example of a D2D-enabled network with all defined optimization variables is shown in Figure 1.