Abstract

We propose a dynamic resource allocation algorithm for device-to-device (D2D) communication underlying a Long Term Evolution Advanced (LTE-A) network with reinforcement learning (RL) applied for unlicensed channel allocation. In a considered system, the inband and outband resources are assigned by the LTE evolved NodeB (eNB) to different device pairs to maximize the network utility subject to the target signal-to-interference-and-noise ratio (SINR) constraints. Because of the absence of an established control link between the unlicensed and cellular radio interfaces, the eNB cannot acquire any information about the quality and availability of unlicensed channels. As a result, a considered problem becomes a stochastic optimization problem that can be dealt with by deploying a learning theory (to estimate the random unlicensed channel environment). Consequently, we formulate the outband D2D access as a dynamic single-player game in which the player (eNB) estimates its possible strategy and expected utility for all of its actions based only on its own local observations using a joint utility and strategy estimation based reinforcement learning (JUSTE-RL) with regret algorithm. A proposed approach for resource allocation demonstrates near-optimal performance after a small number of RL iterations and surpasses the other comparable methods in terms of energy efficiency and throughput maximization.

1. Introduction

D2D communication is a direct communication between the users transmitting over the cellular spectrum (inband) or operating on an unlicensed band (i.e., outband). The main advantages of inband D2D communication are the increased spectrum efficiency and possibility of quality of service (QoS) provisioning for different cellular/D2D users. The chief obstacles to the implementation of inband D2D access are (i) interference mitigation (between the users transmitting over the same frequency bands) and (ii) resource allocation [1]. Effective resource allocation and interference management strategies can significantly improve the performance of cellular networks. The objectives here could be different (such as improvement of spectrum efficiency, cellular coverage, network throughput, or user experience) but to achieve the optimal system performance, the problems of cellular/D2D mode selection, spectrum assignment, power allocation, and interference mitigation should be considered jointly in the algorithm design. Related contributions in this area are [210] studying the problem of interference mitigation for underlying D2D communication. It should be noted, however, that the majority of proposed formulations (except [2, 3]) does not deal with the issues of mode selection, spectrum assignment, and interference management in a joint fashion but rather by splitting the original problem into smaller subproblems (see e.g., [10]) or by separating the time scales of these subproblems (e.g., [9]). Hence, although the complexity of such methods is less than the complexity of a joint resource allocation, their efficiency in maximizing some certain optimality criterion is clearly downgraded. Outband D2D communication (carried over Wi-Fi Direct [11], ZigBee [12], or Bluetooth [13]) eliminates the need for interference mitigation but can be distorted by the randomness of unlicensed channels. Existing works on outband D2D access focus on such issues as power consumption (e.g., [1417]) and coordination between cellular and wireless interfaces ([1821]). Some of these works ([14, 15, 21]) suggest control of unlicensed band by the cellular network (which requires a certain amount of cooperation and information exchange between different radio interfaces). Other works (e.g., [17, 18, 20]) imply autonomous operation of D2D devices (based on stochastic modeling of unlicensed channels).

The main contributions of this work are as follows. We consider a network-controlled D2D communication in which the licensed and unlicensed spectrum resources, user modes, and transmission power levels are allocated to different device pairs by the LTE eNB to maximize the overall network utility. We consider a general network deployment scenario where the unlicensed band is assumed to be provided by one or more radio access technologies (RATs) based on the orthogonal frequency division multiple access (OFDMA), carrier sense multiple access with collision avoidance (CSMA/CA), frequency-hopping code division multiple access (FH-CDMA), or any other multiple access method. It is assumed that all device pairs are equipped with different wireless interfaces allowing them to connect to the appropriate RAT and use a CSMA/CA to avoid collisions when operating on the unlicensed band. Hence, each unlicensed channel becomes available to a D2D pair only when it is idle. Unlike many previous works, we jointly solve the problems of inband/outband access, mode selection, and spectrum/power assignment by combining these problems into one optimization problem which allows to allocate the inband network resources and offload the D2D traffic in a most effective way (in terms of maximizing the overall network utility). Note that the formulated problem can be solved to optimality only if the global channel and network knowledge (including the precise information on the operating conditions of the licensed and unlicensed channels) is available to the eNB. However, because of the absence of an established control link between the unlicensed and cellular radio interfaces, the eNB cannot get any information about the quality and availability of the unlicensed channels. As a result, a considered resource allocation problem becomes a stochastic optimization problem that can be dealt with by deploying a learning theory [22] (to estimate the random unlicensed channel environment).

Consequently, we formulate the outband D2D access as a dynamic single-player game in which the player (eNB) estimates its possible strategy and expected utility for all of its actions based only on its own local observations using a JUSTE-RL with regret (originally proposed in [23]). The main idea behind RL is that the actions leading to the higher network utility at the current stage should be granted with higher probabilities at the next stage [22]. In the simplest form of RL (described, e.g., in [24]), a learning agent estimates its best strategy based on its observed utility without any prior information about its operating environment. This form of RL requires only algebraic operations but its convergence to the equilibrium state is not guaranteed [25]. In Q-learning [22], a utility is estimated using some value-action function. This RL method converges to a Nash equilibrium (NE) state. However, it requires maximization of the action-value at every stage which can be computationally demanding [22]. In JUSTE-RL algorithm (described, in detail, in [23]), a learning agent estimates not only its own strategy but also the expected utility for all of its actions. Unlike Q-learning, JUSTE-RL does not need to perform optimization of the action-value (since only algebraic operations are required to update the strategies) and, hence, it has a lower computational complexity. On the other hand, compared to a basic RL algorithm, JUSTE-RL converges to a -NE [23, 25].

It is worth mentioning that, in wireless communications, RL has been studied in the context of various spectrum access problems. In [26, 27], the learning has been employed to minimize the interference (created by adjacent nodes) in partially overlapping channels. This problem has been formulated as the exact potential graphical game admitting a pure-strategy NE and, therefore, the proposed approach is not realizable in a broader range of problems. A cognitive network with multiple players has been analyzed in [28]. In this work, the learning and channel selection have been separated into two different procedures which increased the complexity of a proposed resource allocation approach. Besides, the stability of a final solution was not verified. A multi-player game for inband D2D access, where the players (D2D users) learn their optimal strategies based on the throughput performance in a stochastic environment, has been studied in [29]. It was assumed that each D2D user can transmit over the vacant cellular channels using a CSMA/CA implying that there are no channels with interfering users (i.e., each orthogonal channel can be occupied by at most one cellular/D2D user). Although the authors consider a scenario with two D2D users operating on the same channel, it is not clear how a D2D user can sense whether the user operating on the channel is cellular or D2D. An autonomous D2D access in heterogeneous cellular networks comprising multiple low-power and high-power BSs with (possibly) overlapping spectrum bands has been investigated in [30]. This problem has been modeled as a stochastic noncooperative game with multiple players (D2D pairs) admitting a mixed-strategy NE. The goal of each player was to jointly select the wireless channel and power level to maximize its reward, defined as the difference between the achieved throughput and the cost of power consumption constrained by the minimum tolerable SINR requirements of this D2D pair. To solve this problem, a fully autonomous multiagent Q-learning algorithm (which does not require any information exchange and/or cooperation among different users) is developed and implemented in an LTE-A network.

The rest of the paper is organized as follows. A general network model for inband and outband network operation is described in Section 2. A general problem and the algorithms for unlicensed and licensed resource allocation are formulated in Section 3. The algorithm implementation, including the proposed resource allocation procedure in an LTE-A networks and performance evaluation, is presented in Section 4. The paper is finalized in Conclusion.

2. Network Model

In this paper, the problem of resource allocation for D2D communication is investigated for both the uplink (UL) and downlink (DL) directions. Similarly, the discussion through the rest of the paper is applicable (if not stated otherwise) to either direction. Consider a basic LTE-A network consisting of one eNB and user pairs, denoted , with being the set of user pairs’ indices. It is assumed that a fixed licensed spectrum band of the eNB spans resource blocks (RBs), numbered , with denoting the set of RBs’ indices comprising the bandwidth. The network runs on a slotted-time basis with the time axis partitioned into equal nonoverlapping time intervals (slots) of the length , with t denoting an integer-valued slot index. Each pair of users can communicate with each other either by the traditional cellular mode (CM) via the eNB or in a D2D mode (DM) without traversing the eNB. Let be the set of the indices of device pairs that can operate only in CM and let denote the set of the indices of potential D2D pairs (The indices in and can be determined based on, e.g., user application (such as video sharing, gaming, and proximity-aware social networking) in which the pair of devices could potentially be in range for the direct communication. Such information can be acquired from a standard session initiation protocol (SIP) procedure (which handles the session setups and users arrivals in LTE networks). Interested readers are referred to [31] for a comprehensive description of an SIP procedure and its use in the D2D access.).

In our network, any potential D2D pair can be allocated with cellular or D2D mode (based on the results of resource allocation procedure). Consequently, we define a binary mode allocation variable , , equaling 1, if PUn is allocated CM at slot t, and 0, otherwise. Note that , for all . Further, we consider the following models of D2D access.(i)Inband D2D: a D2D pair operates within the licensed LTE spectrum in an underlay to cellular communication.(ii)Outband D2D: a D2D pair transmits over the unlicensed band by exploiting other RATs, such as Wi-Fi Direct [11], ZigBee [12], or Bluetooth [13] (It is assumed that all user devices are equipped with the corresponding wireless interfaces to be able to communicate using a suitable RAT.). We assume that there is no coordination and/or information exchange between different wireless interfaces.To differentiate the pairs according to their D2D access, we define a binary channel access variable , , equaling 1, if PUn operates inband at slot t, and 0, otherwise. Note that all cellular users can access only the LTE bands. Hence, , for all .

2.1. Inband Network Operation

In LTE/LTE-A, RBs are allocated to cellular users by the eNBs using a standard packet scheduling procedure [32]. The use of packet scheduling in a D2D-enabled LTE-A network is described, in detail, in [33]. In short, a packet scheduling process can be explained as follows. In the UL direction, at the beginning of any slot t, each user is required to collect and transmit its buffer status information. After collecting this data, a user sends the scheduling request (SR) with its buffer status information to the eNB via a dedicated physical uplink control channel (PUCCH). After receiving all the SRs, the eNB allocates the RBs to the users (according to a certain scheduling algorithm) and responds to all the SRs by sending the scheduling grants (SGs) together with the allocation information to the corresponding users via dedicated physical downlink control channels (PDCCHs) [33]. In the DL, the eNB readily finds out the DL buffer status for each user, allocates the RBs, and sends the SGs with allocation information via PDCCHs [33]. In the framework used in this paper, the above scheduling process is applied for both the cellular and D2D communication with some modifications (the corresponding resource allocation procedure will be described in Section 4).

Let us further define a binary RB allocation variable , , , equaling 1, if PUn is allocated with RBk at slot t, and 0, otherwise. Each RB can be allocated to at most one cellular user. Hence,The number of D2D users operating on the same RBs is unlimited. Additionally, to maximize the network utilization, we enforce each RB to be allocated to at least one user. That is, Note that both the OFDMA used for DL transmissions and single carrier frequency division multiple access (SC-FDMA) applied in the UL direction provide orthogonality of resource allocation to cellular communications. This allows achieving a minimal level of cochannel interference between the transmitter-receiver pairs located within one cell [34]. Thus, when information is transmitted by cellular/D2D user, it will be distorted only by the users operating on the same RB(s).

Let , , , and , denote the channel gain coefficient between the transmitter and receiver of PUn and PUm operating on RBk (for , indicates the channel gain coefficient between PUn operating on RBk and the eNB). In LTE system, the instantaneous values of can be obtained from the channel state information (CSI) through the use of special reference signals (RSs) [35] and, hence, they are known to the eNB and the users. Then, for any PUn operating on RBk, the SINR at slot t in the UL direction is described bywhere is the variance of zero-mean additive white Gaussian noise (AWGN) power and is the transmission power allocated to PUn at slot t. Clearly, is nonnegative and cannot exceed some predefined maximal level ; that isAt any t, the inband service rate of PUn depends on the number of RBs allocated to this device pair and the SINR in each RB. That is, where is the service rate of PUn (in bits per slot or bps) over licensed (inband) spectrum and is the bandwidth of one LTE RB (ω = 180 kHz).

2.2. Outband Network Operation

We consider M separate outband wireless channels numbered, for notation consistency, as (In this paper, we consider a general scenario when the unlicensed outband access can be based on OFDMA, CSMA/CA (in case of Wi-Fi Direct), FH-CDMA (in case of Bluetooth), or any other multiple access method.). We denote by the set of channel indices within the unlicensed band and use a binary channel allocation variable , , , to indicate if PUn is allocated with the unlicensed channel (in which case, ) or not (). Note that , for all and (since cellular users can access only the LTE bands). For , equals 0, if , and 1, otherwise (i.e., if ). Hence, To avoid collisions, the D2D pairs use a CSMA/CA method when operating outband. As a result, each unlicensed channel is available to D2D communication only when it is idle. Additionally, to reduce the possibility of collisions between D2D users, we assume that, at any slot t, at most one device pair can transmit over each unlicensed channel . That is,The transmission procedure for the pair of D2D users operating outband is described as follows. At the beginning of slot t, one of the users starts sensing the allocated unlicensed channel (for simplicity, we assume perfect sensing). If the channel is free, the transmission phase (of the length , such that ) begins. Note that the duration of is random. It depends on the availability of the channel and the applied CSMA/CA scheme. The probability density function (p.d.f.) of is not calculated here (since it has no impact on the further analysis in this paper). An example of such calculations can be found in [36].

Let , for all , denote the channel gain coefficient between the transmitter and receiver of PUn operating on unlicensed channel . Then, the SINR of PUn transmitting over the channel at slot t can be expressed byand the service rate of PUn over unlicensed (outband) spectrum is described bywhere is the bandwidth (in Hz) of unlicensed channel . Note that neither the eNB nor D2D users have prior information about quality and availability of unlicensed channels. Therefore, the exact values of and are unknown to the eNB and the D2D users.

3. Resource Allocation Problem

3.1. Problem Statement

We define a binary -dimensional RB allocation matrix and a binary -dimensional unlicensed channel allocation matrix asrespectively. We also define a binary N-dimensional D2D access allocation vector , a binary N-dimensional mode allocation vector , and a real-valued N-dimensional power allocation vector . Then, the sets of all admissible values for , , , , and are described by Example of a D2D-enabled network with all defined optimization variables is shown in Figure 1.

Ideally, at any slot t, the eNB should distribute the network resources among the users to maximize their aggregated service rate. That is, to maximize the sum:where represents the service rate of PUn (operating either inband or outband). However, when communicating over the unlicensed spectrum, each D2D pair should transmit at a maximal power level to achieve the high SINR regime (and, consequently, service rate) which, in turn, results in increased power consumption of mobile terminals. Therefore, when formulating the utility of each device pair, we should also consider the cost of power consumption, to quantify the trade-off between the achieved rate and power level (as in [37]). Accordingly, we can define a utility of PUn at slot , as the difference between its instantaneous service rate and the cost of power consumption:where is the cost per unit (W) level of power for PUn.

Using the above definition, we can express our resource allocation problem as follows:where the constraint (11f) is necessary to protect the users from heavy interference (here stands for the minimal SINR level acceptable by PUn). Note that information on the sets and is readily available at the eNB. The values of for , and , are obtained by the eNB from the CSI carried by the RSs. The only missing information is related to that depends on the parameters (representing the availability of the unlicensed channel in our model) and (which defines the quality of unlicensed channel ), for all . The latter parameter is determined by the unlicensed channel allocations and, hence, the eNB can adapt to the changes of in time and space. Since there is no coordination (and no information exchange) between the LTE and outband RAT interfaces, solving (11a)–(11f) to optimality might be impossible, which is a rather strong argument in favor of applying a well-known reinforcement learning (RL) for resource allocation.

The main idea behind RL is that the actions (unlicensed channel allocations) leading to the higher network utility at slot should be granted with higher probabilities at slot and vice versa [22]. In the simplest form of RL (presented in [24]), the learning agent estimates its possible strategies based on the locally observed utility without any prior information about the operating environment. This form of RL requires only algebraic operations but does not guarantee the convergence to an equilibrium [25]. In Q-learning [22], the agent’s utility is estimated using some value-action function. Given the certain (easy to follow) conditions, this algorithm converges (with probability 1) to an NE state. However, it requires maximization of the action-value at every slot t (which can be computationally demanding depending on the structure of a chosen value-action function) [20]. In JUSTE-RL algorithm [23], the learning agent estimates not only its own strategy but also the expected utility for all of its actions. Unlike Q-learning, JUSTE-RL does not need to perform optimization of the action-value (since only algebraic operations are required to update the strategies) and, hence, it has a lower computational complexity. On the other hand, compared to a basic RL algorithm, JUSTE-RL converges to a -NE [23, 25]. We now show how a JUSTE-RL with regret can be applied to our problem.

3.2. Unlicensed Channel Allocation

To apply JUSTE-RL with regret to our problem, we represent it as a game with one player (the eNB) having no information about the operating environment. A finite set of the eNB’s actions represents the set of all admissible unlicensed channel allocation decisions. The objective of the eNB is to select, at any slot t, an action to maximize the eNB’s utility . In the following, we use notation , to specify the eNB’s decision regarding the allocation of an unlicensed channel to a pair PUn and to describe all unlicensed channel allocations by the eNB when selecting a particular action at slot t. We also use to denote the D2D access allocation vector and to indicate the outband service rate achieved by playing the action . After taking an action at slot t, the eNB observes the (random) service rate and estimates the network utility by solving the following problem: whereand Note that, unlike problem (11a)–(11f), problem (12a)–(12e) can be solved to optimality (since is known). It has three optimization variables , , and and, hence, its complexity is lower than the complexity of (11a)–(11f) (the method for solving (12a)–(12e) is presented in the next subsection).

We also define a mixed-strategy probability of playing an action at slot t as and a regret for not playing this action at slot t asIn JUSTE-RL, the probability distribution of a regret over all possible actions becomes the Boltzmann–Gibbs distribution (aka canonical ensemble), given by [22]where k = 1.38064852 × 10−23 J/K is the Boltzmann constant; is the system temperature (in K). High temperatures make all actions almost equiprobable and low temperatures result in greedy action selection [22].

Using the above definitions, the dynamics of a JUSTE-RL with regret can be described as [23]for all , where , , and are the learning rates, such that [23]Typically, the learning rates are set equal [22]:where , , , and . The initializations , , and should be sufficiently close to zero, for all . The dynamics (15) converge to the -Nash equilibrium. Note that a Nash equilibrium point for (15) is given by [23]:The corresponding learning algorithm for unlicensed channel allocation is presented in Algorithm 1 (where T indicates the total simulation length in slots). Note that this algorithm converges when , for all . The complexity of JUSTE-RL with regret is mainly determined by the size of an action set , since, at any slot t, we have to select an action that maximizes (the dynamics in (15) require only algebraic operations and, thus, its computational complexity is negligible). Consequently, the worst-case time complexity of Algorithm 1 is , where is the size of our action set.

Initialization:
(1) Input , , , ;
(2) For all , set , , ;
(3) For all , set ;
Main Loop:
(4) While () do
 (5) Select and set ;
 (6) For all , set ;
 (7) For all , set ;
 (8) Execute   and observe , for all ;
 (9) Solve (12a)–(12e) to find an optimal ;
 (10) For all , update , , using (15);
(11) End.
3.3. Inband Resource Allocation

Consider (12a)–(12e) that represents a joint mode, RB, and power level allocation problem. This problem has two binary optimization variables and , one real-valued variable , nonlinear objective (12a), and nonlinear constraints (12c) and (12e). Hence, it belongs to a family of the mixed-integer nonlinear programming (MINLP) problems. It has been well established in the past (see, e.g., [38]) that all MINLP problems involving binary variables (such as (12a)–(12e)) are Nondeterministic Polynomial-time- (NP-) hard. For immediate NP-hardness proof for a considered problem note that, given that can be either 0 or 1, any feasible solution to (12a)–(12e) is a subset of vertices. The constraint (12d) also implies that at least one end point of every edge is included in this subset. Hence, the solution to this problem describes a vertex cover, for which finding a minimum is NP-hard.

Most of the MINLP solution techniques involve the construction of the following relaxations to the considered problem: a nonlinear programming (NLP) relaxation (the original problem without integer restrictions) and a mixed-integer linear-programming (MILP) relaxation (an original problem where the nonlinearities are replaced by supporting hyperplanes). To form the MILP and NLP relaxations to (12a)–(12e), let us first represent in equivalent form the following:where objective (18a) and constraints (18b) and (18c) are linear, while constraints (18d)–(18f) are nonlinear. The MILP relaxation to (18a)–(18f) in a given point (, , ) is given by The NLP relaxation to (18a)–(18f) is given bywhere

In general, all MINLP problems can be solved using either exact techniques (e.g., branch-and-bound [39]) or heuristic methods (such as local branching [40], large neighborhood search [41], and feasibility pump [42]). Since we are interested in a reasonably simple and fast algorithm, it is more convenient to use heuristics to solve (18a)–(18f). Among numerous heuristic techniques, feasibility pump (FP) [43] is perhaps the most simple and effective method for producing more and better solutions in a shorter average running time (the local convergence properties of FP for nonconvex problems have been proved in [44]). The fundamental idea of an FP heuristic is to decompose the MINLP problem into two parts: integer feasibility and constraint feasibility. The former is achieved by rounding (solving the MILP relaxation to an original problem), the latter by projection (solving the NLP relaxation). The algorithm generates two sequences of integral and rounding points. The first sequence of integral points, , contains the solutions that may violate the nonlinear constraints; the second sequence, , comprises the rounding points that are feasible for the MILP relaxation but might not be integral.

Particularly, with the input being a solution to an NLP relaxation (20a)–(20f), FP generates two sequences by solving the following problems, for ,where and are l1-norm and l2-norm, respectively. The rounding is carried out by solving the problem (21a)–(21f) and the projection is the solution to (22a)–(22f). Consequently, an FP algorithm alternates between the rounding and projection steps until (which implies feasibility) or until the number of iterations i has reached its predefined limit I. The workflow of the algorithm is presented in Algorithm 2. Note that to retain the local convergence, the problems (21a)–(21f) and (22a)–(22f) have to be solved exactly in the FP algorithm. The problem (22a)–(22f) (and, hence, (20a)–(20f)) can be solved using any standard NLP method. In this paper, an interior point algorithm (described, e.g., in [45]) which has a polynomial time complexity is applied to solve (20a)–(20f) and (22a)–(22f). The MILP problem (21a)–(21f) is relatively simple and, therefore, it can be solved to optimality by any technique from the family of the branch-and-bound methods (e.g., [46]).

Initialization:
(1) Input , ;
(2) While () do
 (3) Input , , ;
 (4) Solve (20a)–(20f) to find the optimal ;
Main Loop:
 (5) While () do
  Rounding:
  (6) Solve (21a)–(21f) to find the optimal ;
  (7) If () then break;
  Projection:
  (8) Solve (22a)–(22f) to find the optimal ;
  (9) Set ;
(10) End.

Note that, in general, finding an optimal solution to any joint resource allocation problem with integrality constraints is NP-hard (which has been shown in [47]). Consequently, most of the recent approaches to deal with such kind of problems focus on finding the high-quality suboptimal solutions using, for example, relaxation (by removing all the integer restrictions, as it has been done in [47, 48]) or iterative two-stage algorithms for determining the optimal integral solutions given fixed power levels and, then, finding the optimal power allocation with fixed integral points (e.g., [49]). In this paper, instead of relaxation or iteration, we directly apply a heuristic FP algorithm that has a polynomial time complexity in the size n of the problem (with c being some real constant) [43] (Note that in our case, the size n of the problem (18a)–(18f) is proportional to . The numerical results showing the complexity of a proposed algorithm will be presented in Section 4.). Hence, the presented heuristic approach has moderate complexity compared to the previously proposed algorithms for resource allocation with integrality constraints whose complexity ranges from linear [21, 30, 47, 48, 50] to polynomial [2, 3, 49, 5153].

4. Algorithm Implementation

4.1. Resource Allocation Procedure

We now discuss the implementation of the proposed algorithms (presented in Section 3) in an LTE-A network. The following scheduling procedure is repeated at the beginning of each slot t as follows.(i)All users send their SRs to the eNB via dedicated PUCCHs. Note that the SRs may contain some useful control information, such as updated target SINR level or observed throughput on the unlicensed channel .(ii)After receiving the SRs from all of the users, the eNB performs resource allocation (by assigning the modes, RBs and unlicensed channels, and power levels to user pairs according to Algorithms 1 and 2) and sends the SGs with optimal allocations to the corresponding users via PDCCHs.(iii)After receiving the SGs, the users start their data transmissions over allocated RBs/unlicensed channels with assigned mode and power levels.

As it was already been mentioned, we deploy a CSMA/CA for outband D2D access using a procedure described in IEEE 802.11 [54]. As dictated by [54], if a certain D2D pair PUn, , is allocated with one or more unlicensed channels then, prior to transmission, one of the users must first sense the channel (to determine whether it is idle) for the duration of a distributed coordination function interframe space (DIFS). DIFS (which is 34 μs long) consists of a short interframe space (SIFS) equaling 16 μs and 2 Wi-Fi slots (each equals 9 μs). After DIFS, a user must typically defer its transmission for a random number of slots, generated from 0 to CW-1 (contention window size), to allow the other devices to share a channel in a fair manner. Given that the minimum CW value is , the device will, on an average, wait for about 7.5 Wi-Fi slots before transmission. Thus, the average channel access delay is 16 μs + 9.5 × 9 μs = 101.5 μs (independent of service rate). Since the slot duration in LTE system ( ms) is much longer than the average channel access delay (101.5 μs), it is expected that (in average) a D2D pair will be able to exchange the data within the scheduled period. In this case, each of the users in a D2D pair should observe achieved throughput and report this value to the eNB when sending its SR. Otherwise (if a D2D pair is not able to exchange the data within one slot), the D2D users send the value to the eNB. Note that a CSMA/CA does not allow two-way data transmission. Hence, the second device in a D2D pair can start the data transmission only after the first user has finished its transmission.

It is worth mentioning that, at some point in time, a JUSTE-RL will reach its equilibrium state. However, even after the equilibrium has been reached, the eNB continues the learning process, because the network environment (channel quality, network traffic, and the number of active users) is likely to change over time resulting in different optimal mode, RB/unlicensed channel, and power allocations.

4.2. Simulation Model

A simulation model of the network has been implemented upon a standard LTE-A platform using the OPNET simulation and development package [55]. The model consists of one eNB and N user pairs randomly positioned inside a three-sector hexagonal cell (with the antenna pattern specified in [56]). It is assumed that the users operate outdoors in a typical urban environment and are stationary throughout all simulation runs. Each user device has its own traffic generator, enabling a variety of traffic patterns. For simplicity, in the examples below, the user traffic is modeled as a full buffer with load of 10 packets per second and packet size of 1500 bytes. In all simulations, cellular pairs, , , slots, = 106 K, and I = 1000 iterations, the target SINR levels for each device pair are set as = = 0 dB, for all . The licensed band of the eNB comprises K = 100 RBs (equivalent to 20 MHz). The unlicensed band comprises M = 4 nonoverlapping OFDM channels with = 10 MHz, for . The main simulation parameters of our model are listed in Table 1. Other parameters are set in accordance with 3GPP specifications [56].

In this paper, the evaluation of a proposed approach for inband and outband resource allocation, referred to as JRA (JUSTE-RL based resource allocation), is divided into two parts. In the first part, we analyze the performance of JUSTE-RL with regret for unlicensed channel allocation (Algorithm 1). In the second part, we examine the efficiency of a proposed joint inband/outband resource allocation (Algorithms 1 and 2). In the following, the performance of JRA is compared with the performance of the following resource allocation techniques.(i)First is joint inband/outband resource allocation with -greedy Q-learning (GQL) [57] based on formulations (12a)–(12e) and (18a)–(18f), where the unlicensed channels are allocated to the users by the LTE eNB. In GQL, at any slot t, an action with the largest Q-value is selected with probability and the other actions are selected uniformly at random with probability . In all simulation experiments, the value of is set in accordance with the most common suggestions (provided, e.g., in [22]), as = 0.1.(ii)Second is centralized optimal strategy (COS), where the inband and outband network resources are allocated to the users by solving (11a)–(11f) directly based on global channel and network knowledge. Note that COS corresponds to the most efficient (in terms of network utility maximization) strategy although it is not practically realizable (since in the real network deployment scenarios, the precise information about quality and availability of unlicensed channels is not available) (In this paper, we use an FP algorithm to find the optimal solution to (11a)–(11f) or (18a)–(18f) in GQL and COS.).(iii)Third is social heuristic for multimode D2D communication (SMD) in an LTE-A network proposed in [21] to reduce the complexity of an original optimization problem for joint inband/outband resource allocation. This algorithm assigns user modes and resources to maximize the social welfare based on the global channel and network knowledge. The eNB creates a randomly ordered list of the D2D pairs. Then, it computes the aggregated network utility for each mode of the first user in the list and assigns this user with a mode that provides the highest aggregated utility. This process is repeated for all D2D pairs.(iv)Fourth is greedy heuristic for multimode D2D communication (GMD) in LTE-A networks [21], where the modes and inband/outband network resources are allocated to maximize the individual users’ welfare based on the global channel and network knowledge. Similar to SMD, the eNB creates a randomly ordered list of the D2D pairs. After this, it computes the utility for each mode of the first user in the list and assigns this user a mode assuring the highest individual utility. This process is repeated for all D2D pairs.(v)Fifth is ranked heuristic for multimode D2D communication (RMD) in LTE-A networks [21]. Here the eNB evaluates the utility of each user in each mode (based on the global channel and network knowledge) and sorts the D2D pairs according to their utilities in a descending order. Next, the eNB allocates the first user in the list a mode that guarantees the highest aggregated network utility. This process is repeated for all D2D pairs.Note that all algorithms used in our performance evaluation are simulated with identical system parameters.

4.3. Performance of a Learning Algorithm

We start with the performance evaluation of JUSTE-RL with regret for unlicensed channel allocation (outlined in Algorithm 1). Figures 2 and 3 demonstrate the learning speed of JRA. Figure 2 shows the average number of RL iterations (slots) necessary for convergence of strategies in JRA (at the point where , for all ) with different values of , , and and a varying number of user pairs, . The average number of RL iterations (slots) necessary for convergence of utilities in JRA (at the point where , for all ) with , , and and ) is plotted in Figure 3. The accuracy of estimation in JRA is presented in Figures 4 and 5. Figure 4 shows the absolute error of strategy estimation in JRA, denoted as , defined as a sum of the absolute differences between the actual optimal strategies and the estimated strategies upon the algorithm termination. That is, where is an optimal strategy estimated in JRA upon the algorithm termination (at slot ) and is the actual optimal strategy obtained by playing an action . Figure 5 demonstrates the absolute error of utility estimation in JRA, denoted , defined as a sum of the absolute differences between the actual and the estimated optimal network utilities upon the algorithm termination; that is,where is an optimal network utility estimated in JRA upon the algorithm termination (at slot ) and is the actual optimal network utility obtained by playing an action . The observations in Figures 25 show that the rates of convergence of strategies and utilities and the accuracy of strategy and utility estimation are almost the same. Furthermore, we find that the number of iterations necessary for the algorithm convergence and absolute estimation error strongly depend on the setting of the parameters , , and : the worst performance is attained with , , and and the best with , , and . Such results are rather predictable since the parameters , , and are related to the parameters , , and (see (16b)) which have a direct influence on the learning rate of JRA [22, 23].

In Figures 6 and 7, the instantaneous network utility is presented as a function of time in scenarios with low network load () and high network load () and fixed. Here a proposed JRA technique is simulated with the settings = 0.5, = 0.61, and = 0.67. The graphs in these figures show that the efficiency of JRA and GQL improves gradually over time. After about 300 slots (which is the average time necessary for the convergence of strategies and utilities in Algorithm 1), JRA demonstrates near-optimal results. GQL needs a little longer time (≈400 slots) to converge, after which its performance also becomes very close to the performance of COS. Unlike JRA and GQL, the performance of COS, SMD, GMD, and RMD is consistent over time (since these algorithms do not involve any learning process). We also observe that the network utility attained in SMD, GMD, and RMD is much smaller than that in COS. To understand such poor performance of SMD, GMD, and RMD, note that, in these algorithms, the original resource allocation problem is divided into two separate problems: (i) mode selection and (ii) packet scheduling. After that, the mode selection problem is solved using very plain heuristics (social, greedy, or ranked) which reduces the complexity of an original optimization problem (from exponential to linear) but has a negative impact on the performance of these techniques in terms of network utility maximization [21].

4.4. Performance of Joint Inband/Outband Allocation

We now evaluate the efficiency of a proposed inband/outband resource allocation (Algorithms 1 and 2). The graphs in Figures 810 demonstrate the computational complexity, solution time, and solution accuracy of different resource allocation techniques in the experiments with = 0.55, = 0.61, and = 0.67 (in JRA) collected during the entire simulation period T. Particularly, in Figures 8 and 9, the average number of algorithm iterations (per slot) and solution time (in μs) are presented as a function of the number of user pairs N. Figure 10 shows the average relative deviation from the optimal solution, denoted as Δ and calculated according toin GQL, JRA, and COS andin SMD, FMD, and RMD. In the above equations, is the optimal solution found in GQL, JRA, and COS, is the actual optimal solution to the original resource allocation problem (11a)–(11f), is the mode allocation in SMD, FDM, and RMD, and is the actual optimal allocation (a solution to the optimization problem originally stated in [21]). It follows from these figures that all simulated strategies have moderate computational complexity. Predictably, COS has the highest complexity because the number of optimization variables (, , , ) in this algorithm is bigger than that in JRA, GQL, SMD, FMD, and RMD. The lowest complexity and solution accuracy are achieved in SMD, FMD, and RMD (which have a linear time complexity but are based on very raw approximations and plain heuristic assumptions).

Figures 1113 present the observations collected at slot t = 500 with = 0.55, = 0.61, and = 0.67 (in JRA). The average user throughput (in kbits/s) and the average transmission power (per user) (in dBm) in different algorithms estimated according toare shown in Figures 11 and 12, respectively. The instantaneous network utility in different algorithms depending on the target SINR level, , with fixed number of user pairs, , is plotted in Figure 13. The obtained results demonstrate that the average user throughput decreases with the number of user pairs N (Figure 11). This is rather predictable because when the network load increases, the number of RBs or unlicensed channels available for each user decreases resulting in a reduced throughput. Besides, to achieve the desired SINR levels, the users tend to transmit at a higher power level (see Figure 12) when the total number of user pairs in the network increases. The graphs in Figure 13 show that the network utilities in different resource allocation schemes are described by some concave functions of . To understand such results, note that with too low settings of ( dB), the total user throughput reduces because of the bad channel conditions leading to the decreased network utility. On the other hand, when is too high ( > 4 dB), the throughput (and, consequently, network utility) degrades due to the shortage of available bandwidth since the number of channels with suitable data transmission conditions becomes very small (because not all of them satisfy the SINR requirements of the users). We also observe that, in all simulated scenarios, the performance of JRA is very close to optimal (i.e., the one achieved in COS). GQL performs a little worse than JRA but still better than heuristic algorithms (SMD, GMD, and RMD).

5. Conclusion

This paper introduces a JRA algorithm for a D2D-enabled LTE-A network with access to unlicensed band provided by one or more RATs based on different channel access methods (OFDMA, CSMA/CA, FH-CDMA, etc.). In the presented framework, the inband/outband network resources (cellular/D2D modes, spectrum, and power) are allocated jointly by the LTE eNB to maximize the total network utility. Unlike most of the previously proposed techniques for outband D2D communication (which presume a certain level of coordination and information exchange between licensed and unlicensed systems), our JUSTE-RL based approach for unlicensed channel assignment is fully autonomous and has demonstrated relatively fast (≈300 RL iterations) convergence to -Nash equilibrium (given the appropriate settings of learning rates). Simulations results also show that the proposed joint inband/outband resource allocation strategy outperforms other relevant spectrum and power management schemes in terms of energy efficiency and throughput maximization.

Competing Interests

The authors declare that they have no competing interests.