#### Abstract

Recently, there is an emerging trend of addressing “energy efficiency” aspect of wireless communications. And coordinated multipoint (CoMP) communication is a promising method to improve energy efficiency. However, since the downlink performance is also important for users, we should improve the energy efficiency as well as keeping a perfect downlink performance. This paper presents a control theoretical approach to study the energy efficiency and downlink performance issues in cooperative wireless cellular networks with CoMP communications. Specifically, to make the decisions for optimal base station grouping in energy-efficient transmissions in CoMP, we develop a Reinforcement Learning (RL) Algorithm. We apply the -learning of the RL Algorithm to get the optimal policy for base station grouping with introduction of variations at the beginning of the -learning to prevent from falling into local maximum points. Simulation results are provided to show the process and effectiveness of the proposed scheme.

#### 1. Introduction

The continuously growing demand for ubiquitous wireless access leads to the rapid development of wireless cellular networks during the last decade [1–3]. Such tremendous growth in wireless industry has made it become one of the leading sources of world energy consumption and is expected to grow dramatically in the future. Rapidly rising energy costs and increasingly rigid environmental standards have led to an emerging trend of addressing “energy efficiency” aspect of wireless communication technologies.

On account of the deterioration of air pollution and the worsening of greenhouse effect, optimization of energy utilization and the sustainable development have become one of the hot topics in academia. Base station costs much energy consumption in wireless cellular communication networks, which causes huge waste in the case of the number of users reducing to little. It goes against the energy optimization and the implementation of environmental protection. Therefore, it is necessary to reduce the energy consumption of base stations in order to optimize the energy efficiency of wireless cellular network.

In a typical wireless cellular network, base stations account for about 70% of the total energy consumption [4]. In addition, a base station consumes even more than 90% of its peak energy when there is little or no traffic [5]. In allusion to the energy efficiency optimization, base station sleeping strategy is proposed, in which the base station should be turned off when the traffic is low and the surrounding base stations cooperate to serve the users. It combines the coordinated multipoint communication and base station sleeping strategy in order to implement an optimal CoMP grouping to sever the sleeping cell. Coordinated multipoint communication is a new method that helps with the dynamic base station cooperation, where signals transmitted or received by spatially separated antenna sites are jointly processed [6].

The Reinforcement Learning Algorithm is one of the important methods in the Machine Learning. In this paper, the Reinforcement Learning Algorithm is used to optimize the cooperation of base station in CoMP communication. Reinforcement Learning (RL) is learning through the direct experimentation. It does not assume the existence of a teacher that provides examples upon which learning of a task takes place. Instead, in RL the experience is the only teacher. With historical roots on the study of conditioned reflexes, RL gets its advantages in fields as diverse as Operational Research and Robotics because of its theoretical relevance and potential applications. In this paper, the RL Algorithm operates for the Operational Research in CoMP communication.

The rest of this paper is organized as follows. Section 2 presents the system models. Section 3 presents the RL Algorithm in in CoMP communication. Simulation results are presented and discussed in Section 4. Finally, we conclude this study in Section 5.

#### 2. System Models

##### 2.1. Coordinated Multipoint Communications

Coordinated Multipoint Communication is a key feature in LTE-Advanced technologies. For the sake of better application of CoMP in practice, we build small scale model units to research the CoMP scheme, which supplies higher energy efficiency for base stations and better communication performance for users. In this way can we “green” the field of communications.

In the wireless cellular networks, one small scale model unit consists of 19 cellular cells, in which the most lightly loaded cell is chosen as the center one. The base stations in the second channel are too far to serve the users in the center cell, so they are considered as the noise source. Then the first channel hexagonal cellular cells are numbered from 0 to 6 with base stations located in the center of the cells, and it is 500 m between two base stations as shown in Figure 1. The center base station is switched off and the users are served by the dynamically CoMP cooperating of base stations 1~6. The number and locations of the users in the 0 cell are both optional and alterable, so the CoMP scheme must be adjusted along with that. As the definition of CoMP is explained initially in 3GPP 36.814 [7] as dynamic coordination among multiple geographically separated points referred to as CoMP cooperating set for downlink transmission and uplink reception, the scheme is to find the number and locations of the base stations participating in the CoMP to provide the best communications as the environment changes. The CoMP scheme is considered in the situation where the downlink payload and data are available at each point in the CoMP cooperating set and downlink payload is transmitted on Physical Downlink Shared Channel (PDSCH) from multiple points in the CoMP set [8]. However in this paper, we mainly focus on the downlink transmission in the perfect channel state.

##### 2.2. Large Scale Propagation and Pathloss Models

Pathloss in wireless communication is defined as the difference in dB between the transmitted and the received signal powers due to the attenuation during the propagation [9]. The traditional log-normal shadowing for large scale pathloss modeling is formulated as where represents the received signal power at the user equipment (UE), represents the transmitted signal power at the severing base station, and the pathloss is denoted by formulated as where denotes the pathloss at the reference distance, represents the propagation distance, is the path loss exponent, and is the Gaussian random variable with zero mean and standard deviation modeling the shadowing effect of the media [8].

Urban Macro (UMa) pathloss model is used according to the below equations and parameters specified in Table 1 with respect to ITU for line of sight (LoS) and none line of sight (NLoS) scenarios. probability is model as a Bernoulli random variable [10]:

##### 2.3. Downlink Performance

When making CoMP scheme decision, energy efficiency is not the only index that we take into consideration. We should also pay attention to the downlink performance to guarantee the quality of communications.

The initial motivation for CoMP was to increase the cell edge user throughput and spectral efficiency by making use of intercell orthogonal resource assignments [11]. And the received SINR is calculated as where is the received signal power from the serving base station using the UMa large scale pathloss model while the remaining act as the interferers and the receiver noise power formulated as where dB/Hz is the noise spectral density and is the frequency bandwidth assigned to the user .

However in the CoMP cellular system, members of the CoMP scheme perform joint scheduling on PDSCH to transfer the user plane data using TM-9. So the received downlink SINR in joint transmission systems is formulized according to [8] as where the represents the CoMP set while the represents the received power outside the CoMP set. In this way, both the SINR and the performance of the cell edge users’ communications are increased.

Downlink capacity received by each user is formulated as

Depending on the user location and mobility, each user has a distinct CoMP transmission set . Since the downlink capacity is also an important index in CoMP communication, it is also necessary to coordinate all the to find the best CoMP scheme.

##### 2.4. Bits/Joule Energy Efficiency

One common method to measure energy efficiency is to use bits per Joule. For a CoMP scheme , energy efficiency can be defined as the ratio of transmission capacity to transmission power as follows: where is the total downlink capacity and is the total transmission power in CoMP scheme , which is expressed as where is the total power consumption for base station in CoMP scheme , which is calculated using the assumptions from [12] and [13] as where is the radiated power per base station, is the signal processing power per base station, and is the power due to backhauling. is the power amplifier efficiency and and are the cooling and battery backup losses in the system.

The signal processing power per sector as a function of different cooperation sizes scales as where is the number of base stations participating in the CoMP scheme .

Backhauling power consumption for base stations using CoMP is modeled in [12] as
where is a given average backhaul requirement per base station. The expression of is expressed as
where is the additional pilot density, is the CSI signaling under CoMP network, and *μ*s is the sample period which is the reciprocal of the assumed OFDM subcarrier spacing at 15 kHz [8].

As the models above, we try to get the best CoMP scheme which both ensures the quality of the communication and maximizes the energy efficiency.

#### 3. Reinforcement Learning

In this section, we present the application of reinforcement learning (RL) algorithm to get the best CoMP scheme . Firstly, we start with the introduction of the RL Algorithm and its branch -learning. Then we build an appropriate value function applying RL to get the optimal CoMP scheme .

##### 3.1. RL Algorithm

Generally speaking, the RL Algorithm is to try to find an optimal action policy to solve a given task in an unknown environment. During the RL, a learner observes the state of the environment and according to which it chooses its action. Firstly, a policy is taken arbitrarily, so it is a very typical learning process that the learner makes wrong actions. Then the learner receives a reward feedback signal from the environment and based on the reward it improves its policy.

The learner chooses its action depending on the environment state and the reward , and the policy is adjusted along with action as it is shown in Figure 2.

During the RL a learner looks for an optimal policy for which it will always receive the best rewards from an environment. For such a policy, the value function is always largest or equal to the value function to any policy .

At present, RL Algorithm mainly includes two categories: the value function estimation and the strategy space search method. Concretely, there are Monte Carlo algorithm, Temporal Difference (TD) algorithm, Dynamic Programming algorithm, -learning algorithm, and Sarsa learning algorithm in RL Algorithm.

##### 3.2. -Learning Algorithm

-learning algorithm is formed by Watkins and Dayan in 1989 [14], which is a milestone in the development of reinforcement learning. Currently, -learning algorithm is also one of the most widely applied reinforcement learning algorithms.

Compared to other RL Algorithms, -learning algorithm gets value for each* State-Action,* and the value function is expressed as . Value of is the accumulation rewards on the basis of the recycle implementation that action is chosen according to related policy in state . The optimal policy of -learning is to maximize ; thus the optimal policy can be expressed as

As a result, the learner only needs to consider the current state and available actions and then choose the action which maximizes the according to the policy. When choosing the action, we only have to select the maximum value from the list, which greatly simplifies the process of decision-making. Values in list are the results of the iterative learning step by step. Learner needs constant interaction with the environment to enrich list, so that it can cover all possible situations. The -learning iterative calculation formula of value function is where is the learning coefficient and is the discount factor.

Steps of -learning algorithm are as follows.(1)Initialize value function, , factors, and *.*(2)Observe the current state .(3)According to the current state, select the action along with the policy, and observe the next state.(4)According to the value function of new* State-Action* estimate .(5)Check if the learning ends; if not, set and go back to (1).

##### 3.3. RL Algorithm In CoMP Communication

In CoMP communication, our purpose is to find the best CoMP scheme , which is called optimal policy in RL Algorithm. As the models built in Section 2, there are 6 base stations participating in the cooperation. The base stations are numbered from 1 to 6 as Figure 1 shows. Then the cooperation policy can be a 1 × 6 matrix. For example, if the optimal policy is the cooperation of base stations 1, 3, and 6, then the policy is .

The environment state is the number and the locations of the users. The value function is the energy efficiency. The rewards are related to the downlink capacity and received SINR. We set the downlink capacity standard as 1 Mbits/s and the received SINR standard as 21 dB. The rates of the users who reach the standards are expressed as and . After different trials, Table 2 shows the relations between the , , and the rewards.

Because the policy is a matrix, the actions interacting with the policy are complex as follows:

When choosing the actions 1~6, it means adding a base station in CoMP, while when choosing the actions 7~12, it means removing a base station from CoMP. But not all the actions are available at each step.

Then -learning iterative calculation formula for CoMP is

In practice, the number and the locations change much more slowly than the process of finding the optimal policy , so for a period the optimal policy is the same. In order to prevent from falling into a local optimal point, we introduce variations to every six steps. After sixty steps, we remove the variations; then the system operates in the optimal policy with the maximal while the state changes very faintly*.* When the state changes, the -learning repeats the process above automatically. In this way, it leads to a green cellular network in both energy efficiency and downlink performance way.

#### 4. Simulation and Discussions

In this section, computer simulations are carried out to evaluate the performance of the proposed CoMP scheme in the RL Algorithm. The parameters in the simulations are shown in Table 1. We assume that the channel state information is perfect in this paper. The CoMP scheme changes along with the communication standards. In different standard, the number of the base stations participating in the CoMP communication is different.

Figure 3(a) shows the value of changes with the SINR standard. When the SINR standard is low, most of the users in the center cell can reach it, so the SINR reward of each set is positive and the energy efficiency mainly affect the value of . As the SINR standard increases, less users in the center cell can reach the standard, and the reward turns negative, so the value descends. Figure 3(b) shows the best CoMP set in different SINR standards. As the SINR standard increases, the best CoMP set increases to meet the standard. While it increases to 30 dB, even CoMP set 6 cannot meet the standard, so it turns down to CoMP set 1 by the impact of energy efficiency.

**(a)**

**(b)**

Figure 4(a) shows the value of changes with the Capacity standard. When the Capacity standard is low, most of the users in the center cell can reach it, so the Capacity reward of each set is positive and the energy efficiency mainly affect the value of . As the Capacity standard increases, less users in the center cell can reach the standard, and the reward turns negative, so the value descends. Figure 4(b) shows the best CoMP set in different Capacity standards. As the Capacity standard increases, the best CoMP set increases to meet the standard. While it increases to 11 Mbits/s, even CoMP set 6 cannot meet the standard, so it turns down to CoMP set 2 by the impact of energy efficiency.

**(a)**

**(b)**

Figures 5–9 are simulations under the standard in which Capacity standard is 1 Mbit/s and SINR Standard is 21 dB. Figure 5 shows that energy efficiency descends as the CoMP set increases. Figure 6 shows that both Downlink Capacity and SINR increase as the CoMP set increases. Figure 7 shows that, in this situation, CoMP set 4 usually gets the largest value.

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

In Figures 8(a) and 9(a), the red circles represent the base stations in the CoMP result from the -learning, while the yellow ones represent the best base station grouping by the enumeration search. Figure 8 shows the result and process of -learning without variations. It traps into the local optimum easily, which leads to the wrong base station grouping checked by enumeration search. To come over it, variation is introduced to the -learning every six steps as Figure 9 shows. In this way, the local optimum is eliminated and the best base station grouping can be got according to the -learning method.

Enumeration search is the auxiliary examination method to check the -learning result. Compared to the enumeration search, -learning method can optimize the base station grouping automatically in a short time as Figure 10 shows.

Figure 11 shows the process of -learning with a variation every six steps. For each state, we set sixty steps to adjust the strategy. So within the 60 steps, there is a variation added every six steps, which makes the value oscillate. After sixty steps, the variations are removed. Figure 12 shows that we find the maximal in the first sixty steps and act the optimal policy with the . As Figures 11 and 12 show, the state changes little. At the 121 and 272 steps, , so the -learning goes back to the variation to find the new optimal policy . And the relevant energy efficiency and the rewards are shown in Figures 8 and 12. The relevant states and the optimal policy are shown in Table 3.

#### 5. Conclusion

In wireless cellular network, it is very important to increase the energy efficiency of radio access networks to meet the challenges. It is also important to increase the downlink performance of the wireless network to meet the requirement of the users. In this paper, we proposed the RL Algorithm to derive the optimal base station grouping decisions for efficient transmissions in CoMP. In addition, we introduce variations preventing from trapping into local maximum. Simulation results have been presented to show the process of the -learning and the optimal policy it finds.

More research is in progress to make the communication model more realistic by the consideration of imperfect channel state information. In addition, two or more base stations in the small scale model unit can be dynamically switched off to further “green” the wireless communications.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

The work was partly supported by the Natural Science Foundation of Hebei Province of China under the Project no. F2014203095 and the Young Teacher of Yanshan University under the Project no. 13LGA007.