Computational Intelligence and Metaheuristic Algorithms with ApplicationsView this Special Issue
Review Article | Open Access
Kok-Lim Alvin Yau, Geong-Sen Poh, Su Fong Chien, Hasan A. A. Al-Rawi, "Application of Reinforcement Learning in Cognitive Radio Networks: Models and Algorithms", The Scientific World Journal, vol. 2014, Article ID 209810, 23 pages, 2014. https://doi.org/10.1155/2014/209810
Application of Reinforcement Learning in Cognitive Radio Networks: Models and Algorithms
Cognitive radio (CR) enables unlicensed users to exploit the underutilized spectrum in licensed spectrum whilst minimizing interference to licensed users. Reinforcement learning (RL), which is an artificial intelligence approach, has been applied to enable each unlicensed user to observe and carry out optimal actions for performance enhancement in a wide range of schemes in CR, such as dynamic channel selection and channel sensing. This paper presents new discussions of RL in the context of CR networks. It provides an extensive review on how most schemes have been approached using the traditional and enhanced RL algorithms through state, action, and reward representations. Examples of the enhancements on RL, which do not appear in the traditional RL approach, are rules and cooperative learning. This paper also reviews performance enhancements brought about by the RL algorithms and open issues. This paper aims to establish a foundation in order to spark new research interests in this area. Our discussion has been presented in a tutorial manner so that it is comprehensive to readers outside the specialty of RL and CR.
Cognitive radio (CR)  is the next generation wireless communication system that enables unlicensed or Secondary Users (SUs) to explore and use underutilized licensed spectrum (or white spaces) owned by the licensed or Primary Users (PUs) in order to improve the overall spectrum utilization. The CR technology improves the availability of bandwidth at each SU, and so it enhances the SU network performance. Reinforcement learning (RL) has been applied in CR so that the SUs can observe, learn, and take optimal actions on their respective local operating environment. For example, a SU observes its spectrum to identify white spaces, learns the best possible channels for data transmissions, and takes actions such as to transmit data in the best possible channel. Examples of schemes in which RL has been applied are dynamic channel selection , channel sensing , and routing . To the best of our knowledge, the discussion on the application of RL in CR networks is new albeit the importance of RL in achieving the fundamental concept of CR, namely, cognition cycle (see Section 2.2.1). This paper provides an extensive review on various aspects of the application of RL in CR networks, particularly, the components, features, and enhancements of RL. Most importantly, we present how the traditional and enhanced RL algorithms have been applied to approach most schemes in CR networks. Specifically, for each new RL model and algorithm which is our focus, we present the purpose(s) of a CR scheme, followed by in-depth discussion on its associated RL model (i.e., state, action, and reward representations) which characterizes the purposes, and finally the RL algorithm which aims to achieve the purpose. Hence, this paper serves as a solid foundation for further research in this area, particularly, for the enhancement of RL in various schemes in the context of CR, which can be achieved using new extensions in existing schemes, and for the application of RL in new schemes.
The rest of this paper is organized as follows. Section 2 presents RL and CR networks. Section 3 presents various components, features, and enhancements of RL in the context of CR networks. Section 4 presents various RL algorithms in the context of CR networks. Section 5 presents performance enhancements brought about by the RL algorithms in various schemes in CR networks. Section 6 presents open issues. Section 7 presents conclusions.
2. Reinforcement Learning and Cognitive Radio Networks
This section presents an overview of RL and CR networks.
2.1. Reinforcement Learning
Reinforcement learning is an unsupervised and online artificial intelligence technique that improves system performance using simple modeling . Through unsupervised learning, there is no external teacher or critic to oversee the learning process, and so, an agent learns knowledge about the operating environment by itself. Through online learning, an agent learns knowledge on the fly while carrying out its normal operation, rather than using empirical data or experimental results from the laboratory.
Figure 1 shows a simplified version of a RL model. At a particular time instant, a learning agent or a decision maker observes state and reward from its operating environment, learns, decides, and carries out its action. The important representations in the RL model for an agent are as follows.(i) State represents the decision-making factors, which affect the reward (or network performance), observed by an agent from the operating environment. Examples of states are the channel utilization level by PUs and channel quality.(ii) Action represents an agent’s action, which may change or affect the state (or operating environment) and reward (or network performance), and so the agent learns to take optimal actions at most of the times.(iii) Reward represents the positive or negative effects of an agent’s action on its operating environment in the previous time instant. In other words, it is the consequence of the previous action on the operating environment in the form of network performance (e.g., throughput).
At any time instant, an agent observes its state and carries out a proper action so that the state and reward, which are the consequences of the action, improve in the next time instant. Generally speaking, RL estimates the reward of each state-action pair, and this constitutes knowledge. The most important component in Figure 1 is the learning engine that provides knowledge to the agent. We briefly describe how an agent learns. At any time instant, an agent’s action may affect the state and reward for better or for worse or maintain the status quo; and this in turn affects the agent’s next choice of action. As time progresses, the agent learns to carry out a proper action given a particular state. As an example of the application of the RL model in CR networks, the learning mechanism is used to learn channel conditions in a dynamic channel selection scheme. The state represents the channel utilization level by PUs and channel quality. The action represents a channel selection. Based on an application, the reward represents distinctive performance metrics such as throughput and successful data packet transmission rate. Lower channel utilization level by PUs and higher channel quality indicate better communication link, and hence the agent may achieve better throughput performance (reward). Therefore, maximizing reward provides network performance enhancement.
-learning  is a popular technique in RL, and it has been applied in CR networks. Denote decision epochs by ; the knowledge possessed by agent for a particular state-action pair at time is represented by -function as follows: where (i) represents state,(ii) represents action,(iii) represents delayed rewards, which is received at time for an action taken at time ,(iv) represents discount factor. The higher the value of , the greater the agent relies on the discounted future reward compared to the delayed reward ,(v)represents learning rate. The higher the value of , the greater the agent relies on the delayed reward and the discounted future reward , compared to the -value at time .
At decision epoch , agent observes its operating environment to determine its current state . Based on the , the agent chooses an action . Next, at decision epoch , the state changes to as a consequence of the action , and the agent receives delayed reward . Subsequently, the -value is updated using (1). Note that, in the remaining decision epochs at time , , the agent is expected to take optimal actions with regard to the states; hence, -value is updated using a maximized discounted future reward . As this procedure evolves through time, agent receives a sequence of rewards and the -value converges. Q-learning searches for an optimal policy at all time instants through maximizing value function as shown below:
Hence, the policy (or action selection) for agent is as follows:
The update of the -value in (1) does not cater for the actions that are never chosen. Exploitation chooses the best-known action, or the greedy action, at all time instants for performance enhancement. Exploration chooses the other nonoptimal actions once in a while to improve the estimates of all -value in order to discover better actions. While Figure 1 shows a single agent, the presence of multiple agents is feasible. In the context of CR networks, a rigorous proof of the convergence of -value in the presence of multiple SUs has been shown in .
The advantages of RL are as follows:(i)instead of tackling every single factor that affects the system performance, RL models the system performance (e.g., throughput) that covers a wide range of factors affecting the throughput performance including the channel utilization level by PUs and channel quality and, hence, its simple modeling approach;(ii)prior knowledge of the operating environment is not necessary; and so a SU can learn the operating environment (e.g., channel quality) as time goes by.
2.2. Cognitive Radio Networks
Traditionally, spectrum allocation policy has been partitioning radio spectrum into smaller ranges of licensed and unlicensed frequency bands (also called channels). The licensed channels provide exclusive channel access to licensed users or PUs. Unlicensed users or SUs, such as the popular wireless communication systems IEEE 802.11, access unlicensed channels without incurring any monetary cost, and they are forbidden to access any of the licensed channels. Examples of unlicensed channels are Industrial, Scientific, and Medical (ISM) and Unlicensed National Information Infrastructure (UNII) bands. While the licensed channels have been underutilized, the opposite phenomenon has been observed among the unlicensed channels.
Cognitive radio enables SUs to explore radio spectrum and use white spaces whilst minimizing interference to PUs. The purpose is to improve the availability of bandwidth at each SU, hence improving the overall utilization of radio spectrum. CR helps the SUs to establish a “friendly” environment, in which the PUs and SUs coexist without causing interference with each other as shown in Figure 2. In Figure 2, a SU switches its operating channel across various channels from time to time in order to utilize white spaces in the licensed channels. Note that each SU may observe different white spaces, which are location dependent. The SUs must sense the channels and detect the PUs’ activities whenever they reappear in white spaces. Subsequently, the SUs must vacate and switch their respective operating channel immediately in order to minimize interference to PUs. For a successful communication, a particular white space must be available at both SUs in a communication node pair.
The rest of this subsection is organized as follows. Section 2.2.1 presents cognition cycle, which is an essential component in CR. Section 2.2.2 represents various application schemes in which RL has been applied to provide performance enhancement.
2.2.1. Cognition Cycle
Cognition cycle , which is a well-known concept in CR, is embedded in each SU to achieve context awareness and intelligence in CR networks. Context awareness enables a SU to sense and be aware of its operating environment; while intelligence enables the SU to observe, learn, and use the white spaces opportunistically so that a static predefined policy is not required while providing network performance enhancement.
The cognition cycle can be represented by a RL model as shown in Figure 1. The RL model can be tailored to fit well with a wide range of applications in CR networks. A SU can be modeled as a learning agent. At a particular time instant, the SU agent observes state and reward from its operating environment, learns, decides, and carries out action on the operating environment in order to maximize network performance. Further description on RL-based cognition cycle is presented in Section 2.1.
2.2.2. Application Schemes
Reinforcement learning has been applied in a wide range of schemes in CR networks for SU performance enhancements, whilst minimizing interference to PUs. The schemes are listed as follows, and the nomenclatures (e.g., (A1) and (A2)) are used to represent the respective application schemes throughout the paper.(A1) Dynamic Channel Selection (DCS). The DCS scheme selects operating channel(s) with white spaces for data transmission whilst minimizing interference to PUs. Yau et al. [8, 9] propose a DCS scheme that enables SUs to learn and select channels with low packet error rate and low level of channel utilization by PUs in order to enhance QoS, particularly throughput and delay performances.(A2) Channel Sensing. Channel sensing senses for white spaces and detects the presence of PU activities. In , the SU reduces the number of sensing channels and may even turn off channel sensing function if its operating channel has achieved the required successful transmission rate in order to enhance throughput performance. In , the SU determines the durations of channel sensing, time of channel switching, and data transmission, respectively, in order to enhance QoS, particularly throughput, delay, and packet delivery rate performances. Both [10, 11] incorporate DCS (A1) into channel sensing in order to select operating channels. Due to the environmental factors that can deteriorate transmissions (e.g., multipath fading and shadowing), Lo and Akyildiz  propose a cooperative channel sensing scheme, which combines sensing outcomes from cooperating one-hop SUs, to improve the accuracy of PU detection.(A3) Security Enhancement. Security enhancement scheme  aims to ameliorate the effects of attacks from malicious SUs. Vucevic et al.  propose a security enhancement scheme to minimize the inaccurate sensing outcomes received from neighboring SUs in channel sensing (A2). A SU becomes malicious whenever it sends inaccurate sensing outcomes, intentionally (e.g., Byzantine attacks) or unintentionally (e.g., unreliable devices). Wang et al.  propose an antijamming scheme to minimize the effects of jamming attacks from malicious SUs, which constantly transmit packets to keep the channels busy at all times so that SUs are deprived of any opportunities to transmit.(A4) Energy Efficiency Enhancement. Energy efficiency enhancement scheme aims to minimize energy consumption. Zheng and Li  propose an energy-efficient channel sensing scheme to minimize energy consumption in channel sensing. Energy consumption varies with activities, and it increases from sleep, idle, to channel sensing. The scheme takes into account the PU and SU traffic patterns and determines whether a SU should enter sleep, idle, or channel sensing modes. Switching between modes should be minimized because each transition between modes incurs time delays.(A5) Channel Auction. Channel auction provides a bidding platform for SUs to compete for white spaces. Chen and Qiu  propose a channel auction scheme that enables the SUs to learn the policy (or action selection) of their respective SU competitors and place bids for white spaces. This helps to allocate white spaces among the SUs efficiently and fairly.(A6) Medium Access Control (MAC). MAC protocol aims to minimize packet collision and maximize channel utilization in CR networks. Li et al.  propose a collision reduction scheme that reduces the probability of packet collision among PUs and SUs, and it has been shown to increase throughput and to decrease packet loss rate among the SUs. Li et al.  propose a retransmission policy that enables a SU to determine how long it should wait before transmission in order to minimize channel contention.(A7) Routing. Routing enables each SU source or intermediate node to select its next hop for transmission in order to search for the best route(s), which normally incurs the least cost or provides the highest amount of rewards, to the SU destination node. Each link within a route has different types and levels of costs, such as queuing delay, available bandwidth or congestion level, packet loss rate, energy consumption level, and link reliability, as well as changes in network topology as a result of irregular node’s movement speed and direction.(A8) Power Control. Yao and Feng  propose a power selection scheme that selects an available channel and a power level for data transmission. The purpose is to improve its Signal-to-Noise Ratio (SNR) in order to improve packet delivery rate.
3. Reinforcement Learning in the Context of Cognitive Radio Networks: Components, Features, and Enhancements
This section presents the components of RL, namely, state, action, reward, discounted reward, and -function; as well as the features of RL, namely, exploration and exploitation, updates of learning rate, rules and cooperative learning. The components and features of RL (see Section 2.1) are presented in the context of CR. For each component and feature, we show the traditional approach and subsequently the alternative or enhanced approaches with regard to modeling, representing, and applying them in CR networks. This section serves as a foundation for further research in this area, particularly, the application of existing features and enhancements in current schemes in RL models for either existing or new schemes.
Note that, for improved readability, the notations (e.g., and ) used in this paper represent the same meaning throughout the entire paper, although different references in the literature may use different notations for the same purpose.
Traditionally, each state is comprised of a single type of information. For instance, in , each state represents a single channel out of channels available for data transmission. The state may be omitted in some cases. For instance, in , the state and action representations are similar, so the state is not represented. The traditional state representation can be enhanced in the context of CR as described next.
Each state can be comprised of several types of information. For instance, Yao and Feng  propose a joint DCS (A1) and power allocation (A8) scheme in which each state is comprised of three-tuple information; specifically, . The substate represents the number of SU agents, represents the number of communicating SU agents, and represents the received power on each channel.
The value of a state may deteriorate as time goes by. For instance, Lundén et al.  propose a channel sensing (A2) scheme in which each state represents SU agent ’s belief (or probability) that channel is idle (or the absence of PU activity). Note that the belief value of channel deteriorates whenever the channel is not sensed recently, and this indicates the diminishing confidence in the belief that channel remains idle. Denote a small step size by (i.e., ); the state value of channel deteriorates if it is not updated at each time instant; specifically, .
Traditionally, each action represents a single action out of a set of possible actions . For instance, in , each action represents a single channel out of the channels available for data transmission. The traditional action representation can be enhanced in the context of CR as described next.
Each action can be further divided into various levels. As an example, Yao and Feng  propose a joint DCS (A1) and power allocation (A8) scheme in which each action represents a channel selection, and each represents a power level allocation with being the number of power levels. As another example, Zheng and Li  propose an energy efficiency enhancement (A4) scheme in which there are four kinds of actions, namely, transmit, idle, sleep, and sense channel. The sleepaction represents a sleep level with being the number of sleep levels. Note that different sleep level incurs different amount of energy consumption.
3.3. Delayed Reward
Traditionally, each delayed reward represents the amount of performance enhancement achieved by a state-action pair. A single reward computation approach is applicable to all state-action pairs. As an example, in , represents the reward and cost values of 1 and −1 for each successful and unsuccessful transmission, respectively. As another example, in , represents the amount of throughput achieved within a time window. The traditional reward representation can be enhanced in the context of CR as described next.
The delayed reward can be computed differently for distinctive actions. As an example, in a joint DCS (A1) and channel sensing (A2) scheme, Felice et al.  compute the delayed rewards in two different ways based on the types of actions: channel sensing and data transmission . Firstly, a SU agent calculates delayed reward at time instant . The indicates the likelihood of the existence of PU activities in channel whenever action is taken. Specifically, where indicates the number of neighboring SU agents, while , which is a binary value, indicates the existence of PU activities as reported by SU neighbor agent . Secondly, a SU agent calculates delayed reward at time instant . The indicates the successful transmission rate, which takes into account the aggregated effect of interference from PU activities whenever action is taken. Specifically, where indicates the number of data packets sent by SU agent , indicates the number of acknowledgment packets received by SU agent , and indicates the number of data packets being transmitted by SU agent .
Jouini et al.  apply an Upper Confidence Bound (UCB) algorithm to compute delayed rewards in a dynamic and uncertain operating environment (e.g., operating environment with inaccurate sensing outcomes), and it has been shown to improve throughput performance in DCS (A1). The main objective of this algorithm is to determine the upper confidence bounds for all rewards and subsequently use them to make decisions on action selection. The rewards are uncertain, and the uncertainty is caused by the dynamicity and uncertainty of the operating environment. Let represent the number of times an action has been taken on the operating environment up to time ; an agent calculates the upper confidence bounds of all delayed rewards as follows: where is the mean reward, and is the upper confidence bias being added to the mean. Note that if is not chosen at time instant . The is calculated as follows: where exploration coefficient is a constant empirical factor. For instance, in [22, 23].
The UCB algorithm selects actions with the highest upper confidence bounds, and so (3) is rewritten as follows:
3.4. Discounted Reward
Traditionally, the discounted reward has been applied to indicate the dependency of -value on future rewards. Based on an application, the discounted reward may be omitted with to show the lack of dependency on future rewards, and this approach is generally called the myopic approach. As an example, Li  and Chen et al.  apply -learning in DCS (A1), and the -function in (1) is rewritten as follows:
The traditional -function (see (1)) has been widely applied to update -value in CR networks. The traditional -function can be enhanced in the context of CR as described next.
Lundén et al.  apply a linear function approximation-based approach to reduce the dimensionality of the large state-action spaces (or reduce the number of state-action pairs) in a collaborative channel sensing (A2) scheme. A linear function provides a matching value for a state-action pair. The matching value , which shows the appropriateness of a state-action pair, is subsequently applied in -value computation. The linear function is normally fixed (or hard-coded), and various kinds of linear functions are possible to indicate the appropriateness of a state-action pair based on prior knowledge. For instance, yields a value that represents the level of desirability of a certain number of SU agents sensing a particular channel . Higher value indicates that the number of SU agents sensing a particular channel is closer to a desirable number. Using a fixed linear function, the learning problem is transformed into learning the matching value as follows:
The parameter is updated as follows:
3.6. Exploration and Exploitation
Traditionally, there are two popular approaches to achieve a balanced trade-off between exploration and exploitation, namely, softmax and -greedy . For instance, Yau et al.  use the -greedy approach in which an agent explores with a small probability (i.e., ) and exploits with probability . Essentially, these approaches aim to control the frequency of exploration so that the best-known action is taken at most of the times. The traditional exploration and exploitation approach can be enhanced in the context of CR as described next.
In [3, 25], using the softmax approach, an agent selects actions based on a Boltzman distribution; specifically, the probability of selecting an action in state is as follows: where is a time-varying parameter called temperature. Higher temperature value indicates more exploration, while smaller temperature value indicates more exploitation. Denote the time duration during which exploration actions are being chosen by ; the temperature is decreased as time goes by so that the agent performs more exploitation as follows: where and are initial and final values of temperature, respectively. Note that, due to the dynamicity of the operating environment, exploration is necessary at all times, and so .
In , using the -greedy approach, an agent uses a simple approach to decrease exploration probability as time goes by as follows: where is a discount factor and is the minimum exploration probability.
3.7. Other Features and Enhancements
This section presents other features and enhancements on the traditional RL approach found in various schemes for CR networks, including updates of learning rate, rules, and cooperative learning.
3.7.1. Updates of Learning Rate
Traditionally, the learning rate is a constant value . The learning rate may be adjusted as time goes by because higher value of may compromise the RL algorithm’s accuracy to converge to a correct action in a finite number of steps . In , the learning rate reduces as time goes by using , where is a small value to provide smooth transition between steps. In , the learning rate is updated using .
Rules determine a feasible set of actions for each state. The traditional RL algorithm does not apply rules although it is an important component in CR networks. For instance, in order to minimize interference with PUs, the SUs must comply with the timing requirements set by the PUs, such as the time interval that a SU must vacate its operating channel after any detection of PU activities.
As an example, Zheng and Li  propose an energy efficiency enhancement scheme in which there are four kinds of actions, namely, transmit, idle, sleep, and sense channel. Rules are applied so that the feasible set of actions is comprised of idle and sleep whenever the state indicates that there is no packet in the buffer. As another example, Peng et al.  propose a routing scheme, specifically, a next hop selection scheme in which the action represents the selection of a next hop out of a set of SU next hops. Rules are applied so that the feasible set of actions is limited to SU next hops with a certain level of SNR, as well as with shorter distance between next hop and the hop after next. The purposes of the rules are to reduce transmission delays and to ensure high-quality reception. Further description about [4, 15] is found in Table 1.
3.7.3. Cooperative Learning
Cooperative learning enables neighbor agents to share information among themselves in order to expedite the learning process. The exchanged information can be applied in the computation of -function. The traditional RL algorithm does not apply cooperative learning, although it has been investigated in multiagent reinforcement learning (MARL) .
Felice et al.  propose a cooperative learning approach to reduce exploration. The -value is exchanged among the SU agents, and it is used in the -function computation to update -value. Each SU agent keeps track of its own -value , and it is updated using the similar way to  (see Section 3.4). At any time instant, each agent receives -value from its neighbor agent . The agent keeps a vector of -value with . For the case , the -value is updated as follows: where defines the weight assigned to cooperation with neighbor agent . Similar approach has been applied in , and the -value is updated based on the weight as follows:
In , the weight depends on how much a neighbor agent can contribute to the accurate estimation of value function , such as the physical distance between agent and . In , the weight depends on the accuracy of the exchanged -value (or expert value as described next) and the physical distance between agent and .
In , an agent exchanges its -value with its neighboring agents only if the expert value for -value is greater than a particular threshold. The expert value indicates the accuracy of the -value . For instance, in , the -value indicates the availability of white spaces in channel , and so greater deviation in the signal strengths reduces the expert value . By reducing the exchanges of -value with low accuracy, this approach reduces control overhead, and hence it reduces interference to PUs.
Application of cooperative learning in the CR context has been very limited. More description on cooperative learning is found in Section 4.8. Further research could be pursued to investigate how to improve network performance using this approach in existing and new schemes.
4. Reinforcement Learning in the Context of Cognitive Radio Networks: Models and Algorithms
Direct application of the traditional RL approach (see Section 2.1) has been shown to provide performance enhancement in CR networks. Reddy  presents a preliminary investigation in the application of RL to detect PU signals in channel sensing (A2). Table 1 presents a summary of the schemes that apply the traditional RL approach. For each scheme, we present the purpose(s) of the CR scheme, followed by its associated RL model.
Most importantly, this section presents a number of new additions to the RL algorithms, which have been applied to various schemes in CR networks. A summary of the new algorithms, their purposes, and references, is shown in Table 2. Each new algorithm has been designed to suit and to achieve the objectives of the respective schemes. For instance, the collaborative model (see Table 2) aims to achieve an optimal global reward in the presence of multiple agents, while the traditional RL approach achieves an optimal local reward in the presence of a single agent only. The following subsections (i.e., Sections 4.1–4.9) provide further details to each new algorithm, including the purpose(s) of the CR scheme(s), followed by its associated RL model (i.e., state, action, and reward representations) which characterize the purposes, and finally the enhanced algorithm which aims to achieve the purpose. Hence, these subsections serve as a foundation for further research in this area, particularly, the application of existing RL models and algorithms found in current schemes to either apply them in new schemes or extend the RL models in existing schemes to further enhance network performance.
4.1. Model 1: Model with in -Function
This is a myopic RL-based approach (see Section 3.4) that uses so that there is lack of dependency on future rewards, and it has been applied in [10, 17, 18]. Li et al.  propose a joint DCS (A1) and channel sensing (A2) scheme, and it has been shown to increase throughput, as well as to decrease the number of sensing channels (see performance metric (P4) in Section 5) and packet retransmission rate. The purposes of this scheme are to select operating channels with successful transmission rate greater than a certain threshold into a sensing channel set and subsequently to select a single operating channel for data transmission.
Table 3 shows the RL model for the scheme. The action is to select whether to remain at the current operating channel or to switch to another operating channel with higher successful transmission rate. A preferred channel set is composed of actions with -value greater than a fixed threshold (e.g., in ). Since the state and action are similar in this model, the state representation is not shown in Table 3, and we represent . Note that if there is no channel switch. The reward represents different kinds of events, specifically, in case of successful transmission, and in case of unsuccessful transmission or channel is sensed busy. The RL model is embedded in a centralized entity such as a base station.
Algorithm 1 presents the RL algorithm for the scheme. The action is chosen from a preferred channel set. The update of the -value is self-explanatory. Similar approach has been applied in DCS (A1) [30, 31].
Li et al.  propose a MAC protocol, which includes both DCS (A1) and a retransmission policy (A6), to minimize channel contention. The DCS scheme enables the SU agents to minimize their possibilities of operating in the same channel. This scheme uses the RL algorithm in Algorithm 1, and the reward representation is extended to more than a single performance enhancement. Specifically, the reward represents the successful transmission rate and transmission delay. Higher reward indicates higher successful transmission rate and lower transmission delay, and vice versa. To accommodate both transmission rate and transmission delay in -function, the reward representation becomes , and so the -function becomes . The retransmission policy determines the probability a SU agent transmits at time , and so indicates the probability a SU agent transmits at time . The reward , 0, and if the transmission delay at time is smaller than, equal to, and greater than the average transmission delay, respectively. The reward represents different kinds of events; specifically, , 0, and in case of successful transmission, idle transmission, and unsuccessful transmission, respectively; note that idle indicates that channel is sensed busy, and so there is no transmission.
Li et al.  propose a MAC protocol (A6) to reduce the probability of packet collision among PUs and SUs, and it has been shown to increase throughput and to decrease packet loss rate. Since both successful transmission rate and the presence of idle channels are important factors, it keeps track of the -functions for channel sensing and transmission using RL algorithm in Algorithm 1, respectively. Hence, similar to Algorithm 2 in Section 4.2, there is a set of two -functions. The action is to select whether to remain at the current operating channel or to switch to another operating channel. The sensing reward and if the channel is sensed idle and busy, respectively. The transmission reward and if the transmission is successful and unsuccessful, respectively. Action selection is based on the maximum average -value; specifically, .
4.2. Model 2: Model with a Set of -Functions
A set of distinctive -functions can be applied to keep track of the -value of different actions, and it has been applied in [11, 21]. Di Felice et al.  propose a joint DCS (A1) and channel sensing (A2) scheme, and it has been shown to increase goodput and packet delivery rate, as well as to decrease end-to-end delay and interference level to PUs. The purposes of this scheme are threefold:(i)firstly, it selects an operating channel that has the lowest channel utilization level by PUs;(ii)secondly, it achieves a balanced trade-off between the time durations for data transmission and channel sensing;(iii)thirdly, it reduces the exploration probability using a knowledge sharing mechanism.
Table 4 shows the RL model for the scheme. The state represents a channel for data transmission. The actions are to sense channel, to transmit data, or to switch its operating channel. The reward represents the difference between two types of delays, namely, the maximum allowable single-hop transmission delay and a successful single-hop transmission delay. A single-hop transmission delay covers four kinds of delays including backoff, packet transmission, packet retransmission, and propagation delays. Higher reward level indicates shorter delay incurred by a successful single-hop transmission. The RL model is embedded in a centralized entity such as a base station.
Algorithm 2 presents the RL algorithm for the scheme. Denote learning rate by , eligible trace by , and the amount of time during which the SU agent is involved in successful transmissions or was idle (i.e., no packets to transmit) by , as well as the temporal differences by and . A single type of -function is chosen to update the -value based on the current action being taken. The temporal difference indicates the difference between the actual outcome and the estimated -value.
In step (b), the eligible trace represents the temporal validity of state . Specifically, in , eligible trace represents the existence of PU activities in channel , and so it is only updated when channel sensing operation is taken. Higher eligible trace indicates greater presence of PU activities, and vice versa. Hence, the term is in the update of -value , and is in the update of -value in Algorithm 2. Therefore, higher eligible trace results in higher value of and lower value of , and this indicates more channel sensing tasks and lesser data transmission in channels with greater presence of PU activities. The action switches channel from state to state . The -greedy approach is applied to choose the next channel . In , eligible trace , which represents the temporal validity or freshness of the sensing outcome, is only updated when the channel sensing operation is taken as shown in Algorithm 2. The eligible trace is discounted whenever is not chosen as follows: where is a discount factor for the eligible trace. Equation (15) shows that the eligible trace of each state is set to the maximum value of 1 whenever action is taken; otherwise, it is decreased with a factor of .
In step (c), the value keeps track of the channel that provides the best-known lowest estimated average transmission delay. In other words, the channel must provide the maximum amount of reward that can be achieved considering the cost of a channel switch . Hence, can keep track of a channel that provides the best-known state value the SU agent receives compared to the average state value by switching its current operating channel to the operating channel . Note that the state value is exchanged among the SU agents to reduce exploration through cooperative learning (see Section 3.7.3).
In step (d), the policy is applied at the next time instant. The policy provides probability distributions over the three possible types of actions using a modified Boltzmann distribution (see Section 3.6). Next, the policy is applied to select the next action in step (a).
4.3. Model 3: Dual -Function Model
The dual -function model has been applied to expedite the learning process . The traditional -function (see (1)) updates a single -value at a time, whereas the dual -function updates two -values simultaneously. For instance, in , the traditional -function updates the -value for the next state only (e.g., SU destination node), whereas the dual -function updates the -value for the next and previous states (e.g., SU source and destination nodes, respectively). The dual -function model updates a SU agent’s -value in both directions (i.e., towards the source and destination nodes) and speeds up the learning process in order to make more accurate decisions on action selection; however, at the expense of higher network overhead incurred by more -value exchanges among the SU neighbor nodes.
Xia et al.  propose a routing (A7) scheme, and it has been shown to reduce SU end-to-end delay. Generally speaking, the availability of channels in CR networks is dynamic, and it is dependent on the channel utilization level by PUs. The purpose of this scheme is to enable a SU node to select a next-hop SU node with higher number of available channels. The higher number of available channels reduces the time incurred in seeking for an available common channel for data transmission among a SU node pair, and hence it reduces the MAC layer delay.
Table 5 shows the RL model for the scheme. The state represents a SU destination node . The action represents the selection of a next-hop SU neighbor node . The reward represents the number of available common channels among nodes and . The RL model is embedded in each SU agent.
This scheme applies the traditional -function (see (1)) with . Hence, the -function is rewritten as follows: where is an upstream node of SU neighbor node , so node must estimate and send information on to SU node .
The dual -function model in this scheme is applied to update the -value for the SU source and destination nodes. While the traditional -function enables the SU intermediate node to update the -value for the SU destination node only (or next state), which is called forward exploration, the dual -function model enables an intermediate SU node to achieve backward exploration as well by updating the -value for the SU source node (or previous state). Forward exploration is achieved by updating the -value at SU node for the SU destination node whenever it receives an estimate from SU node , while backward exploration is achieved by updating the -value at SU node for the SU source node whenever it receives a data packet from node . Note that, in the backward exploration case, node ’s packets are piggybacked with its -value so that node is able to update -value for the respective SU source node. Although the dual -function approach increases the network overhead, it expedites the learning process since SU nodes along a route update -value of the route in both directions.
4.4. Model 4: Partial Observable Model
The partial observable model has been applied in a dynamic and uncertain operating environment. The uniqueness of the partial observable model is that the SU agents are uncertain about their respective states, and so each of them computes belief state , which is the probability that the environment is operating in state .
Bkassiny et al.  propose a joint DCS (A1) and channel sensing (A2) scheme, and it has been shown to improve the overall spectrum utilization. The purpose of this scheme is to enable the SU agents to select their respective operating channels for sensing and data transmission in which the collisions among the SUs and PUs must be minimized.
Table 6 shows the RL model for the scheme. The state represents the availability of a set of channels for data transmission. The action represents a single channel out of channels available for data transmission. The reward represents fixed positive (negative) values to be rewarded (punished) for successful (unsuccessful) transmissions. The RL model is embedded in each SU agent so that it can make decision in a distributed manner.
Algorithm 3 presents the RL algorithm for the scheme. The action is chosen from a preferred channel set. The chosen action has the maximum belief-state -value, which is calculated using belief vector as weighting factor. The belief vector is the probability of a possible set of state being idle at time . Upon receiving reward , the SU agent updates the entire set of belief vectors using Bayes’ formula . Next, the SU agent updates the -value . Note that .
It shall be noted that Bkassiny et al.  apply the belief vector as a weighting vector in its computation of -value , while most of the other approaches, such as , use belief vector as the actual state, specifically, . This approach has been shown to achieve a near-optimal solution with a very low complexity in .
4.5. Model 5: Actor-Critic Model
Traditionally, the delayed reward has been applied directly to update the -value. The actor-critic model adjusts the delayed reward value using reward corrections, and this approach has been shown to expedite the learning process. In this model, an actor selects actions using suitability value, while a critic keeps track of temporal difference, which takes into account reward corrections in delayed rewards.
Vucevic et al.  propose a collaborative channel sensing (A2) scheme, and it has been shown to minimize error detection probability in the presence of inaccurate sensing outcomes. The purpose of this scheme is that it selects neighboring SU agents that provide accurate channel sensing outcomes for security enhancement purpose (A3). Table 7 shows the RL model for the scheme. The state is not represented. An action represents a neighboring SU chosen by SU agent for channel sensing purpose. The reward represents fixed positive (negative) values to be rewarded (punished) for correct (incorrect) sensing outcomes compared to the final decision, which is the fusion of the sensing outcomes. The RL model is embedded in each SU agent.
The critic keeps track of , where is the temporal difference and is a constant (e.g., ). In , depends on the difference between the delayed reward and the long-term delayed reward , the number of incorrect sensing outcomes, and the suitability value . Next, the actor selects actions using given by the critic. The probability of selecting action is based on the suitability value of action ; .
4.6. Model 6: Auction Model
The auction model has been applied in centralized CR networks. In the auction model, a centralized entity, such as a base station, conducts auctions and allows SU hosts to place bids so that the winning SU hosts receive rewards. The centralized entity may perform simple tasks, such as allocating white spaces to SU hosts with winning bids , or it may learn using RL to maximize its utility . The RL model may be embedded in each SU host in a centralized network [16, 36–38], or in the centralized entity only .
Chen and Qiu  propose a channel auction scheme (A5), and it has been shown to allocate white spaces among SU hosts (or agents) efficiently and fairly. The purpose of this scheme is to enable the SU agents to select the amount of bids during an auction, which is conducted by centralized entity, for white spaces. The SU agents place the right amount of bids in order to secure white spaces for data transmission, while saving their credits, respectively. The RL model is embedded in each SU host.
Table 8 shows the RL model for the scheme. The state indicates a SU agent’s information, specifically, the amount of data for transmission in its buffer and the amount of credits (or “wealth”) it owns. The action is the amount of a bid for white spaces. The reward indicates the amount of data sent. This scheme applies the traditional -learning approach (see (1)), to update -values.