Abstract

-value initialization significantly influences the efficiency of -learning. However, there have been no precise rules to choose the initial -values as yet correctly, which are usually initialized to a default value. This paper proposes a novel -value initialization framework for cellular network applications and factorization -learning Initialization (FQI). The proposed method works as an add-on of -learning that automatically and efficiently initializes the nonupdated -values by utilizing the correlation model of the visited experiences built on factorization machines. In an open-source VoLTE network, FQI was introduced into -learning and four improved variants (Dyna -learning, -learning, double -learning, and speedy -learning) for performance comparison. The experiment results demonstrate that the factorized algorithms based on FQI substantially outperform the original algorithms, often learning policies that attain 1.5-8 times higher final performance measured by the episode reward and the convergence episodes.

1. Introduction

The complexity and large scale of radio access technologies in future cellular networks impose significant operational challenges [1]. In a cellular network, there is a multitude of tunable parameters in every base station (BS) to maintain and optimize numerous performance indicators. The parameters have a significant impact on the performance of BSs and should be cognitively adapted to the dynamically changing network environments [2]. However, this is by no means an easy task. First, with more advanced features being deployed in the networks, the number of such parameters increases significantly, and the dependencies among these parameters are more intricate [3]. Moreover, the correlation between different parameters and performance indicators is beyond the capability of available analytical models, as cellular networks evolve to be extremely dynamic and complex due to the scale, density, and heterogeneity.

Recently, leveraging reinforcement learning (RL) to obtain the optimal control policy is emerging as a promising solution [46], which enables an autonomous agent to learn from their actions and consequences in the interactive environment. -learning [7] is a well-known model-free RL algorithm that finds an estimate of the optimal action-value function. For finite state-action problems, it has been shown that -learning converges to the optimal action-value function [8]. However, it suffers from slow convergence. The main reason is the combination of the sample-based stochastic approximation and the fact that the Bellman operator propagates information throughout the whole space. Many methods have been proposed to improve and speed up -learning, such as reducing the state space [911], modifying -value update [1220], or specifying initial -values [2123].

In this paper, we present a novel -value initialization framework and factorization -learning initialization (FQI) to enhance the convergence of -learning for parameter optimization in cellular networks. We test the proposed framework on -learning and four improved variants (Dyna -learning [12], -learning [13], double -learning [14], and speedy -learning [15]) in an open-source cellular environment: voice over-LTE (VoLTE) power control. The experimental results demonstrate that the factorized -learning and its variants based on FQI outperform the original algorithms by 1.5-8 times on the valid actions and convergence episodes.

The remainder of this article is organized as follows: Section 2 reviews previous works and relates them to the current research. In Section 3, we briefly describe the cellular network model and formulate the optimization problem. Section 4 introduces the preliminaries of -learning. Section 5 presents the FQI framework, its main design motivation, and the algorithm that we use to build it. Section 6 conducts extensive experiments to demonstrate the effectiveness of the proposed framework in an open-source simulated VoLTE network. Finally, the conclusion of the entire work, along with the further research possibilities in the area, have been documented in Section 6.

-learning has been widely applied in cellular networks as a tool for parameter optimization problems [24], such as backhaul optimization [25], handover optimization [26], resource optimization [27], and power control [28]. In order to improve and speed up -learning, many works have been studied, which involve three aspects: reducing the state space, improving -values update, and specifying initial -values.

The first type of method is mainly to reduce the searchable size of the state space. A hierarchical approach was proposed to decompose the RL problem into subproblems, where solving each of them will be more powerful than solving the entire problem [9]. The Kanerva coding approach was proposed to reduce the number of states of -learning for TCP congestion control [10]. The work in [11] relaxed the constraint of action space for 5G caching to reduce the learning complexity of -learning from exponential space size to linear space size.

The second type of method is mainly to modify the -value updating method. Dyna -learning [12] was proposed to assign each state-action pair a bonus inversely proportionate to the number of times the pair has been visited. -learning [13] was proposed to extend -learning to eligibility traces, which combines -learning with the TD() return estimation process. Double -learning [14] was proposed to overcome the overestimation problem by applying double estimators to -learning. Speedy -learning [15] was proposed to address the problem of slow convergence in the standard form of the -learning algorithm. A faster -table initialized method was proposed that does not only update the -value of a single state-action pair but adds estimates for the cost function for all other possible actions of the current state for the first visiting one state [16]. A matrix-gain approach was designed to accelerate the convergence of Q-learning by optimizing its asymptotic variance [17]. A linear function approximation to update -values for 5G caching was used to offer faster convergence and reduce the complexity of Q-learning [18]. An acceleration scheme for -learning was proposed by incorporating the historical iterates of the -function [19]. A new -value updating mechanism was used, in which the value of the similar state-action pairs are updated synchronously [20].

The last type of method mainly appropriately specifies the initialization -value. It has been shown that the initial -values have a significant influence on the efficiency of RL for goal-directed tasks [21]. A method in which the -table is initialized to some maximum value and carefully lowered towards the empirical estimates was proposed [22]. A neural network-based -learning algorithm was proposed by appropriately specifying initial -values [23]. Obviously, a good -value initialization method can further strengthen the two types above methods to improve -learning performance further. Nevertheless, there have been no precise rules for choosing the initial -values as yet correctly, and -values are usually initialized to 0 or a random value.

To enhance the convergence of -learning by initializing -values automatically and efficiently for parameter optimization in cellular networks, this paper makes the following specific contributions: (1)Formulate -value initialization for the parameter optimization problem as a collaborative filtering problem that builds the correlation model between the visited experiences(2)Propose a novel -value initialization framework based on factorization machines, factorization -learning initialization (FQI), which continuously predicts -values that still default value based on the correlation model built on the visited experiences(3)Conduct a set of experiments in an open-source simulated VoLTE network. The results show that the proposed framework significantly improves the performance of -learning and four improved variants measured by the valid actions and convergence episodes

3. Preliminaries and Motivation

RL is learning what to do (how to map situations to actions) so as to maximize a numerical reward signal. The agent is not told which actions to take but instead must discover which actions yield the most reward by trying them. We describe the essential ingredients in the RL in this section briefly.

3.1. Markov Decision Processes

In order to formalize the RL problem, Markov decision processes (MDP) formally are used to describe an environment for most RL. An MDP is defined by : (i) is a finite set of possible states(ii) is a finite set of possible actions(iii) is a distribution of reward given (state, action) pair, (iv) is a state transition probability matrix, (v) is a discount factor

All states in MDP has Markov property, referring to the fact that the current state captures all relevant information from the history, . A policy is the behaviour function of the agent from to that specifies what action to take in each state. The objective of RL is to find the optimal policy that maximizes the expected cumulative discounted reward:

3.2. -Learning

In -learning, the -value function is to measures how good a particular state-action pair is. The -value function at state and action is the expected cumulative reward from taking action in state and then following the policy :

The optimal -value function is the maximum expected cumulative reward achievable from a given (state, action) pair:

satisfies the following Bellman equation:

The optimal policy corresponds to taking the best action in any state as specified by . -learning uses the following value iteration algorithm to solve for the optimal policy:

is the cumulated reward expected to be obtained after taking at state , which is updated according to the learning rate , the discount factor , and the expected maximum value of the next state . -learning does converge to the optimal state-action value function if the state-action pairs are to be explored infinitely often [8]. During the learning phase, the agent needs to decide which action to choose, either to find out more about the environment or to take one step closer to the goal. Techniques for selecting actions in RL are called exploration strategies. The most widely used exploration strategies are -greedy and Boltzmann. (i)-Greedy. The agent randomly explores with probability and takes the optimal action most of the time with probability (ii)Boltzmann. Boltzmann is an exponential weighting scheme broadly used for balancing exploration and exploitation. The probability of choosing an exponential function of the empirical mean of the reward of that action is denoted as follows:

denotes the probability the agent selects action in state .

3.3. Motivation

-values are usually set to default values in -learning and represented by -table, which has a row for each state and a column for each action. The update of the -table depends on the interaction with the environment, and the -values of the unvisited state-action pairs will remain the default value. In the early stages of learning, the visited state-action pairs are sparse, and most -values of the state-action pairs are default values. Therefore, -learning often performs extremely poorly in the early stages of learning due to less information about the environment being forced to act more or less randomly. It is precisely these challenges above that constitute the prime motivations of this article. We are motivated to investigate the following issue: how to utilize the visited experiences to initialize the -values that have never been updated automatically and efficiently to bootstrap -learning exploration?

Assuming that there are UEs and BSs in a cellular network, we derive states, actions, and rewards from constructing the RL process to learn the best control strategy. In the following, we describe them one by one. (i)States. Let be the set of states of cellular networks with the size , where is the discretization value set of the -th network performance indicator or attribute(ii)Actions. Let be the set of parameter combinations of cellular networks with the size , where is the valid value set of the -th network parameter(iii)Rewards. The reward signal is obtained from the cellular networks after the agent takes a parameter setting when it is in state and moves to the next state (iv)-Value Function. The state-action value function is denoted . It is the expected cumulated reward when starting in state and selecting an action

A motivating example is shown in Figure 1 to comprehend the idea of FQI better. The -table example has five states (, , , , and ) and four actions (, , , and ). Initially, the default value of the -table is set to 0, which means that the agent has no information about the environment. Some state-action pairs are explored as the agent interacts with the environment, and the corresponding -values are updated. The explored experiences are illustrated by blue entries in the visited -table. Then, the problem we study is transferred to precisely predicting the -values with default values (white entries) based on the visited experiences. Once the unvisited entries are accurately predicted, the risk of random exploration to discover these unknown low -values (red numbers) can be reduced according to the factorized -table.

4. Factorization -Learning Initialization

This section shows the motivation behind this work and describes the proposed framework in detail. The key idea is to continuously capture the correlation between the states and actions from the visited experiences to predict the -values that have never been updated to mitigate the possibility of selecting poor parameter settings.

4.1. Proposed Framework

This section presents a novel -value initialization framework and factorization -learning initialization (FQI). The principal structure and the main building modules of FQI are shown in Figure 2, which bases on two main modules: (1) -table monitoring and (2) -table factorization. -table monitoring determines whether to factorize the -table, and -table factorization predicts the -values that have never been updated. The original -learning algorithm consists of the modules in the left box, and the direction of the solid black and red lines is the workflow. FQI (dotted lines) replaces the original action selection process (solid red line) with -table monitoring and -table factorization. Each is depicted in green and yellow blocks, respectively, and described in the subsequent subsections.

4.1.1. -Table Monitoring

To determine when to factorize -table, the percentage of visited states will be calculated after each interaction with the environment: where is the state-action indicator that is equal to 1 if action at state has been visited and 0 otherwise. is the visited state indicator function that is equal to 1 if an action has been visited at state and is equal to 0 otherwise.

Then, the increment of the percentage of visited states is also obtained after every interaction as follows: where is the percentage of visited states of the last -table factorization, and its initial value equals 0. If is greater than or equal to the introduced parameter factorization threshold , the -table will be replaced with the factorized -table based on -table factorization. Then, the agent selects actions based on the factorized -table. Otherwise, the agent selects actions according to the current -table.

4.1.2. -Table Factorization

The core idea of -table factorization is to capture the potential correlation between states and actions from the visited experiences to predict the -values that are still default values. Inspired by [29], we build the -table’s correlation model between states and actions via factorization machine [30]. Factorization machines can model the interactions between different variables using factorized parameters even in problems with huge sparsity combining the advantages of support vector machines with a factorization model. The predicted -values of the factorization machine model equation are defined as where is an -dimensional feature vector, which is composed of the one-hot encodings of states and actions . An example of feature vector from the motivating example is shown in Figure 3. is a hyperparameter that decides the dimensionality of the factorization, and the model parameters are

is the global bias. models the strength of the -th feature variable. and are the -th value in the -th feature variable and the -th feature variable, respectively. models the interaction between the -th and -th variable by factorizing it.

For the sake of simplicity, we denote . The parameters of the factorization machine model Eq. (9) are estimated by solving the following least square minimization problem: where is the linear regularization parameter, and is the regularization parameter, which is used to prevent overfitting problems. The gradient of the factorization machine model is

Algorithm 1 sketches how FQI works. First, the value of the factorization threshold is set. After every interaction with the environment, the increment of the percentage of visited states is calculated (line 1). -table factorization is activated if is above (line 2). Then, the visited entries in -table will be extracted as the experiences triples (line 3). After the factorization machine model parameters have been estimated (line 4 and line 5), the -table will be completed via Eq. (9) (line 6). The factorized -table is used to bootstrap further exploration (line 7). In this way, the -learning agent is exposed continually to the factorized -table from the early stages of the learning process, thereby mitigating the possibility of selecting poor parameter settings in random exploration to discover these unvisited -values.

Input: Factorization threshold , latent factor , max iterations
1: Calculate the percentage of visited states through Eq. (7) and the increment of the percentage of visited states by Eq. (8).
2: If then
3:  Extract the visited entries from Q table as experiences triples.
4:  Concatenate the one hot encodings of state and action , respectively, in experience
 Tuples to form feature vector and take as the corresponding target.
5:  Estimate the model parameters at most iterations by Eq. (12).
6:  Complete the Q table via Eq. (9).
7:  Replace the original Q table with the factorized Q table.
8: End if
4.2. Complexity

The main computation of Algorithm 1 mainly lies in the execution times of factorizing the -table. The main computation for each -table factorization is to evaluate the loss function Eq. (11) and its gradients against the variables. It can be computed in linear time [30], where is the number of latent factors and is the dimensionality of the feature vector. Therefore, the main increased computation complexity compared with the standard -learning has only complexity where is the approximation threshold.

5. Experimental Results

In this section, we evaluate the performance of factorization Q-learning initialization (FQI) in an open-source simulated VoLTE network [31]. We deployed FQI on -learning (QL), double -learning (DQL), -learning (), Dyna -learning (DynaQ), and speedy -learning (SQL) to evaluate performance improvements on the measures of the valid actions and convergence episodes. First, we describe the adopted setup in Section 5.1 before delving into the experimental results in Section 5.2.

5.1. Simulation Setting

We consider an orthogonal frequency division multiplexing (OFDM) multiaccess downlink cellular network of base stations (BS)s. There consists of one serving BS and at least one interfering BS. The user equipment (UE) are randomly scattered and moving in the BS service area engaged in VoLTE shown in Figure 4. Further system details are referred to [31].

There are two BSs and two UEs, in which the UEs are moving with both log-normal shadow fading and small-scale fading. The target of QL applied in this environment is to jointly optimize the transmit power at the two BSs to make the UEs meet the target SINR. There are 16 parameter settings (actions) to choose from in this environment, which increase or decrease concurrently the transmit powers of the serving and interfering BSs. The environment has 46656 states which are discretized by the positions of the two UEs and the transmit powers of the serving and interfering BSs. -Greedy and Boltzmann are used to select actions, respectively.

An has a duration of 20 timesteps, where the agent selects an action to interact with the environment and receives a reward every timestep. If the SINR target of UEs is fulfilled, the agent receives a reward of 100, and the episode goes on. Otherwise, the episode is terminated prematurely, and the reward is -20. An episode is called if the target objective was fulfilled within ten timesteps. The main hyperparameters for QL and FQI in the experiment are shown in Table 1. All algorithms are implemented and available at [32].

5.2. Results and Observations

We ran the original and factorized algorithms based on FQI several times with random seeds. The aggregated results of the convergence episodes and average episode reward across five different factorization thresholds are reported in Table 2. The average episode reward directly reflects the SINR quality of UEs, while the convergence episodes refer to the continuous fulfillment of the SINR target. The larger the value of these two indicators, the power parameter tuning based on DRL algorithms can better allow UEs to meet the SINR target. We observed significant improvements for the factorized algorithms over the original algorithms, with average performance gains of 1.5-8 times higher final results for the original QL, DynaQ, , DQL, and SQL. Since exposure to the factorized -table from the early stages of the learning process, the factorized algorithms can reach final higher results. Moreover, the factorized algorithms improve the performance under both -greedy and Boltzmann, and the algorithms with Boltzmann can obtain better performance than -greedy. This is because Boltzmann selects an action with probability based on -values through Eq. (6), rather than blindly accepting any random action such as -greedy, when it comes time for the agent to explore the environment.

More detailed results are provided in Figures 5 and 6 to distinguish the difference in performance at different factorization thresholds . Figure 5 shows the cumulative convergence episodes for the original and factorized algorithms for different factorization thresholds . In general, the performance of the factorized algorithms (solid markers) outperforms the original algorithms (blank markers). For the factorized algorithms, adopting Boltzmann (solid circle markers) is better than the performance of -greedy (solid square makers). However, when the parameter is 0.05, the performance curve of the factorized algorithms with -greedy partially coincides with the curve of the original algorithms with Boltzmann, indicating that the performance of the two algorithms is similar.

The results of the average episode reward of different are demonstrated in Figure 6. The main point to note is that the factorized algorithms (solid markers) are significantly better than the original algorithms, and the original with -greedy (blank square makers) is the worst. This means that the factorized algorithms can adjust the transmit power of BSs more efficiently than the original algorithms to allow UEs to meet the SINR target more. At the beginning of training, the original and factorized algorithms performed similarly. However, the performances of the factorized algorithms soon become superior to the original algorithms as training progresses. Furthermore, we can observe that the learning curve of the factorized algorithms with is steeper than the original algorithms at the initial stage of learning and reaches the plateau faster.

To further study the impact of the factorization threshold , the average performance of original algorithms and factorized algorithms with different is shown in Figure 7. A key point to note is that the performance of the factorized algorithms improves as the parameter decreases. It should also be noted that lower leads to faster learning, but even a high threshold captures most of the benefits of -table factorization. This is because the smaller the parameter , the earlier and more frequently the factorized algorithms perform -table factorization to predict the nonupdated -values, which bootstraps agent exploration more.

6. Conclusion

In this article, we sought to enhance the convergence of -learning for parameter optimization in cellular networks. A -value initialization framework based on factorization machines and factorization -learning initialization (FQI) was proposed to keep on predicting the nonupdated -values based on the visited experiences to bootstrap exploration. We described the details of FQI and showed its effectiveness on -learning and its several improved variants, Dyna -learning, -learning, double -learning, and speedy -learning with two widely used exploration strategies, -greedy, and Boltzmann. The experimental results in an open-source simulated VoLTE network show that the factorized algorithms based on our proposed framework are substantially better than the original algorithms, exceeding their final performance by 1.5-8 times. In addition, earlier and more Q-table factorization can improve the performance of the algorithms due to more guidance for agents to explore, which is more evident in the early stage of learning with too little information to explore efficiently.

A major issue for this work is that it cannot be directly used in the environment of continuous state and action. However, we note that the network parameters in cellular networks are always discrete. Moreover, the continuous state space of each feature can be discretized, and these continuous values can be modeled by utilizing them as additional features.

7. Future Work

Interesting future work would include research to obtain more insight into the merits of the FQI framework. For instance, in a multiagent -learning scenario, FQI can be used to model the state-action interaction between multiple agents to maintain a union -table. Possibly, to learn sophisticated feature interactions behind agents’ behaviors, replacing factorization machines in the proposed algorithm with DeepFM [33] can yield better results. More analysis on the performance of -learning and related algorithms such as zap -learning and delay -learning is desirable. Furthermore, it would be interesting to see how factorization zap -learning, factorization delayed -learning, and other extensions of -learning perform in practice when applied to FQI.

Data Availability

https://github.com/bszeng/Factorization_Q-learning_Initialization

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (62171387) and the China Postdoctoral Science Foundation (2019M663475).