Abstract

To solve the problems of poor quality of service and low energy efficiency of nodes in underwater multinode communication networks, a distributed power allocation algorithm based on reinforcement learning is proposed. The transmitter with reinforcement learning capability can select the power level autonomously to achieve the goal of getting higher user experience quality with lower power consumption. Firstly, we propose a distributed power optimization model based on the Markov decision process. Secondly, we further give a reward function suitable for multiobjective optimization. Finally, we present a distributed power allocation algorithm based on Q-learning and use it as an adaptive mechanism to enable each transmitter in the network to adjust the transmit power according to its own environment. The simulation results show that the proposed algorithm not only increases the total channel capacity of the system but also improves the energy efficiency of each transmitter.

1. Introduction

Marine information technology not only plays an important role in the fields of marine environment monitoring, exploration and resource development, marine disaster warning, and underwater target location tracking but also is a hot direction for information science research [1, 2]. The primary problem to be solved in the development of marine information technology is the construction of underwater sensor networks and the allocation of resources for network communication; otherwise, marine information technology is not possible [35]. With the increasing exploitation of underwater resources, the variety and number of communication nodes deployed underwater are becoming more and more abundant, and there will even be multiple types of underwater communication networks deployed in the same sea area. For example, in Ref. [6], a two-dimensional underwater sensing network structure was developed in which the sensor nodes were anchored to the seafloor. This means that the sensors can only detect a range of data on the seafloor. However, many other important 3D data, such as the flow rate and salinity of seawater, which are crucial for one to study the characteristics of the marine environment, are not detectable. Correspondingly, this paper proposes an autonomous underwater vehicle (AUV) to monitor and collect important 3D data, and uses different types of sensors to detect a range of data on the seafloor.

Unlike wireless electromagnetic wave communication networks, most acoustic modems in underwater acoustic communication networks (UACNs) are battery-powered, but in an underwater environment, battery replacement and charging are extremely difficult [7]. Meanwhile, there are many types of nodes deployed in UACNs, including multiple types of nodes such as master nodes, sub-nodes, AUVs, and so on. Normally, different types of nodes hope to transmit data with greater power to obtain a higher quality of service [8]. In this case, if proper interference control is not performed, there will be increased interference between nodes and a huge waste of transmit power. So it can be seen that, because of the complex underwater acoustic communication environment, the proposed resource allocation algorithm needs to have strong adaptive characteristics to counter the dynamic underwater acoustic communication environment. The low transmission rate of orthogonal frequency-division multiplexing (OFDM) technology has obvious advantages in combating the complex communication environment of underwater acoustics. Its low transmission rate effectively reduces multipath reflection interference [9], and it is also extremely resistant to inter-code interference [10]. Motivated by previous analysis, based on the modeling of OFDM underwater heterogeneous communication networks, we consider how to find a balance between power consumption and interference level to achieve optimal system performance.

To summarize, we consider the issue of energy efficiency optimization in cooperative UACNs. Since the resource allocation process can be considered as a Markov decision process (MDP), reinforcement learning (RL) is applied to solve the above problems [11]. Specifically, RL methods are used to find the equilibrium between power consumption and interference level, i.e., to select the appropriate transmit power for each node to obtain a high quality of service within the interference allowable range. To this end, this paper seeks the global optimal strategy by constructing a global MDP. The main contributions of this work are summarized as follows:(i)We propose a learning framework suitable for communication nodes. The framework realizes the transformation of resource allocation problem like the Markov decision model, which defines the state space and action set in the environment according to the actual problem that needs to be solved.(ii)We propose a systematic reward function design method based on the multiobjective optimization problem and the nature of RL, which is used to guide the training method of the transmitter. The designed reward function takes into account the network environment and node energy which are uncontrollable factors, and achieves maximization of quality of service (QoS) of communication nodes with relatively small energy consumption. We further show that the proposed reward function can achieve significant improvements in energy efficiency.(iii)We propose a resource allocation strategy for underwater transmitters based on Q-learning, which is distributed and scalable. The simulation results show that, compared with the greedy algorithm, the resource allocation strategy based on Q-learning achieves a higher system capacity and a longer life cycle.

The rest of this paper is organized as follows. Section 2 reviews the work related to resource allocation in UACNs. Section 3 introduces the multisectional cooperative communication network model and describes the problems related to resource allocation. Section 4 proposes a resource allocation strategy based on Q-learning and proves the effectiveness of the designed scheme theoretically, and Section 5 compares the proposed algorithm with the greedy algorithm. Finally, Section 6 concludes the paper.

Compared to the channel bandwidth on land, the available bandwidth underwater is very narrow, only a few kilohertz. When there are more underwater communication nodes, many nodes will communicate in similar frequency bands, which will generate large interference between nodes and affect the communication quality of underwater nodes. Facing the complicated underwater communication environment, many scholars have improved the communication quality of underwater sensor networks by rationally allocating resources such as channels and power.

The problem of resource allocation has been extensively studied in UACNs. Aiming at the energy limitation and throughput problems in UACNs, the linear Gaussian relay channel (LGRC) model is used in Ref. [12] to optimize the power spectral density of the input power, effectively expanding the transmission capacity of UACNs. In a similar study, For the MQAM-OFDM underwater acoustic communication system, a joint power-rate allocation algorithm is proposed in Ref. [13], which optimizes the transmission power of the node and improves the transmission rate of the system. In Ref. [14], the authors proposed an efficient spectrum management system receiver-initiated spectrum management (RISM) for underwater acoustic cognitive networks and aimed to maximize the node channel capacity for power allocation, which effectively avoids conflicts in data transmission and improves the data transmission rate. However, the centralized optimization algorithm proposed by the abovementioned study only optimizes the transmission rate of the node, and does not consider the quality of service of the network. In order to improve its own throughput, each transmitting node usually chooses a larger transmitting power, which causes more serious network interference and further reduces the life cycle of the node. In Ref. [15], a joint frequency-power allocation-based algorithm is proposed for UACNs, which effectively extends the life cycle of nodes by setting the power level according to the distance between nodes. The disadvantage is that this algorithm is only suitable for environments with dense network nodes. Meanwhile, considering the complex underwater communication environment, it is difficult to deploy a centralized control center underwater, so the abovementioned centralized power algorithm cannot meet the strong distributed application requirements of the UACNs.

RL has been developed to continuously optimize its own strategies through continuous interaction with unknown environments, and can be used in a distributed manner to achieve better results in many scenarios [16, 17]. For example, in order to solve the multinode interference problem in UACNs, in Ref. [5], the authors converted the resource allocation problem into a Markov decision model and proposed a cooperative Q-learning optimization scheme. However, Ref. [6] did not consider the node energy consumption. Furthermore, an anti-interference relay selection scheme for deep Q network (DQN) is proposed in Ref. [18], which selects the node position based on the interference level of the node on the one hand, and adjusts the node transmit power according to the magnitude of the BER on the other hand. The disadvantage is that the algorithm only considers a network composed of a few nodes and lacks scalability. Therefore, in order to balance node energy consumption and network interference level, and at the same time, considering the scalability of the algorithm, this paper regards the communication node as an agent, and transforms the resource allocation problem into the Q-learning algorithm model to obtain the optimized strategy result.

3. System Model and Problem Formulation

3.1. System Model

In this paper, we consider the UACNs OFDM system composed of multiple transmitter-receiver pairs. In UACNs, the transmitting nodes collect environmental information, and the receiving nodes are relay nodes or data fusion centers. According to application needs, there are many types of transmitting nodes, including sensor nodes, Autonomous Underwater Vehicle (AUV), Unmanned Underwater Vehicle (UUV), and many others. Different types of transmitter-receiver pairs have different communication requirements and priority levels. The bandwidth of the OFDM system is equally divided into orthogonal sub-channels, whose set is denoted as . For convenience, we assume that the bandwidth of each sub-channel is the unit bandwidth. All orthogonal channels are shared channels that can be freely accessed by all transmitter-receiver pairs. Meanwhile, suppose that there are pairs of sensor nodes and 1 pair of AUV pairs in the network, where represents the index of the sensor node. The overall network configuration is shown in Figure 1. Please note that although we consider each transmitter to serve a single receiver, the proposed method can be easily adapted to serve more transmitter-receiver pairs.

From the above text, the received signal of node includes interference from node and thermal noise; then the signal-to-interference-to-noise ratio (SINR) at node can be expressed as Ref. [19]where is the transmit power of AUV ; denotes the channel gain from the AUV to node ; indicates the transmit power of node ; is the channel gain from node to node ; is the transmit power of node ; and denotes the channel gain from node to node . denotes the noise power of the underwater acoustic channel. Underwater acoustic channel noise is an important topic in the application practice of UACNs, as hydrostatic pressure effects (tides, waves, etc., caused by wind, rain, and seismic disturbances) and industrial behavior (e.g., surface sailing) remain one of the main reasons hindering the development of underwater acoustic communication [2022]. Calculating the noise power is a very complex challenge, because of the significant time-space-frequency variability of underwater acoustic channel noise [23, 24]. Fortunately, can be calculated from the corresponding power spectral density [15, 25], which can be described as follows:wherewhere , , , and denote ocean turbulence, ship activity, wind and waves, and thermal movement of molecules in the water, respectively. In addition, and represent the influencing factor of sea surface wind speed and ship activity, respectively.

In the underwater acoustic communication system, the channel gain can be expressed as Ref. [25]where is the normalization coefficient, denotes the transmission distance (km), indicates the communication frequency (Hz), is the expansion loss, which describes the channel characteristics of underwater acoustic propagation, denotes the expansion coefficient, with a value of 1.5, and is the absorption coefficient, which can be expressed by Thorp empirical formula as [26]

Assume that all channel parameters are known by the transmitting node, which is consistent with previous work such as Refs. [3, 5]. In fact, this is reasonable, because the channel information can be fed back to each transmitting node through the backhaul network. Thus, the normalized capacity of any receiver can be expressed as follows:

3.2. Problem Formulation

During the operation of UACNs, when the noise conditions of the underwater acoustic channel are given, each transmitter hopes to transmit data with a larger power in order to obtain a higher quality of service. However, excessive transmission of power will increase the level of network interference, which will greatly reduce the communication quality. Besides this, transmitter usually uses battery power when working underwater, and excessive transmitting power will accelerate the energy consumption of the transmitter. Therefore, the main goal of our work is to solve the energy optimization problem, i.e., to maximize the service quality of the receiver with a smaller energy consumption.

As mentioned previously, if we assume that the transmitting power of the transmitting node is , then the optimization goal can be expressed as follows: where the objective (7) indicates the maximization of the network capacity with relatively small energy consumption. denotes the information transmission capacity between the j-th transmitter-receiver pair, and is the transmit power of the j-th transmitter node. The first constraint (8) denotes the power limit of the transmitting node . The in (9) and in (10), respectively, denote the minimum SINR of node and the AUV when meeting application requirements. In other words, constraints (9) and (10) ensure that all receivers have sufficient quality of service. Considering (8)–(10), it can be concluded that the optimization in (7) is not only a multiobjective optimization problem but also a nonconvex problem of UACNs. This is mainly because of the SINR expression in (1) and the optimization goal of (7). In the next section, a method based on reinforcement learning is proposed to solve the above problems.

4. Resource Allocation Based on Reinforcement Learning

4.1. Markov Decision Process

The environment that interacts with the agent is usually called a Markov Decision Process (MDP) with a finite state. We assume that represents the discrete set of environmental states, is the discrete set of actions that the agent can perform, represents the reward value of the agent performing action in state , and be the state transition function. At each time , the agent interacts with the environment to obtain the current state , and selects an action from the action set to execute. According to the probability distribution relation , the environment is thus changed, shifting from state to and generating feedback on the choice of action of the intelligence, that is, the reward value . The whole process is iterated and optimized until convergence.

The goal of the RL method is to continuously optimize the agent’s decision strategy in the iterative process. Formally, strategy describes the mapping relationship from environmental state to action selection. The task of the intelligence is to obtain the optimal policy during the learning process so that the total expected discounted return reaches the maximum in a finite number of steps, that iswhere is the reward discount factor at the moment; is the initial state of the system; and is the immediate reward obtained by executing the action strategy. is often referred to as the value function of the intelligence at state .

The process of RL can be described as an MDP, which has Markov properties. In other words, the state of the environment is only related to the state of the previous moment, and not related to the state of the earlier time. Therefore, the value function can be simplified to

Therefore, the optimal strategy satisfies the Bellman equation as [9]

However, in the actual systems, the state transition function is generally unknown. The agent cannot model the quadruple of reinforcement learning. Therefore, it is necessary to use model-free RL algorithms. Q-learning is the most representative of these algorithms. The Q-function is defined aswhere denotes the cumulative discount reward obtained by selecting action at state and choosing the optimal policy all the way through the subsequent policy selection process. Combining equations (12) and (13), the relationship between the value function and the state-action value function can be obtained as follows:

Therefore, the optimal value function can be obtained from . Then, (14) can be expressed as follows:

From the above equation, the update rule of the predicted Q function is provided as [5]where and denote the Q values before and after the update, respectively; is the learning rate, and a larger value indicates that the update of rewards depends more on immediate rewards than on the accumulation of past experience. It can be seen that the Q value is updated using the optimal Q value of the immediate reward and the next state to which it is transferred, and the basic idea is to estimate the Q function by incrementally summing the Q values of the previous state action pairs.

4.2. Reinforcement Learning-Based Power Allocation Approach

In this paper, each emitter is considered as an intelligent body with RL capability. Next, the most important thing is to transform the resource optimization problem in UACNs into a RL algorithm model and use it to obtain optimal decision results. The existing problem scenario is modeled based on the four elements of reinforcement learning.

4.2.1. Action Space

According to the optimization goal described in (7), the action of the agent is to select the power. Generally speaking, the Q function is stored in a look-up table. For this, we first discretize the power selection. Assuming the transmit power of the -th agent, the selection range is , which can be discretized as follows:where is the number of discretized powers.

4.2.2. State Space

The state of the environment should be defined based on local observations. The key to the problem of UACNs resource allocation is to determine the level of interference around each receiver and the energy consumption of the transmitter. Therefore, at time , we can define the state observed by transmitter as follows:where indicates whether the SINR received by receiver is greater than or lower than its threshold , that is,where represents the action vector of other receivers. In this paper, we use to represent the discrete set of environmental states related to receiver .

4.2.3. Reward Function

The reward value of the agent’s RL indicates the degree of satisfaction of the agent with the strategy choice. In the current scenario, the optimization goal is to maximize the QoS of the receiver device with less power consumption, which is essentially a multiobjective optimization problem. In this paper, we transform the multiobjective problem into a single-objective problem by the weight coefficient method, and transform the optimization goal setting into the reward value, denoted as follows:

This is based on the following points. In (21), and , respectively, denote the capacity of AUV and node at time . and are equal to and , respectively. If there is a higher SINR at the receiver, a lower bit error rate will usually be obtained, which in turn will have a higher throughput. However, an excessively high SINR requires the transmitter to transmit at a high-power level, which in turn will cause more energy consumption and increase interference to other users. To avoid this, we consider energy efficiency, i.e., (21) select the correct number of received bits per unit of energy consumption as part of the reward function. Simultaneously, (21) also considers the deviation of AUV and node from their required capacity thresholds, that is, and are reduced from (21) to decrease the value of the reward. In addition, the parameter ensures the fairness of the algorithm. represents the distance between the node and the AUV normalized to . is a constant, indicating whether the node is near an AUV. For example, if the distance between the node and the AUV is less than , the node will be affected by the AUV more than any other transmitter with a distance greater than . Then, the node should give less reward, which means that the first and third terms in (21) are multiplied by the inverse of and to reduce the reward, respectively.

Due to the independent selection of power levels by devices, different devices may interfere greatly with other devices in order to maximize their own profits. In other words, incorrect action selection may cause the SINR of some receivers to fall below its threshold, so the reward value is redefined as

Specifically, if the SINR in the current channel is greater than the predefined threshold (see (9)), i.e., the QoS is greater than the minimum requirement, the reward value is calculated from (21); otherwise, the reward value is 0. Overall, (22) is the payoff for choosing the power under state to ensure the quality of service of the transmission, as well as to achieve energy efficiency.

The convergence of the Q-learning algorithm mainly depends on the convergence of the Q-value function [27]. Next, we will analyze the convergence of the proposed algorithm.

Theorem 1. The value of the reward function formulated according to formula (22) is bounded in different system states.

Proof. From (22),we need to prove that the reward function is bounded in different system states when .
From (21), consists of three components, which are the energy efficiency , the deviation of the communication capacity of the AUV from the corresponding capacity threshold , and the deviation of the communication capacity of the sensor node from the corresponding capacity threshold . Here, , and are constant.
Consider that the action space defined by power discretization is a discrete finite value, i.e., , the communication capacity of the AUV and the communication capacity of the sensor node are bounded in any state.
Furthermore, the product form composed of the capacity value , the capacity value , and the power value must also be a discrete finite value, i.e., the energy efficiency value is bounded. Meanwhile, and are bounded. So must be bounded.

Theorem 2. In the iteration of the Q-value of a bounded reward function r(s,a), the learn factor 0 < ≤ 1 and satisfiesIf the optimal Q-value is denoted as , then when , we haveThe conclusion exhibited in Theorem 2 has a detailed proof process in Ref. [28], which will not be repeated here.

4.3. Algorithm Description

Based on the above preparatory work, the Q-learning-based resource allocation algorithm for UACNs can be described as follows. Algorithm 1 first initializes the relevant parameters, and then uses the greedy method [29] to guide the behavior selection of the intelligent Q-Agent, and updates the Q-value function based on equation (17), and iterates until the Q-value function converges to make a decision on the resource allocation scheme of UACNs.

Initialization:
(1)Set , .
(2)Initialize .
Repeated Learning: (for each episode)
(3)Looks up the Q-table and selects the state , i.e.,
(4)Execute the -greedy [29] method to select the action
(5)Calculate the reward function based on equation (22).
(6)Calculate the current Q-value function.
(7)Update the Q-table according to equation (17).
(8)Update the state .
(9)Go back to 3 until the state is the final state.

5. Numerical Results

In order to verify the effectiveness of the proposed algorithm, the next objective of this section is to evaluate the performance in two different scenarios, i.e., a sparse network consisting of four transmitter-receiver pairs and a dense network with dynamic access consisting of multiple transmitter-receiver pairs. The network model of this paper is shown in Figure 1, and the simulation parameters are set according to Refs. [19, 30]. The maximum transmit power of the transmitter , system bandwidth , propagation coefficient , carrier frequency , noise power . In addition, we consider the random nonstationary characteristics of the underwater signal, and use to reflect the influence of the underwater uncertainty factors on the underwater acoustic channel, where and obeys the Rayleigh distribution with a mean value of 0.1. Therefore, is used for the gain of the hydroacoustic channel in the simulation.

The minimum SINR requirement for node and AUV is defined in terms of the rate required to support its corresponding receiver. In the simulation, we assume that the minimum transmission rate required to satisfy QoS for node is 0.4 b/s/Hz, i.e. . In addition, for AUV, the minimum rate required is set to 1 b/s/Hz, i.e. . It is important to note that by knowing the media access control (MAC) layer parameters, the value of the channel transmission rate can be calculated using (Ref. [21], equations (20) and (21)). The parameters associated with performing Q-learning are set as follows: learning rate , discount factor . -greedy algorithm is used for the first 80% of iterations, random , and the maximum number of iterations is set to 50,000. Besides, in order to achieve noncooperative power allocation in UACNs, one of the most important issues is the definition of the receiving reward. In this paper, the concept of energy efficiency is introduced in (11), which will be used as one of the metrics for numerical evaluation.

We first consider a sparse network consisting of four transmitter-receiver pairs. Assume that the four transmitters and four receivers are randomly distributed in a region that is 1.5 km deep, 1.5 km long, and 1 km wide, and the coordinate information of the nodes is shown in Table 1. Figure 2 shows the effect of the transmit power of the AUV on the other three node . As a whole, the SINR of the three nodes gradually decreases as the transmit power of the AUV increases and the network environment interference enhances, which makes the transmission capacity of the three nodes decrease continuously. Further, when the AUV is a certain fixed value, node is closest to the AUV and suffers the strongest interference, i.e., the smallest SINR, and thus its acquired capacity is the smallest among the three links. Conversely, node is farthest from the AUV and its acquired capacity is the largest.

Figure 3 shows the results of the proposed learning algorithm in this paper compared with the greedy algorithm. In order to make a fair comparison between the two algorithms, we choose energy efficiency as the evaluation index. The results are shown in Figure 3, which indicates that as the power of AUV increases, the network energy efficiency of the proposed learning algorithm, although gradually decreasing, is significantly better than that of the greedy algorithm. It should be noted that, as shown in Figure 2, the decrease in network energy efficiency is a reasonable phenomenon. In fact, in the greedy algorithm, each transmitting node always chooses the maximum power for transmission, which keeps the energy in a high consumption state, but the transmission capacity does not increase significantly.

Figure 4 illustrates the curve of AUV transmission capacity variation with transmit power. From the figure, it can be seen that the proposed algorithm can make the transmission capacity of AUV better than the greedy algorithm. This is mainly because the proposed algorithm can better balance the energy consumption and network interference level, so that the transmit power of each node in the network can be adjusted adaptively to achieve a win-win situation.

Next, we further consider a dynamic access dense network consisting of multiple transmitter-receiver pairs. Assume that the transmitting power of the AUV is 8 W, while the number of sensor nodes in the network increases continuously from 1 to 20 with random distribution. The simulation starts with one transmitter-receiver pair. After convergence, the next transmitter-receiver pair is added to the network and so on. Figure 5 shows the state of the node capacity distribution as the number of nodes in the network increases. As can be seen from the figure, under the same conditions, compared to the greedy algorithm, the learning algorithm proposed in this paper is able to maintain a better network quality of service by adaptively adjusting the node transmitting power according to the changes in the network environment. At the same time, it should be noted that as the number of nodes increases, the level of network interference increases, which makes the overall energy efficiency of nodes show a decreasing trend.

Figure 6 shows the graph of network energy efficiency with increasing number of nodes. It is obvious from the graphs that the proposed algorithm can well balance the network transmission capacity and energy consumption, which greatly improves the network service quality. In the greedy algorithm, all nodes choose the maximum transmission power for the pursuit of higher transmission capacity, which not only causes energy waste but also enhances the interference between networks, and finally makes the network energy efficiency maintain at a low level.

Finally, we perform the convergence and complexity analysis of the algorithm. The maximum number of iterations of the proposed learning algorithm is set to 50,000, and the average number of iterations for the convergence of the algorithm in the two scenarios is shown in Figure 7. From the figure, it can be found that the proposed algorithm requires approximately equal number of iterations in the two different scenarios. In other words, the mathematical expectation and the variance of the number of iterations required for the proposed algorithm to converge are 41,200 and 35.6, respectively, in the underwater sparse scenario when the firing power of the heterogeneous nodes varies between 0 and 15, and 41236 and 49.1, respectively, in the underwater dense scenario when the number of nodes varies between 1 and 20. The stability of the proposed algorithm is thus demonstrated.

To better understand the running time of the proposed algorithm, Figure 8 shows the actual running time of the proposed algorithm on a conventional processor. Specifically, in the underwater sparse scenario, when the transmit power of the heterogeneous nodes varies between 0 and 15, the mathematical expectation and variance of the running time required for the proposed algorithm to converge are 5.65 and 0.51, respectively. In the underwater dense scenario, when the number of nodes varies between 1 and 20, the running time required for the proposed algorithm to converge gradually increases. This is mainly because when the number of nodes increases, a lot of time is needed to find the equilibrium between communication capacity and energy consumption.

6. Conclusion

This paper proposes a power allocation scheme based on Q-learning. This scheme considers the interference problem in UACNs composed of multiple transmitter-receiver pairs and the energy efficiency of each transmitter, while each transmitter (sensor node, AUV) is able to train itself to select the appropriate transmit power to support its service nodes while protecting other nodes in the network. In addition, the learning algorithm proposed in this paper, as a distributed method, can solve the power optimization problem for networks with dynamic access of sensor nodes while having low complexity. The scheme is scalable and has a clear advantage in energy efficiency compared to the greedy algorithm. In future work, we design function approximators for neural networks to solve the problem of large state space and action space.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was funded by the National Natural Science Foundation of China, under grant no. 62001199, the Fujian Province Natural Science Foundation of China, under grant no. 2019J01842, the President’s Fund of Minnan Normal University, under grant no. KJ2020003, the Scientific Research Fund of Fujian Provincial Education Department, under grant no. JT180596, and the Ningde Science and Technology Project, under grant nos. 20140157 and 20160044.