In this paper, a multiantenna wireless transmitter communicates with an information receiver while radiating RF energy to surrounding energy harvesters. The channel between the transceivers is known to the transmitter, but the channels between the transmitter and the energy harvesters are unknown to the transmitter. By designing its transmit covariance matrix, the transmitter fully charges the energy buffers of all energy harvesters in the shortest amount of time while maintaining the target information rate toward the receiver. At the beginning of each time slot, the transmitter determines the particular beam pattern to transmit with. Throughout the whole charging process, the transmitter does not estimate the energy harvesting channel vectors. Due to the high complexity of the system, we propose a novel deep Q-network algorithm to determine the optimal transmission strategy for complex systems. Simulation results show that deep Q-network is superior to the existing algorithms in terms of the time consumption to fulfill the wireless charging process.

1. Introduction

For a wireless transceiver pair with multiple antennas, optimizing the transmit covariance matrix can achieve high data-rate communication over the multiple-input multiple-output (MIMO) channel. Meanwhile, the radiated radio frequency (RF) energy can be acquired by the nearby RF energy harvesters to charge the electronic devices [1].

The problem of simultaneous wireless information and power transfer (SWIPT) has been widely discussed in recent years. SWIPT systems are divided into two categories: (1) the receiver splits the received signals for information decoding and energy harvesting [2, 3]; (2) separated and dedicated information decoders (ID) and RF energy harvesters (EH) exist in the systems [4]. For the second type of the system, different transmission strategies have ever been proposed to achieve good performance points in the rate-energy region [1, 2, 5]. For the multiple RF energy harvesters, which are in the vicinity of the wireless transmitter, the covariance matrix at the transmitter is designed to either maximize the net energy harvesting rate or fairly distribute the radiated RF energy at the harvesters [6, 7]. The achievable information rate of the wireless transmitter-receiver pair is beyond a minimum requirement for reliable communication. Most of the existing works assume the channel state information (CSI) is completely known. Given the complete CSI, the transmitter designs the transmit covariance matrix to achieve the maximum information rate while satisfying the RF energy harvesting requirement [4, 8].

However, in practice, it is difficult for the transmitter to obtain the channel state information to the nearby RF energy harvesters because the scattering distribution of the hardware-limited energy harvesters makes the channel estimation at the RF energy harvesters challenging [9, 10]. The analytic center cutting plane method (ACCPM) was proposed for the transmitter to approximate the channel information with a few bits of feedback from the RF energy receiver iteratively [10]. Since this method is implemented by solving a convex optimization problem, the algorithm leads to high computational complexity. To reduce complexity, channel estimation based on Kalman filtering was proposed [11]. Nevertheless, the disadvantage of this approach is the slow convergence rate. In order to effectively deal with the CSI acquisition problem, in our paper, we will use the deep learning algorithm to solve the optimization problem in the SWIPT system only with partial channel information. The partial CSI is easy to acquire, which is already enough to achieve superior system performance using the deep Q-network. To the best of our knowledge, we are the first one to use the deep Q-network to optimize the SWIPT system performance and validate its superiority.

In our model, the transmitter intends to fully charge all surrounded energy harvesters’ energy buffers in the shortest time while maintaining a target information rate toward the receiver. The communication link is defined as a strong line-of-sight (LOS) transmission, which is supposed to be invariant, but the energy harvesting channel conditions vary over time. Due to current hardware limitations, we assume that the estimation of the energy harvesting channel vectors is not able to be implemented under the fast varying channel conditions. As a result, the wireless charging problem can be modeled as a high complexity discrete-time stochastic control process with unknown system dynamics [12]. In [13], a similar problem has been explored. A multiarmed bandit algorithm is used to determine the optimal transmission strategy. In our paper, we apply a deep Q-network to solve the optimization problem and the simulation results demonstrate that the deep Q-network algorithm outperforms the multiarmed bandit algorithm. Historically, deep Q-network has a strongly proven record of attaining mastery over complex games with a very large number of system states, and unknown state transition probabilities [12]. More recently, a deep Q-network has been applied to deal with complex communication problems and has been shown to achieve good performance [1416]. For this reason, we found deep Q-network fitting for our model. In our model, we consider the accumulated energy of the energy harvesters as the system states, while we define the action as the transmit power allocation. At the beginning of each time slot, each energy harvester sends feedback about the accumulated energy level to the wireless transmitter, and the transmitter collects all the information in order to generate the system state and inputs it into a well-trained deep Q-network. The deep Q-network outputs the Q values corresponding to all possible actions. The action with the maximum Q value is selected as the beam pattern to be used for the transmission during the current time slot.

Based on the traditional deep Q-network, the double deep Q-network and dueling deep Q-network algorithms are applied in order to reduce the observed overestimations [17] and improve the learning efficiency [18]. Henceforth, we apply dueling double deep Q-network to solve the varying channel multiple energy harvester wireless charging problem.

The novelties of this paper are summarized as follows:(i)The simultaneous wireless information and power transfer problem is formulated as a Markov decision process (MDP) in an unknown varying channel condition for the first time.(ii)The deep Q-network algorithm is applied to solve the proposed optimization problem for the first time. We demonstrate that, compared to the other existing algorithms, deep Q-network shows the superiority in efficient and stable wireless power transfer.(iii)Multiple experimental scenarios are explored. By varying the number of transmission antennas and the number of energy harvesters in the system, the performance of both the deep Q-network and the other algorithms is compared and analyzed.(iv)The evaluation for the algorithms is based on the real experimental data, which validate the effectiveness of the proposed deep Q-network in real-time simultaneous wireless information and power transfer systems.

The rest of the paper is organized as follows. In Section 2, we describe the simultaneous wireless information and power transfer system model. In Section 3, we model the optimization problem as a Markov decision process and present a deep Q-network algorithm to determine the optimal transmission strategy. In Section 4, we present our simulation results for different experimental environments. Section 5 concludes the paper.

2. System Model

As shown in Figure 1, an information transmitter communicates with its receiver while perceived by nearby RF energy harvesters [8]. Both the transmitter and the receiver are equipped with antennas, while each RF energy harvester is equipped with one receive antenna. The baseband received signal at the receiver can be represented aswhere denotes the normalized baseband equivalent channel from the information transmitter to its receiver, represents the transmitted signal, and is the zero-mean circularly symmetric complex Gaussian noise with .

The transmit covariance matrix is denoted by , i.e., . The covariance matrix is Hermitian positive semidefinite, i.e., . The transmit power is restricted by the transmitter’s power constraint , i.e., . For the information transmission, we assume that a Gaussian codebook with infinitely many code words is used for the symbols and the expectation of the transmit covariance matrix is taken over the entire codebook. Therefore, is the zero-mean circularly symmetric complex Gaussian with . With transmitter precoding and receiver filtering, the capacity of the MIMO channel is the sum of the capacities of the parallel noninterfering single-input single-output (SISO) channels (eigenmodes of channel ) [19]. We convert the MIMO channel to eigenchannels for information and energy transfer [20, 21]. A singular value decomposition (SVD) on gives , where contains the singular values of . Since the MIMO channel is decomposed into parallel SISO channels, the information rate can be given bywhere are the diagonal elements of with .

The RF energy harvester received power specifies the harvested energy normalized by the baseband symbol period and scaled by the energy conversion efficiency. The received power at the th energy harvester iswhere is the channel vector from the transmitter to the th energy harvester. With MIMO channel decomposition, the received power at energy harvester is denoted aswhere are the elements of vector with .

We define the simplified channel vector from the transmitter to the th RF energy harvester asfor each . The simplified channel vector contains no phase information. The simplified channel vectors compose matrix as

In what follows, we assume that time is slotted, each time slot as a duration , and that each energy harvester is equipped with an energy buffer of size , . Without loss of generality, we assume that, at , all harvesters’ buffers are empty, which corresponds to system state . At a generic time slot , the transmitter transmits with one of the designed beam patterns. Each harvester can harvest the specific amount of power , and its energy buffer values increase to . Therefore, each state of the system includes the accumulated harvested energy information of all harvesters, i.e.,where denotes the th energy harvester’s accumulated energy up to time slot .

Once all harvesters are fully charged, we assume that the system arrives at a final goal state denoted as . We note that the energy buffer level also accounts for situations in which .

3. Problem Formulation for Time-Varying Channel Conditions

In this section, we suppose that the communication link is characterized by strong LOS transmission, which results in an invariant channel matrix , while the energy harvesting channel vector varies over time slots. We model the wireless charging problem as a Markov decision process (MDP) and show how to solve the optimization problem using reinforcement learning (RL). When the number of system states is very large, we apply a deep Q-network algorithm to acquire the optimal strategy at each particular system state.

3.1. Problem Formulation

In order to model our optimization problem as a RL problem, we define the beam pattern chosen in a particular time slot as the action . The set of possible actions is determined by equally generating different beam patterns with power allocation vector that satisfies the power and information rate constraints, i.e., , . Each beam pattern corresponds to a particular power level , which depends not only on the action but also on the channel condition experienced by the harvester during time slot .

Given the above, the simultaneous wireless information and energy transfer problem for a time-varying channel can be formulated as minimizing the time-consumption n to fully charge all the energy harvesters while maintaining the information rate between the information transceivers:

In general, the action selected at each time slot will be different to adapt to the current channel conditions and current energy buffer state of the harvesters. Therefore, the evolution of our system can be described by a Markov chain, where the generic state is identified by the current buffer levels of the harvester, i.e., . The set of all states is denoted by . Among all states, we are interested in the state in which all harvesters’ buffer is empty, namely, , and the state in which all the harvesters are fully charged, i.e., . If we suppose that we know all the channel coefficients at each time slot, problem can be seen as a stochastic shortest path (SSP) problem from state to state . At each time slot, the system is in a generic state , the transmitter selects a beam pattern or action , and the system moves to a new state . The dynamics of the system is captured by transition probabilities , , and , describing the probability that the harvesters’ energy buffers reach the levels in after a transmission with beam pattern . We note that the goal state is absorbing, i.e., .

Each transition also has an associated reward, , that denotes the reward when the current state is , action is selected, and the system moves to state . Since we aim at reaching in the fewest transmission time slots, we consider that the action entails a positive reward related to the difference between the current energy buffer level and the full energy buffer level of all harvesters. When the system reaches state , we set the reward as 0. In this way, the system not only tries to fully charge all harvesters in the shortest time but will also uniformly charge all the harvesters. In detail, we define the reward function aswhereand denotes the unit price of the harvested energy.

It is noted that different reward functions can also be selected. As an example, it is also possible to set a constant negative reward (e.g., a unitary cost) for each transmission that the system does not reach the goal state and a big positive reward only for the states and actions that bring the system to the goal state . This can be expressed as follows:

We note that the reward formulation of equation (11) is actually equivalent to minimizing the number of time slots required to reach state starting from state .

Using the above formulation, the optimization problem can then be seen as a stochastic shortest path search from state to state on the Markov chain with states and probabilities , actions , and rewards . Our objective is to find, for each possible state , an optimal action so that the system will reach the goal state following the path with maximum average reward. A generic policy can be written as .

Different techniques can be applied to solve problem , as it represents a particular class of MDPs. In this paper, however, we assume that the channel conditions at each time slot are unknown, which corresponds to not knowing the transition probabilities . Therefore, in the next section, we describe how to solve the above problem using reinforcement learning.

3.2. Optimal Power Allocation with Reinforcement Learning

Reinforcement learning is suitable for solving optimization problems in which the system dynamics follow a particular transition probability function, however, the probabilities are unknown. In what follows, we first show how to apply the Q-learning algorithm [22] to solve the optimization problem and then show how we can combine the reinforcement learning approach with a neural network to approximate the system model in case of large states and action sets, using deep Q-network [12].

3.2.1. Q-Learning Method

If the number of system states is small, we can depend on the traditional Q-learning method to find the optimal strategy at each system state, as defined in the previous section.

To this end, we define the cost function of action on system state as , with . The algorithm initializes with and then updates the values using the following equation:whereand denotes the learning rate. In each time slot, only one Q value is updated, and hence, all the other Q values remain the same.

At the beginning of the learning iterations, since the Q-table does not have enough information to choose the best action at each system state, the algorithm randomly explores new actions. Hence, we first define threshold , and we then randomly generate a probability . In the case that , we choose the action as

On the contrary, if , we randomly select one action from the action set .

When converges, the optimal strategy at each state is determined aswhich corresponds to finding the optimal beam pattern for each system state during the charging process.

3.2.2. Deep Q-Network

When considering a complex system with multiple harvesters, large energy buffers, and time-varying channel conditions, the number of system states dramatically increases. In order to learn the optimal transmit strategy at each system state, the Q-learning algorithm described before requires a Q-table with a large number of elements, making it very difficult for all the values in the Q-table to converge. Therefore, in what follows, we describe how to apply the deep Q-network (DQN) approach to find the optimal transmission policy.

The main idea of DQN is to train a neural network to find the Q function of a particular system state and action combination. When the system is in state , and action is selected, the Q function is denoted as . denotes the parameters of the Q-network. The purpose of training the neural network is to make

According to the DQN algorithm [17], two neural networks are used to solve the problem: the evaluation network and the target network, which are denoted as and , respectively. Both the and the are set up with several hidden layers. The input of the and the are denoted as and , which describe the current system state and the next system state , respectively. The output of and are denoted as and , respectively. The evaluation network is continuously trained to update the value of ; however, the target network only copies the weight parameters from the evaluation network intermittently (i.e., ). In each neural network learning epoch, the loss function is defined aswhere represents the real Q value and is calculated aswhere is the learning rate. As the loss function updates, the values are backpropagated to the neural network to update the weight of the .

(1)Randomly generate the weight parameter for the . The clones the weight parameters . . . . . .
(2)At the beginning of the time slot, randomly generate a probability .
and :
we choose the action as
Randomly choose the action from action set .
The transmitter transmits with the selected beam pattern.
(3)Throughout the whole time slot, the RF energy is accumulated in the harvesters’ energy buffer, as , . At the end of each time slot, each harvester feedbacks the energy level to the transmitter and the system state is updated to .
(4). . If reaches the maximum of experience pool, remains constant, , otherwise, . . . .
(5)After experience pool accumulates enough data, from experiences, randomly select experiences to train the neural network . Backpropagation method is applied to minimize the loss function . Clone the weight parameters from to after several time intervals.
. . . . If , algorithm terminates; otherwise, go back to step 2.
go to step 3.

In order to better train the neural network, we apply the experience reply method to remove the correlation between different training data. Each experience consists of the current system state , the action , the next system state , and the corresponding reward . The experience is denoted by the set . The algorithm records experiences, and randomly select (with ) experiences from for training. After the training is finished, clones all the weight parameters from the (i.e., ).

The algorithm used for the DQN training process is presented in Algorithm 1. In the algorithm, we define in each training iteration, we generate usable experiences and select of all for training the . In total, we suppose there are training iterations. We consider that, for both the and the , there are layers in the neural network. In the learning process, we use to denote all energy harvesters’ channel condition in a particular time slot.

3.2.3. Dueling Double Deep Q-Network

Since more harvesters and time-varying channel conditions incur more system states, even if we utilize the original DQN, it is hard to study the transmit rules for the transmitter. Therefore, we can apply dueling double DQN in order to deal with the overestimating problem during the training process and improve the learning efficiency of the neural network. Doubling DQN is a technique that strengthens the traditional DQN algorithm by preventing overestimating to happen [17]. In traditional DQN, as shown in equation (18), we utilize the to predict the maximum value of the next state. However, the is not updated at every training episode, which may lead to an increase in the training error and therefore complicate the training process. In doubling DQN, we utilize both the and the to predict the value. The is used to determine the optimal action to be taken for the system state as follows:

It can be shown that, following this approach, the training error considerably decreases [17].

In traditional DQN, the neural network only has the value as the output. In order to speed up the convergence, we apply dueling DQN by setting up two output streams from the neural network. The first stream is represented by the output value results of the neural network, which represents the value of each system state. The second stream is called advantage output and describes the advantage of applying each particular action to the current system state [18]. and are parameters that relate the two streams and the neural network output, which is denoted as

Dueling DQN can efficiently eliminate the extra training freedom, which speeds up the training [18].

4. Simulation Results

We simulate a MIMO wireless communication system with nearby RF energy harvesters. The wireless transmitter has at most antennas. The communication MIMO channel matrix is measured by two Wireless Open-Access Research Platform (WARP) v3 boards. Both WARP boards are mounted with the FMC-RF-2X245 dual-radio module, which is operated in 5.805 GHz frequency band. The Xilinx Virtex-6 FPGA operates as the central processing system and the WARPLab is used for rapid physical layer prototyping which is compiled by MATLAB [23]. We deploy two transceivers as line-of-sight transmission. The maximum transmitted power is W.  dBm. The information rate requirement is 53 bps/Hz. The average channel gain from the transmitter to the energy harvester is  dB. The energy conversion efficiency is 0.1. The duration of one time slot is defined as  ms.

DQN is trained to solve for the optimal transmit strategies for each system state. The simulation parameters used for DQN are presented in Table 1.

As described in Section 3.2, the exploration rate determines the probability that the network selects an action randomly or follows the values of the Q-table. Initially, we set because the experience pool has to accumulate reasonable amount of data to train the neural network. decreases with 0.001 at each training interval and finally stops at , since the experience pool has collected enough training data.

Refer to [24]. The dueling double DQN is used in our paper, which is shown in Figure 2. The software environment for simulation is TensorFlow 0.12.1 with Python 3.6 in Jupyter Notebook 5.6.0.

For the energy harvesters’ channel, to show an example of the performance achievable by the proposed algorithm, we consider the Rician channel fading model [25]. We suppose within each time slot , the channel is invariant and varies in different time slots [26]. At the end of each time slot, the energy harvester feedbacks the current energy level back to the transmitter. For the Rician fading channel model, the total gain of the signal is denoted as , where is the invariant LOS component and denotes a zero-mean Gaussian diffuse component. The channel between the transmit antenna and the energy harvester can be denoted as . The magnitude of the faded envelope can be modeled using the Rice factor such that , where denotes the average power of the main LOS component between the transmit antenna and energy harvester and denotes the variance of the scatter component. We can derive the magnitude of the main LOS component as since . The mean and the variance of are denoted as and , respectively. In polar coordinates, .

First, we explore the optimal deep Q-network structure under fading channels. We suppose the number of antennas is and the number of energy harvesters is . The channel between each antenna of the transmitter and each harvester is individually Rician distributed. The action set contains 13 actions satisfying the information rate requirement: , , , , , , , , , , , , and .

The LOS amplitude components of all channel links are defined as , with and . The LOS phase components of all channel links are defined as , , , , , and . The standard deviation of the amplitude and phase is denoted as and , respectively. We suppose . Hence, .

Using the fading channel model above, in Figure 3, we show how the structure of the neural network together with the learning rate can affect the performance of the DQN, for a fixed number of training episodes (i.e., 40000). The performance of DQN is measured by the average number of time slots required to fully charge two harvesters. The average time-consumption is obtained over 1000 testing data. Figure 3 shows that if the deep Q-network has multiple hidden layers, a smaller learning rate is necessary to achieve better performance. When the learning rate is 0.1, the DQN with 4 hidden layers performs worse than a neural network with 2 or 3 hidden layers. On the other side, when the learning rate decreases, we can see that the neural network with 4 hidden layers and a learning rate of 0.00005 achieves the best overall performance. We do not see a monotonic decrease in the average number of time slots due to the stochastic nature of the channel that causes some fluctuations in the DQN optimization. After an initial improvement, decreasing the learning rate results in a slight increase in the average number of charging steps for all three neural network structures. This is due to the fixed number of training episodes. As a result, for all the simulations presented in this section, we consider a DQN algorithm using a 4 hidden layer deep neural network, with 100 nodes in each layer and a learning rate of 0.00005.

In Figure 4, we can observe that the size of the experience pool also affects the performance of DQN (40000 training episodes). To eliminate the correlation between the training data, we select part of the experience pool for training. In our simulation, this parameter, called mini-batch, is set to 10. Larger experience pool contains more training data; hence, selecting the mini-batch from it for training can eliminate the correlation between the training data. However, we need to balance the size of the experience pool and the weight replacement interval. If the experience pool is large but the replacement iteration interval is small, even if we address the correlation problem between the training data, the neural network does not have enough training episodes to reduce the training error before the weight of the is replaced. From Figure 4, we can observe that a large number of replacement iteration intervals may not be the best choice too. Therefore, we determine that, for our problem, DQN achieves the best performance when the size of the experience pool and the neural network replacement iteration interval are 60000 and 1000, respectively.

Figure 5 shows the impact of the reward function (see Section 3.1) on the DQN performance. In this figure, we consider the following reward functions: : if and otherwise; : if and otherwise; : if and otherwise. Here, and . All three reward functions are designed to minimize the number of time slots required to fully charge all the harvesters. However, from Figure 5, we can observe that the best performance can be obtained using . In this case, the energy level accumulated by each harvester increases uniformly, which results in the DQN to converge faster to the optimal policy. Both and , instead, do not penalize states that unevenly charge the harvesters and therefore require more iterations to converge to the optimal solution (not shown in the figure) due to the large number of system states to explore. Therefore, in the following simulations, we use the reward function in both Figures 5 and 6, we average 40000 training steps every 100 steps in order to better show the convergence of the algorithm.

Figure 6 shows that when each energy harvester in the system is equipped with a larger energy buffer, the number of system states increases, and therefore, DQN requires more training period to converge to the steady transmit strategy for each system state. We can observe that when  mJ, the system only needs less than 5000 training episodes to converge to the optimal strategy, and when  mJ, the system needs around 12000 training episodes to converge to the optimal policy. However, for  mJ, the system needs as many as 20000 training episodes to converge to the optimal strategy.

In the following simulations, we explore the impact of the channel model on optimization problem . For the Rician fading channel model, we consider and to be the same for all . In this way, we can approximate the Rician distribution as a Gaussian distribution. We fix , but allow the standard deviation of both the amplitude and the phase of the channel to change to evaluate the performance on the system under different channel conditions. Since and , . We define to guarantee .

In Figure 7, we express the standard deviation of the phase and amplitude of the channel, and we compare the performance attained by the optimal policy with the performance of different other algorithms. The multiarmed bandit (MAB) algorithm is also implemented to compare with the DQN. In MAB, each bandit arm represents a particular transmission pattern. The upper confidence bound (UCB) algorithm [27] is implemented to maximize the reward and determine the optimal action. Once the action is selected from the action space , it will be used for transmission continuously. The myopic algorithm is another machine learning algorithm that can be compared with DQN. Myopic solution has the same structure as the DQN; however, the reward discount is defined as . As a result, the optimal strategy is determined only according to the current observation instead of considering the future consequence. Myopic solution has been widely used to solve the complex optimization in wireless communication problem and achieve good system performance [28]. Besides two machine learning algorithms, another two heuristic algorithms are also used for system performance comparison. For even power allocation, the transmit power is evenly allocated on parallel channel for transmission. The random action selection is also applied for performance comparison. The random action selection has the worst performance while DQN performs best. Compared to the optimal existing algorithm multiarmed bandit algorithm, the DQN can consume fewer time slots to complete charging. In some channel conditions, the myopic solution can achieve a similar performance as the DQN. However, the myopic solution cannot perform stably. For example, as the standard deviation of the channel amplitude is , DQN can outperform myopic solution by . The instability can be explained as the myopic solution makes the decision only on the current system state and the current reward, which does not consider the future consequence. Hence, the training effects cannot be guaranteed. Overall, the DQN has superiority in both the charging time consumption and performing stability corresponding to different channel conditions.

To better explain the performance of the optimal policy, in Figure 8, we plot the action selected by DQN at a particular system state when . When , the optimal action selected by multiarmed bandit is the third action , which can finish charging both harvesters in around 60 time slots. Meanwhile, the optimal policy determined by DQN can finish charging in around 43 time slots. To this end, Figure 8 shows that the charging process can actually be divided into two parts: before harvester 1 accumulates 1.2 mJ energy and harvester 2 accumulates 0.8 mJ energy, mostly action 4 is selected. After that, mostly action 1 is selected. As defined above, both the amplitude and the phase of the channel are Gaussian distributed with zero standard deviation, and . So when both the amplitude and the phase of the channel change, the simplified channel state information will be distributed around and . As a result, it can be shown that a policy that selects either action 1 or action 4 with different probabilities can have better performance than the policy that only selects action 3. Henceforth, the DQN can consume fewer time slots to fully charge two energy harvesters.

In Figure 9, the performance of the DQN is compared with the other four algorithms by varying the number of energy harvesters in the system. In general, as the number of energy harvesters increases, all four algorithms consume more time slots to complete the wireless charging process. Compared to the random action selection, DQN can consume at least less time slots to complete the charging. The performance of the multiarmed bandit and the even power allocation is very similar, which can be explained as the optimal action determined by the multiarmed bandit algorithm is close to the even power allocation strategy. Compared with two fixed action selection strategies (multiarmed bandit and even power allocation), DQN can reduce the time consumption by up to (when the number of energy harvesters is ). The myopic solution is still not the optimal strategy. From the figure, we can observe that the myopic solution outperforms two fixed action selection algorithms. Even though in some conditions (), the performance difference between DQN and myopic solution is very small, the myopic solution consumes more than of the time slot than DQN in average. Overall, the DQN is the optimal algorithm which consumes fewest time slots to fully charge all the energy harvesters regardless of the number of energy harvesters.

In Figure 10, the number of transmit antennas is increased from to . The number of energy harvesters varies from to . Though the number of antennas increases, the channel conditions between the transmitter and the energy harvesters become more complicated; DQN still outperforms all the other four algorithms. Compared with myopic solution, multiarmed bandit, even action selection, and random action selection, DQN can consume up to , , , and fewer time slots to fulfill the wireless charging, respectively. As the number of energy harvesters increases, the superiority of the DQN becomes more obvious compared to two fixed action selection algorithms, which can be explained as it is more inefficient to select one fixed action to deal with a more complicated varying channel environment. Even though in some conditions, the performance of the myopic solution and DQN is similar, the myopic solution is not stable in dealing with different energy harvesters conditions. The results from both Figures 9 and 10 demonstrate the superiority of the DQN in optimizing the time consumption for wireless power transfer.

5. Conclusions

In this paper, we design the optimal wireless power transfer system for multiple RF energy harvesters. Deep learning methods are used to enable the wireless transmitter to fully charge the energy buffers of all energy harvesters in the shortest time while meeting the information rate requirement of the communication system.

As the channel conditions between the transmitter and the energy harvesters are time-varying and unknown, we model the problem as a Markov decision process. Due to the large number of system states in the model and the difficulty of training, we adapt a deep Q-network approach to find the best transmit strategy for each system state. In the simulation section, multiple experimental environments are explored. The measured real-time data are used to run the simulation. Deep Q-network is compared with the other four existing algorithms. The simulation results validate that the deep Q-network is superior to all the other algorithms in terms of the time consumption for fulfilling wireless power transfer.

Data Availability

The simulation data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.