#### Abstract

Variational quantum circuit is proposed for applications in supervised learning and reinforcement learning to harness potential quantum advantage. However, many practical applications in robotics and time-series analysis are in partially observable environment. In this work, we propose an algorithm based on variational quantum circuits for reinforcement learning under partially observable environment. Simulations suggest learning advantage over several classical counterparts. The learned parameters are then tested on IBMQ systems to demonstrate the applicability of our approach for real-machine-based predictions.

#### 1. Introduction

The combination of deep neural networks and reinforcement learning is demonstrated as an effective way to tackle computational problems that are difficult for other traditional approaches [1, 2]. In the usual reinforcement learning settings, the underlying model is the Markov decision process (MDP) [3, 4], where the information of environment can be completely obtained by the learning agent. However, in many real-world applications such as robotics [5], the observations are obtained from the sensors of mobile robots and hence are limited. In such cases, it is necessary to model the problem by the partially observable Markov decision process (POMDP) [6, 7]. POMDP is a framework for the environments where complete information cannot be obtained. One difficulty occurs in the POMDP in robotics setting is the perceptual aliasing problem, in which the learning agent cannot distinguish one state from another state due to the limitation of observation ability. To make proper decision under limited observation, the agent has to memorize its history to distinguish one state from the other.

One traditional method for POMDP is the belief value iteration [5–7], where the agent maintains a belief distribution over possible states. The value function then becomes a functional over continuous functions and hence is computationally expensive. To deal with the computational difficulties, other methods using Monte Carlo [5, 8] and recurrent neural network [9, 10] have been proposed. These methods are difficult to execute without a sufficient computing capacity and memory. Complex-valued reinforcement learning has been proposed as a POMDP algorithm that can be executed with less computational resources [11]. In complex-valued reinforcement learning, the action value function gives a complex number, and there is another complex number internal reference vector that represents time series information. Tables [11, 12] and neural networks [13, 14] have been used to express complex action value functions. However, expressing correlated complex numbers using a classical computer is not considered to be a good choice from the viewpoint of memory efficiency. On the other hand, since quantum computers perform calculations over Hilbert space, it is reasonable to think that quantum processors can be used to represent complex-valued functions efficiently. If the abovementioned complex action value function can be expressed by a quantum computer, it may become a more memory efficient POMDP learning method.

In this work, we propose a method for performing complex-valued reinforcement learning by expressing a complex action value function using a quantum computer. Previous works apply current quantum hardware [15] for supervised machine learning using the variational quantum circuit method [16–18]. The applications of variational quantum circuit to the value function in reinforcement learning problem are also demonstrated [19–22]. Some works use classical-quantum hybrid model to solve large problems [23, 24]. Other works use variational quantum circuit to represent policy in reinforcement learning [25, 26]. Variational quantum circuit is also applied to represent both policy and value function in actor-critic method [27]. Quantum algorithms are also applied to sampling from policy in deep energy-based reinforcement learning [28]. These quantum algorithms for reinforcement learning are based on MDP. We implement the variational quantum circuit for complex-valued action value function approximation and compare the performance against other methods like complex neural networks. The learning performance shows advantage over some other classical methods. We further use the parameters learned from simulation for predictions with IBMQ systems. The agent is able to reach the goal state with the predictions made by IBM machines. This discovery suggests that the use of variational quantum circuit for POMDP provides possible advantage.

This paper is organized as follows. Background section introduces the concept of POMDP. In Methods section, variational quantum circuit and neural network are applied for complex-valued Q-learning. In Results and Discussion section, partially observable maze environment experiment results are shown. Conclusions section describes conclusion and future work.

##### 1.1. Background

POMDP is a general framework for planning in the environments where perfect information cannot be gotten. In POMDP, the agent cannot fully observe the state but instead receives an observation from the environment, and the observation does not satisfy Markov property. A POMDP is defined as a tuple . is the set of states. is the set of actions. is the state transition probability. is the reward function. is the set of observations. is the observation probability. is the discount rate.

When the agent executes an action in the environment, the state transitions from a state to a next state according to the state transition probability and the agent receives an observation from the environment according to the observation probability and a reward according to the reward function . The history is the time series of actions and observations and is expressed as . As the agent cannot fully observe state, it uses the belief state, which is the probability distribution that represents in which states the agent is. Belief state is denoted by , and it can be updated by the following formula:

The agent's policy is denoted by or . The purpose of the agent is to get a policy that maximizes the expected total reward .

There are two types of methods for planning POMDP. One method is based on value iteration method [7, 29]. This type of method updates the belief state and value function using a model. Belief state is updated by equation (1), and value function is updated by updating the set of alpha vectors. Another method is the method using a black box simulator [8]. Black box simulator is a model that receives state and action and returns next state, observation, and reward. This method executes Monte Carlo tree search and Monte Carlo update of belief state using this simulator. Although these two planning methods use model or black box model, as normally model is not given to the agent, algorithms learning model or algorithms not using model are needed. The former is model-based method, and the latter is model-free method.

Model-based methods inference the model and then execute planning using the learned model. To learn the model, Bayesian method [30, 31] and nonparametric approach [32] are proposed. Model-free methods are RNN method [9, 10, 33], complex-valued reinforcement learning [11–13], etc. These methods do not use belief state but incorporate time series information directly into the value function. In this paper, we focus on the model-free method especially complex-valued reinforcement learning.

#### 2. Methods

##### 2.1. Complex-Valued Q-Learning for POMDP

A POMDP problem that we are interested in here can be described by a tuple . is a discrete set of states. is a discrete set of actions. is a state transition probability matrix for and . is a reward function for and . is a discrete set of observations. is an observation probability matrix for , , and . is the discount rate. In the examples in this work, both the state transition and the observation are deterministic. We look for a policy for which the expected future cumulative reward is optimized. The policy can be derived from the action value Q-function in Q-learning.

Complex-valued Q-learning is based on the iteration:where is the complex-valued observation-action value function. The dot notation for some quantity is used throughout the manuscript to remind us that the quantity is complex-valued. and are the observation and action at time , respectively. The learning rate is a real number. The reward is a real number. is the discounted rate, which is a real number. is a complex hyperparameter for some real constant . is defined as

Here, means the complex conjugation of . denotes taking the real part of a complex number. The learned time series is reproduced by updating each complex number in the opposite phase direction. An eligibility trace method is further implemented such that the action value function is updated according towhere for and is the trace length. This update rule is exact for the table where the complete -table must be stored in the memory. In the variational approaches, the function is variationally optimized by minimizing the functional:with gradient descent . Finally, the policy is the Boltzmann stochastic policy:where is a temperature hyperparameter.

##### 2.2. Variational Quantum Circuit for Q-Function

The function is constructed by the expectation value:where is a scaling constant. is the set of trainable variational parameters for trainable unitary . The quantity will be called the circuit depth. is a set of input parameters which encode the observation and action . parameterizes the input layer local unitaries. is an output unitary to be measured by Hadamard test [34] against the output wave function . is parameterized by some trainable and the action . The circuit structure is depicted in Figure 1. We implement three types of encoding schemes, which are summarized in the following paragraphs.

In Type 1 and Type 3 quantum circuits, the input layer is , where the index means the local rotation is acting on -th qubit. The parameters and are uniform for all the qubits (independent of qubit index ). For Type 1 circuit, the encoding is and , where is the rescaled observation.

For Type 3 circuit, the encoding is and , where . In Type 2 quantum circuit, the input layer is , where the subscript means the local rotation is acting on -th qubit. The encoding is done by binary representation of observation for binary numbers . The observation is then encoded as state with some phase factors by choosing and .

For the output unitary, Type 3 uses , where the subscript means the local rotation is acting on -th qubit. All the , , and are trainable. For Type 1 and Type 2, we use for an action . Figure 2 depicts the quantum circuits encoding used in this work.

All circuit parameters are updated by gradient descent with loss function (5). The gradient of this loss function with respect to each parameter is calculated by the back propagation method on a classical simulator. As quantum circuit calculates the output state by the dot product of the gate matrix and the input state, this calculation is a special case of complex-valued neural networks. Therefore, back propagation method of complex-valued neural networks [35] can be used in the back propagation of quantum circuit.

The gradient of loss function (5) with respect to function is calculated bywhere .

Using the definition , the gradient calculation of expectation value (7) is as follows:

Here, gradient is gotten by equation (8). Figure 3(a) shows this gradient calculation. The gradients of output and are gotten. As the output is gotten by applying some gates with parameters to input , the gradient of the parameters are gotten by back propagating . Similarly, as the output is gotten by applying some gates with parameters to input , the gradient of the parameters are calculated by back propagating . These back propagations are the almost same as the calculation of neural network back propagation, but using the quantum gates is different point. When a gate with parameter is applied to input , the output is gotten by , where is the transformed matrix from gate matrix to calculate the inner product. The gradient is gotten bywhere is the gradient of loss function with respect to the output . The gradient with respect to parameter is calculated bywhere represents the element of the matrix . Figure 3(b) shows the back propagation of the gate calculation. The gradient of loss with respect to the last output is back propagated by equation (10), and parameters are updated by gradient of equation (11).

**(a)**

**(b)**

##### 2.3. Neural Networks

For experimental comparison, we also implement complex-valued neural networks. The detail for the architecture of the neural networks is presented in this section. We use two types of neural networks for experiments. For both Type 1 and Type 2 neural networks, there are 3 layers: one input layer, one hidden layer, and one output layer. In Type 1 neural network, there are two input layer neurons and one output layer neuron. Both the action and the observation are encoded in the input layer neurons. The output neuron then gives . In Type 2 neural network, there is one input neuron and output neurons. Only the observation is encoded in the input layer neuron. Each output neuron then gives for an action . For both Type 1 and Type 2 neural networks, there is one hidden layer, and there are 30 neurons in the hidden layer. The networks are depicted in Figure 4. We use the learning rate 0.0001 for the hidden layer and 0.001 for the output layer. We defined the architecture type 1 based on the paper [13] and the architecture type 2 for comparison. We update the parameters by gradient descent using back propagation [35]. In complex-valued neural networks, fully connected layer is calculated bywhere is the input vector. is the weight matrix. is the bias vector. is the weighted sum of input. is the output vector. They are all complex numbers. is the activation function for complex numbers and defined bywhere is activation function for real numbers.

Back propagation equation to calculate gradient of loss function iswhere means the complex conjugation of . is the gradient of loss with respect to output vector of this layer. The parameters are updated by gradient descent using these gradients.

##### 2.4. The Maze Environment

The partially observable maze environment used in our experiments is depicted in Figure 5. We defined the environment with reference to the maze environments of the paper [11, 12]. In particular, the structure is the same as in the maze of the paper [12]. Each cell in the maze represents a possible state. The agent starts from the start state and aims for the goal state. There are four possible actions . The wall of the maze is represented by cells in dark gray color. If the agent executes an action which makes the agent hitting the wall, the agent's state is not transitioned. The numbers in cells are the observations . The cells painted in the same color are different states with the same observation (except for the white color). The reward is +100 if the agent reaches the goal state, and 0 otherwise. Each episode ends either if the agent reaches the goal state, or the number of time steps reaches a maximum value.

#### 3. Results and Discussion

We perform numerical simulations to compare the learning curves of different approaches. Figure 6(a) shows the learning curves for -table (blue solid line with circle marker), complex neural networks (orange solid line and green dashed line), quantum circuit (red dotted line, purple dashed-dotted line, and brown solid line with “+” cross marker), and long short-term memory (LSTM, pink solid line with “*x*” cross marker). The vertical axis is the number of steps to the goal, so lower value means a better policy. The data is the average of 50 independent runs. Each episode has maximum number of steps 1000. The total number of episodes is 5000, and the data is further smoothed out by taking average over 500 sequential episodes. In all the learning experiments, the trace length is . For LSTM, the history sequence length is 6. All the quantum circuits are using 6 qubits with depth = 3. All the neural networks have one hidden layer consisting of 30 neurons. We observe that the -table method has the most stable learning curve, and the learned policy is able to reach the goal with lower number of steps. This is expected since -table does not use any approximation in the representation of -function. Type 3 quantum circuit gives bad result, where the learned policy does not really improve the reaching time to the goal. The performance of Type 1 quantum circuits is not significantly different comparing to other classical complex neural network approaches. We note that Type 2 quantum circuit provides the best result among all the approximation schemes. It is even better than LSTM approach in our experiments.

**(a)**

**(b)**

We compare the results for various hyperparameters. Figure 6(b) shows the learning curves for Type 2 quantum circuit with different trace number. We observe that the performance could be improved by using trace number 4 (green dashed line) rather than trace number 1 (blue solid line with circle marker) or 2 (orange solid line). However, the performance of trace number 10 (brown solid line with cross marker) is not good either. We observe the best learning curves for the trace number around 6 (red dotted line).

Since the best performance could be obtained by Type 2 circuit with trace number around 6, we then fix the circuit type to Type 2 and trace number to 6 and compare the learning curves for different widths and depths. Figure 7(a) shows the learning curves for various circuit depths. It is observed that the learning curve could be improved by increasing circuit depth from 1 (blue solid line), 2 (orange dashed line), 3 (green dotted line), to 4 (red solid line with circle marker) for . Figure 7(b) shows the learning curves for various circuit widths with fixed circuit depth = 4. We observe that higher circuit width makes the learning task more difficult. The learning curve for (blue solid line) is better than that of (orange dashed line) and (green dotted line).

**(a)**

**(b)**

We then take an offline learning scheme to compare prediction performances of different machines. That is, the parameters are obtained from state-vector simulator-based training process, and the predictions are done using various different methods. The test results are depicted in Figure 8. We test the predictions by using a Numpy-based state-vector simulator [36], the QASM simulator provided by Qiskit [37], and the IBMQ system ibmq_guadalupe [38]. The horizontal axis “episode” indicates the number of episodes for training. Hence, episodes = 1000 means that 1000 episodes training is performed, and then the learned parameters are used for the corresponding prediction experiment. Each data point is the average of 5000 runs. The 5000 runs are conducted by 50 independently learned parameters, and for each learning, there are 100 prediction experiments. The number of shots for Hadamard-test estimation is 4096 for the QASM simulator. The quantum circuit type is Type 2. The number of qubits is 5, and circuit depth is 4. The trace number is 6. For the IBM Q system, a set of parameters trained with 5000 episodes is used for a prediction experiment. The number of shots is 1024. Five experiments are executed, and the number of steps to reach the target state is 67 steps, 1000 steps, 502 steps, 1000 steps, and 1000 steps, respectively. Since the maximum number of steps is 1000, “1000 steps” means that the agent does not reach the target in the experiment.

#### 4. Conclusions

In this work, we propose the quantum circuit algorithm for POMDP based on the complex-valued Q learning. We implemented several encoding schemes and compare them to other classical approaches by numerical simulations. The observed learning curve suggests that, with suitable encoding, the learning efficiency of quantum complex-valued Q learning can be better than other classical methods like complex-valued neural networks. The performance of quantum circuit can be further improved by suitable choice of hyper-parameters. The learned parameters from simulators are then tested with IBMQ systems. The agent is able to reach the goal state with predictions made by real quantum machines. Our results provide a new method for POMDP problems with potential quantum advantage. The partially observable maze environment experiment executed in this work is a discrete state environment. For future works, our proposed method could be applied to the other simulation problems, such as partially observable continuous space environment Mountain Car [4, 39]. Since the encoding scheme showing better result in this work is only used for discrete space, to solve continuous problem without discretization, another new encoding scheme for continuous environment will be needed.

#### Data Availability

The code used to support the findings of this study has been deposited in the GitHub repository https://github.com/tomo920/QC-ComplexValuedRL.

#### Disclosure

The views expressed are those of the authors and do not reflect the official policy or position of IBM or the IBM Q team.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

The authors gratefully acknowledge the funding from New Energy and Industrial Technology Development Organization (NEDO) under grant number NEDO P20015. The authors acknowledge use of the IBM Q for this work. The authors gratefully acknowledge the funding from New Energy and Industrial Technology Development Organization (NEDO).