Abstract
Autonomous driving is a popular and promising field in artificial intelligence. Rapid decision of the next action according to the latest few actions and status, such as acceleration, brake, and steering angle, is a major concern for autonomous driving. There are some learning methods, such as reinforcement learning which automatically learns the decision. However, it usually requires large volume of samples. In this paper, to reduce the sample size, we exploit the deep Gaussian process, where a regression model is trained on small sample datasets and captures the most significant features correctly. Besides, to realize the realtime and closeloop control, we combine the feedback control into the process. Experimental results on the Torcs simulation engine illustrate smooth driving on virtual road which can be achieved. Compared with the amount of training data in deep reinforcement learning, our method uses only 0.34% of its size and obtains similar simulation results. It may be useful for real road tests in the future.
1. Introduction
Autonomous driving is one of the most promising field of artificial intelligence [1, 2]. To realize safety driving on the real road, egovehicles need to recognize and track the objects with its perceptual equipment [3], as well as act properly according to the current road conditions with decisionmaking modules. The decisionmaking module is the most important part for selfdriving, yet the most challenging part to achieve. The core mission mainly includes obstacle avoidance, trajectory planning, and action prediction [4, 5]. The decisionmaking model is built with rulebased [6] or statistical method [7], which are two popular schemes. Rulebased method can implement functionality quickly, but they are confined by the incomplete sets of state and the inability of capturing uncertainty. These shortcomings are overcome by a combination with statistical methods. In addition, with the advent of simulation engines such as Torcs [8] and Carla [9], various methods based on reinforcement learning [10] are proposed in the decisionmaking research and satisfactory performances are achieved. In the typical models of reinforcement learning, the agent begins, without prior knowledge about the world in advance, only with knowledge of which actions are possible, and it is expected to learn the skill solely by interacting with the environment and receiving rewards after taking actions. With this shortcoming, it requires absurdly huge amounts of time and datasets to learn to do specific tasks, such as board game [11] or selfdriving [10]. However, due to insufficient diversity, realworld datasets are often unsatisfactory. And setting up a dataset with precise labels from the real world is laborintensive and timeconsuming, especially in largescale complex transportation systems [12], not to mention building a dataset with specific features that meets our needs. To address this problem, building diversity of the virtual datasets is a practical way to improve the performance of the trained models in the case of insufficient training data [12, 13]. And in our paper, we propose another feasible way to solve decisionmaking problem in autonomous driving, which directly utilizes small datasets without the aid of visual samples.
In recent years, Gaussian process (GP) has become one of prevailing regression techniques [14]. To be precise, a GP is a distribution over functions such that any finite set of function values have a joint Gaussian distribution. The predicted mean function and covariance function are used for regression and uncertainty estimation, respectively. The strength of GP regression lies in avoiding overfitting while still being able to find functions complex enough to describe any observed behavior, even in unstructured or noisy data. GP is commonly used in the situation when observations are expensive or rare to produce and methods such as deep neural network performs poorly. And it has been applied among a wide range from engineering [15], optimization [16], robotics [17], and physics [18] to biology [19]. Nevertheless, the sort of phenomena that can be easily expressed by using GP directly are limited. For example, in a sparse data scenario, the constructed probability distribution is often far away from the true posterior distribution. Recognizing this problem, many interesting research activities have been carried out, which attempt to represent new properties via the hierarchical cascading of Gaussians. Inspired by the widespread success of deep neural network architectures, Damianou and Lawrence proposed a method that a GP was directly composed with another GP; furthermore, the idea was implemented recursively, leading to the socalled deep Gaussian process (deep GP) [20]. A deep GP consists of a cascade of hidden layers of latent variables where each node acts as output for the layer above and as input for the layer below. GPs govern the mappings between the layers with their own kernel. Therefore, deep GP retains valuable properties of GP, such as wellcalibrated predictive uncertainty estimation and nonparametric modeling power. In addition, it employs a hierarchical structure of GP mappings which makes it more flexible, has a greater capacity to generalize, and provides better predictive performance. This model is fascinating because it can potentially discover layers of increasingly abstract data representations, while handling and propagating uncertainty in the hierarchy at the same time [21].
Undoubtedly, according to the nature of Bayesian statistics, the deep GP model makes prediction based on statistical average. However, we cannot ensure that the statistical average results are reasonable with such small training data because of model uncertainty. And it turns out that it is not enough to solve decisionmaking problem in our setting according to our calculation. So, we introduce the feedback control method to compensate this shortcoming.
Based on the analysis above, in this paper, we propose a decisionmaking framework combining deep GP and feedback control method. The deep GP model makes action prediction possible, and feedback control method assists action uncertainty error correction. When the state is fed into the framework, the action will be obtained immediately. It is kind of an endtoend learning method [22]. This method is trained with small data in Torcs and tested in Torcs. According to our calculation, in terms of time consumption and data volume, our method is superior to the deep reinforcement learning trained by deterministic policy gradient (DPG) method [23].
2. Related Work
2.1. Deep Reinforcement Learning
As mentioned above, selfdriving vehicle is a decisionmaking system that processes information from various sources, such as cameras, radars, LiDARs, GPS units, and inertial sensors. This information is used by the vehicle’s system to make driving decisions. The architecture can be implemented either as a sequential perceptionplaningaction pipeline, or as an endtoend system. Recent works are mainly focused on deep reinforcement learning paradigm to achieve selfdriving. Existing reinforcement learning algorithms mainly compose of valuebased and policybased methods. Vanilla Qlearning is the first proposed method and then becomes one of the popular valuebased methods. Karavolos applies vanilla Qlearning algorithm to simulator Torcs and evaluates the effectiveness of using heuristic during the exploration [24]. Recently, lots of variants of Qlearning algorithm, such as DQN [25], Double DQN, and Dueling DQN [26], have been successfully applied to a variety of games and outperform humans since the resurgence of deep neural networks.
Different from valuebased methods, policybased methods learn the policy directly. In other words, policybased methods output action given the current state. Silver et al. [27] propose a DPG algorithm to handle continuous action spaces efficiently without losing adequate exploration. By combining idea from DQN and actorcritic, Lillicrap et al. [23] then propose a deep DPG (DDPG) modelfree approach and achieve endtoend policy learning. In 2016, a new technique, which combines policy gradient and offpolicy Qlearning (PGQL), is proposed and achieves performance exceeding that of both asynchronous advantage actorcritic and Qlearning on the full suite of Atari games [28]. All these policygradient methods can naturally handle the continuous action spaces. Despite validity and practicability of reinforcement learning, the training time costs too much and the volume of training data is its soft spot if we cannot get enough data.
2.2. Feedback Control Method
In addition, traditional control methods also play an important role for solving selfdriving problem. The automatic control is almost the last part in the sequence of the autonomous vehicle, and one of the most critical tasks since it is responsible for its movement. The wellknown controller mainly includes PID (proportional integral derivative) controller and MPC (model predictive control) controller [29]. A PID controller is a practical part used in industrial control applications to regulate pressure, speed, temperature, and other core variables [30]. The PID controller uses a control loop feedback mechanism to control process variables, and it is the most accurate and stable controller. It is so named because its output is the summation of three terms (proportional, integral, and derivative term). Each of these terms depends on the error value between the input and the output.
Differently, the MPC controller relies on dynamic models of the process, the most common being the linear empirical models obtained through system identification. The main advantage of MPC is that it can optimize the current time step, while also taking future time steps into account. This is achieved by optimizing a finite timehorizon, but only implementing the current time slot and then optimizing again, repeatedly.
2.3. Gaussian Process
GP is a Bayesian nonparametric machine learning framework for regression, classification, and unsupervised learning [14]. A GP is a collection of random variables , any finite combination of which satisfies a multivariate normal distribution. Suppose that a set of noisy observed outputs : ( is an independent identically distributed (i.i.d) Gaussian noise with variance ) are available for training inputs set . Then, the latent set is assumed to be a Gaussian prior , where is a covariance matrix with components for . Since the data likelihood can be written as , the GP predictive distribution of the latent outputs with any test inputs can be computed in a closed form by integrating , where is the posterior distribution. Due to the inversion of the covariance matrix, the training GP model needs operations, which prevents it from scaling well to massive datasets. To improve its scalability, the sparse GP (SGP) models exploit a set of inducing output variables for some small set of inducing inputs (i.e., ). Then, the joint probability of , , and is as follows:where (i.e., ), , and is treated as a column vector here. and represent covariance matrices with components for and and for , respectively. The SGP predictive belief can also be computed in a closed form by marginalizing out: . The unifying view of the SGP model can be referred to in [31, 32].
Inference for the GP model is analytically possible when the likelihood is Gaussian. For the nonGaussian likelihoods, approximation approach should work. Titsias [33] proposed a seminal variational inference (VI) framework that approximates the joint posterior distribution with a variational posterior by minimizing the Kullback–Leibler (KL) distance between them: KL. And this procedure is equivalent to maximizing evidence lower bound (ELBO) of the logmarginal likelihood [32, 34]:
A common choice in VI is the Gaussian variational posterior , which results in a Gaussian marginal , where and . A gradientbased algorithm can be employed to maximize the ELBO with respect to the inducing point and hyperparameters in the chosen kernel function. Several common used kernel functions can be found in Table 1 and discussed in [35].
3. Materials and Methods
3.1. Problem Statement
To mathematically formulate the autonomous driving task, we refer to the basic theory of deep reinforcement learning. Let , , and be the state space, action space, and the reward function. In the standard reinforcement learning setting, an agent interacts with the environment at discrete time steps. At each time step , the agent observes the state and takes an action, according to its policy , which maps a state to a deterministic action or a probability distribution over the actions (). Then, it receives an immediate reward from the environment. The goal of a reinforcement learning task is to learn an optimal policy by maximizing the expected accumulated reward from the beginning. In the DDPG setup, it adopts deep neural network to approximate deterministic policy and action value function. However, training the deep neural network costs too much time and needs a lot of data.
In our setting, we treat the deep GP model as the policy. To get the optimal deep GP model, the training data collected from interaction between welltrained neural network and Torcs engine, which consists of state set regarding sensor’s states and action set from the controller in Torcs, are used to train the model . Each state and action , as well as and , are represented by several variables presented in Tables 2 and 3, respectively. And the reward function we defined is as follows:
The reward function can be constructed more effectively by including other related variables [36]. Although the reward function does not contain the variables of action at explicitly, each state is observed at time step after taking the action at time step . It influences the result of reward value indirectly.
To get the best policy, the evidence lower bound, denoted as , for the deep GP model , which is more complex than standard GP should also be maximized using training data, and . According to the Bayesian theory, it means to get the statistical mean of action mapping from state , denoted as . After optimization, the model will be settled, denoted as , which yields . As you can imagine, though the trained deep GP can fit the training data well, it does not guarantee the optimal reward value in the testing period. With that being considered, we introduce the feedback control method to refine the output of the deep GP model, i.e., . In this method, we consider data and , which represent collection of and in each state , in the training data . To achieve better reward, it is designed to optimize action according to the difference between state and the training state . Our solution presents in the following expression:
All the details will be presented in the next section. To compare our method with deep reinforcement learning, we performed an autonomous driving simulation of the lane keeping task in the Torcs engine.
3.2. Proposed Solution
In this section, the details of our autonomous driving decisionmaking methods for lane keeping task is given. The whole framework is presented in Figure 1. After training, we can get a fairly good deep GP model to fit the training data. The trained deep GP model is used to predict the action according to the state feedback from Torcs in each step. For validation, all the predicted actions are further refined by feedback control method for feasibility and safety concerns. Then, the final actions are then sent to Torcs to demonstrate visually the performance on running a successful lap. The overall algorithm flow is shown in Algorithm 1. In the following content, we will discuss core methods in our framework in detail.

3.2.1. Deep GP Model
As for this multidimension input and output problem, we use the deep GP method to fit the training data in consideration of its advantage over a standard GP [20]. A multilayer GP model is a hierarchical composition of GP. Considering a deep GP with a depth of , each GP layer is associated with a set of inputs and a set of outputs for and . An example of deep GP is as follows:where for each layer. Each layer has different kernels. For deep GP, each layer is governed by GP; however, the overall prior is no longer a GP which makes it intractable to train a deep GP model. For reasonability, we can introduce the Gaussian noise in each layer. In this case, we can get the following recursive definition:
A graphical model for deep Gaussian process with one hidden node is illustrated in Figure 2.
Let ; for supervised learning case, the distribution of a deep GP model with hidden layers can be written as follows:
As for the conditional probabilities, they can be expanded as follows:
The nonlinearities introduced by the GP covariance functions make the Bayesian treatment of this model challenging. Inspired by the core idea of the SGP model, it is practical to introduce the inducing inputs and corresponding inducing output variables for GP layers, denoted by the respective sets and . Now, similar to equation (9), we could write that
In this way, we can obtain the logarithm of the augmented joint distribution:where and is the lower bound for :where and [21]. And we can see that the latent variable are integrated out within each layer. Our aim is to approximate the logarithm of the marginal likelihood: where . To get the bound for marginal likelihood, with Jensen’s inequality, we can get thatwhere is the introduced approximate variational distribution.
Generally, the can be more simplified by mean field approximation (i.e., in each layer), and the final form of can be tractable because of these conjugate distributions, when the covariance functions selected in each layer are feasibly convoluted with the Gaussian density [20, 21]. A gradientbased algorithm, such as LBFGSB algorithm [37], can be employed to maximize the variational lower bound ELBOM with respect to the model parameters (i.e., kernel hyperparameter and noise variance in each layer) and variational parameters are introduced:
The trained deep GP model can fit the training data well. In recent years, other several approximation methods are put forward to train deep GP such as importanceweighted variational inference [38], stochastic gradient Hamiltonian Monte Carlo [39], and approximate expectation propagation [40].
3.2.2. Feedback Control Model
After training the deep GP model, all the parameters in the model are settled. We try this model in Torcs and find that it can only finish a little more than half loop trip on the CG road. After analysing the failed experience and the input data, we find the essential cause is that the data from DDPG welltrained network only contain the state cases with small and , which are close to the center line of the lane. With the input state with highly deviated or value, the trained deep GP may generate action with improper , , or value. And it indirectly affects the value of the reward function. We assume that if the values of and stay in a reasonable range, the successful loop trip can be achieved regardless of the values of other state variables. With that being thought, we design an extra feedback control method , for reward optimization, to amend action predicted by the deep GP model.
In this method, we refer to the idea of the PID controller method. For simplicity, unlike the PID controller, we only add proportional changing errors, but it is composed of two different positive items, to the predicted steer value in the action . In addition, instead of using integral error terms, we take the past state into consideration by adding the error between the current state and a reference state in the past. Firstly, several critical values () need to be set in our method. The reason for doing this is that the feedback control is only needed for those improper feedback state variables, and . Secondly, the error is calculated by the difference between th or th number of variable, when is smaller or larger than , and in , and current feedback state variable, and . Finally, the parameters of the linear error term need to be regulated to achieve loop trip. The detailed algorithm flow is presented in Algorithm 2.

As we can see, there are many adjustable parameters in our feedback control method. Actually, it turns out that all the parameters can be easily determined:(i)According to our experience, and can be set immediately after analysing the domain of and . Because the logical judgment in should only be needed when the vehicle deviates too much from the center line or the steering angle is too large.(ii)And the value of and can be set the same as and . Consequently, only four parameters in the feedback control method are left to be considered seriously.(iii)Besides, there are two logical judgment statements after iteration number check in our method. This actually corresponds to the case that the visual vehicle is in the left or right side of the center line. We should consider these two situations separately. Moreover, the reason why we use the absolute value of the error is that the sign of the steer angle should be always in the correct direction, with its value larger (left side of center line) or smaller (right side of center line) than predicted values by deep GP model .(iv)In other words, the absolute value of output action from model is not large enough to drag the vehicle back into the safe road in some extreme dangerous situations. From this perspective, the sign of rest four parameters to be determined will be obvious.
Unlike usual optimization routine, the optimization of the reward value in each step is not carried out by a gradientbased algorithm. Actually, for lane keeping task, if the vehicle can complete the lap successfully, the obtained reward value may be not the best, but it must be one of the local optimal values. After enough trial and error, we can get the relatively optimal parameters introduced in . In this way, after the parameters in method are determined, the vehicle can immediately respond to the Torcs engine through the refined action .
4. Results and Discussion
In this section, we conduct extensive simulations to valid our method and compare it with reinforcement learning approach that are typically used in a similar setting. We start with experiment setup about data preparation, then show how well our model fits the training data, and finally provide comparison by examining the performance of lane keeping in a simulation environment.
4.1. Experiment Setup
DDPG is a variant of deterministic policy gradient algorithm [23], which adopts deep neural network to approximate deterministic policy and action value function. It is an offpolicy algorithm, utilizing the experience replay technique introduced in DQN [25] to break the correlation of the samples and keep samples i.i.d. In addition, the learning method of Q function is similar to that in DQN as well. In our case, we train a deep neural network by DDPG to achieve successful loop trip. It takes about 16 hours and 4000 episodes to achieve a high performance deep neural network. And tens of thousands of data will be updated in the centralized experience replay buffer during training period.
We collect the training data by DDPG welltrained network on the CG road in software Torcs. 338 records () are collected during the loop trip simulation. It contains state set and action set of the visual vehicle. The detail of state and action are already shown in Tables 2 and 3, respectively. To train deep GP network, the state set is the input and the action set is the output. And the raw data are fed into deep GP network without any additional data processing before training.
4.2. Experimental Results
In our case, we use the GPy [41], which is an open framework developed by Sheffield machine learning group, to conduct simulation. We use two layers GP to fit the training data. The kernels we used per layer are as follows:
All the function expressions corresponding to these function names can be found in Table 1 or the GPy document web page. And their corresponding automatic relevance determination (ARD) [35] version can be easily extended. The number of inducing points we used in each layer is 200. After optimization, the output action values are shown in Figures 3–5. In these figures, the true data and predicted mean value legends mean the true training data action values and the predicted values after finishing training deep GP, respectively. The green zone, with its margin depicted by the green dashed line, in the figures, represents the credible interval of the predicted value. The axis variable represents the time steps of selfdriving. We can see that the model can capture the most main features of the train data except for few strong vibration zones.
Recall that we stated at the beginning that the deep GP mode is not enough to solve lane keeping task in our setting. And we compare the cumulative reward value between the deep GP method with and without combining the feedback control method. In Figure 6, it demonstrates that the deep GP model can only finish about half loop trip on CG road, but after combining the feedback control method, the accumulated reward value increases to about 2.7 times more as using the deep GP model only. It proves the effectiveness of the feedback control method. In Section 2.2, we already explained the main reasons why the lane keeping task cannot be completed using the deep GP model only. In addition, although the deep GP model can capture the uncertainty very well, it does not have the ability to correct the wrong predicted actions. In such a rapidly interactive environment, these unreasonable actions are so fatal that the vehicle is much more likely to rush out of the track.
4.3. Experimental Comparison
In Section 4.2, we show how our model fit the training data and the necessity of the feedback control method in our case. And now, we compare their performance with the DDPG method. Compared to the DDPG method, the other two methods take more steps to achieve loop trip and the total rewards are a little less than DDPG. So, their ascending curves of accumulated rewards in each step have more flat slopes than the DDPG method. Table 4 lists several properties, such as Total Rewards, Training Time, and Training Data, of the three methods. In spite of advantages in iteration times and total rewards, the DDPG welltrained network, which is used to get training data, costs about 16 hours to train, and it only takes about 1.5 hours to train the deep GP model. It is also much less than the welltrained network with the AMDDPG method according to the result in the paper [36]. In DDPG and AMDDPG methods, they need to interact with the environment in each episode to update the new training data, and this procedure will be repeated for multiple times for exploration and exploitation. Thus, these datahungry approaches need tens of thousands of data, but the training data we used only contains about 340 items, which is far more less than what is required. All the simulations are conducted on the CG track in Torcs (the overview map in upright corner in Figure 1). Other complex tracks, shown in Figure 7, can be found in Torcs engine or generated by an online tool named TrackGen [42].
With these benefits, we believe that the proposed framework is a promising way to make decisions in simulation environments and actual road conditions. However, there are many technical problems to tackle to achieve the real road test. Admittedly, the shortcomings of the proposed method should also be acknowledged. In this paper, we only test the methods for lane keeping task on a relative simple road. On a complex road or executing a complex task, it is obvious that more data should be fed into the deep GP model and the feedback control method also needs to be dedicatedly designed and validated. For example, doing simulation on Curuzu track in Figure 7, we can imagine that more training data will be recorded by a wellperformed reinforcement learning model. And for this road with many irregular turns, the feedback control method must be tested with extensive trial and error. To check the effectiveness of the proposed framework in real road tests similar to simulation setting, we can record the training data with the aid of perceptual equipment by manual driving. After training the deep GP model offline, we can test its validity in both autonomous driving and manual driving modes. In addition, the parameters in the feedback control method should also be regulated to complete the selfdriving task. For more complex tasks, such as carfollowing or overtaking, since the feedback control method in our framework do not take the motional characteristics of the vehicle into consideration, we plan to combine our framework with other motion control methods, such as pure pursuit [43] or Stanley [44], in the future work.
5. Conclusions
In conclusion, we presented an endtoend learning method which combines the deep GP and feedback control method to solve decisionmaking problem of lane keeping task in selfdriving simulation. The proposed method achieved almost the same performance with only of the training data, compared with deep reinforcement learning, and the time consumption of the training is only . We believe this method is a promising one when dealing with complex selfdriving tasks with small training data.
Data Availability
The raw data are available online in my github repository (https://github.com/Fangwq/Traningdatafordecisionmakingresearch).
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
The authors would like to thank Junta Wu’s, who is the author of AMDDPG, help for training the DDPG network, supplying the data, and providing many helpful clarifications and suggestions. This work was supported by Shenzhen Engineering Laboratory on Autonomous Vehicles, NSFC (61672512 and 61702493) and Shenzhen Basic Research Program (JCYJ20170818164527303 and JCYJ20180507182619669), Science and Technology Development Fun, Macao S.A.R. (FDCT) (No.0015/2019/AKP), and CAS Key Laboratory of HumanMachine IntelligenceSynergy Systems, Shenzhen Institutes of Advanced Technology. The work was also funded by Shenzhen Institute of Artificial Intelligence and Robotics for Society.