Abstract
For global optimal control strategy, it is not only necessary to know the driving cycle in advance but also difficult to implement online because of its large calculation volume. As an artificial intelligentbased control strategy, reinforcement learning (RL) is applied to an energy management strategy of a supermild hybrid electric vehicle. According to timespeed datasets of sample driving cycles, a stochastic model of the driver’s power demand is developed. Based on the Markov decision process theory, a mathematical model of an RLbased energy management strategy is established, which assumes the minimum cumulative return expectation as its optimization objective. A policy iteration algorithm is adopted to obtain the optimum control policy that takes the vehicle speed, driver’s power demand, and state of charge (SOC) as the input and the engine power as the output. Using a MATLAB/Simulink platform, CYC_WVUCITY simulation model is established. The results show that, compared with dynamic programming, this method can not only adapt to random driving cycles and reduce fuel consumption of 2.4%, but also be implemented online because of its small calculation volume.
1. Introduction
With the increasing problems of global warming, air pollution, and energy shortage, hybrid electric vehicles (HEVs) have been extensively studied because of their potential to significantly improve fuel economy and reduce emissions.
Energy management strategies are critical for HEVs to achieve optimal performance and energy efficiency through power split control. Energy management strategies can be divided according to the way they are implemented: rulebased strategies and optimizationbased strategies.
Optimizationbased strategies form an important group of energy management strategies and can be divided into strategies based on instantaneous optimization, global optimization, model predictive control, and artificial intelligent.
Rulebased strategies are primarily based on human experience. The torque distribution of the engine and motor is based on preset control rules, which are formulated by the steadystate MAP chart of the engine and motor. Reference [1] proposed an energy management strategy based on a logic threshold and a fuzzy algorithm for improving fuel economy. Wang et al. [2] developed algorithms for On/Off control, load tracking control, and bus voltage control and conducted a simulation. Rulebased control strategy is simple and can be easily implemented online; however, a static control strategy is not optimal in theory and it also does not consider the dynamic changes in working conditions.
An instantaneous optimal control strategy can ensure an optimum objective at every time step; however, it cannot guarantee an optimum objective design over the whole driving cycle. Compared with the logic threshold strategy, the calculation amount is large, but it can be implemented online. Reference [3] formulated the energy management strategy of an HEV based on driving style recognition. Reference [4] combined Kmeans clustering algorithm with equivalent consumption minimization strategy to realize the energy management of the whole vehicle.
An MPC strategy can ensure an objective optimum design in the prediction domain and can be implemented online. Reference [5] proposed a stochastic model predictive controlbased energy management strategy using the vehicle location, traveling direction, and terrain information of the area for HEVs operating in hilly regions with light traffic.
A global optimal control strategy can guarantee an objective optimum design over a given driving cycle by distributing the power of the engine and motor. However, it can only be implemented offline because of the large volume of calculations involved. Reference [6] proposed an improved dynamic programming (DP) control strategy for hybrid electric buses based on the state transition probability. DP was applied to an EVTbased HEV powertrain to realize its optimal control in [7]. A DP algorithm was also used for global optimization on the performance of a speed coupling ISG HEV in [8].
Intelligent energy management strategies include fuzzy logic, neural network, genetic algorithm, and machine learningbased control strategies. Energy management strategies based on machine learning include those based on supervised learning [9, 10], unsupervised learning [11, 12], and reinforcement learning.
RL is a datadriven approach that assumes the system as a black box, regardless of whether it is linear or nonlinear. As a type of selfadaptive optimal control method based on machine learning, RL has been widely applied to the learning control of several nonlinear systems.
Reference [13] proposed an energy management strategy for PHEB based on RL. Liu et al. [14] proposed an RLenabled energy management strategy by using the speedy Qlearning algorithm, to accelerate the convergence rate in Markov Chainbased control policy computation. Reference [15] developed an online energy management controller for a plugin HEV based on driving condition recognition and a genetic algorithm. In [16], deep Q learning was adopted for energy management and the strategy was proposed and verified. RL was shown to derive modelfree and adaptive control for energy management in [17]. Liu et al. [18] proposed a bilevel control framework that combined predictive learning with RL to formulate an energy management strategy.
RL can ensure global optimum over the driving cycle and does not require foreseeing the driving cycle. Compared with complex dynamic models, datadriven models can be implemented online because of the small calculation volume.
Aimed at supermild HEV, [19] studied rulebased and instantaneous optimization methods to be applied to energy management strategies for supermild HEVs. However, RL has not been reported to be applied to an energy management strategy of supermild HEVs.
This study establishes an energy management strategy for supermild HEVs based on a known model of RL. The optimal control results are obtained by a policy iteration algorithm. Based on the MATLAB/Simulink simulation platform, a simulation is conducted on the economic performance of the vehicle.
2. SuperMild Hybrid Electric Vehicle Model
2.1. Structure and Main Parameters
A supermild HEV is primarily composed of an engine, motor, and continuously variable transmission with reflux power. The continuously variable transmission with reflux power includes a metal belt continuously variable transmission, a fixed speed ratio gear transmission device, a planetary gear transmission device, wet clutches, a oneway clutch, and a brake. Its structure is shown in Figure 1. The main parameters are shown in Table 1.
2.2. Working Modes
There are four working modes of a supermild HEV: onlymotor mode, onlyengine mode, enginecharging mode, and regenerative braking mode, as shown in Figure 2.
(a)
(b)
(c)
(d)
2.3. Vehicle Dynamics Model
2.3.1. Power Demand Model
The power demand at the wheel is the power sum of rolling resistance, air resistance, and acceleration resistance:where is the rolling resistance, is the air resistance, is the acceleration resistance, is the vehicle speed, is the rolling resistance coefficient, is the vehicle mass, is the air resistance coefficient, is the frontal area, and is the conversion coefficient of rotation mass.
2.3.2. Engine Model
The engine is a highly nonlinear system and its working process is very complex. Therefore, engine data for a steadystate condition are obtained through experiment testing. Based on these data, an engine torque model and an effective fuel consumption model are established, as shown in Figures 3 and 4, respectively.
2.3.3. Battery Model
Based on a NiMH battery performance experiment, the electromotive force and internal resistance model are obtained as shown in Formulas (2) and (3): where is the electromotive constant of the battery, is the fitting coefficients, SOC is the state of charge, and is the electromotive force under the current state:where is the internal resistance constant of battery, is the fitting coefficient, is the compensation coefficient of internal resistance with the change in current, SOC is state of charge, and is internal resistance under the current state.
The process to calculate the change in SOC is shown by Formulas (4) ~ (6):Therefore, where is the battery power, is the battery current, and is the battery capacity.
3. Stochastic Modeling of Driver’s Power Demand
Traditionally, the driver’s power demand is obtained according to a given driving cycle; however, in reality, the driving cycle is random. The discretetime stochastic process is regarded as Markov decision process (MDP). In other words, the transition probability from the current state to the next state only depends on the current state and the selected action; this is independent of historical states. The power demand at next state only depends on the current power demand, which is independent of any previous state. To establish the transition probability of power demand, a large volume of data is required. In this study, timespeed datasets of the UDDS and ECE_EUDC driving cycles are adopted to calculate the power demand at each moment (Figure 5). The transition probability matrix of the driver’s power demand is obtained using the maximum likelihood estimation method.
The steps to calculate the power demand, based on the driving cycle, are as follows:The transition probability is the probability from the current state to the next state :
According to the maximum likelihood estimation method, the power transition probability can be obtained bywhere represents the number of times that the transition from to has occured given the vehicle speed and represents the total number of times that has occurred at the vehicle speed; it is given by
Based on the data of ECE_EUDC and UDDS driving cycles, transition probability matrices for the power demand are obtained at given speeds of 10 km/h, 20 km/h, 30 km/h, 40 km/h, as shown in Figure 6.
(a) v=10km/h
(b) v=20km/h
(c) v=30km/h
(d) v=40km/h
4. KnownModel RL Energy Management Strategy
4.1. Addressing Energy Management by RL
From a macroscopic perspective, the energy management strategy of an HEV involves determining the driver's power demand according to the driver's operation (acceleration pedal or braking pedal) and distributing optimal power split between two power sources (engine and motor) on the premise of guaranteeing dynamic performance. From a microscopic perspective, the energy management strategy can be abstracted as solving a sequential optimization decision problem. RL is a machine learning method based on Markov decision process, which can solve sequential optimization decision problems.
The process of solving sequential decision problem by RL is as follows. First, a Markov decision process is represented by the tuple , where is the state set, is the action set, is the transition probability, is the return function, and is the discount factor (Figure 7). Second, based on the Markov decision process, the vehicle controller of an HEV is regarded as an agent, the distribution of its engine power is regarded as action , and the hybrid electric system except for the vehicle controller is regarded as the environment. In order to achieve minimum cumulative fuel consumption, the agent takes certain action to interact with the environment. After the environment accepts the action, the state begins to change and an immediate return is generated to feedback to the agent. The agent chooses the next action based on an immediate return and the current state of environment and, then, interacts with the environment again. The agent interacts with the environment continually, thus generating a considerable amount of data. RL utilizes the generated data to modify the action variable. Then, the agent interacts with the environment again and generates new data; this new data is utilized to further optimize the action variable. After several iterations, the agent will learn the optimal action that can complete the corresponding task; in other words, it will determine the decision sequence (Figure 8), thereby solving the sequential decision problem.
4.2. Mathematical Model of an Energy Management Strategy Based on RL
Mathematical models of the action variable , state variable , and immediate return are established according to RL theory. The SOC, vehicle speed , and power demand are taken as state variables; the engine power is taken as the action variable, and the sum of equivalent fuel consumption of a onestep state transition and the SOC penalty is taken as the immediate return . is a factor of the equivalent fuel consumption, is the fuel consumption, is the electricity consumption, is the penalty factor of the battery, and is reference value of the SOC:
In an infinite time domain, the Markov decision process will solve the problem of determining a sequence of decision policies that can predict the minimum cumulative return expectation of a random process called the optimal state value function :where is the discount factor, represents the return value at time, and π represents the control policy.
Meanwhile, the control and state variables must also meet the following constraints:where the subscript min and max represent the maximum and minimum threshold values for the state of charge, speed, and power, respectively.
4.3. Policy Iterative Algorithm
For Formula (12), according to the Berman equationwhere represents the state at the next time step.
The purpose of the solution is to determine the optimal policy :
A policy iterative (PI) algorithm is used to solve the problem of the random process. It involves a policy estimation step and a policy improvementstep.
The calculation process is shown in Algorithm 1.

In the policy evaluation step, for a control policy (s) (the subscript k represents the number of iterations), the corresponding state value function is calculated, as shown in
In the policy improvement step, the improved policy π_{k+1}(s) is determined through a greedy strategy, as shown in
During policy iteration, policy evaluation and policy improvement are performed alternately until the state value function and the policy converge.
The policy iteration algorithm is adopted, obtaining the optimum control policy that takes the vehicle speed, driver’s power demand, and SOC as the input and the engine power as the output. Figure 9 shows the optimized engine powers at vehicle speeds of 10 km/h, 20 km/h, 30 km/h, and 40 km/h.
(a) v=10km/h
(b) v=20km/h
(c) v=30km/h
(d) v=40km/h
It can be seen from Figure 9 that only the motor works when the SOC is high and the power demand is low; when both the SOC and the power demand are high, only the engine works, and, when the SOC is low, the engine is in the charging mode.
Figure 10 illustrates the offline and online implementation frames of the energy management strategy. In the offline part, the power demand transition probability matrix is obtained by the Markov Chain. Mathematical models of the state variable , action variable , and immediate return are derived according to RL theory. The policy iteration algorithm is employed to determine engine power optimization tables. In the online part, the driver's power demand is obtained by the opening of driver's accelerator pedal and brake pedal. Then, the power of the engine and motor is distributed by looking up the offline engine power optimization tables. Finally, the power is transferred to the wheels by the transmission system.
5. Simulation Experiment and Results
A simulation is conducted on a MATLAB/ Simulink platform, taking ECE_EUDC driving cycle as the simulation driving cycle and setting the initial SOC as 0.6. The energy management strategy based on a known model of RL is adopted, simulating the vehicle operation status online. Results are shown in Figure 11.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
From Figure 11, we can see that the following parameters change with time: gear ratio of transmission , SOC of the battery, power demand , motor power , engine torque , motor torque , and instantaneous equivalent fuel consumption . From Figure 11(b), it can be seen that the gear ratio varies continuously from 0.498 to 4.04; this is primarily because of the use of a CVT with reflux power, which, unlike AMT, can realize power without interruption. In Figure 11(c), SOC of the battery is seen to change from the initial value of 0.6 to the final value of 0.5845; ∆SOC = 0.0155; this can meet the HEV SOC balance requirement before and after the cycle ). The power demand change curve of the vehicle is obtained based on ECE_EUDC, as shown in Figure 11(d). The power distribution curves of the motor and the engine can be obtained according to the engine power optimization control tables, as shown in Figures 11(e) and 11(f). The motor power is negative, which indicates that the motor is in the generating state.
In order to validate the optimal performance of the RL control strategy, vehicle simulation tests based on DP and RL were carried out. Figure 12 shows a comparison of the optimization results by the two control methods based on the ECE_EUDC driving cycle. The solid line indicates the optimization results of RL and the dotted line indicates the results of DP. Figure 12(a) shows the SOC optimization trajectories of the DP and RLbased strategies. The trend of the two curves is essentially the same. The difference between the final SOC and the initial SOC is within 0.02, and the SOC is relatively stable. Compared with RL, the SOC curve of DPbased strategy fluctuates greatly; this is primarily related to the distribution of the power source torque. Figures 12(b) and 12(c) indicate engine and motor torque distribution curves based on DP and RL strategies. The engine torque curves essentially coincide; however, the motor torque curves are somewhat different. This is primarily reflected in the torque distribution when the motor is in the generation state.
(a)
(b)
(c)
Table 2 shows the equivalent fuel consumption obtained through DP and RL optimization. Compared with that obtained by DP (4.952L), the value of fuel consumption obtained by RL is 2.3% higher. The reason for this is that DP only ensures a global optimum at a given driving cycle (ECE_EUDC), whereas RL optimizes the result for a series of driving cycles in an average sense, thereby realizing a cumulative expected value optimum. Compared with rulebased (5.368L) and instantaneous optimization (5.256L) proposed in the literature [19], RL decreased the fuel consumption by 5.6% and 3.6%, respectively.
In order to validate the adaptability of RL to random driving cycle, CYC_WVUCITY is also selected as a simulation cycle. The optimization results are shown in Figure 13. Figure 13(b) shows the variation curve of the SOC of the battery based on DP and RL control strategies. The solid line indicates the optimization results of RL and the dotted line indicates the results of DP. The final SOC value for the DP and RL strategies is 0.59 and 0.58, respectively, with the same initial value of 0.6. The SOC remained essentially stable. Figures 13(c) and 13(d) show the optimal torque curves of the motor and the engine based on DP and RL strategies. It can be seen from the figure that the changing trend of the torque curves based on the two strategies is largely the same. In comparison, the change in the torque obtained by DP fluctuates greatly. Table 3 demonstrates the equivalent fuel consumption of the two control strategies. Compared with DP, RL saves fuel by 2.4%; it can adapt with random cycle perfectly. Meanwhile, the computation time based on DP and RL is recorded, which include offline and online computation time. The offline computation time of DP is 5280s and that of RL is 4300s. Due to large calculation volume and the driving cycle unknown in advance, DP cannot be realized online, while RL is not limited to a certain cycle, which can be realized online by embedding the engine power optimization tables into the vehicle controller. The online operation time of RL is 35s based on CYC_WVUCITY simulation cycle.
(a)
(b)
(c)
(d)
6. Conclusion
We established a stochastic Markov model of the driver’s power demand based on datasets of the UDDS and ECE_EUDC driving cycles.
An RL energy management strategy was proposed which takes SOC, vehicle speed, and power demand as state variables, engine power as power as the action variable, and minimum cumulative return expectation as the optimization objective.
Using the MATLAB/Simulink platform, a CYC_WVUCITY simulation model was established. The results show that, compared with DP, RL could reduce fuel consumption by 2.4% and be implemented online.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This research is supported by Scientific and Technology Research Program of Chongqing Municipal Education Commission (Grant no. KJQN201800718) and Chongqing Research Program of Basic Research and Frontier Technology (Grant no. cstc2017jcyjAX0053).