Mathematical Modeling and Models for Optimal DecisionMaking in Health Care
View this Special IssueResearch Article  Open Access
Control of Blood Glucose for Type1 Diabetes by Using Reinforcement Learning with Feedforward Algorithm
Abstract
Background. Type1 diabetes is a condition caused by the lack of insulin hormone, which leads to an excessive increase in blood glucose level. The glucose kinetics process is difficult to control due to its complex and nonlinear nature and with state variables that are difficult to measure. Methods. This paper proposes a method for automatically calculating the basal and bolus insulin doses for patients with type1 diabetes using reinforcement learning with feedforward controller. The algorithm is designed to keep the blood glucose stable and directly compensate for the external events such as food intake. Its performance was assessed using simulation on a blood glucose model. The usage of the Kalman filter with the controller was demonstrated to estimate unmeasurable state variables. Results. Comparison simulations between the proposed controller with the optimal reinforcement learning and the proportionalintegralderivative controller show that the proposed methodology has the best performance in regulating the fluctuation of the blood glucose. The proposed controller also improved the blood glucose responses and prevented hypoglycemia condition. Simulation of the control system in different uncertain conditions provided insights on how the inaccuracies of carbohydrate counting and mealtime reporting affect the performance of the control system. Conclusion. The proposed controller is an effective tool for reducing postmeal blood glucose rise and for countering the effects of external known events such as meal intake and maintaining blood glucose at a healthy level under uncertainties.
1. Introduction
Type1 diabetes is a chronic condition that is characterized by an excessive increase in blood glucose level because the pancreas does not produce insulin hormone due to the autoimmune destruction of pancreatic beta cells. High blood glucose can lead to both acute and chronic complications and eventually result in failure of various organs.
Until today, there are many challenges in control of the blood glucose in type1 diabetes. One of them is that the glucose kinetics process is complex, nonlinear, and only approximately known [1]. There are also many external known and unknown factors that affect the blood glucose level such as food intakes, physical activities, stress, and hormone changes. Generally, it is difficult to predict and quantify those factors and disturbances.
By using control theories, various studies have been conducted to design a control system for patients with type1 diabetes. For example, Marchetti et al. [2], derived an improved proportionalintegralderivative controller for blood glucose control. Soylu et al. [3] proposed a Mamdani type fuzzy control strategy for exogenous insulin infusion. Model predictive control has also been widely used in type1 diabetes and artificial pancreas development [4, 5]. Recently, together with the development of artificial intelligence and machine learning, reinforcement learning (RL) has emerged as a datadriven method to control unknown nonlinear systems [6, 7] and has been used as a longterm management tool for chronic diseases [8, 9]. The biggest advantage of RL compared to other methods is that the algorithm depends only on interactions with the system and does not require a well represented model of the environment. This especially makes RL well suited for type1 diabetes since the modelling process of the insulinkinetic dynamics is complex and requires invasive measurements on the patient or must be fit through a large dataset. Hence, by using RL as the control algorithm, the modelling process can be bypassed, which makes the algorithm not susceptible to any modelling error.
In diabetes, controlling of blood glucose require actions that are made at specific instance throughout the day in terms of insulin doses or food intakes. The actions are based on the current observable states of the patients (e.g., blood glucose measurement and heart rate). The effectiveness of the actions is calculated by how far the measured blood glucose value is compared to the healthy level. In RL, an agent makes decision based on the current state of the environment. The task of the algorithm is to maximize a cumulative reward function or to minimize a cumulative cost function. Based on these similarities in the decisionmaking process between a human being and a RL agent, RL may be key to the development of an artificial pancreas system.
When dealing with meal disturbances, modelling of glucose ingestion is the norm as well as the first step in designing a controller for disturbance rejection [10]. Feedforward control was proven to be an effective tool to improve disturbance rejection performance [11, 12]. In control system theory, feedforward is the term that describes a controller that utilizes the signal obtained when there is a (large) deviation from the model. Compared to feedback control, where action is only taken after the output has moved away from the setpoint, the feedforward architecture is more proactive since it uses the disturbance model to suggest the time and size of control action. Furthermore, building a meal disturbance model is simpler and requires less data to fit than finding the insulinglucose kinetics. Based on the model, necessary changes in insulin actions can be calculated to compensate for the effects of carbohydrate on the blood glucose level.
A challenge in the control of the blood glucose is the lack of realtime measurement techniques. With the development of continuous glucose measurement sensors, blood glucose level can be measured and provided to the controller in minute intervals. However, blood glucose value alone is usually not enough to describe the states of the system for control purpose. Therefore, an observer is needed to estimate other variables in the state space from the blood glucose measurement. In this paper, the Kalman filter was chosen for that purpose since it can provide an optimal estimation of the state variables when the system is subjected to process and measurement noises [13, 14].
Vrabie et al. [15] established methodologies to obtain optimal adaptive control algorithms for dynamical systems with unknown mathematical models by using reinforcement learning. Based on that, Ngo et al. [16] proposed a reinforcement learning algorithm for updating basal rates in patients with type1 diabetes. This paper completes the framework for blood glucose control with both basal and bolus insulin doses. The framework includes the reinforcement learning algorithm, the feedforward controller for compensating food intake and the Kalman filter for estimating unmeasurable state variables during the control process. This paper also conducts simulations under uncertain information to evaluate the robustness of the proposed controller.
2. Methods
2.1. Problem Formulation
The purpose of our study is to design an algorithm to control the blood glucose in patients with type1 diabetes by the means of changing the insulin concentration. The blood glucose metabolism is a dynamic system in which the blood glucose changing over time as the results of many factors such as food intake, insulin doses, physical activities, and stress level. The learning process of RL is based on the interaction between a decisionmaking agent and its environment, which will lead to an optimal action policy that results in desirable states [17]. The RL framework for type1 diabetes includes the following elements:(i)The state vector at time instance consists of the states of the patient:where and the are measured and desired blood glucose levels, respectively, and is the interstitial insulin activity (defined in the appendix).(ii)The control variable (insulin action) , which is part of the total insulin (a combination of the basal and the bolus insulin (Figure 1)):where and are the basal and bolus at time instance , respectively.(iii)The cost received one timestep later as a consequence of the action. In this paper, the cost was calculated by the following quadratic function:where and . Each element in matrix and the value of indicate the weighting factors of the cost function. The element in the first row and the first column of has the highest value, which corresponds to the weighting of the difference between the measured blood glucose and the prescribed healthy value. Since our ultimate goal is to reduce this difference, the factor of this measurement should have the largest value in the cost function. The element in the second row and second column of corresponds to the weighting of the interstitial insulin activity. The value of indicates the weighting factor of the action (basal update). Minimizing the cost function, therefore, becomes the problem of minimizing the difference between the measured blood glucose and the desired value, the interstitial insulin activity, and the change in basal insulin.
At time instance , a sequence of observations would be , , , and . Based on this observation, the agent receives information about the state of the patient and chooses an insulin action. The body reacts to this action and transitions to a new state. This determines the cost of the action.
For the control design purpose, the blood glucose model (Appendix) was divided into three submodels: the meal (G_{meal}), the insulin (G_{ins}), and the glucose kinetics (G_{glucose}). The controller has three main components: the actor, the critic, and the feedforward algorithm. The actor is used to estimate the actionvalue function, the critic’s task is to obtain optimal basal insulin, and the feedforward algorithm is used to propose the bolus insulin profile for disturbance compensation (food intake). The purpose of the Kalman filter is to estimate unmeasurable states of the patient.
2.2. Basal Update by Actor and Critic
When the patient is in a fasting condition, the controller only needs to change the basal insulin level through the actor and the critic. Based on the current state , the actor proposes an insulin action through the policy . The updated basal rate is obtained from as follows:
After each action, the patient transforms into a new state, and the cost associated with the previous action can be calculated using equation (3). The actionvalue function (Qfunction) of action is defined as the accumulation of cost when the controller takes action at time instance and then continues following policy :where (with ) is the discount factor that indicates the weighting of future cost in the actionvalue function.
The actionvalue function depends on the current state and the next action. It was shown that the actionvalue function satisfies the following recursive equation (Bellman equation) [15, 17]:
Since the state space and action space are infinite, function approximation was used in this paper for estimation of the Qfunction. In this case, the Qfunction was approximated as a quadratic function of vectors and :where the symmetric and positive definite matrix is called the kernel matrix and contains the parameters that need to be estimated. Vector is the combined vector of and :
With Kronecker operation, the approximated Qfunction can be expressed as a linear combination of the basis function :where is the vector that contains elements of and is the Kronecker product.
By substituting in equation (6) by and using the policy iteration method with the least square algorithm [15], elements of vector can be estimated. Matrix can then be obtained from using the Kronecker transformation.
By decomposing the kernel matrix into smaller matrices , , , and , the approximated Qfunction can be written as follows:
The current policy is improved with actions that minimize the Qfunction . This can be done by first taking the partial derivative of the Qfunction and then solving . The optimal solution can thereafter be obtained as follows [15]:
With that, the update of basal insulin iswhere is the equilibrium basal plasma insulin concentration.
2.3. Bolus Update by Feedforward Algorithm
When the patient consumes meals, in addition to the basal insulin, the controller calculates and applies boluses to compensate for the rise of blood glucose as the results of carbohydrate in the food. The feedforward algorithm first predicts how much blood glucose level will rise and then suggests a bolus profile to counter the effects of the meal. The starting time of the bolus doses was also calculated by the algorithm based on the meal intake model.
Since the meal intake model (equations (A.1) and (A.2)) and the insulin model (equation (A.4)) are linear timeinvariant (LTI) models, they can be transformed from state space equations into transfer functions as follows:where
Descriptions and values of , , and are shown in Tables 1 and 2. The transfer function from the meal intake to the blood glucose level can be calculated as


In order to compensate for the meal, the gain of the open loop system must be made as small as possible. Hence, the feedforward transfer function was chosen such that , which leads to
The meal compensation bolus in sdomain can be calculated from the feedforward transfer function:
Hence, the feedforward action becomes the output of the following dynamic system, which can be solved easily using any ordinary differential equation solver:
2.4. Kalman Filter for Type1 Diabetes System
Since the interstitial insulin activity, the amounts of glucose in compartments 1 and 2 cannot be measured directly during implementation, Kalman filter was used to provide an estimation of the state variables from the blood glucose level. The discretized version of the type1 diabetes system can be written in the following form:where , , and matrices , , are linearized coefficient matrices of the model:matrix is the noise input matrix: , the output value is the measured blood glucose deviation from the desired level, is the insulin input noise, and is the blood glucose measurement noise with zeromean Gaussian distribution. The variances of and are assumed to be as follows:
Based on the discretized model, a Kalman filter was implemented through the following equation:where denotes the estimation of based on measurements available at time . The gain is the steadystate Kalman filter gain, which can be calculated bywhere is the solution of the corresponding algebraic Riccati equation [13, 14, 18]:
By assuming the noise variances to be , the Kalman filter gain was calculated from equation (23) as
2.5. Simulation Setup
First, a pretraining of the algorithm was conducted on the type1 diabetes model in the scenario where the patient is in a fasting condition (without food intake). The purpose of the pretraining simulation is to obtain an initial estimation of the actionvalue function for the algorithm. The learning process was conducted by repeating the experiment multiple times (episodes). Each episode starts with an initial blood glucose of 90 mg/dL and ends after 30 minutes. The objective of the algorithm is to search and explore actions that can drive the blood glucose to its target level of 80 mg/dL.
By using the initial estimation of the actionvalue function, the controller was then tested in the daily scenario with food intake. Comparisons were made between the proposed reinforcement learning with the feedforward (RLFF) controller, the optimal RL (ORL) controller [15], and the proportionalintegralderivative (PID) controller. The ORL was designed with the same parameters and pretrained in the same scenario as with the RLFF. The PID controller gains were chosen, which produces a similar blood glucose settling time as the RLFF:where
In order to understand the effects of different food types on the controlled system, two sets of simulations were conducted for food that has slow and fast glucose absorption rates while containing a similar amount of carbs. Absorption rates in the simulations are characterized by parameter from the model, where corresponds to food with a slow absorption rate and corresponds to food with a fast absorption rate. The amount of carbohydrate (CHO) per meal can be found in Figure 2.
Next, the performance of the proposed controller was evaluated under uncertainties of meal information. Two cases of uncertainties were considered: uncertain CHO estimation case and uncertain mealrecording time. In the uncertain CHO estimation, the estimated CHO information that provided to the controller was assumed to be a normal distribution with a standard deviation of 46% from the correct carbohydrate value shown in Figure 2. The standard deviation value was used based on the average adult estimates and the computerized evaluations by the dietitian [19]. For the uncertain mealrecording time, the estimated meal starting time is assumed to be a normal distribution with a standard deviation of two minutes from the real starting time. This standard deviation value was randomly selected because systematic research on the accuracy of mealtime recording for patients with type1 diabetes could not be found. For each case, multiple simulations were conducted in the same closedloop system with its corresponding random variables. From the obtained results, the mean and standard deviation for blood glucose responses at each time point will be calculated and analyzed.
3. Results
After pretraining in the nomeal scenario, the Qfunction was estimated as follows:
The initial basal policy was obtained from the initial Qfunction and equation (12):
The initial estimation of the Qfunction and the initial basal policy were used for subsequent testing simulations of the control algorithm.
During the simulation with correct meal information, blood glucose responses of the RLFF, the ORL, and the PID are shown in Figures 3 and 4. The insulin concentration during the process can also be found in Figures 5 and 6. With slowabsorption food, the fluctuation range of blood glucose was approximately ±30 mg/dL for all three controllers from the desired value (Figure 3). However, with fast absorption glucose meals, the fluctuation range of the postmeal blood glucose level was within ±40 mg/dL with the RLFF compared to ±60 mg/dL with the ORL and is significantly smaller than the fluctuation range ±80 mg/dL of the PID (Figure 4).
Figures 7 and 8 show the blood glucose variation under uncertain meal time and CHO counting. The upper and lower bounds in shaded areas show the mean blood glucose value plus and minus the standard deviation for each instance. Under uncertain meal information, the upper bound was kept to be smaller than 40 mg/dL from the desired blood glucose value for fast glucose absorption food and 15 mg/dL for slow glucose absorption food. The lower bound is smaller than 15 mg/dL from the desired value for fast glucose absorption food and 5 mg/dL for slow glucose absorption food.
4. Discussion
The controller has shown its capability to reduce the rise of postmeal blood glucose in our simulations. It can be seen in Figures 3 and 4 that three controllers were able to stabilize the blood glucose. However, when using the RLFF, the added bolus makes the insulin responses much faster when there is a change in blood glucose level, which reduces the peak of the postmeal glucose rise by approximately 30 percent compared to the ORL and 50 percent compared to the PID in the fastabsorption case. It can also be seen that the undershoot blood glucose (the distance between the lowest blood glucose and the desired blood glucose value) of the PID controller is much larger than that of the RLFF and the ORL. The RLFF has the smallest glucose undershoot among the three controllers. Low blood glucose value (hypoglycemia) can be very dangerous for patients with type1 diabetes. Therefore, simulation results show the advantage of using RLFF in improving safety for patients. In general, with the feedforward algorithm, the proposed algorithm is an effective tool for countering the effects of external events such as meal intake.
Among uncertainties, carb counting created more effect on the variation of the blood glucose than mealtime recording, especially with slow absorbing food. The uncertainty in recording meal time may also lead to larger undershot of blood glucose below the desired level as can be seen in Figure 7. Following the same trend as previous simulations, the fluctuation range of the blood glucose with slow absorbing food is smaller than the fluctuation range with fast glucose absorbing food. In general, the control algorithm kept the blood glucose at the healthy level although uncertainties affect the variation of the responses. However, an accurate carbohydrate counting and accurate mealtime recording method are still important for the purpose of blood glucose control in order to completely avoid the chance of getting hypoglycemia.
5. Conclusion
The paper proposes a blood glucose controller based on reinforcement learning and feedforward algorithm for type1 diabetes. The controller regulates the patient’s glucose level using both basal and bolus insulin. Simulation results of the proposed controller, the optimal reinforcement learning, and the PID controller on a type1 diabetes model show that the proposed algorithm is the most effective algorithm. The basal updates can stabilize the blood glucose, and the bolus can reduce the glucose undershoot and prevent hypoglycemia. Comparison of the blood glucose variation under different uncertainties provides understandings of how the accuracy of carbohydrate estimation and mealrecording time can affect the closedloop responses. The results show that the control algorithm was able to keep the blood glucose at a healthy level although uncertainties create variations in the blood glucose responses.
Appendix
Blood Glucose Model
In this paper, the insulinglucose process was used as the subject in our simulations. The model is described by the following equations [20–23]:where variable descriptions and parameter values are given in Tables 1 and 2. In this model, the inputs are the amount of CHO intake D and the insulin concentration . The output of the model is the blood glucose concentration . It is assumed that the blood glucose is controlled by using an insulin pump, and there is no delay between the administered insulin and the plasma insulin concentration.
Abbreviations
RL:  Reinforcement learning 
RLFF:  Reinforcement learning with feedforward algorithm 
ORL:  Optimal reinforcement learning 
PID:  Proportionalintegralderivative 
LTI:  Linear timeinvariant 
CHO:  Carbohydrate. 
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare no conflicts of interest.
Acknowledgments
The research has been funded by financial support from Tromsø Forskningsstiftelse. The publication charges for this article have been funded by a grant from the publication fund of UiT, the Arctic University of Norway.
References
 Q. Wang, P. Molenaar, S. Harsh et al., “Personalized statespace modeling of glucose dynamics for type 1 diabetes using continuously monitored glucose, insulin dose, and meal intake,” Journal of Diabetes Science and Technology, vol. 8, no. 2, pp. 331–345, 2014. View at: Publisher Site  Google Scholar
 G. Marchetti, M. Barolo, L. Jovanovic, H. Zisser, and D. E. Seborg, “An improved PID switching control strategy for type 1 diabetes,” IEEE Transactions on Biomedical Engineering, vol. 55, no. 3, pp. 857–865, 2008. View at: Publisher Site  Google Scholar
 S. Soylu, K. Danisman, I. E. Sacu, and M. Alci, “Closedloop control of blood glucose level in type1 diabetics: a simulation study,” in Proceedings of International Conference on Electrical and Electronics Engineering (ELECO), pp. 371–375, Bursa, Turkey, November 2013. View at: Google Scholar
 D. Boiroux, A. K. DuunHenriksen, S. Schmidt et al., “Overnight glucose control in people with type 1 diabetes,” Biomedical Signal Processing and Control, vol. 39, pp. 503–512, 2018. View at: Publisher Site  Google Scholar
 H. Lee and B. W. Bequette, “A closedloop artificial pancreas based on model predictive control: humanfriendly identification and automatic meal disturbance rejection,” Biomedical Signal Processing and Control, vol. 4, no. 4, pp. 347–354, 2009. View at: Publisher Site  Google Scholar
 M. K. Bothe, L. Dickens, K. Reichel et al., “The use of reinforcement learning algorithms to meet the challenges of an artificial pancreas,” Expert Review of Medical Devices, vol. 10, no. 5, pp. 661–673, 2014. View at: Publisher Site  Google Scholar
 M. De Paula, L. O. Ávila, and E. C. Martínez, “Controlling blood glucose variability under uncertainty using reinforcement learning and Gaussian processes,” Applied Soft Computing, vol. 35, pp. 310–332, 2015. View at: Publisher Site  Google Scholar
 C. J. C. H. Watkins and P. Dayan, “Technical note: Qlearning,” in Reinforcement Learning, vol. 292, pp. 55–68, Springer US, Boston, MA, USA, 1992. View at: Publisher Site  Google Scholar
 J. Pineau, M. G. Bellemare, A. J. Rush, A. Ghizaru, and S. A. Murphy, “Constructing evidencebased treatment strategies using methods from computer science,” Drug and Alcohol Dependence, vol. 88, no. S2, pp. S52–S60, 2007. View at: Publisher Site  Google Scholar
 K. Lunze, T. Singh, M. Walter, M. D. Brendel, and S. Leonhardt, “Blood glucose control algorithms for type 1 diabetic patients: a methodological review,” Biomedical Signal Processing and Control, vol. 8, no. 2, pp. 107–119, 2013. View at: Publisher Site  Google Scholar
 S. P. Bhattacharyyta, “Disturbance rejection in linear systems,” International Journal of Systems Science, vol. 5, no. 7, pp. 633–637, 1974. View at: Publisher Site  Google Scholar
 H. Zhong, L. Pao, and R. de Callafon, “Feedforward control for disturbance rejection: model matching and other methods,” in Proceedings of 24th Chinese Control and Decision Conference (CCDC), pp. 3528–3533, Taiyuan, China, May 2012. View at: Google Scholar
 F. Lewis, Optimal Estimation, John Wiley & Sons, Inc., Hoboken, NJ, USA, 1986.
 G. F. Franklin, J. D. Powell, and M. L. Workman, Digital Control of Dynamic Systems, AddisonWesley, Boston, MA, USA, 2nd edition, 1990.
 D. Vrabie, K. G. Vamvoudakis, and F. L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles, vol. 81, Institution of Engineering and Technology, London, UK, 1st edition, 2012.
 P. D. Ngo, S. Wei, A. Holubova, J. Muzik, and F. Godtliebsen, “Reinforcementlearning optimal control for type1 diabetes,” in Proceedings of 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), pp. 333–336, Las Vegas, NV, USA, March 2018. View at: Google Scholar
 R. Sutton and A. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, USA, 1st edition, 1998.
 MathWorks, MATLAB Optimization Toolbox: User’s Guide (r2018a), MathWorks, Natick, MA, USA, 2018.
 A. S. Brazeau, H. Mircescu, K. Desjardins et al., “Carbohydrate counting accuracy and blood glucose variability in adults with type 1 diabetes,” Diabetes Research and Clinical Practice, vol. 99, no. 1, pp. 19–23, 2013. View at: Publisher Site  Google Scholar
 R. N. Bergman, Y. Z. Ider, C. R. Bowden, and C. Cobelli, “Quantitative estimation of insulin sensitivity,” American Journal of PhysiologyEndocrinology and Metabolism, vol. 236, no. 6, p. E667, 1979. View at: Publisher Site  Google Scholar
 R. Hovorka, V. Canonico, L. J. Chassin et al., “Nonlinear model predictive control of glucose concentration in subjects with type 1 diabetes,” Physiological Measurement, vol. 25, no. 4, pp. 905–920, 2004. View at: Publisher Site  Google Scholar
 M. E. Wilinska, L. J. Chassin, H. C. Schaller, L. Schaupp, T. R. Pieber, and R. Hovorka, “Insulin kinetics in type1 diabetes: continuous and bolus delivery of rapid acting insulin,” IEEE Transactions on Biomedical Engineering, vol. 52, no. 1, pp. 3–12, 2005. View at: Publisher Site  Google Scholar
 A. Mösching, Reinforcement Learning Methods for Glucose Regulation in Type 1 Diabetes, Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland, 2016.
Copyright
Copyright © 2018 Phuong D. Ngo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.