Abstract

The large scale, time varying, and diversification of physically coupled networked infrastructures such as power grid and transportation system lead to the complexity of their controller design, implementation, and expansion. For tackling these challenges, we suggest an online distributed reinforcement learning control algorithm with the one-layer neural network for each subsystem or called agents to adapt the variation of the networked infrastructures. Each controller includes a critic network and action network for approximating strategy utility function and desired control law, respectively. For avoiding a large number of trials and improving the stability, the training of action network introduces supervised learning mechanisms into reduction of long-term cost. The stability of the control system with learning algorithm is analyzed; the upper bound of the tracking error and neural network weights are also estimated. The effectiveness of our proposed controller is illustrated in the simulation; the results indicate the stability under communication delay and disturbances as well.

1. Introduction

The increasing interconnection of physical systems through cybernetworks or physical networks has been observed in many infrastructures, such as power grid [1, 2], transportation networks, and unmanned systems. One critical issue of these called cyberphysical systems is complexity of the system when it grows very large, especially the control problem. Consequently, distributed schemes are suggested for reducing the communication and computational cost compared with centralized control scheme [3]. However, the coupling of subsystems and nonstatic environment in both cybernetworks and physics networks bring many challenges, such as physical interference among subsystems, time-varying plant parameters, communication delay, and expansibility of the cyberphysical system.

To increase expansibility of the cyberphysical system, the multiagent concept is usually introduced. The cyberphysical system can be divided into many agents. Each agent has its own control policy and a unified framework for pursuing its target [4]. The expansion of the cyberphysical system turns into simply duplicating agents without accommodating control policy. To deal with the physical coupling of networked system, one common approach is to decouple subsystems in control design [5ā€“8]. Each subsystem may utilize state information of neighbored subsystems for mitigating their physical interference, or the designer treats their physical interference as random disturbance [9, 10]. On the other hand, for addressing nonstatic environment with time-varying plants, online supervised learning, adaptive control, and reinforcement learning algorithm are suggested; they all enable adaptively adjusting their control parameters online, while the combination of neural network and reinforcement learning usually leads to better control performance compared with conventional supervised learning and adaptive control scheme [11]. Reinforcement learning constructs a long-run cost-to-go function to predict the consequence cost;, each control action takes the estimated future result into account [12], while, compared with adaptive control, the adaptive ability is limited in the number of time-varying parameters; the number of time-varying parameters of plant model may very large in practice.

Recently, many researches are focused on reinforcement learning with neural network. These researches are classified into two categories. The first category is to simply utilize neural network to approximate unknown part about system model or control strategy, such as cost-to-go function and optimal control law. Prokhorov and Wunsch discussed three families of reinforcement learning control design [13], Heuristic dynamic programming (HDP), dual heuristic programming (DHP), and globalized dual heuristic programming (GDHP) and their application in optimal control. Xu et al. focus on experimental studies of real-time online learning control for nonlinear systems using kernel-based ADP methods [14]. Lee et al. focus on a class of reinforcement learning (RL) algorithms, named integral RL (I-RL), that solve continuous-time (CT) nonlinear optimal control problems with input affine system dynamics [15]. The second category is to combine the approach in the first category with supervised learning algorithm for guaranteeing convergence of the learning system; the supervised reinforcement learning also reduces a large number of trials by employing the error signal with domain knowledge [16ā€“18]. It generates instinct feedback for correcting the control actions. Xu et al. suggest a novel adaptive-critic-based neural network (NN) controller which is investigated for nonlinear pure-feedback systems [19]. Liu et al. were concerned with a reinforcement learning-based adaptive tracking control technique to tolerate faults for a class of unknown multiple-input multiple-output nonlinear discrete-time systems with less learning parameters [20]. Besides these, researchers try to employing multilayer/deep neural network for approximating the functions in control, so that the precision of model is enhanced and the performance can be improved in a consequence [21, 22]. However, it is hard to analyze its stability of learning algorithm. Moreover, the learning rate may be slow as the number of tuned parameters is very large in the deep neural network [23].

In this paper, we suggest a distributed neural controller for the physically coupled networked discrete-time system via online reinforcement learning. We model each subsystem as an agent; each agent can obtain its state and some physical neighbored subsystem state information to figure out optimal control action. One-layer adaptive critic neural network and action neural network are proposed for modeling the cost function and optimal action law. With deterministic learning algorithm, we incorporated supervised learning into our reinforcement learning algorithm for accelerating convergence rate. The stability of the learning algorithm is analyzed and the boundary of each parameter is also estimated. The contribution of this paper is two-fold.

(1) We propose a distributed online reinforcement learning algorithm for controlling physically coupled networked discrete-time system.

(2) Sufficient condition for guaranteeing learning algorithm stability and system stability are derived and the upper bound of parameters is estimated.

The rest of the paper is organized as follows: We model the physically coupled networked system and control system in a mathematical dynamic equation in Section 2, and some assumptions are made for simplifying the analysis; then, control system design via online reinforcement learning algorithm is depicted in Section 3; the stability analysis is detailedly discussed in Section 4; simulation results for illustrating the effectiveness and advantage of our algorithm are elaborated in Section 5. Section 6 is the conclusion part.

2. Physically Coupled Networked Control System and Problem Statement

In the physically coupled networked system, their subsystems may physically interfere with neighbored subsystems and change its state trajectory or dynamic. The structure is shown in Figure 1. In order to improve the control system performance, some cyberconnections of communication infrastructures are installed for exchanging the states of neighbored subsystems [3]. The topology of cyberconnections and physical connections may not be the same for probably practical constraints in cyberresources.

2.1. System Dynamic Equation

For a physically coupled networked system, consider that it consists of nonlinear dynamic subsystems, which are given in the discrete-time form:where , , , and . and are system state vector, control input vector, and disturbance vector for subsystem . , , and are smooth vector function about local system dynamic, neighbor interference, and control input interference which are all unknown. is physically connected neighbor set of subsystems , which can interfere with the state trajectory of subsystem . In order to simplify the analysis, the following reasonable assumptions are made [11].

Assumption 1. The disturbances are bounded .

Assumption 2. is an invertible matrix.

is a positive real number, it means the magnitude of disturbances are bounded. Assumption 2 is made for simplifying the analysis of action network which will be discussed in next section.

The control objective is to track the state target vector ; then we have the error equationTherefore, the subsystem dynamic in a form of error is

2.2. Distributed Control System and Control Objective

Distributed control system is more flexible and scalable than centralized control. Moreover, it divides a large system controller into many small subsystems controllers, which lead to the system state dimension reduction in a controller, so that much computational resource and time can be saved [24].

The control objective is to decrease the error vector as fast as possible and bound in a small region for a given bounded disturbance. For subsystem controller, usually, an exponential damping rate of error is expected with a form ofwhere . Therefore, the desired control input of subsystem can be in a form of is the cyberconnected neighbor set of subsystems , which means the controller of subsystem utilizes the received state information from neighbored subsystems via communication network.

However, , , and are unknown. A reinforcement learning scheme with neural network is proposed for approximating the desired control strategy and strategy utility function about long-term cost.

3. Control System Design by Reinforcement Learning and Neural Network

The proposed distributed control scheme with reinforcement learning consists of three parts: the first part will introduce a strategy utility function (also called long-term cost function); the second part depicts the critic neural network and online training algorithm; the last part of this section elaborates the action neural network and parameter updating algorithm.

3.1. Strategy Utility Function

The utility function defined for subsystem is based on the current filtered state error ; it is formulated aswhere , , and is a given constant positive scalar threshold for lth element of state error vector for subsystem . is also an indicator of current tracking performance; if equals 1, it means the control system has a bad state, and the state deviates the desired value a lot. On the other hand, if equals 0, it indicates well-tracking performance and the lth state error is in a small bounded region.

The long-term cost is the sum of utility function at each sampling time. Based on the utility function , strategic utility function is defined aswhere , , and N is stage number. If is infinite or very large, the strategy utility function is defined in a rolling horizon with a fixed number of stages. It is obvious that the control objective is to minimize which improve the control performance.

3.2. Critic Network Design

In our proposed scheme, one-layer neural network is considered for approximating strategy utility function . For simplifying the stability analysis, only output layer weights of neural network are designed to be adjustable in online training. A one-layer network is suggested to approximate strategy utility function; it isThe basis function is a Gaussian vector function which is defined aswhere is communication latency, is the Gaussian function center vector, and the centers should cover the system operation state region as much as possible. is width of Gaussian function. The approximation error would be very small if the dimension of basis function is large enough [11]. The relation between kth and th optimal control action iswhere is control action for subsystem . We estimate the strategy utility function byThe prediction error of approximated strategy utility function for critic NN isWe define the objective function of critic NN for minimization at th sampling asOne common way to decrease the objective function is to update critic NN parameters along its gradient direction. Applying chain rule, partial derivative of objective function (13) with respect to isTherefore, updating law for critic NN of subsystem is is a given scalar, representing updating step size. The choice of is very important. If is too large, the online learning may diverge.

3.3. Action Neural Network Design

Our control objective is to minimize the tracking error and also to minimize the long-term cost function/strategy utility function . They depend on the control action in each step. The desired control action (5) is an expected strategy for approaching this objective, and an action neural network is suggested for approximating the desired control action. The desired control action can be equal towhere is the optimal weighting matrix for neural output which minimizes the residual ; is the basis function which has the same form as (9). would be very small if the dimension of is very large. However, and are unknown; the desired control action is proposed to be estimated bywhere is the estimated weighting matrix for . And we have the estimated error for desired control action.where , and we denote , which causes dynamic (3) to be and are the neighbor subsystem sets of subsystem which are connected to subsystem in physical way and cyberway. In our proposed scheme, supervised learning is incorporated into the action neural network training for accelerating the convergence rate of online updating. The objective of the policy is not only to minimize long-term cost but also to approximate the desired control output with supervised learning. Thus, the error vector of action network is defined aswhere is the desired utility function value for subsystem , it can be set as 0 [20], and is principal mean square root. The following cost function is defined for each step:Then, the partial derivative of (21) with respect to is obtained by chain rule.Therefore, with gradient descent principle, the action NN weight matrix is updated by is the updating step size for online learning of action neural network. The choice of will be discussed in the next section, which is associated with the stability of the online learning algorithm.

4. Stability Analysis

This section discusses the stability of online learning algorithm and the tracking performance. It is necessary for control design. The upper bound of error and weight parameter of neural networks are analyzed. Firstly, a theorem about the stability of this scheme is proposed.

Theorem 3. For a given networked control system described in (3) and the parameter updating algorithm in (15) (23), if , , , and , wherewhere , , , , and . Then, there exist upper bounds for , when , and they areAnd the system is stable.

Proof of Theorem 3. For the dynamic system described in (3), (15), and (23), we first define a Lyapunov function which consisted of quadratic of tracking error, action network weight error, and the error of critic neural network. It iswherewhere and . For a subsystem , we havewhere , , , , andwhere and . For strategy utility function, (10) leads toand is the utility function under the optimal strategy.and , and . The updating equation (15) yieldsThe last part of the variation of isWith the sum of all the above variations, we getwhere , , , , , and are given in Theorem 3. Therefore, if , , and , the upper boundary of can be estimated, when , is very small which can be neglected if the dimension of basis function is large enough. The upper bound can be estimated by

Remark 4. The stability of this system depends on the control parameters , , , and , system functions and , and the communication networks which affect parameter . It is obvious that if subsystem can obtain all state information from physically connected neighbors, the parameter would be smaller, it improves the system performance because the absolute value of and will be larger, and it decreases the upper bound of and . Moreover, the sign of and cannot be necessarily definite, as they are not the coefficients of the estimated variable in the following Lyapunov function variation expression (34).

5. Simulation Results

This simulation illustrates the effectiveness and advantage of our proposed control scheme in four aspects: (1) The effectiveness of our proposed control scheme of physical coupling networked control system in tracking sine wave signal with disturbances; (2) its effectiveness with communication delay; (3) its advantages compared with conventional reinforcement learning; (4) its effectiveness in multicontrol input system.

The first simulation considers a networked system called system I as shown in Figure 2. System I consisted of four subsystems, each subsystem physically coupled with other subsystems. Each subsystem is a nonlinear system. Their equations are. The initial value of states for this simulation are , , , and . The target signals for areThe details of other functions and variables are listed in Table 1.

Figure 2 illustrates both the physical connection and the cyberconnection of system I. The communication network can send state information from subsystems 1 to 2, 2 to 3, 3 to 4, and 4 to 1. The parameters of the proposed controller are illustrated in Table 2.

The simulation results are shown in Figures 3 and 4. From Figure 3, it is obvious that all of the subsystems converge to the target state with small errors. The curves converge to the target curves at about 125th control actions, which mean the online learning algorithm successfully obtained the desired action network and critic network. From Figure 4, it can be seen that the fluctuant of control output is decreased along the time during the online learning process. They also illustrate the effectiveness of our proposed control scheme.

In order to present the advantage of our suggested control scheme, we select conventional reinforcement learning without supervised learning scheme; the updating of action network solely depends on the backpropagation of critic network with the objective of minimizing the output of critic network [12]. The result is shown in Figure 5. The results explicitly indicate the divergence of the learning algorithm because of the fast changing of target signal. And the conventional reinforcement learning may need off-line learning in advance. The results illustrate our proposed control scheme is more stable and has more powerful online learning ability than the conventional method.

In practice, the controller usually encounters action delay or communication delay. It is also modeled in our suggested model. To illustrate the effectiveness of our proposed control scheme under communication delay, we chose three communication delay values to carry out the simulation. The simulation results are shown in Figures 6ā€“8. These results show our proposed control scheme is stable under communication delay. However, static error increases with the communication delay. It is clear that the error of simulation with is relatively smallest and the error of simulation with is largest in the results.

For further demonstrating the effectiveness of our suggested scheme with multiple control input, we choose another system called system II for simulation. The model of system II is. The system structure is the same as shown in Figure 2. The target signals for and areOther model parameters are illustrated in Table 3.

The controller parameters are set as shown in Table 4.

The simulation results are shown in Figure 9. They show that all the subsystem states converge to the target signals within a small number of time steps (it is about 120). The tracking errors are small, and each of the state variables converges to its corresponding target signal. It illustrates the effectiveness of our suggested control scheme in application of multicontrol input systems with a relatively larger dimension compared with the previous simulation.

6. Conclusion

This paper suggests online reinforcement learning with one-layer neural network for controlling physically coupled networked system. It is a distributed learning control scheme. The networked system is divided into many subsystems; each system is an individual agent with controller and reinforcement learning algorithm. The reinforcement learning algorithm consists of the learning of critic network and action network. The critic network approximates the strategy utility function and the action network approximate the defined desired optimal controller. The action network weights updating decreases long-term cost with supervised learning mechanism by incorporating the desired control errorĻ‚ with long-term cost function . The effectiveness of our proposed controller is illustrated in the simulation part. The simulation results also indicate that the proposed control scheme improves the tracking performance compared with conventional reinforcement learning with only objective of long-term cost (critic network). In the future research, we will investigate the application of our proposed control scheme in a large cyberphysical system such as smart grid.

Conflicts of Interest

The authors declare that they have no conflicts of interest with regard to the publication of this manuscript.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61703347. This research is also supported by Fundamental Research Funds for the Central Universities Grant XDJK2017C071 and Chongqing Natural Science Foundation Grant cstc2016jcyjA0428.