Abstract

We study the use of feedforward neural networks (FNN) to develop models of nonlinear dynamical systems from data. Emphasis is placed on predictions at long times, with limited data availability. Inspired by global stability analysis, and the observation of strong correlation between the local error and the maximal singular value of the Jacobian of the ANN, we introduce Jacobian regularization in the loss function. This regularization suppresses the sensitivity of the prediction to the local error and is shown to improve accuracy and robustness. Comparison between the proposed approach and sparse polynomial regression is presented in numerical examples ranging from simple ODE systems to nonlinear PDE systems including vortex shedding behind a cylinder and instability-driven buoyant mixing flow. Furthermore, limitations of feedforward neural networks are highlighted, especially when the training data does not include a low dimensional attractor. Strategies of data augmentation are presented as remedies to address these issues to a certain extent.

1. Introduction

The need to model dynamical behavior from data is pervasive across science and engineering. Applications are found in diverse fields such as in control systems [1], time series modeling [2], and describing the evolution of coherent structures [3]. While data-driven modeling of dynamical systems can be broadly classified as a special case of system identification [4], it is important to note certain distinguishing qualities: the learning process may be performed off-line, physical systems may involve very high dimensions, and the goal may involve the prediction of long-time behavior from limited training data.

Artificial neural networks (ANN) have attracted considerable attention in recent years in domains such as image recognition in computer vision [5, 6] and in control applications [7]. The success of ANNs arises from their ability to effectively learn low-dimensional representations from complex data and in building relationships between features and outputs. Neural networks with a single hidden layer and nonlinear activation function are guaranteed to be able to predict any Borel measurable function to any degree of accuracy on a compact domain [8].

The idea of leveraging neural networks to model dynamical systems has been explored since the 1990s. ANNs are prevalent in the system identification and time series modeling community [912], where the mapping between inputs and outputs is of prime interest. Billings et al. [13] explored connections between neural networks and the nonlinear autoregressive moving average model (NARMAX) with exogenous inputs. It was shown that neural networks with one hidden layer and sigmoid activation function represent an infinite series consisting of polynomials of the input and state units. Elanayar and Shin [14] proposed the approximation of nonlinear stochastic dynamical systems using radial basis feedforward neural networks. Early work using neural networks to forecast multivariate time series of commodity prices [15] demonstrated its ability to model stochastic systems without knowledge of the underlying governing equations. Tsung and Cottrell [16] proposed learning the dynamics in phase space using a feedforward neural network with time-delayed coordinates.

Paez and Urbina [1719] modeled a nonlinear hardening oscillator using a neural network-based model combined with dimension reduction using canonical variate analysis (CVA). Smaoui [2022] pioneered the use of neural networks to predict fluid dynamic systems such as the unstable manifold model for bursting behavior in the 2-D Navier-Stokes and the Kuramoto-Sivashinsky equations. The dimensionality of the original PDE system is reduced by considering a small number of proper orthogonal decomposition (POD) coefficients [23]. Interestingly, similar ideas of using principal component analysis for dimension reduction can be traced back to work in cognitive science by Elman [24]. Elman also showed that knowledge of the intrinsic dimensions of the system can be very helpful in determining the structure of the neural network. However, in the majority of the results [2022], the neural network model is only evaluated a few time steps from the training set, which might not be a stringent performance test if longer time predictions are of interest.

ANNs have also been applied to chaotic nonlinear systems that are challenging from a data-driven modeling perspective, especially if long time predictions are desired. Instead of minimizing the pointwise prediction error, Bakker et al. [25] satisfied the Diks’ criterion in learning the chaotic attractor. Later, Lin et al. [26] demonstrated that even the simplest feedforward neural network for nonlinear chaotic hydrodynamics can show consistency in the time-averaged characteristics, power spectra, and Lyapunov exponent between the measurements and the model.

A major difficulty in modeling dynamical systems is the issue of memory. It is known that even for a Markovian system, the corresponding reduced-dimensional system could be non-Markovian [27, 28]. In general, there are two main ways of introducing memory effects in neural networks. First, a simple workaround for feedforward neural networks (FNN) is to introduce time delayed states in the inputs [11]. However, the drawback is that this could potentially lead to an unnecessarily large number of parameters [29]. To mitigate this, Bakker [25] considered following Broomhead and King [30] in reducing the dimension of the delay vector using weighted principal component analysis (PCA). The second approach uses output or hidden units as additional feedback. As an example, Elman’s network [29] is a recurrent neural network (RNN) that incorporates memory in a dynamic fashion.

Miyoshi et al. [31] demonstrated that recurrent RBF networks have the ability to reconstruct simple chaotic dynamics. Sato and Nagaya [32] showed that evolutionary algorithms can be used to train recurrent neural networks to capture the Lorenz system. Bailer-Jones et al. [33] used a standard RNN to predict the time derivative in discrete or continuous form for simple dynamical systems; this can be considered an RNN extension to Tsung’s phase space learning [16]. Wang et al. [34] proposed a framework combining POD for dimension reduction and long-short-term memory (LSTM) recurrent neural networks and applied it to a fluid dynamic system.

We limit ourselves to feedforward neural networks, since there are still many unanswered questions about modeling dynamical systems even in this simplest form. It is known that time delayed FNNs closely resemble simple RNNs trained with teacher forcing [35]. Further, RNNs are not easy to train since standard training algorithms (e.g., back propagation through time [36]) are likely to introduce stronger overfitting than FNN due to vanishing gradients [35]. Recently, sparse regression (SINDy) [3, 4] has gained popularity as a tool for data-driven modeling. The idea is to search for a sparse representation of a linear combination of functions selected from a library. In this work, we will compare it with FNN-based models and highlight some differences.

The paper is organized as follows: the problem description is provided in Section 2 and the mathematical formulation of standard and Jacobian-regularized FNNs is presented in Section 3. Results and discussion are presented in Section 4. We first present a comparison with SINDy for simple dynamical systems. Then, we highlight the importance of stabilization to control the global error of predicted trajectory and the impact of Jacobian regularization. Finally, we apply the model in a nonlinear PDE system where a low dimensional attractor is not realized and discuss the limitations of black-box modeling of dynamical system and propose data augmentation as remedies. Conclusions are drawn in Section 5.

2. Problem Description

Consider a dynamical system in Euclidean space which is described by a continuously differentiable function , where . The state satisfies the composition relation for and is the identity function. is the map of the flow described by a vector function as

Similarly, one can define a discrete dynamical system induced by the above smooth dynamical system by considering a constant time step and a state transition map such that

Equivalently, one can rewrite the above system as where resembles a first order solution [33] to .

Our goal is to find an approximation to the dynamics, either (i) in a discrete sense , given the data uniformly sampled from a trajectory given initial condition or (ii) in a continuous sense , given the data , where N is the number of data points. It must be mentioned that—as highlighted in the result section—data does not have to be collected on the same trajectory.

Depending on the way one defines the training and testing set, two types of problems are considered in the current work. (1)Prediction of a certain trajectory starting from an initial condition that is different from the training trajectories(2)Prediction of the future trajectory given the past information of the trajectory as training data

Conservatively speaking, the success of tackling the first of the above problems requires the trajectories in the training data to be a representative of the distribution in the region of interest, which may or may not be feasible depending on how informative the data is. In the context of modeling dynamical systems, it is often implied in previous literature [22] that the initial condition of unseen testing data is not far away from the training data. The second problem can also be difficult since it will challenge the effectiveness of the model as past information might not be sufficient for the model to be predictive on unseen data. Again, it is often implied in the previous works [20, 34, 37] that successful predictions are often accompanied by an underlying low dimensional attractor so the past states as training data can be collected until it becomes a representative of the future.

3. Mathematical Modeling Framework

In this section, we first define performance metrics of the approximation to the dynamics f; then, introduce the standard FNN model and the Jacobian-regularized FNN model. Finally, techniques to mitigate overfitting are described.

3.1. Definitions of Error Metrics

To measure the prediction error for each sample in an a priori sense (i.e., given exact ), we define the local error vector for the -th sample as where is the i-th target to learn from the i-th feature . For example, the feature is the state vector at the i-th step, and the target can be for discrete dynamical system or for continuous dynamical system.

Then, we can define local error at the i-th sample by where is the vector 2-norm, i.e., norm, and is the absolute value.

We can further define the local error of the i-th sample for the j-th component as shown by

The local error assumes that the i-th input feature is predicted accurately. On the other hand, the global error vector is defined by equation (7), in which is obtained by iterative prediction, i.e., a posteriori evaluation, at the i-th step from an initial condition through either time integration or transition function as a discrete map. That is, is obtained from in a recursive sense as follows:

Similarly, the global error is defined by and for the j-th component specifically by

Further, to obtain a holistic view of the model performance in feature space, if Fd or Fc is known, either in the continuous or discrete case, we can define stepwise error as

Note that is not restricted by the training or testing trajectory, but it can be evaluated arbitrarily in the region of interest.

Finally, we consider the uniform averaged coefficient of determination R2 as a scalar metric for measuring regression performance where is given by where is the number of samples in the validation data, and is the prediction of f based on the i-th feature .

3.2. Feedforward Neural Network Model
3.2.1. Basic Model: Densely Connected Feedforward Neural Network

The basic model approximates Fc in equation (1) for the continuous case and Fr in equation (3) in the discrete case using a feedforward neural network. The existence of an arbitrarily accurate feedforward neural network approximation to any Borel measurable function given the enough number of hidden units is guaranteed from the property of the universal approximator [38]. It should be noted that our basic model is related to Tsung’s phase-space-learning model [16]. If the Markovian assumption is adopted, the training feature matrix snapshots X and training target matrix snapshots Y are as follows: and where M is the dimension of the state, N is the total number of snapshots of training data, learning target Y is the time derivative, and the subscript stands for the index of the component. Note that each component of the feature and target are normalized to zero mean and unit variance for better training performance in the neural network.

By generally constructing a densely connected feedforward neural network with hidden layers and output layer as linear, the following recursive expression is defined for each hidden layer: where stands for the input of the neural network , , is the number of hidden units in layer l, and is the activation function of layer l. Note that the output layer is linear, i.e., : where the parameters of the neural network are , .

For example, if we consider using two hidden layers where L = 3 and the number of hidden units are the same, the full expression for the neural network model is given by where is the state of the dynamical system, i.e., the input to the neural network and is the modeling target, i.e., the output of neural network. is a nonlinear activation function. , , , , . Sets of weights and biases are and . The problem is to find the set of parameters of W3 and b3 that result in the best approximation of the underlying ground truth (Fc, Fd, or Fr). Under the framework of statistical learning, it is standard to perform empirical risk minimization (ERM) with mean-square-error loss. The set up and parameters corresponding to the desired solution can be written as where is the index set of training data, and and correspond to the i-th feature target pair.

To deal with the high dimensionality of the optimization problem, we employ Adam [39], a gradient-based algorithm, which is essentially a mixture between momentum acceleration and rescaling parameters. The weights are initialized using a truncated normal distribution to potentially avoid saturation and use the automatic differentiation (AD) provided by Tensorflow [40] to compute the gradients. The neural network model is implemented in Python using the Tensorflow library [40]. Due to the nonconvex nature of equation (18), for such a high degree of freedom of parameters, one can only afford to find a local minimum. In practice, however, a good local minimum is usually satisfactory [35]. Hyperparameters considered in the current work for the basic model are the number of units for each hidden layer nh and activation function σ(·).

Model selection for neural networks is an active research area [4143]. Well-known methods involve grid search/random search [41]/Tree of Parzen Estimators (TPE) [42]/Bayesian optimization [43] with cross validation. We pursue the following trial-error strategy: (1)Given the number of training points, computing the number of equations to satisfy if the network overfits all the training data(2)Pick a neural network with uniform hidden layer structure to overfit the training data with the number of parameters in the network no more than 10% to 50% of the number in step 1(3)Keep reducing the size of neural network by either decreasing the hidden units or number of layers until the training and validation error are roughly the same order(4)For the choice of other hyperparameters, we simply perform grid search

3.2.2. Jacobian Regularized Model

In standard FNNs, minimizing mean-squared-error on the training data only guarantees model performance in terms of the local training error. It does not guarantee the reconstruction of even training trajectory in the a posteriori sense.

Here, we take a closer look at the error propagation in a dynamical system for the FNN model when evaluated in an iterative fashion, i.e., a posteriori sense. Without any loss of generality, considering the discrete case, after we obtain the model f, we can predict given

Moreover, given Fd, we can find the given and as follows

Consider a Taylor expansion of about , we have where H is the Hessian matrix evaluated at some point between and .

Assuming is bounded, and the high order terms are negligible compared to the Jacobian term, we have

Similarly, in the continuous case, we have

The right hand sides of equations (23) and (25) contain contributions from the global error and accumulation of local error. Optimization as in equation (18) can minimize the latter term, but not necessarily the former. This suggests that manipulating the eigenspectrum of the Jacobian might be beneficial for stabilization by suppressing the growth of the error. Due to the simplicity of computing the Frobenius norm compared to the 2-norm, we consider penalizing the Frobenius norm of the Jacobian of the neural network model. In the context of improving generalization performance of input-output neural network models, similar regularization has been also proposed by Rifai et al. [44]. It should be noted that our purpose is to achieve better error dynamics in a temporal sense, which differs from the generalization goal in deep learning. Thus, one may seek a locally optimal solution that can suppress the growth in global error while minimizing the local error.

The regularized loss function inspired from the above discussion is thus where J is the Jacobian of the neural network output with respect to the input and λ is a hyperparameter. On one hand, it should be noted that regularizing the Frobenius norm of the eigenspectrum of the Jacobian indirectly suppresses the magnitude of the eigenvalue of the Jacobian. On the other hand, excessive weighting on the magnitude of the eigenvalue would lead to less weighting on local error, which might result in an undesirably large local error. Thus, λ should be set as a relatively small value without strongly impacting the model performance in an a priori sense.

3.3. Reducing Overfitting

Overfitting is a common issue in the training of machine learning models, and it arises when models tend to memorize the training data instead of generalizing true functional relations. In neural networks, overfitting can occur from poor local minima and is partially due to the unavoidable nonconvexity of an artificial neural network. Overfitting cannot be completely eliminated for most problems, given the NP-hard nature of the problem. Generally, overfitting can be controlled by three kinds of regularization techniques. The first follows the Occam’s razor principle, e.g., L1 sparsity regularization [3]. However, there is no guarantee that Occam’s razor is appropriate for all cases, and finding the optimal sparsity level is often iterative. The second is to smooth the function, e.g., using weight decay [35]. The third type is especially suitable in iterative learning, e.g., early stopping, which is a widely used strategy in the deep learning community [35]. In this work, we found validation-based early stopping to be sufficient. We split the data further into pure training and validation sets, and then monitor overfitting by measuring R2.

4. Results and Discussion

Given sequential training data, the capability of the basic FNN is first evaluated in two-dimensional dynamical systems with polynomial nonlinearities in Section 4.1 and nonpolynomial nonrational dynamics in Section 4.2. The basic model is compared with SINDy [3], which is a method that directly aims at learning functional models using L1 sparse regression on a dictionary of candidate basis functions. In Section 4.3, we demonstrate that the basic model performs better than SINDy on the problem of incompressible flow behind a cylinder, in spite of the explicit addition of quadratic terms to the dictionary. In addition, the local error is found to be strongly correlated with the maximal singular value of the Jacobian, thus serving as an inspiration for Jacobian regularization. In Section 4.4, we demonstrate the stabilizing aspect of Jacobian regularization for the problem of laminar wake behind a cylinder, where the system exhibits a low dimensional attractor. In Section 4.5, we assess the ability of our regularized FNN model to approximate a dynamically evolving high-dimensional buoyancy-driven mixing flow system that is a characteristic of flow physics driven by instabilities. The results show that, for systems that do not exhibit a low dimensional attractor, it is difficult for a black-box model to have satisfactory long-time prediction capabilities. In Section 4.6, we show that predictive properties can be improved by data augmentation in the state space of interest.

4.1. 2D Polynomial System: Van der Pol Oscillator

The first order forward discretized scheme of the Van der Pol (VDP) system is given by where ∆t = 0.1 and μ = 2.0. The modeling target is

Our goal is to reproduce the dynamics governed by Fr from on data collected from a single trajectory. The loss function of the basic model in equation (18) is optimized using training data from a single trajectory, containing 399 data points. Test data containing 599 points is generated using a different initial condition. The data distribution of the training and testing features is shown in Figure 1. Note that initially, a few test points (orange) are away from the training data (blue) which require the model to perform extrapolation. Configuration of hyperparameters is shown in Table 1. Data is normalized to zero mean and unit standard deviation for each component. We use minibatch training with batch size = 64 and 80,000 epochs. The basic model consists of two hidden layers with each layer containing 8 hidden units. Two hidden layers are accompanied by Swish nonlinear activation as , where in practice, β is fixed as unity [45]. The output layer is linear. Randomly 20% of training data is used as a validation set, and we monitor the performance on the validation set as a warning of overfitting. In Figure 1, the learning curve suggests that the model is well-trained and overfitting is not observed.

Results of a priori and a posteriori prediction are shown in Figure 2. The basic model predicts the Fr at each training point very well a priori, but slight phase lag is observed a posteriori in testing, which originates from the extrapolation of the testing data initially.

The variation of the local and global error together with the maximal singular value of the Jacobian is shown in Figure 3. For training data, elocal is observed to be relatively uniform as expected, since the objective optimized is MSE uniformly across all training data points. For testing data, elocal exhibits peak values near the beginning of the trajectory as expected, since the first few points are far away from the training data shown in Figure 1. Moreover, it is interesting to observe that in Figure 3, the peak of the temporal history of local/global error shows a strong correlation with the maximal singular value of the Jacobian.

Stepwise error contours are displayed in Figure 4. The region of large error close to red (implying the difference of the stepwise vector between neural network prediction and ground truth is large) is located near the corner of figure, where there is a dearth of training points. The model performs well near the training points as expected. In this case, since testing data is not very far away from the training data, good performance of extrapolation can be expected. However, we would like to note that there is a moderate amount of error associated with the vector direction in Figure 4 not only at the corners but also near the origin. This implies that a feedforward neural network can generalize to some extent, but with no guarantees, even in regions enclosed by training data. The results also confirm that the known result for a dynamical system with an attractor, the neural network can reproduce the dynamics near the attractor [13, 25, 33, 37, 46].

With the prior knowledge that the system is polynomial in nature, one can use polynomial basis functions to extract the ground truth. To illustrate this, results obtained from SINDy [3] are shown in Figure 5, with threshold parameter as 2 × 10−4, maximal polynomial order as 3, and no validation data set considered. As displayed in Figure 6, the excellent result of SINDy shows the advantage of finding the global features where the parameters obtained are not restricted to the scope of training data since the ground truth is governed by sparse polynomials.

4.2. 2D Nonpolynomial System: A Nonrational Nonpolynomial Oscillator

The success of SINDy is a consequence of the fact that the underlying system can be represented as a sparse vector in a predefined basis library such as that consisting of polynomial or rational functions [4]. Here, we choose a different case: a nonrational nonpolynomial oscillator with ∆t = 0.004

Here, the basic model in equation (18) is optimized using 1199 data points of a single trajectory. Testing data contains 1799 points. Randomly, 20% of the training data is taken as the validation set, but also included in later evaluation. The feature distribution in phase space is shown in Figure 7. Hyperparameters are listed in Table 2 and 128 minibatches and 20,000 epochs are used. The training error and validation error is also shown in Figure 7.

Results for a priori and a posteriori performance on training and testing data are shown in Figure 8. The training trajectory is perfectly reconstructed while the predictions show slight deviation.

The distribution of the local and global error is shown in Figure 9. Again, we observe that maximal local/global error correlates with the peaks of the maximal singular value of the Jacobian. It is interesting to note that the highest local testing error occurs at the peak of the maximal singular value of the Jacobian, instead of at the points close to the initial condition.

The error contour in Figure 10 shows that the stepwise error around the training trajectory is below 0.1. It is important to note that model performance deteriorates at places far away from the training trajectory, especially at the right corner shown in Figure 10.

For SINDy, the polynomial order is set to three and threshold as 2 × 10−4. A priori and a posteriori validation for training and testing is shown in Figure 11. Correspondingly, the stepwise error contour displayed in Figure 12 shows the misfit for the region of interests ranging from −1 to 3 for both two components. Because there is no sparsity in polynomial basis in this case, it is expected that SINDy cannot reconstruct the dynamics correctly and would perform worse than the basic model of FNN. The implication is that for strongly nonpolynomial systems, neural networks are far more flexible compared to SINDy.

4.3. Nonlinear PDE System: Flow behind a Cylinder

In this section, we compare the basic model with SINDy in reconstructing the flow in a cylinder wake. The data is from Brunton et al. [3] which comes from an immersed boundary method solution [47] of the 2D incompressible N-S equations with Re = 100 based on the cylinder diameter. The computational domain consists of a nonuniform grid with near-wall refinement. The inlet condition is uniform flow and the outlet is a convective boundary condition to allow the vorticity to exit the domain freely. Testing data is generated as a temporal extension of states that lie on a limit cycle at the boundary of training data, which indicates that this is not an extrapolation task. To work with such a high-dimensional nonlinear PDE system, we use the coefficients of two POD modes [23] and one “shift mode”, which represents the shift of short-term averaged flow away from the POD space of the first two harmonic modes to reduce the spatial dimension. More details on POD and “shift-modes” are provided in references [23, 48]. Training and testing data is the same as in Brunton et al. [3] where the first 2999 snapshots in time are used for training, and a later 2994 snapshots used for testing. A random 10% of training snapshots is considered as validation set but also included in later evaluation. The distribution of training data and testing data is shown in Figure 13.

Hyperparameters of the basic model are shown in Table 3 with 40,000 epochs. For SINDy, the hyperparameters are the same as in the previous work [3]. As shown in Figure 14, for training data, SINDy reconstructs a smaller growth rate of oscillating behavior, while the basic model accurately reconstructs both the shift mode and two POD modes. For testing data, SINDy contains an observable phase lag for the time period concerned, while the basic model achieves an almost perfect match. This implies that the model obtained from SINDy, although much easier to interpret than neural network, is not the best model for this dynamical system in terms of accuracy. However, we note that from the data distribution in Figure 13, the basic model performs as expected, as the training data covers the attractor well.

4.4. Stabilizing the Neural Network with Jacobian Regularization

Due to the nonconvexity of the optimization problem that arises in the solution of the basic model in equation (18), employing a stochastic gradient-descent type method might lead to a solution corresponding to a local minimum, which is often undesirable and difficult to avoid. Most works in the field of deep learning for feedforward neural networks focus on decreasing the impact of poor local minima to promote generalizability. However, in the context of modeling a dynamical system, as it is often assumed that the trajectory of interest is stable with respect to small disturbances [25], the model should be able to approximately reconstruct the training trajectory in the presence of local errors that arise at each step. This would require regularizing instabilities that could arise in a posteriori prediction. To have meaningful comparisons, random number seeds are fixed for initialization of weights and training data shuffling. Nevertheless, we observe that in some cases, for example, in the previous case of the cylinder wake, an inappropriate choice of neural network configuration of the basic model, e.g., number of hidden units and type of activation function, can potentially lead to instability in a posteriori evaluation. Such instabilities may materialize even while reconstructing the training trajectory, while the corresponding a priori prediction is almost perfect. Previous work [16] explicitly ensured stability by simply adding more adjacent trajectories. Here, we take a different approach by adding a Jacobian regularization term in the cost function in equation (26).

In our numerical experiments, with a certain fixed random seed, it is observed that, when the layer structure is 2-20-20-2 with tanh as activation function instead of elu, the basic model becomes numerically unstable after 2000 steps for training data which is displayed in Figure 15. Similar numerical instability is also observed in testing evaluations. However, for the same fixed random seed, the regularized model with λ = 5 × 10−5 shows numerically stable results with the same neural network configuration for both training and testing data.

The effectiveness of Jacobian regularization may be attributed to finding a balance between lowering the prediction error, i.e., MSE, and suppressing the sensitivity of the prediction of the future state to the current local error. As shown in Figures 16 and 17, on average, the maximal eigenvalue of the Jacobian is smaller for the regularized model than for the basic model. Furthermore, the distribution of the eigenvalues of the Jacobian is shown in Figure 18 in the form of a linear stability diagram with explicit 5th order Runge-Kutta time integration. It is clear that the model with Jacobian regularization has significantly smaller positive real eigenvalues. Note that, due to the Frobenius norm, negative real eigenvalues are also decreased in magnitude.

4.5. Nonlinear PDE System: Instability-Driven Buoyant Mixing Flow

The test problems thus far have served to assess the performance of the basic and Jacobian regularized models on nonlinear dynamical systems that either evolve on or towards an attractor. Such systems, even if high-dimensional, are amenable for projection onto a lower dimensional subspace, using, for instance, POD techniques. In this section, we consider the Boussinesq buoyant mixing flow [49, 50], also known as the unsteady lock-exchange problem [51] which exhibits strong shear and Kelvin-Helmholtz instability phenomena driven by the temperature gradient. Compared to the cylinder flow that evolves on a low-dimensional attractor approaching a limit cycle, the Boussinesq flow is highly convective and instability driven. Consequently, such a system state cannot be represented by a compact set of POD modes from the spatial-temporal field of nondimensionalized velocity and temperature. Rather, the low-dimensional manifold itself evolves with time. Further, any noise in the initial data can produce unexpected deviations that makes such systems challenging to model, even using equation-driven reduced order models such as POD-Galerkin [51].

The data set is generated by solving the dimensionless form of the two-dimensional incompressible Boussinesq equations [51], as shown in equation (30) on a rectangular domain that is 0 < x < 8 and where , v, and θ are the horizontal, vertical velocity, and temperature components, respectively. The dimensionless parameters Re, , and Pr are the Reynolds number, Richardson number, and Prandtl number, respectively, with values chosen as follows: , , and Pr = 1.0. These equations are discretized on a 256 × 33 grid. Initially, fluids at two different temperatures are separated by a vertical line at . The bounding walls are treated as adiabatic with the no-slip condition. A fourth-order compact finite difference scheme is used to compute the derivatives in equation (30). The evolution of the thermal field over the simulation time interval of 32 seconds is shown in Figure 19 and illustrates the highly transient nature of the dynamics. To reduce the dimensionality of the system, POD modes are extracted from the entire data set consisting of 1600 snapshots. The reduced feature set consisting of ten POD weights captures nearly 97% of the total energy is used to train the model and predict the trajectory.

For the setup of training and testing, future state prediction is pursued with the first 70% states of the trajectory treated as training data and the rest for testing. For such a system in 10 dimensions, it is observed that the problem of the a posteriori instability in the basic model becomes more pronounced and difficult to avoid. Challenges of numerical instability were observed even for reconstruction for a wide range of network configurations, and thus, results from the basic model are not reported.

The Jacobian regularized model is employed with hyperparameters shown in Table 4, with Figure 20 showing a posteriori evaluation on training data. The reconstruction is successful, but the performance deteriorates on testing data because the trajectory of the system does not exhibit a low dimensional attractor as in the cylinder case. Therefore, the training data is not informative for predictions on the test set. For a black-box machine learning model, this phenomena can be expected to be more pronounced in high dimensional space due to data scarcity. Specifically, we discuss this problem in the following section.

4.6. Improving Model Predictability by Data Augmentation

In this section, we consider two scenarios of data augmentation: (i) augmenting the information in the data by spreading training locations randomly following a uniform distribution provided that one has access to or at any desired location; (ii) augmenting the data by assembling several trajectories generated from different initial conditions.

4.6.1. Random Uniform Sampling in Phase Space

Recall that, in the two-dimensional problems in Section 4.1 and Section 4.2, the stepwise error contour shows that local error increases on testing scenarios located far away from the training data which was highly concentrated in a compact region of phase space. Without any knowledge of system behavior, it is sensible to start with training data from a random uniform distribution in a compact region of phase space corresponding to interesting dynamics. To conduct a thorough stepwise error contour evaluation of the training target in phase space, the VDP system is chosen to illustrate this idea.

Determining the most informative data samples would potentially involve specific knowledge of the underlying system and the models used and is beyond the scope of the current work. Here, we simply consider uniform random sampling in the phase space in a finite domain: [−3,3] for the first component and [−5,5] for the second component. We obtained a new set of 399 training data points using random uniform sampling in phase space while retaining the same testing data as in Section 4.1.

The performance of the basic model with the same hyperparameter setting as in Section 4.1 on randomly distributed training data is shown in Figure 21. While the number of data points has not been changed, the contour error of the resulting model decreased significantly compared to training with the same number of data points in a single trajectory, which indicates an improved generalizability with the same amount of training data.

4.6.2. Training with Multiple Trajectories with Random Initialization

Training data can also be augmented by multiple trajectories with different initial conditions. Here, we take the one-dimensional viscous Burgers equation shown in equation (31) as example.

The initial conditions are generated following a specific energy spectrum [28, 52] shown in equation (32). where is a periodic domain discretized using 2048 uniformly distributed grid points, and , . where for each , is a random number drawn from a uniform distribution on , if , , and if . Multiple trajectories are generated using different seeds for random numbers to obtain the trajectories of the full-order system. To fully resolve the system as a DNS, equation (31) is solved using a standard pseudospectral method with SSP-RK3 [53] for time stepping. Here, we choose . Discrete cosine transformation (DCT) is used to reduce the dimension of the full system to first 4 cosine modes in the system where around 97% of kinetic energy is preserved. For simplicity, we seek a closed Markovian reduced-order-system, whereas the underlying dynamics is clearly non-Markovian [28, 54].

Since the first component of the DCT is constant, the remaining components of feature space are shown in Figure 22. The training data is far away from the testing data initially, whereas the data converges at a later stage. This is because of the presence of a spiral fixed point attractor resulting from the viscous dissipative nature of the system. Therefore, if the model is only trained from a single trajectory, it will be very difficult for the model to generalize well in the phase space especially where the state of the system is not near an attractor.

Many dynamical systems in nature exhibit attractors in the asymptotic sense. From the viewpoint of data-driven modeling of such dynamics, data scarcity is encountered at the start of trajectory where the number of trajectories required to provide enough information to cover the region of interest grows exponentially. Much research on applying neural network-based models for dynamical systems [16, 20, 34] demonstrate problems starting on limit cycles or chaotic attractors in a low-dimensional feature space, where the issue of initial data scarcity is not significant, or can be easily alleviated by a small increase in available data. However, for the purpose of modeling phenomena such as turbulent fluid flow, which can be high dimensional even after dimension reduction, the model would likely fail for long-time prediction due to data scarcity. Such a situation may be realized in regions of phase space where the state has not arrived at the low manifold attractor. Therefore, the training data might not be representative of testing data which violates the fundamental assumption of a well-posed machine learning problem [55]. Moreover, data scarcity will shrink the region of generalizability of the model as the dimension of the system increases.

A key benefit of using a neural network model is its linear growth in complexity with dimension of the system, in contrast to traditional polynomial regression methods [35]. However, initial data scarcity would limit the generalizability of an ANN in modeling a high dimensional dynamical system that does not exhibit a low dimensional attractor. We believe that this phenomenon of data scarcity observed from this simple nonlinear PDE example also applies to other nonlinear dynamical systems.

To alleviate the initial data scarcity issue, a solution is to augment the training data with more trajectories with different random number seeds in generating the initial condition, while keeping the energy spectrum the same across all cases. In this case, we choose 18 such trajectories. Each trajectory contains states of 1000 snapshots equally spaced in time. For testing data, we simply consider one DNS result with an initial condition different from all training trajectories. The corresponding training and testing trajectories are visualized in phase space as shown in Figure 22.

The basic model is trained with hyperparameters in Table 5 and 1000 epochs. The resulting learning curve and a posteriori evaluation are shown in Figure 23. Relatively large discrepancy is observed near the initial condition as the initial data scarcity is not completely eliminated due to limited number of additional trajectories. Increasing the number of additional trajectories may be unaffordable for very high dimensional systems. Moreover, the result also shows that the error decreases once the trajectory falls on the fixed point attractor. Thus, if the model starts in the low dimensional attractor where the information is well-preserved in the training data, better performance might be expected. This hypothesis is consistent with the previous work [34], where successful prediction of future states starts at the time when the states converge to a low dimensional attractor.

5. Conclusions

This work investigated the modeling of dynamical systems using feedforward neural networks (FNN), with a focus on long time prediction. It was shown that neural networks have advantages over sparse polynomial regression in terms of adaptability, but with a trade-off in training cost and difficulty in extrapolation, which is a natural barrier for almost all supervised learning. From the perspective of global error analysis, and the observation of the strong correlation between the local error and maximal singular value of the Jacobian, we propose the suppression of the Frobenius norm of the Jacobian as regularization. This showed promise in improving the robustness of the basic FNN model given limited data, or when the model has a nonideal architecture, or when the model is unstable. The effectiveness of Jacobian regularization is attributed to finding a balance between lowering the prediction error and suppressing the sensitivity of the prediction of the future state to the current local error. In terms of modeling dynamical systems that do not involve low-dimensional attractors, limitations of FNNs, and perhaps all local ML methods, was demonstrated in a buoyant mixing flow. Challenges were noted in the example of the reduced-order viscous burgers system, where significant initial data scarcity is present. Augmenting the data either by altering the distribution of training data in phase space or by simply adding multiple trajectories from different initial conditions resulted in improvement of the performance of FNN model to some extent. However, these remedies require a significant amount of additional sampling in phase space, especially for high dimensional systems for the period of time without the apparent low-dimensional attractor which suffers data sparsity from the curse of dimensionality.

Data Availability

Most of the data contained in the current study are synthetic data which can be easily generated through the configuration described. In addition, the cylinder flow data is directly downloadable in the public repo of SINDy [41]: faculty.washington.edu/sbrunton/sparsedynamics.zip. And the instability-driven buoyant mixing flow data were supplied by Professor Balaji Jayaraman under license and so cannot be made freely available. Requests for access to these data should be made to [email protected]. Furthermore, all of the data used to support the findings of this study are available from the corresponding author upon request. Ref: [1] Brunton, Steven L., Joshua L. Proctor, and J. Nathan Kutz. “Discovering governing equations from data by sparse identification of nonlinear dynamical systems.” Proceedings of the National Academy of Sciences 113.15 (2016): 3932–3937.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors acknowledge Professor Balaji Jayaraman for his helpful discussion and comments and for providing the data for the two-dimensional instability-driven Buoyant mixing flow. This work was supported by AFOSR and AFRL under grants FA9550-16-1-0309 and FA9550-17-1-0195.