Building Recurrent Neural Networks to Implement Multiple Attractor Dynamics Using the Gradient Descent Method
The present paper proposes a recurrent neural network model and learning algorithm that can acquire the ability to generate desired multiple sequences. The network model is a dynamical system in which the transition function is a contraction mapping, and the learning algorithm is based on the gradient descent method. We show a numerical simulation in which a recurrent neural network obtains a multiple periodic attractor consisting of five Lissajous curves, or a Van der Pol oscillator with twelve different parameters. The present analysis clarifies that the model contains many stable regions as attractors, and multiple time series can be embedded into these regions by using the present learning method.
Recurrent neural networks (RNNs) have been successfully applied to the modeling of various types of dynamical systems. Since the universal approximation ability of multilayer neural networks has been proved, RNNs can model arbitrary dynamical systems and turing machines [1–3]. However, applying RNNs to a desired model may be very difficult even if such RNNs exist . For example, building RNNs to implement required multiple attractor dynamics is a difficult problem for standard training, such as the gradient descent method. Doya and Yoshizawa  demonstrated that RNNs can acquire two limit cycles in the gradient descent method using initialization with small connection weights, whereas learning for more than three limit cycles is difficult . This is due to the fact that the learning of several time series causes a conflict with respect to the changing of the connection weights. How to form RNN models that can learn several temporal sequence patterns has proved to be a challenging problem.
There have been some approaches to this problem. In order to avoid conflicts in the change of parameters, the mixture-of-experts-type architecture has been investigated [7, 8]. The mixture-of-experts model consists of RNNs as experts and a hierarchical gating mechanism. At the end of successful learning, each expert implements attractor dynamics as locally represented knowledge, and a gating mechanism chooses only one expert at any time. The system can acquire many attractor patterns although there is a disadvantage in that the system does not have the generalization ability on the attractor patterns. As the other approach to implement multiple patterns, the parametric bias (PB) method has been developed to improve the learning capability of RNNs [9, 10]. In an RNN that employs the PB method (RNNPB), PB values provide the information needed in order to individualize each sequence. It has been reported that the number of time series that RNNPBs can learn is greater than that which RNNs without PB can learn. However, the PB method cannot avoid the conflict caused by each attractor learning. Therefore, learning multiple time series by an RNNPB tends to fail when the number of time series increases.
In the present study, we will focus on the training method for RNNs to learn multiple attractor dynamics. Furthermore, we will show that the present research is related to research into RNNs with contraction transition functions. In recent years, RNNs with contraction transition mapping have been investigated with respect to the performance of time series learning [11–13], generalization ability , and memory capacity . Jaeger [11, 12] demonstrated that an “echo state network,” which is an RNN with contraction mapping, successfully learns the Mackey-Glass chaotic time series, a well-known benchmark system for time series prediction. In order to formally express the generalization ability, Hammer and Tiňo proved that RNNs with contraction are distribution-independent learnable in the probably approximately correct (PAC) sense . From the above results, RNNs with contraction might be regarded as powerful tools for modeling dynamical systems. However, RNNs with contraction have difficulty in representing multiple attractor dynamics because dynamic states governed by the contraction transition function are globally attracted to one point. In this paper, the representation capability of RNNs with contraction mapping will be improved such that the RNNs can obtain multiple attractor dynamics.
We start by defining the concepts of the RNN and the training method for multiple attractor dynamics. The RNN has the Elman net-type architecture, and the training method for RNNs is basically based on the backpropagation through-time (BPTT) algorithm . We then show in numerical simulation that the RNNs can acquire multiple periodic attractors constituted by five Lissajous curves, or a Van der Pol oscillator with twelve different parameters. Moreover, we consider why the RNNs successfully learn multiple attractors and how the performance of learnability depends on parameters of the RNNs. Finally, we link the results obtained herein to other learning strategies, and consider other advanced research topics.
2.1. Recurrent Neural Network
We first consider a neural network model with recurrent connection, such as the Elman net  (see Figure 1). The RNN contains I/O units, orthogonal units, and internal units. We denote the dynamic states of I/O units, orthogonal units, and internal units at time step by , and , respectively. The RNN is defined by functions and with a parameter , where and are of the forms where , and are matrices, and are vectors, is a time constant that satisfies , and denotes a componentwise application such as .
Dynamic states of the RNN at time step are updated according to From these equations, the RNN can be represented by an -dimensional dynamical system.
We now define bistability for the RNN.Definition 1. Assume is as above. The function is bistable with respect to the third variable if a real value and an integer exist such thatfor every element of the matrix .
The bistability of a function is a key concept of our learning method. We will show in Section 4.1 that the bistable function plays an important role in the learning of multiple attractor dynamics.
2.2. Learning Method
We present a formulation of the training procedure for the RNN with a multiple teacher I/O time series. For every and , we assume that is a sequence of teacher I/O of length .
Initialization of Parameters
We initialize every element of matrices , , and and vectors and randomly from the uniform distribution in the interval . A matrix is randomly assigned such that is bistable. For all is randomly initialized in the interval .
Assume that is an -tuple of vectors for and , and that the dimension of is equivalent to that of if . We initialize such that
Run Network with Teacher I/O and Compute Error Function
For every , the sequence of I/O units of the RNN at learning step is defined by The error function of the RNN at learning step with the th teacher I/O time series is defined bywhere denotes the mean square error function . Finally, the error function at learning step is defined by
Let be a parameter of the RNN at learning step . We determine the parameter by where and are the constants of the learning rate and momentum, respectively. On the other hand, a connection matrix is not changed as in order to hold the bistability condition. We compute the initial state of the internal units at learning step such that where , and is the constant of the learning rate of the initial state. Assume that is a vector as a component of the orthogonal units , such as . The vector is defined by where , and is the constant of the learning rate of the orthogonal units.
Note that the maximum value of the error function depends on the number of units and the length of the teacher I/O time series. Thus, we should scale the learning rates , and with the number of units and length of sequences. In the present paper, we consider parameters , and such that , and .
3. Numerical Experiments
In this section, we conduct two types of experiments as examples of using the training method for RNNs proposed in Section 2. The first experiment shows the learning of five Lissajous curves. The second experiment shows the training of multiple attractors of a Van der Pol oscillator with 12 different parameters.
3.1. Experiment 1: Lissajous Curves
3.1.1. Teacher I/O Time Series
Our first task is to learn the five Lissajous curves defined by and we consider constants and for all (see Figure 2).
3.1.2. Learning and Testing
We now describe the specific conditions applied to RNN training. The time constant is set to . The number of orthogonal units is , and the dimension of a vector is for all . Suppose that is bistable with , and . The learning rates and momentum are given by , and , respectively.
Figure 3 shows the error function for 20 000 learning steps. We also show the Kullback-Leibler divergence between the teacher I/O time series and a sequence of I/O units in the RNN computed by (3) which do not use external perturbation by the teaching sequences. We use the Kullback-Leibler divergence as a measure of the discrepancy between two sequences. Formally, the Kullback-Leibler divergence between two probability distributions and is defined asBy definition, in order to compute the Kullback-Leibler divergence, it is necessary to obtain probability distributions of the teacher I/O time series and a sequence of I/O units. However, obtaining the probability distribution of a sequence of I/O units is very difficult. Therefore, we quantize a time series of real-valued vectors into a symbolic sequence such that if the real value is less than , then the symbol is appropriated, and otherwise the symbol is appropriated. In addition, we use the probability distribution whereby sub-blocks with a block length of appear in the symbolic sequence given by the above quantization.
Figure 4 describes attractors of the trained RNN computed by (3) of which the initial state of internal units is for each . By comparing the attractors with the teacher I/O time series displayed in Figure 2, we can see that the RNN can generate sequences similar to training data.
In Figure 5, examples of attractors for the RNN with random initial states are displayed. This shows that, in addition to the attractors corresponding to teacher I/O time series, there exist many attractors of the RNN.
3.2. Experiment 2: Van der Pol Attractors
3.2.1. Teacher I/O Time Series
Our second task is to learn multiple attractors given by the Van der Pol oscillator with different parameters. The Van der Pol oscillator defined byis a model of an electronic circuit that appeared in very early radios. It is well known that there exists a limit cycle for the Van der Pol oscillator. In this experiment, we consider twelve teacher I/O time series, where the th teacher I/O time series is given byfor and , where and are constant parameters representing the center position of the limit cycle, and is a time constant of the oscillator. We assume that the parameters , and are given by combining the values of , , and . Figure 6 shows the teacher I/O time series given by (16). The length of training data is for .
3.2.2. Learning and Testing
The parameters for learning are set as follows. Let be bistable with and . The dimension of the vector is for every so that . Other parameters are the same as in experiment 1.
The error function and the Kullback-Leibler divergence for 200 000 learning steps are displayed in Figure 7. Figure 8 shows attractors of the trained RNN, and the initial state of the internal units of which is set to for every .
This result allows us to consider that the RNN acquires multiple periodic attractors constituted by the teacher I/O time series.
4. Numerical Analysis
4.1. Contraction and Bistability
Assume that and are sets and that is equipped with a metric structure. A
function is a contraction with respect to if a real value exists such that the
inequalityholds for all and .
Lemma 2. Let
one consider a dynamical system on defined by the transition function ,
where and are defined in (1) and (2), respectively.
Assume that each element of the matrix satisfies (4), and is the maximum absolute value of elements in , ,
If there exist three solutions ofthen
(1) there are invariant sets of a dynamical system ;(2) suppose that is an invariant set of the dynamical system; then, the restriction of to is a contraction with respect to and the maximum norm , where .Proof. We suppose that (18) has three solutions, such as (see Figure 9). In general, and .
(1) Assume and . Then, the expressionis satisfied for all , and , where is the th element of the vector . Hence, if . Furthermore, if , then becauseTherefore, the region is a stable set of the th element of vector satisfying the fact that if then for any .
Similarly, we can easily show that if , then . Thus, there are two stable regions of the th element of vector for each . Then, there are invariant sets.
(2) Let be the invariant set presented above. Assume that .
For any , the inequalityholds because if , then .
On the other hand, for every ,
Then, is obtained for any . Accordingly, the restriction of to is a contraction with respect to and the maximum norm .
For any and , there is a real number such that if , then (18) has three solutions. Thus, if is large enough and matrices and represent small connection weights, then contains invariant sets, and each restriction of to an invariant set is a contraction with respect to a third input. Moreover, the integer is the effective degree of freedom for each contraction mapping restricted to an invariant set. If is a large value, then RNN can acquire a more complex time sequence. In Figures 10 and 11, we plot the Kullback-Leibler divergence of the trained RNN for parameters and , in which the training data are the same as those for experiment 1. These results imply that it is necessary that , and be large values in order to learn multiple attractor dynamics.
In the last paragraph of the previous section, we have shown that RNNs have many stable regions, and the existence of the stable regions plays an important role in the learning of multiple sequences. However, the existence of multiple stable regions is not sufficient for success in the multiple attractor learning because if the change of parameters corresponding to each time series influences other changes, each time series cannot necessarily be embedded into each region. Similarly, this problem appears in the method of RNNPB.
In the training algorithm defined in Section 2, each state of orthogonal units is trained by (5) and (11). Thus, firing of only occurs in the generation of the th teaching sequence. This implies that orthogonal units allow the conflict of parameter changes caused by multiple time series learning to be avoided because orbits corresponding to each teaching I/O time series run around the orthogonal state space of the trained RNN.
In order to show the effect of the orthogonal units on the conflict among teaching sequences, we consider the th learning ratio defined bywhere is an element of the matrix . If is nearly equal to , then the change in is approximately independent of teaching sequences rather than the th sequence. In Figure 12, we plot the value determined bywhere is a set of indices corresponding to the elements of the vector . The value represents the average of the th learning ratio for connections between internal units and orthogonal units . In this numerical experiment, for each learning step, is clearly larger than , where is the number of teaching sequences. Then, the sum of the th learning ratios of connection weights between internal units and orthogonal units is dominant. Therefore, in changing matrix , there is no conflict generated by multiple teaching sequences. However, we could not find a strong bias of the learning ratio for the matrices and and every element of with . Thus, we consider that connection weights between internal units and orthogonal units encode information on an individual time series, and other connection weights encode whole information.
In this report, we have investigated a method of embedding multiple time series into a single RNN. In order to clarify the characteristics of the proposed approach, we compare the proposed approach with other approaches with respect to information representation of multiple sequences in the models. The mixture-of-RNN-experts-type model composes local representation in an RNN for each sequence. The local representation provides robustness against changing the parameters in learning, but it lacks the ability to extract common patterns included in the sequences because of the independency of the local representation. In the proposed model, the local representation is constructed into orthogonal units, while the global representation is also constructed into internal units using the connection weights between I/O units and internal units. Since each sequence generated by the proposed model shares the state space and connection weights, the model can extract common patterns of the sequences as well as conventional neural networks.
Another characteristic, which clarifies the difference between our model and other models, is whether the classification of each time series is self-organized into the state space. For example, in the mixture-of-RNN-experts-type model, the allocation of time series to each RNN is determined automatically. As another example, in the RNNPB model, PB values are self-organized such that the PB can individualize each time series. On the other hand, the proposed model needs the information of orthogonalization for each time series. Since the sparse firing patterns which appear in orthogonal units, corresponding to time series, are given as teaching information externally, the classification of sequences is not self-organized. The characteristic whereby the time series cannot be automatically classified is a disadvantage of the proposed model. However, the time series can be classified using other clustering techniques before applying the proposed method. Thus, by combining the proposed method and other clustering techniques, an algorithm that automatically classifies and generates multiple time series can be constructed.
In this paper, we have presented an RNN model and a learning algorithm that can acquire the ability to generate multiple sequences. The RNN model consists of two distinct properties called bistability and orthogonality. Bistability guarantees the existence of multiple attractor structures in RNNs, and provides the RNNs with contraction transition mapping. Orthogonality, which is given as a function of the orthogonal vectors of RNNs, helps prevent conflicts with respect to parameter changes caused by multiple training sequences. In the numerical experiments, RNNs which have bistability and orthogonality can learn multiple periodic attractors constituted by five Lissajous curves or 12 Van der Pol oscillators. Based on these results, the proposed model can be applied to the modeling of various types of dynamical systems that include multiple attractors.
F.-S. Tsung, Modeling dynamical systems with recurrent neural networks, Ph.D. thesis, Department of Computer Science, University of California, San Diego, Calif, USA, 1994.
H. Jaeger, “Short term memory in echo state networks,” National Research Center for Information Technology, Bremen, German, 2001.View at: Google Scholar
W. Maass, T. Natschläger, and H. Markram, “A fresh look at real-time computation in generic recurrent neural circuits,” Institute for Theoretical Computer Science, TU Graz, Graz, Austria, 2002.View at: Google Scholar
D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations, D. E. Rumelhart and J. L. McLelland, Eds., pp. 318–362, MIT Press, Cambridge, Mass, USA, 1986.View at: Google Scholar