Advances in Artificial Neural Systems

Volume 2009 (2009), Article ID 846040, 11 pages

http://dx.doi.org/10.1155/2009/846040

## Building Recurrent Neural Networks to Implement Multiple Attractor Dynamics Using the Gradient Descent Method

Brain Science Institute, RIKEN, 2-1 Hirosawa, Wako City, Saitama 351-0198, Japan

Received 31 March 2008; Accepted 22 August 2008

Academic Editor: Akira Imada

Copyright © 2009 Jun Namikawa and Jun Tani. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The present paper proposes a recurrent neural network model and learning algorithm that can acquire the ability to generate desired multiple sequences. The network model is a dynamical system in which the transition function is a contraction mapping, and the learning algorithm is based on the gradient descent method. We show a numerical simulation in which a recurrent neural network obtains a multiple periodic attractor consisting of five Lissajous curves, or a Van der Pol oscillator with twelve different parameters. The present analysis clarifies that the model contains many stable regions as attractors, and multiple time series can be embedded into these regions by using the present learning method.

#### 1. Introduction

Recurrent neural networks (RNNs) have been successfully applied to the modeling of various types of dynamical systems. Since the universal approximation ability of multilayer neural networks has been proved, RNNs can model arbitrary dynamical systems and turing machines [1–3]. However, applying RNNs to a desired model may be very difficult even if such RNNs exist [4]. For example, building RNNs to implement required multiple attractor dynamics is a difficult problem for standard training, such as the gradient descent method. Doya and Yoshizawa [5] demonstrated that RNNs can acquire two limit cycles in the gradient descent method using initialization with small connection weights, whereas learning for more than three limit cycles is difficult [6]. This is due to the fact that the learning of several time series causes a conflict with respect to the changing of the connection weights. How to form RNN models that can learn several temporal sequence patterns has proved to be a challenging problem.

There have been some approaches to this problem. In order to avoid conflicts in the change of parameters, the mixture-of-experts-type architecture has been investigated [7, 8]. The mixture-of-experts model consists of RNNs as experts and a hierarchical gating mechanism. At the end of successful learning, each expert implements attractor dynamics as locally represented knowledge, and a gating mechanism chooses only one expert at any time. The system can acquire many attractor patterns although there is a disadvantage in that the system does not have the generalization ability on the attractor patterns. As the other approach to implement multiple patterns, the parametric bias (PB) method has been developed to improve the learning capability of RNNs [9, 10]. In an RNN that employs the PB method (RNNPB), PB values provide the information needed in order to individualize each sequence. It has been reported that the number of time series that RNNPBs can learn is greater than that which RNNs without PB can learn. However, the PB method cannot avoid the conflict caused by each attractor learning. Therefore, learning multiple time series by an RNNPB tends to fail when the number of time series increases.

In the present study, we will focus on the training
method for RNNs to learn multiple attractor dynamics. Furthermore, we will show
that the present research is related to research into RNNs with contraction
transition functions. In recent years, RNNs with contraction transition mapping
have been investigated with respect to the performance of time series learning
[11–13], generalization ability [14], and memory capacity [15]. Jaeger [11, 12] demonstrated that an “echo
state network,” which is an RNN with contraction mapping, successfully learns
the Mackey-Glass chaotic time series, a well-known benchmark system for time
series prediction. In order to formally express the generalization ability,
Hammer and Ti*ň*o proved that RNNs with contraction are
distribution-independent learnable in the probably approximately correct (PAC)
sense [14]. From the
above results, RNNs with contraction might be regarded as powerful tools for
modeling dynamical systems. However, RNNs with contraction have difficulty in
representing multiple attractor dynamics because dynamic states governed by the
contraction transition function are globally attracted to one point. In this
paper, the representation capability of RNNs with contraction mapping will be
improved such that the RNNs can obtain multiple attractor dynamics.

We start by defining the concepts of the RNN and the training method for multiple attractor dynamics. The RNN has the Elman net-type architecture, and the training method for RNNs is basically based on the backpropagation through-time (BPTT) algorithm [16]. We then show in numerical simulation that the RNNs can acquire multiple periodic attractors constituted by five Lissajous curves, or a Van der Pol oscillator with twelve different parameters. Moreover, we consider why the RNNs successfully learn multiple attractors and how the performance of learnability depends on parameters of the RNNs. Finally, we link the results obtained herein to other learning strategies, and consider other advanced research topics.

#### 2. Model

##### 2.1. Recurrent Neural Network

We first consider a neural network model with recurrent connection, such as the Elman net [17] (see Figure 1). The RNN contains I/O units, orthogonal units, and internal units. We denote the dynamic states of I/O units, orthogonal units, and internal units at time step by , and , respectively. The RNN is defined by functions and with a parameter , where and are of the forms where , and are matrices, and are vectors, is a time constant that satisfies , and denotes a componentwise application such as .

Dynamic states of the RNN at time step are updated according to From these equations, the RNN can be represented by an -dimensional dynamical system.

We now define bistability for the RNN.*Definition 1. *Assume is as above. The function is *bistable* with respect to the third
variable if a real value and an integer exist such thatfor every element of the matrix .

The bistability of a function is a key concept of our learning method. We will show in Section 4.1 that the bistable function plays an important role in the learning of multiple attractor dynamics.

##### 2.2. Learning Method

We present a formulation of the training procedure for the RNN with a multiple teacher I/O time series. For every and , we assume that is a sequence of teacher I/O of length .

*Initialization of Parameters*

We initialize every element of matrices , ,
and and vectors and randomly from the uniform distribution in the
interval .
A matrix is randomly assigned such that is bistable. For all is randomly initialized in the interval .

Assume that is an -tuple of vectors for and ,
and that the dimension of is equivalent to that of if .
We initialize such that

*Run Network with Teacher I/O and Compute Error Function*

For every ,
the sequence of I/O units of the RNN at learning step is defined by
The error function of the RNN at learning step with the th teacher I/O time series is defined
bywhere denotes the mean square error function .
Finally, the error function at learning step is defined by

*Update Parameters*

Let be a parameter of the RNN at learning step .
We determine the parameter by
where and are the constants of the learning rate and
momentum, respectively. On the other hand, a connection matrix is not changed as in order to hold the bistability condition. We
compute the initial state of the internal units at learning step such that
where ,
and is the constant of the learning rate of the
initial state. Assume that is a vector as a component of the orthogonal
units ,
such as .
The vector is defined by
where ,
and is the constant of the learning rate of the
orthogonal units.

Note that the maximum value of the error function depends on the number of units and the length
of the teacher I/O time series. Thus, we should scale the learning rates ,
and with the number of units and length of
sequences. In the present paper, we consider parameters ,
and such that ,
and .

#### 3. Numerical Experiments

In this section, we conduct two types of experiments as examples of using the training method for RNNs proposed in Section 2. The first experiment shows the learning of five Lissajous curves. The second experiment shows the training of multiple attractors of a Van der Pol oscillator with 12 different parameters.

##### 3.1. Experiment 1: Lissajous Curves

###### 3.1.1. Teacher I/O Time Series

Our first task is to learn the five Lissajous curves defined by and we consider constants and for all (see Figure 2).

###### 3.1.2. Learning and Testing

We now describe the specific conditions applied to RNN training. The time constant is set to . The number of orthogonal units is , and the dimension of a vector is for all . Suppose that is bistable with , and . The learning rates and momentum are given by , and , respectively.

Figure 3 shows the error function for 20 000 learning steps. We also show the Kullback-Leibler divergence between the teacher I/O time series and a sequence of I/O units in the RNN computed by (3) which do not use external perturbation by the teaching sequences. We use the Kullback-Leibler divergence as a measure of the discrepancy between two sequences. Formally, the Kullback-Leibler divergence between two probability distributions and is defined asBy definition, in order to compute the Kullback-Leibler divergence, it is necessary to obtain probability distributions of the teacher I/O time series and a sequence of I/O units. However, obtaining the probability distribution of a sequence of I/O units is very difficult. Therefore, we quantize a time series of real-valued vectors into a symbolic sequence such that if the real value is less than , then the symbol is appropriated, and otherwise the symbol is appropriated. In addition, we use the probability distribution whereby sub-blocks with a block length of appear in the symbolic sequence given by the above quantization.

Figure 4 describes attractors of the trained RNN computed by (3) of which the initial state of internal units is for each . By comparing the attractors with the teacher I/O time series displayed in Figure 2, we can see that the RNN can generate sequences similar to training data.

In Figure 5, examples of attractors for the RNN with random initial states are displayed. This shows that, in addition to the attractors corresponding to teacher I/O time series, there exist many attractors of the RNN.

##### 3.2. Experiment 2: Van der Pol Attractors

###### 3.2.1. Teacher I/O Time Series

Our second task is to learn multiple attractors given by the Van der Pol oscillator with different parameters. The Van der Pol oscillator defined byis a model of an electronic circuit that appeared in very early radios. It is well known that there exists a limit cycle for the Van der Pol oscillator. In this experiment, we consider twelve teacher I/O time series, where the th teacher I/O time series is given byfor and , where and are constant parameters representing the center position of the limit cycle, and is a time constant of the oscillator. We assume that the parameters , and are given by combining the values of , , and . Figure 6 shows the teacher I/O time series given by (16). The length of training data is for .

###### 3.2.2. Learning and Testing

The parameters for learning are set as follows. Let be bistable with and . The dimension of the vector is for every so that . Other parameters are the same as in experiment 1.

The error function and the Kullback-Leibler divergence for 200 000 learning steps are displayed in Figure 7. Figure 8 shows attractors of the trained RNN, and the initial state of the internal units of which is set to for every .

This result allows us to consider that the RNN acquires multiple periodic attractors constituted by the teacher I/O time series.

#### 4. Numerical Analysis

##### 4.1. Contraction and Bistability

Assume that and are sets and that is equipped with a metric structure. A
function is a contraction with respect to if a real value exists such that the
inequalityholds for all and .
Lemma 2. *Let
one consider a dynamical system on defined by the transition function ,
where and are defined in (1) and (2), respectively.
Assume that each element of the matrix satisfies (4), and is the maximum absolute value of elements in , ,
and .
If there exist three solutions of**then*

(1)* there are invariant sets of a dynamical system ;*(2)* suppose that is an invariant set of the dynamical system;
then, the restriction of to is a contraction with respect to and the maximum norm ,
where .**Proof. *We suppose that (18) has three
solutions, such as (see Figure 9). In general, and .

(1) Assume and .
Then, the expressionis satisfied for all ,
and ,
where is the th element of the vector .
Hence, if .
Furthermore, if ,
then becauseTherefore, the region is a stable set of the th element of vector satisfying the fact
that if then for any .

Similarly, we can easily show that if ,
then .
Thus, there are two stable regions of the th element of vector for each .
Then, there are invariant sets.

(2) Let be the invariant set presented above. Assume
that .

For any ,
the inequalityholds because if ,
then .

On the other hand, for every ,

Then, is obtained for any .
Accordingly, the restriction of to is a contraction with respect to and the maximum norm .

For any and , there is a real number such that if , then (18) has three solutions. Thus, if is large enough and matrices and represent small connection weights, then contains invariant sets, and each restriction of to an invariant set is a contraction with respect to a third input. Moreover, the integer is the effective degree of freedom for each contraction mapping restricted to an invariant set. If is a large value, then RNN can acquire a more complex time sequence. In Figures 10 and 11, we plot the Kullback-Leibler divergence of the trained RNN for parameters and , in which the training data are the same as those for experiment 1. These results imply that it is necessary that , and be large values in order to learn multiple attractor dynamics.

##### 4.2. Orthogonality

In the last paragraph of the previous section, we have shown that RNNs have many stable regions, and the existence of the stable regions plays an important role in the learning of multiple sequences. However, the existence of multiple stable regions is not sufficient for success in the multiple attractor learning because if the change of parameters corresponding to each time series influences other changes, each time series cannot necessarily be embedded into each region. Similarly, this problem appears in the method of RNNPB.

In the training algorithm defined in Section 2, each state of orthogonal units is trained by (5) and (11). Thus, firing of only occurs in the generation of the th teaching sequence. This implies that orthogonal units allow the conflict of parameter changes caused by multiple time series learning to be avoided because orbits corresponding to each teaching I/O time series run around the orthogonal state space of the trained RNN.

In order to show the effect of the orthogonal units on the conflict among teaching sequences, we consider the th learning ratio defined bywhere is an element of the matrix . If is nearly equal to , then the change in is approximately independent of teaching sequences rather than the th sequence. In Figure 12, we plot the value determined bywhere is a set of indices corresponding to the elements of the vector . The value represents the average of the th learning ratio for connections between internal units and orthogonal units . In this numerical experiment, for each learning step, is clearly larger than , where is the number of teaching sequences. Then, the sum of the th learning ratios of connection weights between internal units and orthogonal units is dominant. Therefore, in changing matrix , there is no conflict generated by multiple teaching sequences. However, we could not find a strong bias of the learning ratio for the matrices and and every element of with . Thus, we consider that connection weights between internal units and orthogonal units encode information on an individual time series, and other connection weights encode whole information.

#### 5. Discussion

In this report, we have investigated a method of embedding multiple time series into a single RNN. In order to clarify the characteristics of the proposed approach, we compare the proposed approach with other approaches with respect to information representation of multiple sequences in the models. The mixture-of-RNN-experts-type model composes local representation in an RNN for each sequence. The local representation provides robustness against changing the parameters in learning, but it lacks the ability to extract common patterns included in the sequences because of the independency of the local representation. In the proposed model, the local representation is constructed into orthogonal units, while the global representation is also constructed into internal units using the connection weights between I/O units and internal units. Since each sequence generated by the proposed model shares the state space and connection weights, the model can extract common patterns of the sequences as well as conventional neural networks.

Another characteristic, which clarifies the difference between our model and other models, is whether the classification of each time series is self-organized into the state space. For example, in the mixture-of-RNN-experts-type model, the allocation of time series to each RNN is determined automatically. As another example, in the RNNPB model, PB values are self-organized such that the PB can individualize each time series. On the other hand, the proposed model needs the information of orthogonalization for each time series. Since the sparse firing patterns which appear in orthogonal units, corresponding to time series, are given as teaching information externally, the classification of sequences is not self-organized. The characteristic whereby the time series cannot be automatically classified is a disadvantage of the proposed model. However, the time series can be classified using other clustering techniques before applying the proposed method. Thus, by combining the proposed method and other clustering techniques, an algorithm that automatically classifies and generates multiple time series can be constructed.

#### 6. Conclusion

In this paper, we have presented an RNN model and a learning algorithm that can acquire the ability to generate multiple sequences. The RNN model consists of two distinct properties called bistability and orthogonality. Bistability guarantees the existence of multiple attractor structures in RNNs, and provides the RNNs with contraction transition mapping. Orthogonality, which is given as a function of the orthogonal vectors of RNNs, helps prevent conflicts with respect to parameter changes caused by multiple training sequences. In the numerical experiments, RNNs which have bistability and orthogonality can learn multiple periodic attractors constituted by five Lissajous curves or 12 Van der Pol oscillators. Based on these results, the proposed model can be applied to the modeling of various types of dynamical systems that include multiple attractors.

#### References

- K. Funahashi and Y. Nakamura, “Approximation of dynamical systems by continuous time recurrent neural networks,”
*Neural Networks*, vol. 6, no. 6, pp. 801–806, 1993. View at Publisher · View at Google Scholar - H. T. Siegelmann and E. D. Sontag, “Analog computation via neural networks,”
*Theoretical Computer Science*, vol. 131, no. 2, pp. 331–360, 1994. View at Publisher · View at Google Scholar · View at MathSciNet - H. T. Siegelmann and E. D. Sontag, “On the computational power of neural nets,”
*Journal of Computer and System Sciences*, vol. 50, no. 1, pp. 132–150, 1995. View at Publisher · View at Google Scholar · View at MathSciNet - Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,”
*IEEE Transactions on Neural Networks*, vol. 5, no. 2, pp. 157–166, 1994. View at Publisher · View at Google Scholar · View at PubMed - K. Doya and S. Yoshizawa, “Memorizing oscillatory patterns in the analog neuron network,” in
*Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN '89)*, vol. 1, pp. 27–32, Washington, DC, USA, June 1989. View at Publisher · View at Google Scholar - F.-S. Tsung,
*Modeling dynamical systems with recurrent neural networks*, Ph.D. thesis, Department of Computer Science, University of California, San Diego, Calif, USA, 1994. - D. M. Wolpert and M. Kawato, “Multiple paired forward and inverse models for motor control,”
*Neural Networks*, vol. 11, no. 7-8, pp. 1317–1329, 1998. View at Publisher · View at Google Scholar - J. Tani and S. Nolfi, “Learning to perceive the world as articulated: an approach for hierarchical learning in sensory-motor systems,”
*Neural Networks*, vol. 12, no. 7-8, pp. 1131–1141, 1999. View at Publisher · View at Google Scholar - J. Tani, “Learning to generate articulated behavior through the bottom-up and the top-down interaction processes,”
*Neural Networks*, vol. 16, no. 1, pp. 11–23, 2003. View at Publisher · View at Google Scholar - J. Tani and M. Ito, “Self-organization of behavioral primitives as multiple attractor dynamics: a robot experiment,”
*IEEE Transactions on Systems, Man and Cybernetics Part A*, vol. 33, no. 4, pp. 481–488, 2003. View at Publisher · View at Google Scholar - H. Jaeger, “Short term memory in echo state networks,” National Research Center for Information Technology, Bremen, German, 2001. View at Google Scholar
- H. Jaeger and H. Haas, “Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication,”
*Science*, vol. 304, no. 5667, pp. 78–80, 2004. View at Publisher · View at Google Scholar · View at PubMed - W. Maass, T. Natschläger, and H. Markram, “A fresh look at real-time computation in generic recurrent neural circuits,” Institute for Theoretical Computer Science, TU Graz, Graz, Austria, 2002. View at Google Scholar
- B. Hammer and P. Tiňo, “Recurrent neural networks with small weights implement definite memory machines,”
*Neural Computation*, vol. 15, no. 8, pp. 1897–1929, 2003. View at Publisher · View at Google Scholar - O. L. White, D. D. Lee, and H. Sompolinsky, “Short-term memory in orthogonal neural networks,”
*Physical Review Letters*, vol. 92, no. 14, Article ID 148102, 4 pages, 2004. View at Publisher · View at Google Scholar - D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in
*Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations*, D. E. Rumelhart and J. L. McLelland, Eds., pp. 318–362, MIT Press, Cambridge, Mass, USA, 1986. View at Google Scholar - J. L. Elman, “Finding structure in time,”
*Cognitive Science*, vol. 14, no. 2, pp. 179–211, 1990. View at Publisher · View at Google Scholar