Abstract
The present paper proposes a recurrent neural network model and learning algorithm that can acquire the ability to generate desired multiple sequences. The network model is a dynamical system in which the transition function is a contraction mapping, and the learning algorithm is based on the gradient descent method. We show a numerical simulation in which a recurrent neural network obtains a multiple periodic attractor consisting of five Lissajous curves, or a Van der Pol oscillator with twelve different parameters. The present analysis clarifies that the model contains many stable regions as attractors, and multiple time series can be embedded into these regions by using the present learning method.
1. Introduction
Recurrent neural networks (RNNs) have been successfully applied to the modeling of various types of dynamical systems. Since the universal approximation ability
of multilayer neural networks has been proved, RNNs can model arbitrary
dynamical systems and turing machines [1–3]. However, applying RNNs to a
desired model may be very difficult even if such RNNs exist [4]. For example, building RNNs
to implement required multiple attractor dynamics is a difficult problem for
standard training, such as the gradient descent method. Doya and Yoshizawa
[5] demonstrated that
RNNs can acquire two limit cycles in the gradient descent method using
initialization with small connection weights, whereas learning for more than
three limit cycles is difficult [6]. This is due to the fact that the learning of several
time series causes a conflict with respect to the changing of the connection weights.
How to form RNN models that can learn several temporal sequence patterns has
proved to be a challenging problem.
There have been some approaches to this problem. In
order to avoid conflicts in the change of parameters, the
mixture-of-experts-type architecture has been
investigated [7, 8].
The mixture-of-experts model consists of RNNs as experts and a hierarchical
gating mechanism. At the end of successful learning, each expert implements
attractor dynamics as locally represented knowledge, and a gating mechanism
chooses only one expert at any time. The system can acquire many attractor
patterns although there is a disadvantage in that the system does not have the
generalization ability on the attractor patterns. As the other approach to
implement multiple patterns, the parametric bias (PB) method has been developed
to improve the learning capability of RNNs [9, 10]. In an RNN that employs the PB method (RNNPB), PB
values provide the information needed in order to individualize each sequence.
It has been reported that the number of time series that RNNPBs can learn is
greater than that which RNNs without PB can learn. However, the PB method
cannot avoid the conflict caused by each attractor learning. Therefore,
learning multiple time series by an RNNPB tends to fail when the number of time
series increases.
In the present study, we will focus on the training
method for RNNs to learn multiple attractor dynamics. Furthermore, we will show
that the present research is related to research into RNNs with contraction
transition functions. In recent years, RNNs with contraction transition mapping
have been investigated with respect to the performance of time series learning
[11–13], generalization ability [14], and memory capacity [15]. Jaeger [11, 12] demonstrated that an “echo
state network,” which is an RNN with contraction mapping, successfully learns
the Mackey-Glass chaotic time series, a well-known benchmark system for time
series prediction. In order to formally express the generalization ability,
Hammer and Tiňo proved that RNNs with contraction are
distribution-independent learnable in the probably approximately correct (PAC)
sense [14]. From the
above results, RNNs with contraction might be regarded as powerful tools for
modeling dynamical systems. However, RNNs with contraction have difficulty in
representing multiple attractor dynamics because dynamic states governed by the
contraction transition function are globally attracted to one point. In this
paper, the representation capability of RNNs with contraction mapping will be
improved such that the RNNs can obtain multiple attractor dynamics.
We start by defining the concepts of the RNN and the
training method for multiple attractor dynamics. The RNN has the Elman net-type
architecture, and the training method for RNNs is basically based on the
backpropagation through-time (BPTT) algorithm [16]. We then show in numerical
simulation that the RNNs can acquire multiple periodic attractors constituted
by five Lissajous curves, or a Van der Pol oscillator with twelve different
parameters. Moreover, we consider why the RNNs successfully learn multiple
attractors and how the performance of learnability depends on parameters of the
RNNs. Finally, we link the results obtained herein to other learning strategies,
and consider other advanced research topics.
2. Model
2.1. Recurrent Neural Network
We first consider a neural network model with
recurrent connection, such as the Elman net [17] (see Figure 1). The RNN contains I/O units, orthogonal
units, and internal units. We denote the dynamic states of I/O units,
orthogonal units, and internal units at time step
by
,
and
,
respectively. The RNN is defined by functions
and
with a parameter
,
where
and
are of the forms
(1)
(2)
where
,
and
are matrices,
and
are vectors,
is a time constant that satisfies
,
and
denotes a componentwise application such as
.
Figure 1: Architecture of the recurrent neural
network. Solid arrows, dotted arrows, and boxes represent fixed connections,
adjustable connections, and network states, respectively.
Dynamic states of the RNN at time step
are updated according to
(3)
From these equations, the RNN
can be represented by an
-dimensional dynamical system.
We now define bistability for the RNN.Definition 1. Assume
is as above. The function
is bistable with respect to the third
variable
if a real value
and an integer
exist such that
(4)for every element
of the matrix
.
The bistability of a function
is a key concept of our learning method. We
will show in Section 4.1 that the bistable function
plays an important role in the learning of
multiple attractor dynamics.
2.2. Learning Method
We present a formulation of the training procedure for
the RNN with a multiple teacher I/O time series. For every
and
,
we assume that
is a sequence of teacher I/O of length
.
Initialization of Parameters
We initialize every element of matrices
,
,
and
and vectors
and
randomly from the uniform distribution in the
interval
.
A matrix
is randomly assigned such that
is bistable. For all
is randomly initialized in the interval
.
Assume that
is an
-tuple of vectors
for
and
,
and that the dimension of
is equivalent to that of
if
.
We initialize
such that
(5)
Run Network with Teacher I/O and Compute Error Function
For every
,
the sequence
of I/O units of the RNN at learning step
is defined by
(6)
The error function
of the RNN at learning step
with the
th teacher I/O time series is defined
by
(7)where
denotes the mean square error function
.
Finally, the error function
at learning step
is defined by
(8)
Update Parameters
Let 
be a parameter of the RNN at learning step
.
We determine the parameter
by
(9)
where
and
are the constants of the learning rate and
momentum, respectively. On the other hand, a connection matrix
is not changed as
in order to hold the bistability condition. We
compute the initial state
of the internal units at learning step
such that
(10)
where
,
and
is the constant of the learning rate of the
initial state. Assume that
is a vector as a component of the orthogonal
units
,
such as
.
The vector
is defined by
(11)
(12)
where
,
and
is the constant of the learning rate of the
orthogonal units.
Note that the maximum value of the error function
depends on the number of units and the length
of the teacher I/O time series. Thus, we should scale the learning rates
,
and
with the number of units and length of
sequences. In the present paper, we consider parameters
,
and
such that
,
and
.
3. Numerical Experiments
In this section, we conduct two types of experiments
as examples of using the training method for RNNs proposed in Section 2. The
first experiment shows the learning of five Lissajous curves. The second
experiment shows the training of multiple attractors of a Van der Pol
oscillator with 12 different parameters.
3.1. Experiment 1: Lissajous Curves
3.1.1. Teacher I/O Time Series
Our first task is to learn the five Lissajous curves
defined by
(13)
and we consider constants
and
for all
(see Figure 2).
Figure 2: Trajectories of the teacher I/O time
series in experiment 1.
3.1.2. Learning and Testing
We now describe the specific conditions applied to RNN
training. The time constant
is set to
.
The number
of orthogonal units is
,
and the dimension of a vector
is
for all
.
Suppose that
is bistable with
,
and
.
The learning rates and momentum are given by
,
and
,
respectively.
Figure 3 shows the error function
for 20 000 learning steps. We also show the
Kullback-Leibler divergence between the teacher I/O time series and a sequence
of I/O units in the RNN computed by (3) which do not use external
perturbation by the teaching sequences. We use the Kullback-Leibler divergence
as a measure of the discrepancy between two sequences. Formally, the
Kullback-Leibler divergence between two probability distributions
and
is defined as
(14)By definition, in order to
compute the Kullback-Leibler divergence, it is necessary to obtain probability
distributions of the teacher I/O time series and a sequence of I/O units.
However, obtaining the probability distribution of a sequence of I/O units is
very difficult. Therefore, we quantize a time series of real-valued vectors
into a symbolic sequence such that if the real value is less than
,
then the symbol
is appropriated, and otherwise the symbol
is appropriated. In addition, we use the
probability distribution whereby sub-blocks with a block length of
appear in the symbolic sequence given by the
above quantization.
Figure 3: Error and Kullback-Leibler
divergence between the teaching sequences and output generated by the RNN for 20 000 learning steps in experiment 1.
Figure 4 describes attractors of the trained RNN
computed by (3) of which the initial state of internal units is
for each
.
By comparing the attractors with the teacher I/O time series displayed in
Figure 2, we can see that the RNN can generate sequences similar to training
data.
Figure 4: Time series

generated by the trained RNN in experiment 1.
For each time series, only the initial state

is different.
In Figure 5, examples of attractors for the RNN with
random initial states are displayed. This shows that, in addition to the
attractors corresponding to teacher I/O time series, there exist many
attractors of the RNN.
Figure 5: Time series

generated by the trained RNN with random
initial state

in experiment 1.
3.2. Experiment 2: Van der Pol Attractors
3.2.1. Teacher I/O Time Series
Our second task is to learn multiple attractors given
by the Van der Pol oscillator with different parameters. The Van der Pol
oscillator defined by
(15)is a model of an electronic
circuit that appeared in very early radios. It is well known that there exists
a limit cycle for the Van der Pol oscillator. In this experiment, we consider
twelve teacher I/O time series, where the
th teacher I/O time series
is given by
(16)for
and
,
where
and
are constant parameters representing the
center position of the limit cycle, and
is a time constant of the oscillator. We assume
that the parameters
,
and
are given by combining the values of
,
,
and
.
Figure 6 shows the teacher I/O time series given by (16). The length of
training data is
for
.
Figure 6: Teaching sequences of experiment 2. (a) Trajectories on

.
(b) Temporal trajectories of teaching sequences.
3.2.2. Learning and Testing
The parameters for learning are set as follows. Let
be bistable with
and
.
The dimension of the vector
is
for every
so that
.
Other parameters are the same as in experiment 1.
The error function and the Kullback-Leibler divergence
for 200 000 learning steps are displayed in Figure 7.
Figure 8 shows attractors of the trained RNN, and the initial state of the
internal units of which is set to
for every
.
Figure 7: Error and Kullback-Leibler divergence between
the teaching sequences and output generated by the RNN for 200 000 learning steps in experiment 1.
Figure 8: Time
series

generated by the trained RNN in experiment 2.
For each time series, the initial state

is the same as the training phase. (a)
Trajectories on

.
(b) Temporal trajectories.
This result allows us to consider that the RNN acquires multiple periodic
attractors constituted by the teacher I/O time series.
4. Numerical Analysis
4.1. Contraction and Bistability
Assume that
and
are sets and that
is equipped with a metric structure. A
function
is a contraction with respect to
if a real value
exists such that the
inequality
(17)holds for all
and
.
Lemma 2. Let
one consider a dynamical system on
defined by the transition function
,
where
and
are defined in (1) and (2), respectively.
Assume that each element
of the matrix
satisfies (4), and
is the maximum absolute value of elements in
,
,
and
.
If there exist three solutions of
(18)then
(1) there are
invariant sets of a dynamical system
;(2) suppose that
is an invariant set of the dynamical system;
then, the restriction of
to
is a contraction with respect to
and the maximum norm
,
where
.Proof. We suppose that (18) has three
solutions, such as
(see Figure 9). In general,
and
.
(1) Assume
and
.
Then, the expression
(19)is satisfied for all
,
and
,
where
is the
th element of the vector
.
Hence,
if
.
Furthermore, if
,
then
because
(20)Therefore, the region
is a stable set of the
th element of vector
satisfying the fact
that if
then
for any
.
Similarly, we can easily show that if
,
then
.
Thus, there are two stable regions of the
th element of vector
for each
.
Then, there are
invariant sets.
(2) Let
be the invariant set presented above. Assume
that
.
For any
,
the inequality
(21)holds because if
,
then
.
On the other hand, for every
,
(22)
Then,
is obtained for any
.
Accordingly, the restriction of
to
is a contraction with respect to
and the maximum norm
.
Figure 9: Schematic diagram of (
18).
For any
and
,
there is a real number
such that if
,
then (18) has three solutions. Thus, if
is large enough and matrices
and
represent small connection weights, then
contains
invariant sets, and each restriction of
to an invariant set is a contraction with
respect to a third input. Moreover, the integer
is the effective degree of freedom for each
contraction mapping restricted to an invariant set. If
is a large value, then RNN can acquire a more
complex time sequence. In Figures 10 and 11, we plot the Kullback-Leibler
divergence of the trained RNN for parameters
and
,
in which the training data are the same as those for experiment 1. These
results imply that it is necessary that
,
and
be large values in order to learn multiple
attractor dynamics.
Figure 10: Kullback-Leibler divergence between
the teaching sequences and output generated by the trained RNN with

(

),

,
and

.
Figure 11: Kullback-Leibler divergence between the
teaching sequences and output generated by the trained RNN with

,
and

(

). (a)

,
(b)

,
and (c)

.
4.2. Orthogonality
In the last paragraph of the previous section, we have
shown that RNNs have many stable regions, and the existence of the stable
regions plays an important role in the learning of multiple sequences. However,
the existence of multiple stable regions is not sufficient for success in the
multiple attractor learning because if the change of parameters corresponding
to each time series influences other changes, each time series cannot
necessarily be embedded into each region. Similarly, this problem appears in
the method of RNNPB.
In the training algorithm defined in Section 2, each
state of orthogonal units
is trained by (5) and (11). Thus, firing of
only occurs in the generation
of the
th teaching sequence. This implies that
orthogonal units allow the conflict of parameter changes caused by multiple
time series learning to be avoided because orbits corresponding to each
teaching I/O time series run around the orthogonal state space of the trained
RNN.
In order to show the effect of the orthogonal units on
the conflict among teaching sequences, we consider the
th learning ratio
defined by
(23)where
is an element of the matrix
.
If
is nearly equal to
,
then the change in
is approximately independent of teaching
sequences rather than the
th sequence. In Figure 12, we plot the value
determined by
(24)where
is a set of indices corresponding to the
elements of the vector
.
The value
represents the average of the
th learning ratio for connections between
internal units and orthogonal units
.
In this numerical experiment, for each learning step,
is clearly larger than
,
where
is the number of teaching sequences. Then, the
sum of the
th learning ratios of connection weights
between internal units and orthogonal units
is dominant. Therefore, in changing matrix
,
there is no conflict generated by multiple teaching sequences. However, we
could not find a strong bias of the learning ratio for the matrices
and
and every element
of
with
.
Thus, we consider that connection weights between internal units and orthogonal
units encode information on an individual time series, and other connection
weights encode whole information.
Figure 12: Average

of the

th learning ratio for the connections between
internal units and orthogonal units

for 200 000 learning steps in experiment 1.
5. Discussion
In this report, we have investigated a method of
embedding multiple time series into a single RNN. In order to clarify the
characteristics of the proposed approach, we compare the proposed approach with
other approaches with respect to information representation of multiple
sequences in the models. The mixture-of-RNN-experts-type model composes local
representation in an RNN for each sequence. The local representation provides
robustness against changing the parameters in learning, but it lacks the
ability to extract common patterns included in the sequences because of the
independency of the local representation. In the proposed model, the local
representation is constructed into orthogonal units, while the global
representation is also constructed into internal units using the connection
weights between I/O units and internal units. Since each sequence generated by
the proposed model shares the state space and connection weights, the model can
extract common patterns of the sequences as well as conventional neural
networks.
Another characteristic, which clarifies the difference
between our model and other models, is whether the classification of each time
series is self-organized into the state space. For example, in the
mixture-of-RNN-experts-type model, the allocation of time series to each RNN is
determined automatically. As another example, in the RNNPB model, PB values are
self-organized such that the PB can individualize each time series. On the
other hand, the proposed model needs the information of orthogonalization for
each time series. Since the sparse firing patterns which appear in orthogonal
units, corresponding to time series, are given as teaching information
externally, the classification of sequences is not self-organized. The
characteristic whereby the time series cannot be automatically classified is a
disadvantage of the proposed model. However, the time series can be classified
using other clustering techniques before applying the proposed method. Thus, by
combining the proposed method and other clustering techniques, an algorithm
that automatically classifies and generates multiple time series can be
constructed.
6. Conclusion
In this paper, we have presented an RNN model and a
learning algorithm that can acquire the ability to generate multiple sequences.
The RNN model consists of two distinct properties called bistability and
orthogonality. Bistability guarantees the existence of multiple attractor
structures in RNNs, and provides the RNNs with contraction transition mapping.
Orthogonality, which is given as a function of the orthogonal vectors of RNNs,
helps prevent conflicts with respect to parameter changes caused by multiple
training sequences. In the numerical experiments, RNNs which have bistability
and orthogonality can learn multiple periodic attractors constituted by five
Lissajous curves or 12 Van der Pol oscillators. Based on these results, the
proposed model can be applied to the modeling of various types of dynamical
systems that include multiple attractors.
References
- K. Funahashi and Y. Nakamura, “Approximation of dynamical systems by continuous time recurrent neural networks,” Neural Networks, vol. 6, no. 6, pp. 801–806, 1993.
- H. T. Siegelmann and E. D. Sontag, “Analog computation via neural networks,” Theoretical Computer Science, vol. 131, no. 2, pp. 331–360, 1994.
- H. T. Siegelmann and E. D. Sontag, “On the computational power of neural nets,” Journal of Computer and System Sciences, vol. 50, no. 1, pp. 132–150, 1995.
- Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157–166, 1994.
- K. Doya and S. Yoshizawa, “Memorizing oscillatory patterns in the analog neuron network,” in Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN '89), vol. 1, pp. 27–32, Washington, DC, USA, June 1989.
- F.-S. Tsung, Modeling dynamical systems with recurrent neural networks, Ph.D. thesis, Department of Computer Science, University of California, San Diego, Calif, USA, 1994.
- D. M. Wolpert and M. Kawato, “Multiple paired forward and inverse models for motor control,” Neural Networks, vol. 11, no. 7-8, pp. 1317–1329, 1998.
- J. Tani and S. Nolfi, “Learning to perceive the world as articulated: an approach for hierarchical learning in sensory-motor systems,” Neural Networks, vol. 12, no. 7-8, pp. 1131–1141, 1999.
- J. Tani, “Learning to generate articulated behavior through the bottom-up and the top-down interaction processes,” Neural Networks, vol. 16, no. 1, pp. 11–23, 2003.
- J. Tani and M. Ito, “Self-organization of behavioral primitives as multiple attractor dynamics: a robot experiment,” IEEE Transactions on Systems, Man and Cybernetics Part A, vol. 33, no. 4, pp. 481–488, 2003.
- H. Jaeger, “Short term memory in echo state networks,” National Research Center for Information Technology, Bremen, German, 2001.
- H. Jaeger and H. Haas, “Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication,” Science, vol. 304, no. 5667, pp. 78–80, 2004.
- W. Maass, T. Natschläger, and H. Markram, “A fresh look at real-time computation in generic recurrent neural circuits,” Institute for Theoretical Computer Science, TU Graz, Graz, Austria, 2002.
- B. Hammer and P. Tiňo, “Recurrent neural networks with small weights implement definite memory machines,” Neural Computation, vol. 15, no. 8, pp. 1897–1929, 2003.
- O. L. White, D. D. Lee, and H. Sompolinsky, “Short-term memory in orthogonal neural networks,” Physical Review Letters, vol. 92, no. 14, Article ID 148102, 4 pages, 2004.
- D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations, D. E. Rumelhart and J. L. McLelland, Eds., pp. 318–362, MIT Press, Cambridge, Mass, USA, 1986.
- J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–211, 1990.