Brain Science Institute, RIKEN, 2-1 Hirosawa, Wako City, Saitama 351-0198, Japan
The present paper proposes a recurrent neural network model and learning algorithm that can acquire the ability to generate desired multiple sequences. The network model is a dynamical system in which the transition function is a contraction mapping, and the learning algorithm is based on the gradient descent method. We show a numerical simulation in which a recurrent neural network obtains a multiple periodic attractor consisting of five Lissajous curves, or a Van der Pol oscillator with twelve different parameters. The present analysis clarifies that the model contains many stable regions as attractors, and multiple time series can be embedded into these regions by using the present learning method.
1. Introduction
Recurrent neural networks (RNNs) have been successfully applied to the modeling of various types of dynamical systems. Since the universal approximation ability
of multilayer neural networks has been proved, RNNs can model arbitrary
dynamical systems and turing machines [1–3]. However, applying RNNs to a
desired model may be very difficult even if such RNNs exist [4]. For example, building RNNs
to implement required multiple attractor dynamics is a difficult problem for
standard training, such as the gradient descent method. Doya and Yoshizawa
[5] demonstrated that
RNNs can acquire two limit cycles in the gradient descent method using
initialization with small connection weights, whereas learning for more than
three limit cycles is difficult [6]. This is due to the fact that the learning of several
time series causes a conflict with respect to the changing of the connection weights.
How to form RNN models that can learn several temporal sequence patterns has
proved to be a challenging problem.
There have been some approaches to this problem. In
order to avoid conflicts in the change of parameters, the
mixture-of-experts-type architecture has been
investigated [7, 8].
The mixture-of-experts model consists of RNNs as experts and a hierarchical
gating mechanism. At the end of successful learning, each expert implements
attractor dynamics as locally represented knowledge, and a gating mechanism
chooses only one expert at any time. The system can acquire many attractor
patterns although there is a disadvantage in that the system does not have the
generalization ability on the attractor patterns. As the other approach to
implement multiple patterns, the parametric bias (PB) method has been developed
to improve the learning capability of RNNs [9, 10]. In an RNN that employs the PB method (RNNPB), PB
values provide the information needed in order to individualize each sequence.
It has been reported that the number of time series that RNNPBs can learn is
greater than that which RNNs without PB can learn. However, the PB method
cannot avoid the conflict caused by each attractor learning. Therefore,
learning multiple time series by an RNNPB tends to fail when the number of time
series increases.
In the present study, we will focus on the training
method for RNNs to learn multiple attractor dynamics. Furthermore, we will show
that the present research is related to research into RNNs with contraction
transition functions. In recent years, RNNs with contraction transition mapping
have been investigated with respect to the performance of time series learning
[11–13], generalization ability [14], and memory capacity [15]. Jaeger [11, 12] demonstrated that an “echo
state network,” which is an RNN with contraction mapping, successfully learns
the Mackey-Glass chaotic time series, a well-known benchmark system for time
series prediction. In order to formally express the generalization ability,
Hammer and Tiňo proved that RNNs with contraction are
distribution-independent learnable in the probably approximately correct (PAC)
sense [14]. From the
above results, RNNs with contraction might be regarded as powerful tools for
modeling dynamical systems. However, RNNs with contraction have difficulty in
representing multiple attractor dynamics because dynamic states governed by the
contraction transition function are globally attracted to one point. In this
paper, the representation capability of RNNs with contraction mapping will be
improved such that the RNNs can obtain multiple attractor dynamics.
We start by defining the concepts of the RNN and the
training method for multiple attractor dynamics. The RNN has the Elman net-type
architecture, and the training method for RNNs is basically based on the
backpropagation through-time (BPTT) algorithm [16]. We then show in numerical
simulation that the RNNs can acquire multiple periodic attractors constituted
by five Lissajous curves, or a Van der Pol oscillator with twelve different
parameters. Moreover, we consider why the RNNs successfully learn multiple
attractors and how the performance of learnability depends on parameters of the
RNNs. Finally, we link the results obtained herein to other learning strategies,
and consider other advanced research topics.
2. Model
2.1. Recurrent Neural Network
We first consider a neural network model with
recurrent connection, such as the Elman net [17] (see Figure 1). The RNN contains I/O units, orthogonal
units, and internal units. We denote the dynamic states of I/O units,
orthogonal units, and internal units at time step by ,
and ,
respectively. The RNN is defined by functions and with a parameter ,
where and are of the forms
where ,
and are matrices, and are vectors, is a time constant that satisfies ,
and denotes a componentwise application such as .
Figure 1: Architecture of the recurrent neural
network. Solid arrows, dotted arrows, and boxes represent fixed connections,
adjustable connections, and network states, respectively.
Dynamic states of the RNN at time step are updated according to
From these equations, the RNN
can be represented by an -dimensional dynamical system.
We now define bistability for the RNN.Definition 1. Assume is as above. The function is bistable with respect to the third
variable if a real value and an integer exist such thatfor every element of the matrix .
The bistability of a function is a key concept of our learning method. We
will show in Section 4.1 that the bistable function plays an important role in the learning of
multiple attractor dynamics.
2.2. Learning Method
We present a formulation of the training procedure for
the RNN with a multiple teacher I/O time series. For every and ,
we assume that is a sequence of teacher I/O of length .
Initialization of Parameters
We initialize every element of matrices , ,
and and vectors and randomly from the uniform distribution in the
interval .
A matrix is randomly assigned such that is bistable. For all is randomly initialized in the interval .
Assume that is an -tuple of vectors for and ,
and that the dimension of is equivalent to that of if .
We initialize such that
Run Network with Teacher I/O and Compute Error Function
For every ,
the sequence of I/O units of the RNN at learning step is defined by
The error function of the RNN at learning step with the th teacher I/O time series is defined
bywhere denotes the mean square error function .
Finally, the error function at learning step is defined by
Update Parameters
Let be a parameter of the RNN at learning step .
We determine the parameter by
where and are the constants of the learning rate and
momentum, respectively. On the other hand, a connection matrix is not changed as in order to hold the bistability condition. We
compute the initial state of the internal units at learning step such that
where ,
and is the constant of the learning rate of the
initial state. Assume that is a vector as a component of the orthogonal
units ,
such as .
The vector is defined by
where ,
and is the constant of the learning rate of the
orthogonal units.
Note that the maximum value of the error function depends on the number of units and the length
of the teacher I/O time series. Thus, we should scale the learning rates ,
and with the number of units and length of
sequences. In the present paper, we consider parameters ,
and such that ,
and .
3. Numerical Experiments
In this section, we conduct two types of experiments
as examples of using the training method for RNNs proposed in Section 2. The
first experiment shows the learning of five Lissajous curves. The second
experiment shows the training of multiple attractors of a Van der Pol
oscillator with 12 different parameters.
3.1. Experiment 1: Lissajous Curves
3.1.1. Teacher I/O Time Series
Our first task is to learn the five Lissajous curves
defined by
and we consider constants and for all (see Figure 2).
Figure 2: Trajectories of the teacher I/O time
series in experiment 1.
3.1.2. Learning and Testing
We now describe the specific conditions applied to RNN
training. The time constant is set to .
The number of orthogonal units is ,
and the dimension of a vector is for all .
Suppose that is bistable with ,
and .
The learning rates and momentum are given by ,
and ,
respectively.
Figure 3 shows the error function for 20 000 learning steps. We also show the
Kullback-Leibler divergence between the teacher I/O time series and a sequence
of I/O units in the RNN computed by (3) which do not use external
perturbation by the teaching sequences. We use the Kullback-Leibler divergence
as a measure of the discrepancy between two sequences. Formally, the
Kullback-Leibler divergence between two probability distributions and is defined asBy definition, in order to
compute the Kullback-Leibler divergence, it is necessary to obtain probability
distributions of the teacher I/O time series and a sequence of I/O units.
However, obtaining the probability distribution of a sequence of I/O units is
very difficult. Therefore, we quantize a time series of real-valued vectors
into a symbolic sequence such that if the real value is less than ,
then the symbol is appropriated, and otherwise the symbol is appropriated. In addition, we use the
probability distribution whereby sub-blocks with a block length of appear in the symbolic sequence given by the
above quantization.
Figure 3: Error and Kullback-Leibler
divergence between the teaching sequences and output generated by the RNN for 20 000 learning steps in experiment 1.
Figure 4 describes attractors of the trained RNN
computed by (3) of which the initial state of internal units is for each .
By comparing the attractors with the teacher I/O time series displayed in
Figure 2, we can see that the RNN can generate sequences similar to training
data.
Figure 4: Time series generated by the trained RNN in experiment 1.
For each time series, only the initial state is different.
In Figure 5, examples of attractors for the RNN with
random initial states are displayed. This shows that, in addition to the
attractors corresponding to teacher I/O time series, there exist many
attractors of the RNN.
Figure 5: Time series
generated by the trained RNN with random
initial state
in experiment 1.
3.2. Experiment 2: Van der Pol Attractors
3.2.1. Teacher I/O Time Series
Our second task is to learn multiple attractors given
by the Van der Pol oscillator with different parameters. The Van der Pol
oscillator defined byis a model of an electronic
circuit that appeared in very early radios. It is well known that there exists
a limit cycle for the Van der Pol oscillator. In this experiment, we consider
twelve teacher I/O time series, where the th teacher I/O time series is given byfor and ,
where and are constant parameters representing the
center position of the limit cycle, and is a time constant of the oscillator. We assume
that the parameters ,
and are given by combining the values of , ,
and .
Figure 6 shows the teacher I/O time series given by (16). The length of
training data is for .
Figure 6: Teaching sequences of experiment 2. (a) Trajectories on .
(b) Temporal trajectories of teaching sequences.
3.2.2. Learning and Testing
The parameters for learning are set as follows. Let be bistable with and .
The dimension of the vector is for every so that .
Other parameters are the same as in experiment 1.
The error function and the Kullback-Leibler divergence
for 200 000 learning steps are displayed in Figure 7.
Figure 8 shows attractors of the trained RNN, and the initial state of the
internal units of which is set to for every .
Figure 7: Error and Kullback-Leibler divergence between
the teaching sequences and output generated by the RNN for 200 000 learning steps in experiment 1.
Figure 8: Time
series generated by the trained RNN in experiment 2.
For each time series, the initial state is the same as the training phase. (a)
Trajectories on .
(b) Temporal trajectories.
This result allows us to consider that the RNN acquires multiple periodic
attractors constituted by the teacher I/O time series.
4. Numerical Analysis
4.1. Contraction and Bistability
Assume that and are sets and that is equipped with a metric structure. A
function is a contraction with respect to if a real value exists such that the
inequalityholds for all and .
Lemma 2. Let
one consider a dynamical system on defined by the transition function ,
where and are defined in (1) and (2), respectively.
Assume that each element of the matrix satisfies (4), and is the maximum absolute value of elements in , ,
and .
If there exist three solutions of then
(1) there are invariant sets of a dynamical system ;(2) suppose that is an invariant set of the dynamical system;
then, the restriction of to is a contraction with respect to and the maximum norm ,
where .Proof. We suppose that (18) has three
solutions, such as (see Figure 9). In general, and .
(1) Assume and .
Then, the expressionis satisfied for all ,
and ,
where is the th element of the vector .
Hence, if .
Furthermore, if ,
then becauseTherefore, the region is a stable set of the th element of vector satisfying the fact
that if then for any .
Similarly, we can easily show that if ,
then .
Thus, there are two stable regions of the th element of vector for each .
Then, there are invariant sets.
(2) Let be the invariant set presented above. Assume
that .
For any ,
the inequalityholds because if ,
then .
On the other hand, for every ,
Then, is obtained for any .
Accordingly, the restriction of to is a contraction with respect to and the maximum norm .
Figure 9: Schematic diagram of (
18).
For any and ,
there is a real number such that if ,
then (18) has three solutions. Thus, if is large enough and matrices and represent small connection weights, then contains invariant sets, and each restriction of to an invariant set is a contraction with
respect to a third input. Moreover, the integer is the effective degree of freedom for each
contraction mapping restricted to an invariant set. If is a large value, then RNN can acquire a more
complex time sequence. In Figures 10 and 11, we plot the Kullback-Leibler
divergence of the trained RNN for parameters and ,
in which the training data are the same as those for experiment 1. These
results imply that it is necessary that ,
and be large values in order to learn multiple
attractor dynamics.
Figure 10: Kullback-Leibler divergence between
the teaching sequences and output generated by the trained RNN with (), ,
and .
Figure 11: Kullback-Leibler divergence between the
teaching sequences and output generated by the trained RNN with ,
and (). (a) ,
(b) ,
and (c) .
4.2. Orthogonality
In the last paragraph of the previous section, we have
shown that RNNs have many stable regions, and the existence of the stable
regions plays an important role in the learning of multiple sequences. However,
the existence of multiple stable regions is not sufficient for success in the
multiple attractor learning because if the change of parameters corresponding
to each time series influences other changes, each time series cannot
necessarily be embedded into each region. Similarly, this problem appears in
the method of RNNPB.
In the training algorithm defined in Section 2, each
state of orthogonal units is trained by (5) and (11). Thus, firing of only occurs in the generation
of the th teaching sequence. This implies that
orthogonal units allow the conflict of parameter changes caused by multiple
time series learning to be avoided because orbits corresponding to each
teaching I/O time series run around the orthogonal state space of the trained
RNN.
In order to show the effect of the orthogonal units on
the conflict among teaching sequences, we consider the th learning ratio defined bywhere is an element of the matrix .
If is nearly equal to ,
then the change in is approximately independent of teaching
sequences rather than the th sequence. In Figure 12, we plot the value determined bywhere is a set of indices corresponding to the
elements of the vector .
The value represents the average of the th learning ratio for connections between
internal units and orthogonal units .
In this numerical experiment, for each learning step, is clearly larger than ,
where is the number of teaching sequences. Then, the
sum of the th learning ratios of connection weights
between internal units and orthogonal units is dominant. Therefore, in changing matrix ,
there is no conflict generated by multiple teaching sequences. However, we
could not find a strong bias of the learning ratio for the matrices and and every element of with .
Thus, we consider that connection weights between internal units and orthogonal
units encode information on an individual time series, and other connection
weights encode whole information.
Figure 12: Average of the th learning ratio for the connections between
internal units and orthogonal units for 200 000 learning steps in experiment 1.
5. Discussion
In this report, we have investigated a method of
embedding multiple time series into a single RNN. In order to clarify the
characteristics of the proposed approach, we compare the proposed approach with
other approaches with respect to information representation of multiple
sequences in the models. The mixture-of-RNN-experts-type model composes local
representation in an RNN for each sequence. The local representation provides
robustness against changing the parameters in learning, but it lacks the
ability to extract common patterns included in the sequences because of the
independency of the local representation. In the proposed model, the local
representation is constructed into orthogonal units, while the global
representation is also constructed into internal units using the connection
weights between I/O units and internal units. Since each sequence generated by
the proposed model shares the state space and connection weights, the model can
extract common patterns of the sequences as well as conventional neural
networks.
Another characteristic, which clarifies the difference
between our model and other models, is whether the classification of each time
series is self-organized into the state space. For example, in the
mixture-of-RNN-experts-type model, the allocation of time series to each RNN is
determined automatically. As another example, in the RNNPB model, PB values are
self-organized such that the PB can individualize each time series. On the
other hand, the proposed model needs the information of orthogonalization for
each time series. Since the sparse firing patterns which appear in orthogonal
units, corresponding to time series, are given as teaching information
externally, the classification of sequences is not self-organized. The
characteristic whereby the time series cannot be automatically classified is a
disadvantage of the proposed model. However, the time series can be classified
using other clustering techniques before applying the proposed method. Thus, by
combining the proposed method and other clustering techniques, an algorithm
that automatically classifies and generates multiple time series can be
constructed.
6. Conclusion
In this paper, we have presented an RNN model and a
learning algorithm that can acquire the ability to generate multiple sequences.
The RNN model consists of two distinct properties called bistability and
orthogonality. Bistability guarantees the existence of multiple attractor
structures in RNNs, and provides the RNNs with contraction transition mapping.
Orthogonality, which is given as a function of the orthogonal vectors of RNNs,
helps prevent conflicts with respect to parameter changes caused by multiple
training sequences. In the numerical experiments, RNNs which have bistability
and orthogonality can learn multiple periodic attractors constituted by five
Lissajous curves or 12 Van der Pol oscillators. Based on these results, the
proposed model can be applied to the modeling of various types of dynamical
systems that include multiple attractors.