#### Abstract

Adaptive natural gradient learning avoids singularities in the parameter space of multilayer perceptrons. However, it requires a larger number of additional parameters than ordinary backpropagation in the form of the Fisher information matrix. This paper describes a new approach to natural gradient learning that uses a smaller Fisher information matrix. It also uses a prior distribution on the neural network parameters and an annealed learning rate. While this new approach is computationally simpler, its performance is comparable to that of adaptive natural gradient learning.

#### 1. Introduction

Amari et al. developed the adaptive natural gradient learning (ANGL) algorithm for multilayer perceptrons [1–3]. It is similar to Newton's method in which the gradient of the error function is multiplied by the inverse of the Hessian matrix of the error function. However, Amari’s algorithm replaces the inverse of the Hessian matrix with the inverse of the Fisher information matrix. The simplified natural gradient learning (SNGL) algorithm introduced in this paper uses a new formulation of the Fisher information matrix. SNGL is based on the backpropagation algorithm [4]. In addition, the SNGL algorithm also uses regularization [5] to penalize solutions with large connection weights. This regularization also prevents the Fisher information matrix from approaching a singular matrix. The learning rate is also annealed [6] to improve the general performance of SNGL.

The remainder of this section will review the natural gradient and discuss the details of Amari's ANGL algorithm in [2, 7, 8]. The next section will introduce the SNGL algorithm and the theory behind it. The third section will describe two improvements used in the SNGL algorithm. These improvements are regularization and annealed learning rates. The following section will show the experimental results and compare the performance of these two algorithms with that of ordinary backpropagation. The final section will conclude the ideas discussed in this paper.

##### 1.1. Mathematical Preliminaries and Notation

The natural gradient learning algorithm is “natural” in the sense that it follows the manifold of the underlying parameter space of multilayer perceptrons. This parameter space has Riemannian geometry. Calculating the direction of steepest descent in a Riemannian manifold is more complicated than it is in a Euclidean space.

The parameter space of multilayer perceptrons is an affine space in which probability distributions are the points and random variables are vectors [9, 10]. Given a probability density function of a random variable , the translation operation is .

###### 1.1.1. Exponential Families

An exponential family is a set of distributions with probability density functions of the form Thus, the distributions in the exponential family are parameterized by , and these parameters are the canonical coordinates of the distribution. The number is chosen such that . Given an open set in the event space of the random variable , the probability that the result of the random variable lies within is

The sufficient statistics for the exponential family are the random variables in (1). These statistics can be used to estimate the coordinates from the values of samples of .

The log-likelihood function of a probability distribution is the logarithm of its probability density function . The score functions are the derivatives of the log-likelihood function with respect to the coordinates . The score for parameter is

The Fisher information of two score functions is the expected value of their product. For the score functions and , the Fisher information is where is the expectation operator for the distribution with probability density function . The Fisher information matrix is a real symmetric matrix whose elements are the Fisher information for each pair of score functions.

###### 1.1.2. Direction of Steepest Descent

When finding the minimum of a function in a Euclidean space, a descent direction [11] is a unit norm tangent vector such that The minimum value of (5) is where is an arbitrary scalar and which occurs when . Thus, in a Euclidean space, the direction of steepest descent is the unit tangent vector .

Finding the direction of steepest descent turns out to be harder than it first appears to be. In a Euclidean geometry, it is relatively straightforward. In a Riemannian geometry, the coordinate system is curved, and a gradient descent algorithm should follow the curvature of the manifold. The direction of steepest descent should be compensated for this curvature.

In a Riemannian space, a metric exists at every point . If the tangent vectors form a basis for the tangent space at , then is the Riemannian metric where is an inner product defined on the tangent space .

Given a tangent vector and the gradient , their inner product will be If is the steepest descent direction, then, according to the analysis above, . This occurs when where is the matrix whose elements are equal to the inner products of the basis vectors. In other words, the coordinate system matters when doing gradient descent, and it may change at each point on the manifold. The natural gradient learning algorithm uses this “natural” descent direction at each step.

###### 1.1.3. A Probability Model for Multilayer Perceptrons

In order for this analysis to be applied to neural networks, there must be a probability density function associated with a neural network. This section describes an exponential family in which each member of the family has a probability density function of network output errors. This exponential family will then be used to determine the direction of steepest descent and the natural gradient to be used in a learning algorithm.

Let the vectors and , respectively, be the inputs and outputs of a training set. A convenient model for training a multilayer perceptron is where represents the output of the multilayer perceptron for training input , and each represents the error on each training pair. A probability distribution can then be defined with the density function where is the dimension of the vector space to which belong. In this model we assume that each error vector is independent and identically distributed so that the joint density of all the training patterns is

With the training set and the network architecture (e.g., number of hidden nodes, number of hidden layers, transfer functions) specified, the probability distributions in (8) can be points in a manifold [9]. The network parameters are the coordinates of the points. This is called a neuromanifold [10]. Random variables are the tangent vectors at the point . If is a random variable, then the point is (by definition) the distribution with the density function . The score functions in (3) of the network parameters form a basis for the tangent space at a point [12]. If the inner product of this tangent space is defined to be , where is the expectation operator with density function , then the Riemannian metric is the Fisher information matrix because the Fisher information between two score functions is [9, 10, 12]. These definitions are illustrated in the following example.

###### 1.1.4. Natural Gradient with a Single Neuron

This is an example of a neuromanifold. Consider a neural network that consists of a single neuron that is described by a weight vector of dimension and a bias . This example will show how the Fisher information matrix is determined for a neural network. It also shows some of the pitfalls that are encountered when computing its inverse.

The log-likelihood of a distribution in a neuromanifold is If the network is a single neuron where is a -dimensional input vector, and is a sigmoid function such as or the logistic function, then the score functions are The gradient of is The Fisher information matrix is . In this case, the straightforward calculation yields assuming that .

One property of sigmoid functions is that their derivatives are small outside of the region near the origin. Therefore, as the norm of or the bias increases, the derivative of the output tends toward zero. Since the derivative is squared in (14), the Fisher information matrix could become nearly singular.

The matrix will also be singular unless there are linearly independent input vectors. For example, if the input vectors of the training set were three-dimensional vectors, then there must be three linearly independent vectors in order to successfully invert . Care must be taken when computing the inverse of . Calculating becomes much more complicated with the complex network architectures of multilayer perceptrons.

##### 1.2. Adaptive Natural Gradient

Multilayer perceptrons can have many parameters. If the dimension of the tangent space is , then will be an matrix. Inverting the matrix has computational complexity . In the calculation of the natural gradient, the inverse of the Fisher information matrix needs to be known at each step of the algorithm. However, the Fisher information matrix itself does not need to be calculated directly for each step.

Amari et al. developed an algorithm called adaptive natural gradient learning (ANGL) [2]. It uses the matrix inversion lemma [12] to perform a rank 1 update on the inverse of the Fisher information matrix at each step. The parameters are arranged and indexed as a column vector. The score function then has the same form and is used in the updated formula The variable is an integer representing the current iteration number of the learning algorithm. At each value of , new values of the network parameters are used to compute the training error vectors. These errors are used to calculate the changes to the network parameters. The inverse of the Fisher information matrix is first initialized as .

As mentioned in the previous paragraph, the ANGL algorithm stacks all the network parameters into one tall column vector. This is typically done when using a nonlinear least squares algorithm like Levenberg-Marquardt [11]. For networks with multiple layers, this leads to tall vectors and a large Fisher information matrix. Amari avoids inverting the matrix by using the matrix inversion lemma. However, the matrix is still usually large enough to be an issue with memory.

The ANGL algorithm performs well, but it needs sufficient memory to store a large matrix and is sensitive to initial conditions and learning rates.

#### 2. A Simplified Natural Gradient Learning Algorithm

This section describes the learning rule used in the SNGL algorithm. First, the directional derivative for a single-layer perceptron is determined. Then the Fisher information matrix for this network is described in detail. These two results are then generalized to find the directional derivative and the Fisher information matrix for a multilayer perceptron. Finally, the section will conclude with a discussion of the computational complexity of SNGL compared to that of ANGL.

The derivation and analysis of the simplified natural gradient learning algorithm begin with the observation that the neural network is parameterized by in the affine transformation . There is one affine transformation for each layer of the network. The formula in (11) was for a single neuron that has only one output. In general, the affine transformations in each layer of a multilayer perceptron can have more than one output. The objective of the algorithm is to minimize the sum of squares of the error vector norms. These norms are calculated with the standard norm of a Cartesian coordinate system. However, that does not mean that the matrix parameters need to be column-ordered into tall vectors. While the homomorphism as done in [2] preserves the vector space operations such as addition and multiplication by a scalar, it does not preserve multiplying the matrix by a vector. Therefore, it makes sense that the score function of a matrix parameter should itself be a matrix. The Frobenius norm of a Matrix is . The inner product associated with this norm is . The following analysis shows how the derivative of the norm of an error vector can be equivalent to an inner product between matrices.

##### 2.1. The Directional Derivative

Assume that the network is a single-layer perceptron with the vector-valued sigmoidal transfer function . Thus, the log-likelihood function with training examples that depends on the parameters and is Furthermore, assume that if the affine transformation is added to as , then the minimum point, according to (16), is reached. Thus, and represent a descent direction.

Let be a curve through the manifold for . The affine transformation is the initial point, and is the final point. The derivative of this curve with respect to is and is constant with respect to . The log-likelihood along this curve is The directional derivative of the log-likelihood with respect to is where is the Jacobian matrix of . Here, we have isolated by using the adjoint of in the second argument of the inner product. Given that and , the directional derivative of the log-likelihood function at is

Let . Then, by using the linearity property of the inner product, (19) can be rewritten as The matrix can be isolated because the matrix inner product is being used on the right-hand side. Thus, .

Since , then, by the chain rule, . Thus, and the components of the score function are The components in (21) would point in the direction of steepest descent if the space of affine transformations was a Euclidean space. To find the direction of steepest descent, then (21) must be corrected for the curvature of the parameter space using the Riemannian metric .

##### 2.2. The Fisher Information Matrix

The Fisher information matrix is the Riemannian metric in this manifold [1]. For the affine transformation , the Fisher information matrix is The score function with the components in (21) can be arranged in an augmented matrix form as

Thus, in terms of these components, the Fisher information matrix is The partial derivative in (21) is a sum of rank 1 matrices. Therefore, the outer product of with itself is the sum where . When then because . When then However, since is not known, it must be estimated. The unbiased estimator is The Fisher information of the weight matrix is estimated as and the Fisher information of the bias vector is estimated as The Fisher information matrix can be estimated as the sum of these two matrices. Combining like terms gives The direction of steepest descent in (19) is now represented as

These equations are for a perceptron with only one layer. Please note that by using the matrix forms, the Fisher information matrix is only a matrix. If we had stacked the weight matrix and bias vector into a single column vector, then the Fisher information matrix would be matrix.

##### 2.3. Extension to Multilayer Perceptrons

A multilayer perceptron has layers each with its own nonlinear sigmoidal function and affine transformation . Each layer feeds into the one above it with index . Thus, the output of layer is . The input layer is defined to be the identity transformation and . For convenience, let the vector be the output of layer when given input . The output can be written recursively, with .

Two vector spaces and can be joined together by means of a direct sum in which stacking a vector on top of a vector results in the vector The analog to the direct sum with matrices is to create a block matrix. Given two matrices and , their direct sum is The direct sum of the score functions of the layers in a multilayer perceptron is used as the analog for stacking the layers. The Fisher information matrix is then a block-diagonal matrix where each square matrix along the main diagonal is the Fisher information matrix for a specific layer.

The score functions for layer are where is the backpropagation matrix that is determined with the recurrence relation where is the number of layers in the network. Thus, for the output layer, the layer index is and the backpropagation matrix is the Jacobian matrix of the output layer's transfer function.

Let be the error for pattern . The estimate of the Fisher information matrix for layer is Thus, the updates for the parameters in layer are

##### 2.4. Computational Complexity

The new formulation of the Fisher information matrix is significantly smaller than what was used by Amari et al. [2]. Suppose that a two-layer perceptron has inputs, outputs, and hidden nodes. The weight matrices have dimensions and . The addition of 1 to both and is due to the extra bias vectors of dimensions and . If the parameters are column ordered into a single vector, then that vector has dimension . In the ANGL algorithm's Fisher information matrix, there are elements. The SNGL algorithm's block-diagonal Fisher information matrix has two blocks on the main diagonal that together have elements. Clearly, if , , and are positive integers.

It is also less computationally intensive to invert the block matrix used in the SNGL algorithm since the matrix inversion algorithms would have complexity . The complexity to update Amari's recursion in (15) is . This takes into account that ANGL exploits the matrix inversion lemma. Table 1 shows the computational complexity in both space and time of the ANGL algorithm and the SNGL algorithm for three problems described by Park et al. [3]. The last two columns labelled “Ratio” in Table 1 are the ratios of the space and time complexities of the SNGL algorithm compared to that of the ANGL written as percentages.

##### 2.5. SNGL Improvements

There are two more elements of the simplified natural gradient learning algorithm. The first is the regularization of the gradient descent algorithm by adding a prior distribution to the probability density function of the network errors [5]. The second is annealing the learning rate of the algorithm [6]. Neither has any significant impact on the computational complexity. The complexity of the regularization is for a two-layer network. The learning rate is recalculated once per learning epoch.

###### 2.5.1. Regularization of Network Parameters

The goal of regularization is to penalize larger connection weights. This helps to discourage overtraining. The penalty of the weights is governed by the hyperparameter . For a neuron with parameters and , the probability density function of these parameters in the prior distribution is where is the dimension of the vector space to which belongs. Thus, the weight vector is distributed normally as and the bias as . The log-likelihood function is The score functions of and with respect to are, respectively The probability density function of this prior distribution can be added to the probability density function of the network error distribution. Then the updates for the parameters in a multilayer perceptron are

When the algorithm is overtraining, the parameters are still increasing. Subtracting times the current parameter from the update effectively counteracts the update when the norm of that parameter is large enough. The algorithm will eventually reach a steady state where the parameters are longer increasing.

Since , , , and , the Fisher information matrix for the regularized algorithm is

Adding to the Fisher information matrix helps the algorithm avoid inverting an almost singular matrix. Thus, if all the errors are zero, then the Fisher information matrix will be . The effect will be multiplying all the weight update elements by . This is the largest value the inverse will reach during the SNGL algorithm's execution.

In [13] Park adds regularization to the ANGL algorithm with an Akaike Information Criterion.

###### 2.5.2. Annealed Learning Rates

Annealing the learning rate [6] is a technique in which a learning algorithm will slowly decay its learning rate after a specific epoch during the execution of the algorithm. Let the learning rate of the SNGL algorithm be , where is the current learning epoch. It is defined as where is the initial learning rate and is a factor that determines which learning epoch begins the annealing process. The epoch in which the decay begins is .

#### 3. Experimental Results

The exclusive OR function was the first Boolean function to be learned by a multilayer perceptron that was not linearly separable [4]. Encoding data with a low density parity check (LDPC) code [14] could be considered the exclusive OR problem's big brother. An LDPC is computed by multiplying (0,1)-vectors with a (0,1)-matrix over to get a longer (0,1)-vector. Remember that in addition is the XOR operation and multiplication is the AND operation. Formally, a low density parity check code is a code in which extra bits that are parity checks are added to the message. A matrix is used to represent the transformation of the bits over . The encoding function is the matrix This code has a rate of 1/2 because the code words are twice as long as the message words.

The training set consists of 32 training patterns. The inputs are the 32 possible messages, and the outputs are the 32 10-bit code words corresponding to each message. Thus, the network has 5 input units and 10 output units. The multilayer perceptron will also have 10 hidden units. During the training the bits are changed according to a bit error rate of 0.01. This will help the network deal with bit errors.

First, the performance of the three algorithms will be compared. This will be followed by a comparison of their effective learning rates. Finally, an experiment in which the effective learning rate of the SNGL and ANGL algorithms when these learning rates are used in the ordinary gradient learning (OGL) algorithm as described in [4] will be discussed.

##### 3.1. Performance Comparison

Figure 1 shows the sum-squared error of the three algorithms when learning this LDPC encoding function. As can be seen, the performance of the SNGL algorithm is comparable to that of the OGL algorithm, but it reaches convergence in less than 1% of the time.

Table 2 lists the parameters used in all three algorithms and some of the performance metrics. The ANGL algorithm achieves a much lower MSSE. Using the prior probability by adding the current parameters multiplied by the hyperparameter caused the components of the Fisher information matrix to approach infinity. Therefore, it was omitted, and the weights of the network were allowed to grow without constraint. Thus, the ANGL overtrained the network significantly.

Table 2 may at first seem to be an unfair comparison. However, each of the algorithms were run with and without the regularization and annealed learning rates. The algorithm that performed the best is listed above. This was done to conserve space and simplify the reports. Thus, OGL performs better without an annealed learning rate on this problem. There were several difficulties implementing ANGL with the regularization. It did not perform well with regularization.

##### 3.2. The Effective Learning Rate

The effective learning rate is defined to be the product of the baseline learning rate and the smallest eigenvalue of the matrix chosen to represent the inverse of the Fisher information matrix. For OGL the effective learning rate is just the learning rate chosen for the algorithm. For ANGL it is the learning rate multiplied by the smallest eigenvalue of the estimated Fisher information matrix. For SNGL it is the inverse of the largest eigenvalue of the Fisher information matrix multiplied by the learning rate. Figure 2 shows the effective learning rates for these three algorithms. It is a simplified number to give the reader a sense of the step size used by the algorithm during each learning epoch.

The effective learning rate for the SNGL algorithm is small for the first 100 learning epochs. Between the hundredth and the first thousandth epoch, the learning rate becomes quite large, for a step size, becoming larger than one for a few epochs. When the algorithm reaches convergence, then the learning rate begins to decrease due to annealing. The LDPC function is more complicated than XOR, and this is reflected in the number of learning epochs required to reach the three phases of learning for SNGL.

##### 3.3. Substituting Learning Rates

One might argue that SNGL and ANGL perform better than OGL because they both use higher learning rates. However, as Figure 2 shows, both of these algorithms have smaller learning rates at first, and the learning rates do not increase until progress is made towards good parameter values.

Figure 3 shows the performance of OGL with the effective learning rates of ANGL and SNGL in addition to OGL with a flat learning rate along with ANGL and SNGL. The plot shows that using the effective learning rate of ANGL improves the performance of OGL. However, the performance of OGL does not improve with the SNGL algorithm's effective learning rate. For this problem, the Fisher information matrix used in SNGL is a block-diagonal matrix with two 10-by-10 blocks on the main diagonal. In the ANGL algorithm, it is a 61-by-61 matrix. These matrices are positive definite. The eigenvalues represent how much information is associated with each direction. The eigenvectors associated with these eigenvalues are these directions. Using the maximum eigenvalue to determine an effective learning rate is good for demonstration purposes. However, it will not improve performance of the algorithm because the directional information is missing.

#### 4. Conclusion

This paper developed a simplified natural gradient learning algorithm. Using an algebraic justification to form the gradient of a multilayer perceptron as a block matrix, a smaller block-diagonal Fisher information matrix was discovered. For the three problems studied in the experiments, the Fisher information matrix was much smaller than that in ANGL. The minimum sum-squared error performance of this algorithm is comparable to Amari et al. [2] but uses less memory and computational resources.