Mathematical Problems in Engineering

Volume 2016, Article ID 3015087, 17 pages

http://dx.doi.org/10.1155/2016/3015087

## Simplified Information Maximization for Improving Generalization Performance in Multilayered Neural Networks

IT Education Center and School of Science and Technology, Tokai University, 1117 Kitakaname, Hiratsuka, Kanagawa 259-1292, Japan

Received 30 July 2015; Accepted 21 February 2016

Academic Editor: Antonino Laudani

Copyright © 2016 Ryotaro Kamimura. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

A new type of information-theoretic method is proposed to improve prediction performance in supervised learning. The method has two main technical features. First, the complicated procedures used to increase information content are replaced by the direct use of hidden neuron outputs. Information is controlled by directly changing the outputs of the hidden neurons. In addition, to simultaneously increase information content and decrease errors between targets and outputs, the information acquisition and use phases are separated. In the information acquisition phase, the autoencoder tries to acquire as much information content on input patterns as possible. In the information use phase, information obtained in the acquisition phase is used to train supervised learning. The method is a simplified version of actual information maximization and directly deals with the outputs from neurons. The method was applied to the three data sets, namely, Iris, bankruptcy, and rebel participation data sets. Experimental results showed that the proposed simplified information acquisition method was effective in increasing the real information content. In addition, by using the information content, generalization performance was greatly improved.

#### 1. Introduction

*(1) Information-Theoretic Methods.* Information-theoretic methods in neural networks have received due attention ever since Linsker stated the so-called “InforMax” principle in living systems [1–4]. The Informax principle holds that living systems try to maximize information content at every stage of information processing. In other words, living systems should acquire as much information as possible in order to maintain their existence. Following this principle, there have been many attempts to use information-theoretic methods in neural networks [5–9]. Following this was the development of information-theoretic methods to control hidden neuron activation, with the aim of interpreting internal representations as much as possible and examining relations between information and generalization [10–15]. The method was successful in increasing information content, keeping training errors between targets and outputs relatively small. However, there were several limitations of these information-theoretic methods. Among them, the inability to increase information, computational complexity, and compromise between information maximization and error minimization were the most serious ones.

First, several cases were observed where the information-theoretic methods did not necessarily succeed in increasing information content. For example, when the number of neurons increases, the adjustment among neurons becomes difficult, preventing the neural networks from increasing information content. Second, there is the problem of computational complexity. As expected, information or entropy functions require complex learning formulas. This suggests that information-theoretic methods can be effective only for the relatively small sized neural networks. Third is the problem of compromise between information maximization and error minimization. From an information-theoretic point of view, information on input patterns should be increased as much as possible. On the other hand, neural networks should minimize errors between targets and outputs. Because information maximization and error minimization are sometimes contradictory, it can be difficult to compromise between the two in one framework.

*(2) Simplified Methods.* To solve the above-mentioned problems, a new information-theoretic method is here proposed to simplify the procedure of information maximization. The proposed procedure is composed of two steps, namely, realization of information maximization by directly controlling hidden neuron outputs and the separation of the information acquisition and use phases.

First, the process of information maximization can be realized by simulating the actual process of information maximization. In the information-theoretic method, when information increases, a small number of hidden neurons are activated. This number should be decreased in the course of learning as much as possible in order to increase the information. For this purpose, hidden neurons are arranged according to the magnitude of their variance. Specifically, hidden neurons with larger variance are more strongly activated. Much importance is placed on neurons with larger variances. This direct use of outputs can also facilitate the process of information maximization and reduce computational complexity.

Second, the information acquisition and use phases are separated. This is because it has been difficult to increase information maximization and achieve error minimization at the same time. First, information content in input patterns is acquired. Information content is then used to train supervised neural networks. This eliminates the contradiction between information maximization and error minimization in the same learning processes. The effectiveness of separation has been demonstrated in the field of deep learning [16–19].

Finally, relations between the present method and sparse coding should be noted as well. In deep learning, sparse representations play an important role when only a small number of components are nonzero and the majority is forced to be zero. Sparse coding is said to be related to improved separability and interpretation and is biologically motivated [20–26]. One of the main differences between the present method and sparse coding methods is that sparse coding usually aims to suppress the majority of components and eventually realize a small number of nonzero components. On the other hand, the present method aims only to find a small number of important components and, eventually, make the majority of components zero. In terms of detecting important components, the present method is an active one, while the others are passive ones.

*(3) Outline.* In Section 2, the information content in hidden neurons is introduced. Then, the procedure of information maximization is simplified by directly controlling hidden neuron outputs. In Section 3, three experimental results of the Iris, bankruptcy, and rebel participation data sets are discussed. In all experimental results, it is shown that information could be increased using the present simplified method. This information increase is shown to be in direct proportion to generalization performance for higher layers in particular. Though abrupt decreases and increases in information can be observed, the simplified method can increase information for higher-layered neural networks.

#### 2. Theory and Computational Methods

##### 2.1. Simplified Information Maximization

Information-theoretic methods were originally developed to increase information content in hidden neurons on input patterns. Various methods have been successfully applied for increasing information content to a certain quantity [27–29]. However, these methods have been typically limited to networks with a relatively small number of hidden neurons because of the computational complexity involved. In addition, it has been observed that the obtained information content did not necessarily contribute to improved prediction performance. The present paper proposes a method to directly control the outputs from the neurons to weaken the computational complexity of the information-theoretic methods. The procedure of information maximization can be approximated by producing a smaller number of activated hidden neurons in a concrete way.

###### 2.1.1. Information in Hidden Neurons

Though multilayered neural networks are supposed, the learning procedures are explained by using the simple layered network, because the same procedures are repeated in the multilayered networks. Let and denote the th element of the th input pattern and connection weights from the th input neuron to the th hidden neuron in Figure 1; then the net input is computed by where is the number of input neurons. The output from th hidden neuron for th input pattern is computed by where the sigmoid activation function is here used. The averaged output for th hidden neuron is defined by where is the number of input patterns. In addition, the variance is computed by The firing probability of th hidden neuron is obtained by The entropy is defined by where is the number of hidden neurons. The information is defined as the decrease of entropy from its maximum value: