Computational Intelligence and Neuroscience

Volume 2015 (2015), Article ID 818243, 13 pages

http://dx.doi.org/10.1155/2015/818243

## On Training Efficiency and Computational Costs of a Feed Forward Neural Network: A Review

Department of Engineering, Roma Tre University, Via Vito Volterra 62, 00146 Rome, Italy

Received 7 May 2015; Revised 16 August 2015; Accepted 17 August 2015

Academic Editor: Saeid Sanei

Copyright © 2015 Antonino Laudani et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

A comprehensive review on the problem of choosing a suitable activation function for the hidden layer of a feed forward neural network has been widely investigated. Since the nonlinear component of a neural network is the main contributor to the network mapping capabilities, the different choices that may lead to enhanced performances, in terms of training, generalization, or computational costs, are analyzed, both in general-purpose and in embedded computing environments. Finally, a strategy to convert a network configuration between different activation functions without altering the network mapping capabilities will be presented.

#### 1. Introduction

Neural networks (NNs) are generally accepted in literature as a versatile and powerful tool for nonlinear mapping of a generic -dimensional nonlinear function. The mapping capabilities of a NN are strictly related to the nonlinear component found in the activation function (AF) of the neurons. Indeed, without the presence of a nonlinear activation function, the NN would be a simple linear interpolator. The most generic representation of a NN is a group of elementary processing units (neurons) characterized by a weighted connection to other input units. The processing of the unit consists of a linear part, where the inputs are linearly combined through the weights values, and a nonlinear part, where the weighted combination of the inputs is passed through an activation function, which is usually a threshold/squashing function. The nonlinear part of a NN is completely separated from the linear combination of the weighted inputs, thus opening a large number of possibilities for the choice of an activation function. Given the representation of the elementary unit, the inner architecture of the NN expresses the way those units are connected between themselves and to the inputs and outputs of the NN itself. Numerous authors studied the mapping capabilities of a NN, according to the inner architecture. In particular, it has been proved that the simple feed forward architecture with a single layer [1] and multiple layer [2–6] can be used as universal approximator given mild assumptions on hidden layer. A feed forward neural network (FFNN) is a NN where the inner architecture is organized in subsequent layers of neurons, and the connections are made according to the following rules: every neuron of a layer is connected to all (and only) the neurons of the subsequent layer. This topology rule excludes backward connections, found in many recurrent NNs [7–11], and layer-skipping, found in particular NN architectures like Fully Connected Cascade (FCC) [12]. Another focal point is the a-dynamicity of the architecture: in a FFNN no memory or delay is allowed, thus making the network useful only to represent static models. On this particular matter, several studies showed that, even by lacking dynamic capabilities, a FFNN can be used to represent both the function mapping and its derivatives [13]. Nevertheless, the choice of a suitable activation function for a FFNN, and in general, for a NN, is subject to different criterions. The most common considered criterions are training efficiency and computational cost. The former is especially important in the occurrence that a NN is trained in a general-purpose computing environment (e.g., using Matlab); the latter is critical in embedded systems (e.g., microcontrollers and FPGA (Field Programmable Gate Array)) where computational resources are inherently limited. The core of this work will be to give a comprehensive overview of the possibilities available in literature concerning the activation function for FFNN. Extending some of the concepts shown in this work could be possible to dynamic NNs as well; however training algorithms for dynamic NNs are completely different from the one used for FFNN, and thus the reader is advised to exert a critical analysis on the matter. The paper will be structured as follows. In the first part of this survey, different works focusing on the training efficiency of a FFNN with a specific AF will be presented. In particular, three subareas will be investigated: first, the analytic AFs, which enclose all the variants on the squashing functions proposed in the classic literature [14, 15]; then, the fuzzy AFs, which exploit complex membership functions to achieve faster convergence during the training procedure; last, the adaptive AFs, which focus on shaping the NN nonlinear response to mimic as much as possible the mapping function properties. In the second part of this survey, the topic of computational efficiency will be explored through different works, focusing on different order approximations found in literature. In the third and last part of this survey, a method to transform FFNN weights and biases to change the AF of the hidden layer without need to retrain the network is presented. Conclusions will follow in the fourth part. An appendix, containing the relevant figures of merit reported by the authors in their work, closes this paper.

#### 2. Activation Functions for Easy Training

##### 2.1. Analytic Activation Functions

The commonly used backpropagation algorithm for FFNN training suffers from slow learning speed. One of the reasons for this drawback lies in the rule for the computation of the FFNN’s weights correction matrix, which is calculated using the derivative of the activation function for the FFNN’s neurons. The universal approximation theorem [1] states that one of the conditions for the FFNN to be a universal approximator is for the activation function to be bounded. For these reasons, most of the activation functions show a high derivative near the origin and a progressive flattening moving towards infinity. This means that, for neurons having a sum of weighted inputs very large in magnitude, learning rate will be very slow. A detailed comparison between different simple activation functions based on exponentials and logarithms can be found in [16], where the authors investigate the learning rate and convergence speed on a character recognition problem and the classic XOR classification problem, proposing the use of the inverse tangent as a fast-learning activation function. The authors compare the training performance, in terms of Epochs required to learn the task, of the proposed inverse tangent function, against the classic sigmoid and hyperbolic tangent functions, and the novel logarithmic activation function found in [17], finding a considerable performance gain. In [18], the sigmoid activation function is modified by introducing the square of the argument, enhancing the mapping capabilities of the NN. In [19], two activation functions, one based on integration of the triangular function and one on the difference between two sigmoids (log-exponential), are proposed and compared through a barycentric plotting technique, which projects the mapping capabilities of the network in a hyper dimensional cube. The study has shown that log-exponential function has been slowly accelerated but it was effective in MLP network with backpropagation learning. In [20] a piecewise interpolation by means of cubic splines is used as an activation function, providing performances comparable to the sigmoid function with reduced computational costs. In [21], the proposed activation function is derived by Hermite orthonormal polynomials. The criterion is that every neuron in the hidden layer is characterized by a different AF, which is more complex for every neuron added. Through extensive simulations, the authors prove that such network shows great performance in comparison to analogous FFNN with identical sigmoid AFs. In [22], the authors propose a performance comparison between eight different AFs, including the stochastic AF and the novel “neural” activation function, obtained by the combination of a sinusoidal and sigmoid activation function. Two tests sets are used for comparison: breast cancer and thyroid diseases related data. The work shows that the sigmoid AFs yield, overall, the worst accuracy, and the hyperbolic tangent and the neural AF perform better on breast cancer dataset and thyroid disease dataset, respectively, pointing out the dataset dependence of the AF capabilities. The “neural” AF is investigated in [23] as well (in this work, it is referred to as “periodic”), where the NN is trained by the extended Kalman filter algorithm. The network is tested, against classic sigmoid and sinusoidal networks, in handwriting recognition, time series prediction, parity generation, and XOR mapping. The authors prove that the periodic function proposed outperforms both classic AFs in terms of training convergence. In [24], the authors suggest the combination of sigmoid and sinusoidal and Gaussian activation function, to exploit their independent space division properties. The authors compare the hybrid structure in a multifrequency signal classification problem, concluding that even if the combination of the three activation functions performs better than the sigmoid (in terms of convergence speed) and the Gaussian (in terms of noise rejection), the sinusoidal activation function by itself still achieves better results. Another work investigating an activation function based on sinusoidal modulation can be found in [25], where the authors propose a cosine modulated Gaussian function. The use of sinusoidal activation function is deeply investigated in [26], where the authors present a comprehensive comparison between eight different activation functions on eight different problems. Among other results, the Sinc activation function is proved as a valid alternative to the hyperbolic tangent, and the sinusoidal activation function has good training performance on small FFNNs. In Table 1, a summary of the different analytical AFs proposed in this paragraph is shown.