Abstract

The radial basis function (RBF) network has its foundation in the conventional approximation theory. It has the capability of universal approximation. The RBF network is a popular alternative to the well-known multilayer perceptron (MLP), since it has a simpler structure and a much faster training process. In this paper, we give a comprehensive survey on the RBF network and its learning. Many aspects associated with the RBF network, such as network structure, universal approimation capability, radial basis functions, RBF network learning, structure optimization, normalized RBF networks, application to dynamic system modeling, and nonlinear complex-valued signal processing, are described. We also compare the features and capability of the two models.

1. Introduction

The multilayer perceptron (MLP) trained with backpropagation (BP) rule [1] is one of the most important neural network models. Due to its universal function approximation capability, the MLP is widely used in system identification, prediction, regression, classification, control, feature extraction, and associative memory [2]. The RBF network model was proposed by Broomhead and Lowe in 1988 [3]. It has its foundation in the conventional approximation techniques, such as generalized splines and regularization techniques. The RBF network has equivalent capabilities as the MLP model, but with a much faster training speed, and thus it has become a good alternative to the MLP.

The RBF network has its origin in performing exact interpolation of a set of data points in a multidimensional space [4]. It can be considered one type of functional link nets [5]. It has a network architecture similar to the classical regularization network [6], where the basis functions are the Green’s functions of the Gram’s operator associated with the stabilizer. If the stabilizer exhibits radial symmetry, an RBF network is obtained. From the viewpoint of approximation theory, the regularization network has three desirable properties [6, 7]. It can approximate any multivariate continuous function on a compact domain to an arbitrary accuracy, given a sufficient number of units; it has the best approximation property since the unknown coefficients are linear. The solution is optimal by minimizing a functional containing a regularization term.

1.1. Network Architecture

The RBF network is a three-layer (𝐽1-𝐽2-𝐽3) feedforward neural network, as shown in Figure 1. Each node in the hidden layer uses a radial basis function (RBF), denoted 𝜙(𝑟), as its nonlinear activation function. The hidden layer performs a nonlinear transform of the input, and the output layer is a linear combiner mapping the nonlinearity into a new space. Usually, the same RBF is applied on all nodes; that is, the RBF nodes have the nonlinearity 𝜙𝑖(⃗𝑥)=𝜙(⃗𝑥−⃗𝑐𝑖), 𝑖=1,…,𝐽2, where ⃗𝑐𝑖 is the prototype or center of the 𝑖th node and 𝜙(⃗𝑥) is an RBF. The biases of the output layer neurons can be modeled by an additional neuron in the hidden layer, which has a constant activation function 𝜙0(𝑟)=1. The RBF network achieves a global optimal solution to the adjustable weights in the minimum mean square error (MSE) sense by using the linear optimization method.

For input ⃗𝑥, the output of the RBF network is given by𝑦𝑖=⃗𝑥𝐽2∑𝑘=1𝑤𝑘𝑖𝜙‖‖⃗𝑥−⃗𝑐𝑘‖‖,𝑖=1,…,𝐽3,(1.1) where 𝑦𝑖(⃗𝑥) is the 𝑖th output, 𝑤𝑘𝑖 is the connection weight from the 𝑘th hidden unit to the 𝑖th output unit, and ‖⋅‖ denotes the Euclidean norm. The RBF 𝜙(⋅) is typically selected as the Gaussian function, and such an RBF network is usually termed the Gaussian RBF network.

For a set of 𝑁 pattern pairs {(𝑥𝑝,𝑦𝑝)∣𝑝=1,…,𝑁}, (1.1) can be expressed in the matrix form𝐘=𝐖𝑇𝚽,(1.2) where 𝑤𝐖=[1𝑤,…,𝐽3] is a 𝐽2×𝐽3 matrix, 𝑤𝑖=(𝑤1𝑖,…,𝑤𝐽2𝑖)𝑇, ⃗𝜙Φ=[1⃗𝜙,…,𝑁] is a 𝐽2×𝑁 matrix, ⃗𝜙𝑝=(𝜙𝑝,1,…,𝜙𝑝,𝐽2)𝑇 is the output of the hidden layer for the 𝑝th sample, that is, 𝜙𝑝,𝑘=𝜙(‖𝑥𝑝−⃗𝑐𝑘‖), 𝐘=[𝑦1𝑦2…𝑦𝑁] is a 𝐽3×𝑁 matrix, and 𝑦𝑝=(𝑦𝑝,1,…,𝑦𝑝,𝐽3)𝑇.

The RBF network with a localized RBF such as the Gaussian RBF network is a receptive-field or localized network. The localized approximation method provides the strongest output when the input is near the prototype of a node. For a suitably trained localized RBF network, input vectors that are close to each other always generate similar outputs, while distant input vectors produce nearly independent outputs. This is the intrinsic local generalization property. A receptive-field network is an associative neural network in that only a small subspace is determined by the input to the network. This property is particularly attractive since the modification of the receptive-field function produces local effect. Thus receptive-field networks can be conveniently constructed by adjusting the parameters of the receptive-field functions and/or adding or removing neurons. Another well-known receptive-field network is the cerebellar model articulation controller (CMAC) [8, 9]. The CMAC is a distributed LUT system suitable for VLSI realization. It can approximate slow-varying functions, but may fail in approximating highly nonlinear or rapidly oscillating functions [10, 11].

1.2. Universal Approximation

The RBF network has universal approximation and regularization capabilities. Theoretically, the RBF network can approximate any continuous function arbitrarily well, if the RBF is suitably chosen [6, 12, 13]. A condition for suitable 𝜙(⋅) is given by Micchelli’s interpolation theorem [14]. A less restrictive condition is given in [15], where 𝜙(⋅) is continuous on (0,∞) and its derivatives satisfy (−1)𝑙𝜙(𝑙)(𝑥)>0, for all 𝑥∈(0,∞) and 𝑙=0,1,2. The choice of RBF is not crucial to the performance of the RBF network [4, 16]. The Gaussian RBF network can approximate, to any degree of accuracy, any continuous function by a sufficient number of centers ⃗𝑐𝑖, 𝑖=1,…,𝐽2, and a common standard deviation 𝜎>0 in the 𝐿𝑝 norm, 𝑝∈[1,∞] [12]. A class of RBF networks can achieve universal approximation when the RBF is continuous and integrable [12]. The requirement of the integrability of the RBF is relaxed in [13]. For an RBF which is continuous almost everywhere, locally essentially bounded and nonpolynomial, the RBF network can approximate any continuous function with respect to the uniform norm [13]. Based on this result, such RBFs as 𝜙(𝑟)=e−𝑟/𝜎2 and 𝜙(𝑟)=e𝑟/𝜎2 also lead to universal approximation capability [13].

In [17], in an incremental constructive method three-layer feedforward networks with randomly generated hidden nodes are proved to be universal approximators when any bounded nonlinear piecewise continuous activation function is used, and only the weights linking the hidden layer and the output layer need to be adjusted. The proof itself gives an efficient incremental construction of the network. Theoretically the learning algorithms thus derived can be applied to a wide variety of activation functions no matter whether they are sigmoidal or nonsigmoidal, continuous or noncontinuous, or differentiable or nondifferentiable; it can be used to train threshold networks directly. The network learning process is fully automatic, and no user intervention is needed.

This paper is organized as follows. In Section 2, we give a general description to a variety of RBFs. Learning of the RBF network is treated in Section 3. Optimization of the RBF network structure and model selection is described in Section 4. In Section 5, we introduce the normalized RBF network. Section 6 describes the applications of the RBF network to dynamic systems and complex RBF networks to complex-valued signal processing. A comparison between the RBF network and the MLP is made in Section 7. A brief summary is given in Section 8, where topics such as generalizations of the RBF network, robust learning against outliers, and hardware implementation of the RBF network are also mentioned.

2. Radial Basis Functions

A number of functions can be used as the RBF [6, 13, 14]𝜙(𝑟)=e−𝑟2/2𝜎21,Gaussian,(2.1)𝜙(𝑟)=𝜎2+𝑟2𝛼𝜎,𝛼>0,(2.2)𝜙(𝑟)=2+𝑟2𝛽,0<𝛽<1,(2.3)𝜙(𝑟)=𝑟,linear,(2.4)𝜙(𝑟)=𝑟2𝜙1ln(𝑟),thin-platespline,(2.5)(𝑟)=1+e(𝑟/𝜎2)−𝜃,logisticfunction,(2.6) where 𝑟>0 denotes the distance from a data point ⃗𝑥 to a center ⃗𝑐, 𝜎 in (2.1), (2.2), (2.3), and (2.6) is used to control the smoothness of the interpolating function, and 𝜃 in (2.6) is an adjustable bias. When 𝛽 in (2.3) takes the value of 1/2, the RBF becomes Hardy’s multiquadric function, which is extensively used in surface interpolation with very good results [6]. When 𝛼 in (2.2) is unity, 𝜙(𝑟) is suitable for DSP implementation [18].

Among these RBFs, (2.1), (2.2), and (2.6) are localized RBFs with the property that 𝜙(𝑟)→0 as 𝑟→∞. Physiologically, there exist Gaussian-like receptive fields in cortical cells [6]. As a result, the RBF is typically selected as the Gaussian. The Gaussian is compact and positive. It is motivated from the point of view of kernel regression and kernel density estimation. In fitting data in which there is normally distributed noise with the inputs, the Gaussian is the optimal basis function in the least-squares (LS) sense [19]. The Gaussian is the only factorizable RBF, and this property is desirable for hardware implementation of the RBF network.

Another popular RBF for universal approximation is the thin-plate spline function (2.5), which is selected from a curve-fitting perspective [20]. The thin-plate spline is the solution when fitting a surface through a set of points and by using a roughness penalty [21]. It diverges at infinity and is negative over the region of 𝑟∈(0,1). However, for training purpose, the approximated function needs to be defined only over a specified range. There is some limited empirical evidence to suggest that the thin-plate spline better fits the data in high-dimensional settings [20]. The Gaussian and the thin-plate spline functions are illustrated in Figure 2.

A pseudo-Gaussian function in the one-dimensional space is introduced by selecting the standard deviation 𝜎 in the Gaussian (2.1) as two different positive values, namely, 𝜎− for 𝑥<0 and 𝜎+ for 𝑥>0 [22]. This function is extended to the multiple dimensional space by multiplying the pseudo-Gaussian function in each dimension. The pseudo-Gaussian function is not strictly an RBF due to its radial asymmetry, and this, however, provides the hidden units with a greater flexibility with respect to function approximation.

Approximating functions with nearly constant-valued segments using localized RBFs is most difficult, and the approximation is inefficient. The sigmoidal RBF, as a composite of a set of sigmoidal functions, can be used to deal with this problem [23]1𝜙(𝑥)=1+e−𝛽[(𝑥−𝑐)+𝜃]−11+e−𝛽[(𝑥−𝑐)−𝜃],(2.7) where 𝜃>0 and 𝛽>0. 𝜙(𝑥) is radially symmetric with the maximum at 𝑐. 𝛽 controls the steepness, and 𝜃 controls the width of the function. The shape of 𝜙(𝑥) is approximately rectangular or more exactly soft trapezoidal if 𝛽×𝜃 is large. For small 𝛽 and 𝜃 it is bell shaped. 𝜙(𝑥) can be extended for 𝑛-dimensional approximation by multiplying the corresponding function in each dimension. To accommodate constant values of the desired output and to avoid diminishing the kernel functions, 𝜙(⃗𝑥) can be modified by adding a compensating term to the product term 𝜙𝑖(𝑥𝑖) [24]. An alternative approach is to use the raised-cosine function as a one-dimensional RBF [25]. The raised-cosine RBF can represent a constant function exactly using two terms. This RBF can be generalized to 𝑛 dimensions [25]. Some popular fuzzy membership functions can serve the same purpose by suitably constraining some parameters [2].

The popular Gaussian RBF is circular shaped. Many RBF nodes may be required for approximating a functional behavior with sharp noncircular features. In order to reduce the size of the RBF network, direction-dependent scaling, shaping, and rotation of Gaussian RBFs are introduced in [26] for maximal trend sensing with minimal parameter representations for function approximation, by using a directed graph-based algorithm.

3. RBF Network Learning

RBF network learning can be formulated as the minimization of the MSE function1𝐸=𝑁𝑁𝑖=1‖‖𝑦𝑝−𝐖𝑇⃗𝜙𝑝‖‖2=1𝑁‖‖𝐘−𝐖𝑇𝚽‖‖2𝐹,(3.1) where 𝐘=[𝑦1,𝑦2,…,𝑦𝑁], ⃗𝑦𝑖 is the target output for the 𝑖th sample in the training set, and ‖⋅‖2𝐹 is the Frobenius norm defined as ‖𝐀‖2𝐹=tr(𝐀𝑇𝐀).

RBF network learning requires the determination of the RBF centers and the weights. Selection of the RBF centers is most critical to RBF network implementation. The centers can be placed on a random subset or all of the training examples, or determined by clustering or via a learning procedure. One can also use all the data points as centers in the beginning and then selectively remove centers using the 𝑘-NN classification scheme [27]. For some RBFs such as the Gaussian, it is also necessary to determine the smoothness parameter 𝜎. Existing RBF network learning algorithms are mainly derived for the Gaussian RBF network and can be modified accordingly when other RBFs are used.

3.1. Learning RBF Centers

RBF network learning is usually performed using a two-phase strategy: the first phase specifies suitable centers ⃗𝑐𝑖 and their respective standard deviations, also known as widths or radii, 𝜎𝑖, and the second phase adjusts the network weights 𝐖.

3.1.1. Selecting RBF Centers Randomly from Training Sets

A simple method to specify the RBF centers is to randomly select a subset of the input patterns from the training set if the training set is representative of the learning problem. Each RBF center is exactly situated at an input pattern. The training method based on a random selection of centers from a large training set of fixed size is found to be relatively insensitive to the use of pseudoinverse; hence the method itself may be a regularization method [28]. However, if the training set is not sufficiently large or the training set is not representative of the learning problem, learning based on the randomly selected RBF centers may lead to undesirable performance. If the subsequent learning using a selection of random centers is not satisfactory, another set of random centers has to be selected until a desired performance is achieved.

For function approximation, one heuristic is to place the RBF centers at the extrema of the second-order derivative of a function and to place the RBF centers more densely in areas of higher absolute second-order derivative than in areas of lower absolute second-order derivative [29]. As the second-order derivative of a function is associated with its curvature, this achieves a better function approximation than uniformly distributed center placement.

The Gaussian RBF network using the same 𝜎 for all RBF centers has universal approximation capability [12]. This global width can be selected as the average of all the Euclidian distances between the 𝑖th RBF center ⃗𝑐𝑖 and its nearest neighbor ⃗𝑐𝑗, 𝜎=⟨‖⃗𝑐𝑖−⃗𝑐𝑗‖⟩. Another simple method for selecting 𝜎 is given by 𝜎=𝑑max/√2𝐽2, where 𝑑max is the maximum distance between the selected centers [3]. This choice makes the Gaussian RBF neither too steep nor too flat. The width of each RBF 𝜎𝑖 can be determined according to the data distribution in the region of the corresponding RBF center. A heuristics for selecting 𝜎𝑖 is to average the distances between the 𝑖th RBF center and its 𝐿 nearest neighbors, or, alternatively, 𝜎𝑖 is selected according to the distance of unit 𝑖 to its nearest neighbor unit 𝑗, 𝜎𝑖=𝑎‖⃗𝑐𝑖−⃗𝑐𝑗‖, where 𝑎 is chosen between 1.0 and 1.5.

3.1.2. Selecting RBF Centers by Clustering

Clustering is a data analysis tool for characterizing the distribution of a data set and is usually used for determining the RBF centers. The training set is grouped into appropriate clusters whose prototypes are used as RBF centers. The number of clusters can be specified or determined automatically depending on the clustering algorithm. The performance of the clustering algorithm is important to the efficiency of RBF network learning.

Unsupervised clustering such as the 𝐶-means is popular for clustering RBF centers [30]. RBF centers determined by supervised clustering are usually more efficient for RBF network learning than those determined by unsupervised clustering [31], since the distribution of the output patterns is also considered. When the RBF network is trained for classification, the LVQ1 algorithm [32] is popular for clustering the RBF centers. Any unsupervised or supervised clustering algorithm can be used for clustering RBF centers. There are many papers that use clustering to select RBF centers, and these are described in [2]. A survey of clustering algorithms is given in [33].

After the RBF centers are determined, the covariance matrices of the RBFs are set to the covariances of the input patterns in each cluster. In this case, the Gaussian RBF network is extended to the generalized RBF network using the Mahalanobis distance, defined by the weighted norm [6]𝜙‖‖⃗𝑥−⃗𝑐𝑘‖‖𝐀=e−(1/2)(⃗𝑥−⃗𝑐𝑘)𝑇⃗Σ−1(⃗𝑥−⃗𝑐𝑘),(3.2) where the squared weighted norm ‖⃗𝑥‖2𝐀=(𝐀⃗𝑥)𝑇(𝐀⃗𝑥)=𝑥𝑇𝐀𝑇𝐀⃗𝑥 and ⃗Σ−1=2𝐀𝑇𝐀. When the Euclidean distance is employed, one can also select the width of the Gaussian RBF network according to Section 3.1.1.

3.2. Learning the Weights

After RBF centers and their widths or covariance matrices are determined, learning of the weights 𝐖 is reduced to a linear optimization problem, which can be solved using the LS method or a gradient-descent method.

After the parameters related to the RBF centers are determined, 𝐖 is then trained to minimize the MSE (3.1). This LS problem requires a complexity of 𝑂(𝑁𝐽22) flops for 𝑁>𝐽2 when the popular orthogonalization techniques such as SVD and QR decomposition are applied [34]. A simple representation of the solution is given explicitly by [3]𝚽𝐖=𝑇†𝐘𝑇=𝚽𝚽𝑇−1𝚽𝐘𝑇,(3.3) where [⋅]† is the pseudoinverse of the matrix within. The over- or underdetermined linear LS system is an ill-conditioned problem. SVD is an efficient and numerically robust technique for dealing with such an ill-conditioned problem and is preferred. For regularly sampled inputs and exact interpolation, 𝐖 can be computed by using the Fourier transform of the RBF network [35], which reduces the complexity to 𝑂(𝑁ln𝑁).

When the full data set is not available and samples are obtained on-line, the RLS method can be used to train the weights on-line [36]𝑤𝑖𝑤(𝑡)=𝑖⃗(𝑡−1)+𝑘(𝑡)𝑒𝑖⃗⃗𝜙(𝑡),𝑘(𝑡)=𝐏(𝑡−1)𝑡⃗𝜙𝑇𝑡𝐏⃗𝜙(𝑡−1)𝑡,𝑒+𝜇𝑖(𝑡)=𝑦𝑡,𝑖−⃗𝜙𝑇𝑡𝑤𝑖1(𝑡−1),𝐏(𝑡)=𝜇⃗⃗𝜙𝐏(𝑡−1)−𝑘(𝑡)𝑇t,𝐏(𝑡−1)(3.4) for 𝑖=1,…,𝐽3, where 0<𝜇≤1 is the forgetting factor. Typically, 𝐏(0)=𝑎0𝐈𝐽2, 𝑎0 being a sufficiently large number and 𝐈𝐽2 the 𝐽2×𝐽2 identity matrix, and 𝑤𝑖(0) is selected as a small random matrix.

In order to eliminate the inversion operation given in (3.3), an efficient, noniterative weight learning technique has been introduced by applying the Gram-Schmidt orthogonalization (GSO) of RBFs [37]. The RBFs are first transformed into a set of orthonormal RBFs for which the optimum weights are computed. These weights are then recomputed in such a way that their values can be fitted back into the original RBF network structure, that is, with kernel functions unchanged. The requirement for computing the off-diagonal terms in the solution of the linear set of weight equations is thus eliminated. In addition, the method has low storage requirements since the weights can be computed recursively, and the computation can be organized in a parallel manner. Incorporation of new hidden nodes does not require recomputation of the network weights already calculated. This allows for a very efficient network training procedure, where network hidden nodes are added one at a time until an adequate error goal is reached. The contribution of each RBF to the overall network output can be evaluated.

3.3. RBF Network Learning Using Orthogonal Least Squares

The orthogonal least-squares (OLS) method [16, 38, 39] is an efficient way for subset model selection. The approach chooses and adds RBF centers one by one until an adequate network is constructed. All the training examples are considered as candidates for the centers, and the one that reduces the MSE the most is selected as a new hidden unit. The GSO is first used to construct a set of orthogonal vectors in the space spanned by the vectors of the hidden unit activation ⃗𝜙𝑝, and a new RBF center is then selected by minimizing the residual MSE. Model selection criteria are used to determine the size of the network.

The batch OLS method can not only determine the weights, but also choose the number and the positions of the RBF centers. The batch OLS can employ the forward [38–40] and the backward [41] center selection approaches. When the RBF centers are distinct, Φ𝑇 is of full rank. The orthogonal decomposition of Φ𝑇 is performed using QR decomposition𝚽𝑇⎡⎢⎢⎣𝐑𝟎⎤⎥⎥⎦=𝐐,(3.5) where 𝐐=[⃗𝑞1,…,𝑞𝑁] is an 𝑁×𝑁 orthogonal matrix and 𝐑 is a 𝐽2×𝐽2 upper triangular matrix. By minimizing the MSE given by (3.1), one can make use of the invariant property of the Frobenius norm1𝐸=𝑁‖‖𝐐T𝐘T−𝐐T𝚽T𝐖‖‖2𝐹.(3.6) Let 𝐐𝑇𝐘𝑇=𝐁𝐁, where ̃𝑏𝐁=[𝑖𝑗] and 𝐁=[𝑏𝑖𝑗] are, respectively, a 𝐽2×𝐽3 and an (𝑁−𝐽2)×𝐽3 matrix. We then have1𝐸=𝑁‖‖‖‖‖⎡⎢⎢⎣𝐁−𝐑𝐖𝐁⎤⎥⎥⎦‖‖‖‖‖2𝐹.(3.7) Thus, the optimal 𝐖 is derived from𝐑𝐖=𝐁.(3.8) In this case, the residual 𝐸=(1/𝑁)‖𝐁‖2𝐹.

Due to the orthogonalization procedure, it is very convenient to implement the forward and backward center selection approaches. The forward selection approach is to build up a network by adding, one at a time, centers at the data points that result in the largest decrease in the network output error at each stage. Alternatively, the backward selection algorithm sequentially removes from the network, one at a time, those centers that cause the smallest increase in the residual.

The error reduction ratio (ERR) due to the 𝑘th RBF neuron is defined by [39]ERR𝑘=∑𝐽3𝑖=1̃𝑏2𝑘𝑖𝑞𝑇𝑘𝑞𝑘tr𝐘𝐘𝑇,𝑘=1,…,𝑁.(3.9) RBF network training can be in a constructive way, and the centers with the largest ERR values are recruited until1−𝐽2𝑘=1ERR𝑘<𝜌,(3.10) where 𝜌∈(0,1) is a tolerance.

ERR is a performance-oriented criterion. An alternative terminating criterion can be based on the Akaike information criterion (AIC) [39, 42], which balances between the performance and the complexity. The weights are determined at the same time. The criterion used to stop center selection is a simple threshold on the ERR. If the threshold chose results in very large variances for Gaussian functions, poor generalization performance may occur. To improve generalization, regularized forward OLS methods can be implemented by penalizing large weights [43, 44]. In [45], the training objective is defined by ∑𝐸+𝑀𝑖𝜆𝑖𝑤2𝑖, where 𝜆𝑖’s are the local regularization parameter and 𝑀 is the number of weights.

The computation complexity of the orthogonal decomposition of Φ𝑇 is 𝑂(𝑁𝐽22). When the size of a training data set 𝑁 is large, the batch OLS is computationally demanding and also needs a large amount of computer memory.

The RBF center clustering method based on the Fisher ratio class separability measure [46] is similar to the forward selection OLS algorithm [38, 39]. Both the methods employ the QR decomposition-based orthogonal transform to decorrelate the responses of the prototype neurons as well as the forward center selection procedure. The OLS evaluates candidate centers based on the approximation error reduction in the context of nonlinear approximation, while the Fisher ratio-based forward selection algorithm evaluates candidate centers using the Fisher ratio class separability measure for the purpose of classification. The two algorithms have similar computational cost.

Recursive OLS (ROLS) algorithms are proposed for updating the weights of single-input single-output [47] and multi-input multioutput systems [48, 49]. In [48], the ROLS algorithm determines the increment of the weight matrix. In [49], the full weight matrix is determined at each iteration, and this reduces the accumulated error in the weight matrix, and the ROLS has been extended for the selection of the RBF centers. After training with the ROLS, the final triangular system of equations in a form similar to (3.8) contains important information about the learned network and can be used to sequentially select the centers to minimize the network output error. Forward and backward center selection methods are developed from this information, and Akaike’s FPE criterion [50] is used in the model selection [49]. The ROLS selection algorithms sort the selected centers in the order of their significance in reducing the MSE [48, 49].

3.4. Supervised Learning of All Parameters

The gradient-descent method provides the simplest solution. We now apply the gradient-descent method to supervised learning of the RBF network.

3.4.1. Supervised Learning for General RBF Networks

To derive the supervised learning algorithm for the RBF network with any useful RBF, we rewrite the error function (3.1) as1𝐸=𝑁𝑁𝐽𝑛=13𝑖=1𝑒𝑛,𝑖2,(3.11) where 𝑒𝑛,𝑖 is the approximation error at the 𝑖th output node for the 𝑛th example𝑒𝑛,𝑖=𝑦𝑛,𝑖−𝐽2𝑚=1𝑤𝑚𝑖𝜙‖‖𝑥𝑛−𝑐𝑚‖‖=𝑦𝑛,𝑖−𝑤𝑇𝑖⃗𝜙𝑛.(3.12)

Taking the first-order derivative of 𝐸 with respect to 𝑤𝑚𝑖 and 𝑐𝑚, respectively, we have𝜕𝐸𝜕𝑤𝑚𝑖2=−𝑁𝑁𝑛=1𝑒𝑛,𝑖𝜙‖‖𝑥𝑛−𝑐𝑚‖‖,𝑚=1,…,𝐽2,𝑖=1,…,𝐽3,(3.13)𝜕𝐸𝜕𝑐𝑚=2𝑁𝑤𝑁𝑚𝑖𝑛=1𝑒𝑛,𝑖̇𝜙‖‖𝑥𝑛−𝑐𝑚‖‖𝑥𝑛−𝑐𝑚‖‖𝑥𝑛−𝑐𝑚‖‖,𝑚=1,…,𝐽2,𝑖=1,…,𝐽3,(3.14) where ̇𝜙(⋅) is the first-order derivative of 𝜙(⋅).

The gradient-descent method is defined by the update equationsΔ𝑤𝑚𝑖=−𝜂1𝜕𝐸𝜕𝑤𝑚𝑖,Δ𝑐𝑚=−𝜂2𝜕𝐸𝜕𝑐𝑚,(3.15) where 𝜂1 and 𝜂2 are learning rates. To prevent the situation that two or more centers are too close or coincide with one another during the learning process, one can add a term such as ∑𝛼≠𝛽𝜓(‖⃗𝑐𝛼−⃗𝑐𝛽‖) to 𝐸, where 𝜓(⋅) is an appropriate repulsive potential. The gradient-descent method given by (3.15) needs to be modified accordingly.

Initialization can be based on a random selection of the RBF centers from the examples and 𝐖 as a matrix with small random components. One can also use clustering to find the initial RBF centers and the LS to find the initial weights, and the gradient-descent procedure is then applied to refine the learning result. When the gradients given above are set to zero, the optimal solutions to the weights and centers can be derived. The gradient-descent procedure is the iterative approximation to the optimal solutions. For each sample 𝑛, if we set 𝑒𝑛,𝑖=0, then the right-hand side of (3.13) is zero, we then achieve the global optimum and accordingly get 𝑦𝑛=𝐖𝑇⃗𝜙𝑛. For all samples, the result is exactly the same as (1.2). The optimum solution to weights is given by (3.3). Equating (3.14) to zero leads to a formulation showing that the optimal centers are weighted sums of the data points, corresponding to a task-dependent clustering problem.

3.4.2. Supervised Learning for Gaussian RBF Networks

For the Gaussian RBF network, the RBF at each node can be assigned a different width 𝜎𝑖. The RBFs can be further generalized to allow for arbitrary covariance matrices 𝚺𝑖𝜙𝑖⃗𝑥=e−(1/2)(⃗𝑥−⃗𝑐𝑖)𝑇𝚺𝑖−1(⃗𝑥−⃗𝑐𝑖),(3.16) where 𝚺𝑖∈𝑅𝐽1×𝐽1 is positive definite, symmetric covariance matrix. When 𝚺𝑖−1 is in general form, the shape and orientation of the axes of the hyperellipsoid are arbitrary in the feature space. If 𝚺𝑖−1 is a diagonal matrix with nonconstant diagonal elements, it is completely defined by a vector ⃗𝜎𝑖∈𝑅𝐽1, and each 𝜙𝑖 is a hyperellipsoid whose axes are along the axes of the feature space, 𝚺𝑖−1=diag(1/𝜎2𝑖,1,…,1/𝜎2𝑖,𝐽1). For the 𝐽1-dimensional input space, each RBF using diagonal 𝚺𝑖−1 has a total of 𝐽1(𝐽1+3)/2 independent adjustable parameters, while each RBF using the same 𝜎 in all directions and each RBF using diagonal 𝚺𝑖−1 have only 𝐽1+1 and 2𝐽1 independent parameters, respectively. There is a trade-off between using a small network with many adjustable parameters and using a large network with fewer adjustable parameters.

When using the RBF using the same 𝜎 in all directions, we get the gradients as𝜕𝐸𝜕𝑐𝑚2=−𝑁𝑁∑𝑛=1𝜙𝑚𝑥𝑛𝑥𝑛−𝑐𝑚𝜎2𝑚𝐽3∑𝑖=1𝑒𝑛,𝑖𝑤𝑖,𝑚,𝜕𝐸𝜕𝜎𝑚2=−𝑁𝑁∑𝑛=1𝜙𝑚𝑥𝑛‖‖𝑥𝑛−𝑐𝑚‖‖2𝜎3𝑚𝐽3∑𝑖=1𝑒𝑛,𝑖𝑤𝑖,𝑚.(3.17) Similarly, for the RBF using diagonal 𝚺𝑖−1, the gradients are given by𝜕𝐸𝜕𝑐𝑚,𝑗2=−𝑁𝑁∑𝑛=1𝜙𝑚𝑥𝑛𝑥𝑛,𝑗−𝑐𝑚,𝑗𝜎2𝑚,j𝐽3∑𝑖=1𝑒𝑛,𝑖𝑤𝑖,𝑚,𝜕𝐸𝜕𝜎𝑚,𝑗2=−𝑁𝑁∑𝑛=1𝜙𝑚𝑥𝑛𝑥𝑛,𝑗−𝑐𝑚,𝑗2𝜎3𝐽𝑚,𝑗3∑𝑖=1𝑒𝑛,𝑖𝑤𝑖,𝑚.(3.18) Adaptations for ⃗𝑐𝑖 and 𝚺𝑖 are along the negative gradient directions. 𝐖 are updated by (3.13) and (3.15). To prevent unreasonable radii, the updating algorithms can also be derived by adding to the MSE 𝐸 a constraint term that penalizes small radii, 𝐸c=∑𝑖1/𝜎𝑖 or 𝐸c=∑𝑖,𝑗1/𝜎𝑖,𝑗.

3.4.3. Remarks

The gradient-descent algorithms introduced so far are batch learning algorithms. By optimizing the error function 𝐸𝑝 for each example (𝑥𝑝,𝑦𝑝), one can update the parameters in the incremental learning model, which are typically much faster than their batch counterparts for suitably selected learning parameters.

Although the RBF network trained by the gradient-descent method is capable of providing equivalent or better performance compared to that of the MLP trained with the BP, the training time for the two methods are comparable [51]. The gradient-descent method is slow in convergence since it cannot efficiently use the locally tuned representation of the hidden-layer units. When the hidden-unit receptive fields, controlled by the widths 𝜎𝑖, are narrow, for a given input only a few of the total number of hidden units will be activated and hence only these units need to be updated. However, the gradient-descent method may leads to large width, and then the original idea of using a number of local tuning units to approximate the target function cannot be maintained. Besides, the computational advantage of locality cannot be utilized anymore [30].

The gradient-descent method is prone to finding local minima of the error function. For reasonably well-localized RBF, an input will generate a significant activation in a small region, and the opportunity of getting stuck at a local minimum is small. Unsupervised methods can be used to determine 𝜎𝑖. Unsupervised learning is used to initialize the network parameters, and supervised learning is usually used for fine-tuning the network parameters. The ultimate RBF network learning algorithm is typically a blend of unsupervised and supervised algorithms. Usually, the centers are selected by using a random subset of the training set or obtained by using clustering, the variances are selected using a heuristic, and the weights are solved by using a linear LS method or the gradient-descent method. This combination may yield a fast learning procedure with a sufficient accuracy.

3.5. Other Learning Methods

Actually, all general-purpose unconstrained optimization methods are applicable for RBF network learning by minimization of 𝐸, with no or very little modification. These include popular second-order approaches like the Levenberg-Marquardt (LM), conjugate gradient, BFGS, and extended Kalman filtering (EKF), and heuristic-based global optimization methods like evolutionary algorithms, simulated annealing, and Tabu search. These algorithms are described in detail in [2].

The objective is to find suitable network structure and the corresponding network parameters. Some complexity criteria such as the AIC [42] are used to control a trade-off between the learning error and network complexity. Heuristic-based global optimization is also widely used for neural network learning. The selection of the network structure and parameters can be performed simultaneously or separately. For implementation using evolutionary algorithm, the parameters associated with each node are usually coded together in the chromosomes. Some heuristic-based RBF network learning algorithms are described in [2].

The LM method is used for RBF network learning [52–54]. In [53, 54], the LM method is used for estimating nonlinear parameters, and the LS method is used for weight estimation at each iteration. All model parameters are optimized simultaneously. In [54], at each iteration the weights are updated many times during the process of looking for the search direction to update the nonlinear parameters. This further accelerates the convergence of the search process. RBF network learning can be viewed as a system identification problem. After the number of centers is chosen, the EKF simultaneously solves for the prototype vectors and the weight matrix [55]. A decoupled EKF further decreases the complexity of the training algorithm [55]. EKF training provides almost the same performance as gradient-descent training, but with only a fraction of the computational cost. In [56], a pair of parallel running extended Kalman filters are used to sequentially update both the output weights and the RBF centers.

In [57], BP with selective training [58] is applied to RBF network learning. The method improves the performance of the RBF network substantially compared to the gradient-descent method, in terms of convergence speed and accuracy. The method is quite effective when the dataset is error-free and nonoverlapping. In [15], the RBF network is reformulated by using RBFs formed in terms of admissible generator functions and provides a fully supervised gradient-descent training method. A learning algorithm is proposed in [59] for training a special class of reformulated RBF networks, known as cosine RBF networks. It trains reformulated RBF networks by updating selected adjustable parameters to minimize the class-conditional variances at the outputs of their RBFs so as to be capable of identifying uncertainty in data classification.

Linear programming models with polynomial time complexity are also employed to train the RBF network [60]. A multiplication-free Gaussian RBF network with a gradient-based nonlinear learning algorithm [61] is described for adaptive function approximation.

The expectation-maximization (EM) method [62] is an efficient maximum likelihood-based method for parameter estimation; it splits a complex problem into many separate small-scale subproblems. The EM method has also been applied for RBF network learning [63–65]. The shadow targets algorithm [66], which employs a philosophy similar to that of the EM method, is an efficient RBF network training algorithm for topographic feature extraction.

The RBF network using regression weights can significantly reduce the number of hidden units and is effectively used for approximating nonlinear dynamic systems [22, 25, 64]. For a 𝐽1-𝐽2-1 RBF network, the weight from the 𝑖th hidden unit to the output unit 𝑤𝑖 is defined by the linear regression [64] 𝑤𝑖=𝑎𝑇𝑖̃⃗𝑥+𝜉𝑖, where ⃗𝑎𝑖=(𝑎𝑖,0,𝑎𝑖,1,…,𝑎𝑖,𝐽1)𝑇 is the regression parameter vector, ̃⃗𝑥=(1,𝑥1,…,𝑥𝐽1)𝑇 is the augmented input vector, and 𝜉𝑖 is zero-mean Gaussian noise. For the Gaussian RBF network, the RBF centers ⃗𝑐𝑖 and their widths 𝜎𝑖 can be selected by the 𝐶-means and the nearest-neighbor heuristic, while the parameters of the regression weights are estimated by the EM method [64]. The RBF network with linear regression weights has also been studied [25], where a simple but fast computational procedure is achieved by using a high-dimensional raised-cosine RBF.

When approximating a given function 𝑓(𝑥), a parsimonious design of the Gaussian RBF network can be achieved based on the Gaussian spectrum of 𝑓(𝑥), 𝛾G(𝑓;⃗𝑐,𝜎) [67]. According to the Gaussian spectrum, one can estimate the necessary number of RBF units and evaluate how appropriate the use of the Gaussian RBF network is. Gaussian RBFs are selected according to the peaks (negative as well as positive) of the Gaussian spectrum. Only the weights of the RBF network are needed to be tuned. Analogous to principal component analysis (PCA) of the data sets, the principal Gaussian components of 𝑓(𝑥) are extracted. If there are a few sharp peaks on the spectrum surface, the Gaussian RBF network is suitable for approximation with a parsimonious architecture. However, if there are many peaks with similar importance or small peaks situated in large flat regions, this method will be inefficient.

The Gaussian RBF network can be regarded as an improved alternative to the four-layer probabilistic neural network (PNN) [68]. In a PNN, a Gaussian RBF node is placed at the position of each training pattern so that the unknown density can be well interpolated and approximated. This technique yields optimal decision surfaces in the Bayes’ sense. Training is to associate each node with its target class. This approach, however, severely suffers from the curse of dimensionality and results in a poor generalization. The probabilistic RBF network [69] constitutes a probabilistic version of the RBF network for classification that extends the typical mixture model approach to classification by allowing the sharing of mixture components among all classes. The probabilistic RBF network is an alternative approach for class-conditional density estimation. It provides output values corresponding to the class-conditional densities 𝑝(⃗𝑥∣𝑘) (for class 𝑘). The typical learning method of probabilistic RBF network for a classification task employs the EM algorithm, which highly depends on the initial parameter values. In [70], a technique for incremental training of the probabilistic RBF network for classification is proposed, based on criteria for detecting a region that is crucial for the classification task. After the addition of all components, the algorithm splits every component of the network into subcomponents, each one corresponding to a different class.

Extreme learning machine (ELM) [71] is a learning algorithm for single-hidden layer feedforward neural networks (SLFNs). ELM in an incremental method (I-ELM) is proved to be a universal approximator [17]. It randomly chooses hidden nodes and analytically determines the output weights of SLFNs. In theory, this algorithm tends to provide good generalization performance at extremely fast learning speed, compared to gradient-based algorithms and SVM. ELM is a simple and efficient three-step learning method, which does not require BP or iterative techniques. ELM can analytically determine all the parameters of SLFNs. Unlike the gradient-based algorithms, the ELM algorithm could be used to train SLFNs with many nondifferentiable activation functions (such as the threshold function) [72]. Theoretically the ELM algorithm can be used to train neural networks with threshold functions directly instead of approximating them with sigmoid functions [72]. In [73], an online sequential ELM (OS-ELM) algorithm is developed to learn data one by one or chunk by chunk with fixed or varying chunk size. In OS-ELM, the parameters of hidden nodes are randomly selected, and the output weights are analytically determined based on the sequentially arriving data. Apart from selecting the number of hidden nodes, no other control parameters have to be manually chosen.

4. Optimizing Network Structure

In order to achieve the optimum structure of an RBF network, learning can be performed by determining the number and locations of the RBF centers automatically using constructive and pruning methods.

4.1. Constructive Approach

The constructive approach gradually increases the number of RBF centers until a criterion is satisfied. The forward OLS algorithm [16] described in Section 3.3 is a well-known constructive algorithm. Based on the OLS algorithm, a constructive algorithm for the generalized Gaussian RBF network is given in [74]. RBF network learning based on a modification to the cascade-correlation algorithm [75] works in a way similar to the OLS method, but with a significantly faster convergence [76]. The OLS incorporated with the sensitivity analysis is also employed to search for the optimal RBF centers [77]. A review on constructive approaches to structural learning of feedforward networks is given in [78].

In [79], a new prototype is created in a region of the input space by splitting an existing prototype ⃗𝑐𝑗 selected by a splitting criterion, and splitting is performed by adding the perturbation vectors ±⃗𝜖𝑗 to ⃗𝑐𝑗. The resulting vectors ⃗𝑐𝑗±⃗𝜖𝑗 together with the existing centers form the initial set of centers for the next growing cycle. ⃗𝜖𝑗 can be obtained by a deviation measure computed from ⃗𝑐𝑗 and the input vectors represented by ⃗𝑐𝑗, and ‖⃗𝜖𝑗‖≪‖⃗𝑐𝑗‖. Feedback splitting and purity splitting are two criteria for the selection of splitting centers. Existing algorithms for updating the centers ⃗𝑐𝑗, widths 𝜎𝑗, and weights can be used. The process continues until a stopping criterion is satisfied.

In a heuristic incremental algorithm [80], the training phase is an iterative process that adds a hidden node ⃗𝑐𝑡 at each epoch 𝑡 by an error-driven rule. Each epoch 𝑡 consists of three phases. The data point with the worst approximation, denoted 𝑥𝑠, is first recruited as a new hidden node ⃗𝑐𝑡=𝑥𝑠, and its weight to each output node 𝑗, 𝑤𝑡𝑗, is fixed at the error at the 𝑗th output node performed by the network at the (𝑡−1)th epoch on 𝑥𝑠. The next two phases are, respectively, local tuning and fine tuning of the variances of the RBFs.

The incremental RBF network architecture using hierarchical gridding of the input space [81] allows for a uniform approximation without wasting resources. The centers and variances of the added nodes are fixed through heuristic considerations. Additional layers of Gaussians at lower scales are added where the residual error is higher. The number of Gaussians of each layer and their variances are computed from considerations based on linear filtering theory. The weight of each Gaussian is estimated through a maximum a posteriori estimate carried out locally on a subset of the data points. The method shows a high accuracy in the reconstruction, and it can deal with nonevenly spaced data points and is fully parallelizable. Similarly, the hierarchical RBF network [82] is a multiscale version of the RBF network. It is constituted by hierarchical layers, each containing a Gaussian grid at a decreasing scale. The grids are not completely filled, but units are inserted only where the local error is over a threshold. The constructive approach is based only on the local operations, which do not require any iteration on the data. Like traditional wavelet-based multiresolution analysis (MRA), the hierarchical RBF network employs Riesz bases and enjoys asymptotic approximation properties for a very large class of functions.

The dynamic decay adjustment (DDA) algorithm is a fast constructive training method for the RBF network when used for classification [83]. It is motivated from the probabilistic nature of the PNN [68], the constructive nature of the restricted Coulomb energy (RCE) networks [84] as well as the independent adjustment of the decay factor or width 𝜎𝑖 of each prototype. The DDA method is faster and also achieves a higher classification accuracy than the conventional RBF network [30], the MLP trained with the Rprop, and the RCE.

Incremental RBF network learning is also derived based on the growing cell structures model [85] and based on a Hebbian learning rule adapted from the neural gas model [86]. The insertion strategy is on accumulated error of a subset of the data set. Another example of the constructive approach is the competitive RBF algorithm based on maximum-likelihood classification [87]. The resource-allocating network (RAN) [88] is a well-known RBF network construction method [88] and is introduced in Section 4.2.

4.2. Resource-Allocating Networks

The RAN is a sequential learning method for the localized RBF network, which is suitable for online modeling of nonstationary processes. The network begins with no hidden units. As the pattern pairs are received during the training, a new hidden unit may be recruited according to the novelty in the data. The novelty in the data is decided by two conditions‖‖⃗𝑥𝑡−⃗𝑐𝑖‖‖‖‖‖‖=‖‖>𝜀(𝑡),(4.1)⃗𝑒(𝑡)⃗𝑦𝑡−𝑓⃗𝑥𝑡‖‖>𝑒min,(4.2) where ⃗𝑐𝑖 is the center nearest to ⃗𝑥𝑡, the prediction error ⃗𝑒=(𝑒1,…,𝑒𝐽3)𝑇, and 𝜀(𝑡) and 𝑒min are two thresholds. The algorithm starts with 𝜀(𝑡)=𝜀max, where 𝜀max is chosen as the largest scale in the input space, typically the entire input space of nonzero probability. 𝜀(𝑡) shrinks exponentially by 𝜀(𝑡)=max{𝜀maxe−𝑡/𝜏,𝜀min}, where 𝜏 is a decay constant. 𝜀(𝑡) is decayed until it reaches 𝜀min.

Assuming that there are 𝑘 nodes at time 𝑡−1, for the Gaussian RBF network, the newly added hidden unit at time 𝑡 can be initialized as𝑐𝑘+1=⃗𝑥𝑡,𝑤(𝑘+1)𝑗=𝑒𝑗(𝑡),𝑗=1,…,𝐽3,𝜎𝑘+1‖‖=𝛼⃗𝑥𝑡−⃗𝑐𝑖‖‖,(4.3) where the value for 𝜎𝑘+1 is based on the nearest-neighbor heuristic and 𝛼 is a parameter defining the size of neighborhood. If a pattern pair (⃗𝑥𝑡,⃗𝑦𝑡) does not pass the novelty criteria, no hidden unit is added and the existing network parameters are adapted using the LMS method [89].

The RAN method performs much better than the RBF network learning algorithm using random centers and that using the centers clustered by the 𝐶-means [30] in terms of network size and MSE. The RAN method achieves roughly the same performance as the MLP trained with the BP, but with much less computation.

In [90], an agglomerative clustering algorithm is used for RAN initialization, instead of starting from zero hidden node. In [91], the LMS method is replaced by the EKF method for the network parameter adaptation so as to generate a more parsimonious network. Two geometric criteria, namely, the prediction error criterion, which is the same as (4.2), and the angle criterion are also obtained from a geometric viewpoint. The angle criterion assigns RBFs that are nearly orthogonal to all the other existing RBFs. These criteria are proved to be equivalent to Platt’s criteria [88]. In [92], the statistical novelty criterion is defined by using the result of the EKF method. By using the EKF method and using this criterion to replace the criteria (4.1) and (4.2), more compact networks and smaller MSEs are achieved than the RAN [88] and the EKF-based RAN [91].

Numerous improvements on the RAN have been made by integrating node-pruning procedure [93–99]. The minimal RAN [93, 94] is based on the EKF-based RAN [91] and achieves a more compact network with equivalent or better accuracy by incorporating a pruning strategy to remove inactive nodes and augmenting the basic growth criterion of the RAN. The output of each RBF unit is linearly scaled to ̂𝑜𝑖(⃗𝑥)∈(0,1]. If ̂𝑜𝑖(⃗𝑥) is below a predefined threshold 𝛿 for a given number of iterations, this node is idle and can be removed. For a given accuracy, the minimal RAN achieves a smaller complexity than the MLP trained with RProp [100]. In [95], the RAN is improved by using Givens QR decomposition-based RLS for the adaptation of the weights and integrating a node-pruning strategy. The ERR criterion in [38] is used to select the most important regressors. In [96], the RAN is improved by using in each iteration the combination of the SVD and QR-cp methods for determining the structure as well as for pruning the network. In the early phase of learning, the addition of RBFs is in small groups, and this leads to an increased rate of convergence. If a particular RBF is not considered for a given number of iterations, it is removed. The size of the network is more compact than the RAN.

In [97], the EKF and statistical novelty criterion-based method [92] is extended by incorporating an online pruning procedure, which is derived using the parameters and innovation statistics estimated from the EKF. The online pruning method is analogous to the saliency-based optimal brain surgeon (OBS) [101] and optimal brain damage (OBD) [102]. The IncNet and IncNet Pro [103] are RAN-EKF networks with statistically controlled growth criterion. The pruning method is similar to the OBS, but based on the result of the EKF algorithm.

The growing and pruning algorithm for RBF (GAP-RBF) [98] and the generalized GAP-RBF (GGAP-RBF) [99] are RAN-based sequential learning algorithms. These algorithms make use of the notion of significance of a hidden neuron, which is defined as a neuron’s statistical contribution over all the inputs seen so far to the overall performance of the network. In addition to the two growing criteria of the RAN, a new neuron is added only when its significance is also above a chosen learning accuracy. If during the training the significance of a neuron becomes less than the learning accuracy, that neuron will be pruned. For each new pattern, only its nearest neuron is checked for growing, pruning, or updating using the EKF. The GGAP-RBF enhances the significance criterion such that it is applicable for training samples with arbitrary sampling density. Both the GAP-RBF and the GGAP-RBF outperform the RAN [88], the EKF-based RAN [91], and the minimal RAN [94] in terms of learning speed, network size, and generalization performance.

4.3. Constructive Methods with Pruning

In addition to the RAN algorithms with pruning strategy [93–99], there are some other constructive methods with pruning.

The normalized RBF network [22] can be sequentially constructed with pruning strategy based on the novelty of the data and the overall behaviour of the network using the gradient-descent method. The network starts from one neuron and adds a new neuron if an example passes two novelty criteria. The first criterion is the same as (4.2), and the second one deals with the activation of the nonlinear neurons, max𝑖𝜙𝑖(⃗𝑥𝑡)<𝜁, where 𝜁 is a threshold. The pseudo-Gaussian RBF is used, and RBF weights are linear regression functions of the input variables. After the whole pattern set is presented at an epoch, the algorithm starts to remove those neurons that meet any of the three cases, namely, neurons with a very small mean activation for the whole pattern set, neurons with a very small activation region, or neurons having an activation very similar to that of other neurons.

In [104], training starts with zero hidden node and progressively builds the model as new data become available. A fuzzy partition of the input space defines a multidimensional grid, from which the RBF centers are selected, so that at least one selected RBF center is close enough to each input example. The method is capable of adapting on-line the structure of the RBF network, by adding new units when an input example does not belong to any of the subspaces that have been selected, or deleting old ones when no data have been assigned to the respective fuzzy subspaces for a long period of time. The weights are updated using the RLS algorithm. The method avoids selecting the centers only among the available data.

As an efficient and fast growing RBF network algorithm, the constructive nature of DDA [83] may result in too many neurons. The DDA with temporary neurons improves the DDA by introducing online pruning of neurons after each DDA training epoch [105]. After each training epoch, if the individual neurons cover a sufficient number of samples, they are marked as permanent; otherwise, they are deleted. This mechanism results in a significant reduction in the number of neurons. The DDA with selective pruning and model selection is another extension to the DDA [106], where only a portion of the neurons which cover only one training sample are pruned and pruning is carried out only after the last epoch of the DDA training. The method improves the generalization performance of the DDA [83] and the DDA with temporary neurons [105], but yields a larger network size than the DDA with temporary neurons. The pruning strategy proposed in [107] aims to detect and remove those neurons to improve generalization. When the dot product values of two nodes is beyond a threshold, one of the two nodes can be pruned.

4.4. Pruning Methods

Various pruning methods for feedforward networks have been discussed in [2]. These methods are applicable to the RBF network since the RBF network is a kind of feedforward network. Well-known pruning methods are the weight-decay technique [43, 45], the OBD [102], and OBS [101]. Pruning algorithms based on the regularization technique are also popular since additional terms that penalize the complexity of the network are incorporated into the MSE criterion.

With the flavor of weight-decay technique, some regularization techniques for improving the generalization capability of the MLP and the RBF network are also discussed in [108]. As in the MLP, the favored penalty term ∑𝑤2𝑖𝑗 is also appropriate for the RBF network. The widths of the RBFs is a major source of ill-conditioning in RBF network learning, and large width parameters are desirable for better generalization. Some suitable penalty terms for widths are given in [108]. Fault-tolerance ability of RBF networks are also described based on a regularization technique. In [109], a Kullback-Leibler divergence-based objective function is defined for improving the fault-tolerance of RBF networks. The learning method achieves a better fault-tolerant ability, compared with weight-decay-based regularizers. In [110], a regularization-based objective function for training a functional link network to tolerate multiplicative weight noise is defined, and a simple learning algorithm is derived. The function link network is somewhat similar to the RBF network. Under some mild conditions the derived regularizer is essentially the same as a weight decay regularizer. This explains why applying weight decay can also improve the fault-tolerant ability of an RBF with multiplicative weight noise.

In [111], the pruning method starts from a large RBF network and achieves a compact network through an iterative procedure of training and selection. The training procedure adaptively changes the centers and the width of the RBFs and trains the linear weights. The selection procedure performs the elimination of the redundant RBFs using an objective function based on the MDL principle [112]. In [113], all the data vectors are initially selected as centers. Redundant centers in the RBF network are eliminated by merging two centers at each adaptation cycle by using an iterative clustering method. The technique is superior to the traditional RBF network algorithms, particularly in terms of the processing speed and solvability of nonlinear patterns.

4.5. Model Selection

A theoretically well-motivated criterion for describing the generalization error is developed by using Stein’s unbiased risk estimator (SURE) [114]𝐽Err2𝐽=err2−𝑁𝜎2𝑛+2𝜎2𝑛𝐽2,+1(4.4) where Err is the generalization error on the new data, err denotes the training error for each model, 𝐽2 is the number of RBF nodes, 𝑁 is the size of the pattern set, and 𝜎2𝑛 is the noise variance, which can be estimated from the MSE of the model. An empirical comparison among the SURE-based method, cross-validation, and the Bayesian information criterion (BIC) [115], is made in [114]. The generalization error of the models by the SURE-based method can be less than that of the models selected by cross-validation, but with much less computation. The SURE-based method has a behavior similar to the BIC. However, the BIC generally gives preference to simpler models since it penalizes complex models more harshly.

The generalization error of a trained network can be decomposed into two parts, namely, an approximation error that is due to the finite number of parameters of the approximation scheme and an estimation error that is due to the finite number of data available [116, 117]. For a feedforward network with 𝐽1 input nodes and a single output node, a bound for the generalization error is given by [116, 117]𝑂1𝑃+𝑂𝑃𝐽1ln(𝑃𝑁)−ln𝛿𝑁1/2,(4.5) with a probability greater than 1−𝛿, where 𝑁 is the number of examples, 𝛿∈(0,1) is the confidence parameter, and 𝑃 is proportional to the number of parameters such as 𝑃 hidden units in an RBF network. The first term in (4.5) corresponds to the bound on the approximation error, and the second on the estimation error. As 𝑃 increases, the approximation error decreases since we are using a larger model; however, the estimation error increases due to overfitting (or alternatively, more data). The trade-off between the approximation and estimation errors is best maintained when 𝑃 is selected as 𝑃∝𝑁1/3 [116]. After suitably selecting 𝑃 and 𝑁, the generalization error for feedforward networks should be 𝑂(1/𝑃). This result is similar to that for an MLP with sigmoidal functions [118]. The bound given by (4.5) has been considerably improved to 𝑂((ln𝑁/𝑁)1/2) in [119] for RBF network learning with the MSE function.

5. Normalized RBF Networks

The normalized RBF network is defined by normalizing the vector composing of the responses of all the RBF units [30]𝑦𝑖=⃗𝑥𝐽2𝑘=1𝑤𝑘𝑖𝜙𝑘⃗𝑥,𝑖=1,…,𝐽3,(5.1) where𝜙𝑘=𝜙⃗𝑥⃗𝑥−⃗𝑐𝑘∑𝐽2𝑗=1𝜙⃗𝑥−⃗𝑐𝑗.(5.2) Since the normalization operation is nonlocal, the convergence process is computationally costly.

A simple algorithm, called weighted averaging (WAV) [120], is inspired by the functional equivalence of the normalized RBF network of the form (5.1) and fuzzy inference systems (FISs) [121].

The normalized RBF network given by (5.1) can be reformulated such that normalization is performed in the output layer [64, 122]𝑦𝑖=∑⃗𝑥𝐽2𝑗=1𝑤𝑗𝑖𝜙⃗𝑥−⃗𝑐𝑗∑𝐽2𝑗=1𝜙⃗𝑥−⃗𝑐𝑗.(5.3) As it already receives information from all the hidden units, the locality of the computational processes is preserved. The two forms of the normalized RBF network (5.1) and (5.3) are equivalent, and their similarity with FISs has been pointed out in [121].

In the normalized RBF network of the form (5.3), the traditional roles of the weights and activities in the hidden layer are exchanged. In the RBF network, the weights determine as to how much each hidden node contributes to the output, while in the normalized RBF network the activities of the hidden nodes determine as to which weights contribute the most to the output. The normalized RBF network provides better smoothness than the RBF network. Due to the localized property of the receptive fields, for most data points, there is usually only one hidden node that contributes significantly to (5.3). The normalized RBF network can be trained using a procedure similar to that for the RBF network. The normalized Gaussian RBF network exhibits superiority in supervised classification due to its soft modification rule [123]. It is also a universal approximator in the space of continuous functions with compact support in the space 𝐿𝑝(𝑅𝑝,𝑑⃗𝑥) [124].

The normalized RBF network loses the localized characteristics of the localized RBF network and exhibits excellent generalization properties, to the extent that hidden nodes need to be recruited only for training data at the boundaries of the class domains. This obviates the need for a dense coverage of the class domains, in contrast to the RBF network. Thus, the normalized RBF network softens the curse of dimensionality associated with the localized RBF network [122]. The normalized Gaussian RBF network outperforms the Gaussian RBF network in terms of the training and generalization errors, exhibits a more uniform error over the training domain, and is not sensitive to the RBF widths.

The normalized RBF network is an RBF network with a quasilinear activation function with a squashing coefficient decided by the actviations of all the hidden units. The output units can also employ the sigmoidal activation function. The RBF network with the sigmoidal function at the output nodes outperforms the case of linear or quasilinear function at the output nodes in terms of sensitivity to learning parameters, convergence speed as well as accuracy [57].

The normalized RBF network is found functionally equivalent to a class of Takagi-Sugeno-Kang (TSK) systems [121]. According to the output of the normalized RBF network given by (5.3), when the 𝑡-norm in the TSK model is selected as algebraic product and the membership functions (MFs) are selected the same as RBFs of the RBF network, the two models are mathematically equivalent [121, 125]. Note that each hidden unit corresponds to a fuzzy rule. In the normalized RBF network, 𝑤𝑖𝑗’s typically take constant values; thus the normalized RBF network corresponds to the zero-order TSK model. When the RBF weights are linear regression functions of the input variables [22, 64], the model is functionally equivalent to the first-order TSK model.

6. Applications of RBF Networks

Tradionally, RBF networks are used for function approximation and classification. They are trained to approximate a nonlinear function, and the trained RBF networks are then used to generalize. All applications of the RBF network are based on its universal approximation capability.

RBF networks have now used in a vast variety of applications, such as face tracking and face recognition [126], robotic control [127], antenna design [128], channel equalizations [129, 130], computer vision and graphics [131–133], and solving partial differential equations with boundary conditions (Dirichlet or Neumann) [134].

Special RBFs are customized to match the data characteristics of some problems. For instance, in channel equalization [129, 130], the received signal is complex-valued, and hence complex-valued RBFs are considered.

6.1. Vision Applications

In the graphics or vision applications, the input domain is spherical. Hence, the spherical RBFs [131–133, 135–137] are required.

In a spherical RBF network, the kernel function of the 𝑖th RBF node 𝜙𝑖(⃗𝑠)=exp{−𝑑(⃗𝑠,⃗𝑐𝑖)2/2Δ2} is a Gaussian function, where ⃗𝑠 is a unit input vector and Δ is the RBF angular width. The function 𝑑(⃗𝑠,⃗𝑐𝑖)=cos−1(⃗𝑠∘⃗𝑐𝑖), where “∘’’ is the dot product operator, is the distance between two unit vectors on the unit sphere.

6.2. Modeling Dynamic Systems

The sequential RBF network learning algorithms, such as the RAN family and the works in [22, 104], are capable of modifying both the network structure and the output weights on line; thus, these algorithms are particularly suitable for modeling dynamical time-varying systems, where not only the dynamics but the operating region changes with time.

The state-dependent autoregressive (AR) model with functional coefficients is often used to model complex nonlinear dynamical systems. The RBF network can be used as a nonlinear AR time-series model for forecasting [138]. The RBF network can also be used to approximate the coefficients of a state-dependent AR model, yielding the RBF-AR model [139]. The RBF-AR model has the advantages of both the state-dependent AR model for describing nonlinear dynamics and the RBF network for function approximation. The RBF-ARX model is an RBF-AR model with an exogenous variable [54]; it usually uses far fewer RBF centers when compared with the RBF network. The time-delayed neural network (TDNN) model can be an MLP-based or an RBF network-based temporal neural network for nonlinear dynamics and time-series learning [2]. The RBF network-based TDNN [140] uses the same spatial representation of time as the MLP-based TDNN [141]. Learning of the RBF network-based TDNN uses the RBF network learning algorithms.

For time-series applications, the input to the network is ⃗𝑥(𝑡)=(𝑦(𝑡−1),…,𝑦(𝑡−𝑛𝑦))𝑇 and the network output is 𝑦(𝑡). There are some problems with the RBF network when used as a time-series predictor [142]. First, the Euclidean distance measure is not always appropriate for measuring the similarity between the input vector ⃗𝑥𝑡 and the prototype ⃗𝑐𝑖 since ⃗𝑥𝑡 is itself highly autocorrelated. Second, the node response is radially symmetrical, whereas the data may be distributed differently in each dimension. Third, when the minimum lag 𝑛𝑦 is a large number, if all lagged versions of the output 𝑦(𝑡) are concatenated as the input vector, the network is too complex, and the performance deteriorates due to irrelevant inputs and an oversized structure. The dual-orthogonal RBF network algorithm overcomes most of these limitations for nonlinear time-series prediction [142]. Motivated by the linear discriminant analysis (LDA) technique, a distance metric is defined based on a classification function of the set of input vectors in order to achieve improved clustering. The forward OLS is used first to determine the significant lags and then to select the RBF centers. In both the steps, the ERR is used for the selection of significant nodes [143].

For online adaptation of nonlinear systems, a constant exponential forgetting factor is commonly applied to all the past data uniformly. This is undesirable for nonlinear systems whose dynamics are different in different operating regions. In [144], online adaptation of the Gaussian RBF network is implemented using a localized forgetting method, which sets different forgetting factors in different regions according to the response of the local prototypes to the current input vector. The method is applied in conjunction with the ROLS [48], and the computing is very efficient.

Recurrent RBF networks, which combine features from the RNN and the RBF network, are suitable for the modeling of nonlinear dynamic systems [145–147]. Time is an internal mechanism and is implicit via recurrent connection. Training of recurrent RBF networks can be based on the RCE algorithm [147], gradient descent, or the EKF [146]. Some techniques for injecting finite state automata into the network have also been proposed in [145].

6.3. Complex RBF Networks

Complex RBF networks are more efficient than the RBF network, in the case of nonlinear signal processing involving complex-valued signals, such as equalization and modeling of nonlinear channels in communication systems. Digital channel equalization can be treated as a classification problem.

In the complex RBF network [130], the input, the output, and the output weights are complex values, whereas the activation function of the hidden nodes is the same as that for the RBF network. The Euclidean distance in the complex domain is defined by [130]𝑑⃗𝑥𝑡,⃗𝑐𝑖=⃗𝑥𝑡−⃗𝑐𝑖𝐻⃗𝑥𝑡−⃗𝑐𝑖1/2,(6.1) where ⃗𝑐𝑖 is a 𝐽1-dimensional complex center vector. The Mahalanobis distance (3.16) defined for the Gaussian RBF can be extended to the complex domain [148] by changing the transpose 𝑇 in (3.16) into the Hermitian transpose 𝐻. Most existing RBF network learning algorithms can be easily extended for training various versions of the complex RBF network [129, 130, 148, 149]. When using clustering techniques to determine the RBF centers, the similarity measure can be based on the distance defined by (6.1). The Gaussian RBF is usually used in the complex RBF network. In [129], the minimal RAN algorithm [93] is extended to its complex-valued version.

Although the input and centers of the complex RBF network [129, 130, 148, 149] are complex valued, each RBF node has a real-valued response that can be interpreted as a conditional probability density function. This interpretation makes such a network particularly useful in the equalization application of communication channels with complex-valued signals. This complex RBF network is essentially two separate real-valued RBF networks.

Learning of the complex Gaussian RBF network can be performed in two phases, where the RBF centers are first selected by using the incremental 𝐶-means algorithm [33] and the weights are then solved by fixing the RBF parameters [148]. At each iteration 𝑡, the 𝐶-means first finds the winning node with index 𝑤 by using the nearest-neighbor rule and then updates both the center and the variance of the winning node by𝑐𝑤(𝑡)=𝑐𝑤(𝑡−1)+𝜂⃗𝑥𝑡−𝑐𝑤,𝚺(𝑡−1)𝑤(𝑡)=𝚺𝑤(𝑡−1)+𝜂⃗𝑥𝑡−𝑐𝑤(𝑡−1)⃗𝑥𝑡−𝑐𝑤(𝑡−1)𝐻,(6.2) where 𝜂 is the learning rate. The 𝐶-means is repeated until the changes in all ⃗𝑐𝑖(𝑡) and 𝚺𝑖(𝑡) are within a specified accuracy. After complex RBF centers are determined, the weight matrix 𝐖 is determined using the LS or RLS algorithm.

In [150], the ELM algorithm is extended to the complex domain, yielding the fully complex ELM (C-ELM). For channel equalization, the C-ELM algorithm significantly outperforms the complex minimal RAN [129], complex RBF network [149], and complex backpropagation, in terms of symbol error rate (SER) and learning speed. In [151], a fully complex-valued RBF network is proposed for regression and classification applications, which is based on the locally regularised OLS (LROLS) algorithm aided with the D-optimality experimental design. These models [150, 151] have a complex-valued response at each RBF node.

7. RBF Networks versus MLPs

Both the MLP and the RBF networks are used for supervised learning. In the RBF network, the activation of an RBF unit is determined by the distance between the input vector and the prototype vector. For classification problems, RBF units map input patterns from a nonlinear separable space to a linear separable space, and the responses of the RBF units form new feature vectors. Each RBF prototype is a cluster serving mainly a certain class. When the MLP with a linear output layer is applied to classification problems, minimizing the error at the output of the network is equivalent to maximizing the so-called network discriminant function at the output of the hidden units [152]. A comparison between the MLP and the localized RBF network (assuming that all units have the RBF with the same width) is as follows.

7.1. Global Method versus Local Method

The MLP is a global method; for an input pattern, many hidden units will contribute to the network output. The localized RBF network is a local method; it satisfies the minimal disturbance principle [153]; that is, the adaptation not only reduces the output error for the current example, but also minimizes disturbance to those already learned. The localized RBF network is biologically plausible.

7.2. Local Minima

The MLP has very complex error surface, resulting in the problem of local minima or nearly flat regions. In contrast, the RBF network has a simple architecture with linear weights, and the LMS adaptation rule is equivalent to a gradient search of a quadratic surface, thus having a unique solution to the weights.

7.3. Approximation and Generalization

The MLP has greater generalization for each training example and is a good candidate for extrapolation. The extension of a localized RBF to its neighborhood is, however, determined by its variance. This localized property prevents the RBF network from extrapolation beyond the training data.

7.4. Network Resources and Curse of Dimensionality

The localized RBF network suffers from the curse of dimensionality. To achieve a specified accuracy, it needs much more data and more hidden units than the MLP. In order to approximate a wide class of smooth functions, the number of hidden units required for the three-layer MLP is polynomial with respect to the input dimensions, while the counterpart for the localized RBF network is exponential [118]. The curse of dimensionality can be alleviated by using smaller networks with more adaptive parameters [6] or by progressive learning [154].

7.5. Hyperlanes versus Hyperellipsoids

For the MLP, the response of a hidden unit is constant on a surface which consists of parallel (𝐽1−1)-dimensional hyperplanes in the 𝐽1-dimensional input space. As a result, the MLP is preferable for linear separable problems. In the RBF network the activation of the hidden units is constant on concentric (𝐽1−1)-dimensional hyperspheres or hyperellipsoids. Thus, it may be more efficient for linear inseparable classification problems.

7.6. Training and Performing Speeds

The error surface of the MLP has many local minima or large flat regions called plateaus, which lead to slow convergence of the training process for gradient search. For the localized RBF network, only a few hidden units have significant activations for a given input; thus the network modifies the weights only in the vicinity of the sample point and retains constant weights in the other regions. The RBF network requires orders of magnitude less training time for convergence than the MLP trained with the BP rule for comparable performance [3, 30, 155]. For equivalent generalization performance, a trained MLP typically has much less hidden units than a trained localized RBF network and thus is much faster in performing.

Remark 7.1. Generally speaking, the MLP is a better choice if the training data is expensive. However, when the training data is cheap and plentiful or online training is required, the RBF network is very desirable. In addition, the RBF network is insensitive to the order of the presentation of the adjusted signals and hence more suitable for online or subsequent adaptive adjustment [156]. Some properties of the MLP and the RBF network are combined for improving the efficiency of modeling such as the centroid-based MLP [157], the conic section function network [158], a hybrid perceptron node/RBF node scheme [159], and a hybrid RBF sigmoid neural network [160].

8. Concluding Remarks

The RBF network is a good alternative to the MLP. It has a much faster training process compared to the MLP. In this paper, we have given a comprehensive survey of the RBF network. Various aspects of the RBF network have been described, with emphasis placed on RBF network learning and network structure optimization. Topics on normalized RBF networks, RBF networks in dynamic systems modeling, and complex RBF networks for handling nonlinear complex-valued signals are also described. The comparison of the RBF network and the MLP addresses the advantages of each of the two models.

In the support vector machine (SVM) and support vector regression (SVR) approaches, when RBFs are used as kernel function, SVM/SVR training automatically finds the important support vectors (RBF centers) and the weights. Of course, the training objective is not in the MSE sense.

Before we close this paper, we would like also to mention in passing some topics associated with the RBF network. Due to length restriction, we refer to the readers to [2] for detailed exhibition.

8.1. Two Generalizations of the RBF Network

The generalized single-layer network (GSLN) [161, 162] and the wavelet neural network (WNN) [163–165] are two generalizations of the RBF network and use the same three-layer architecture as the RBF network. The GSLN is also known as the generalized linear discriminant. Depending on the set of kernel functions used, the GSLN such as the RBF network and the Volterra network [166] may have broad approximation capabilities. The WNN uses wavelet functions for the hidden units, and it is a universal approximator. Due to the localized properties in both the time and frequency domains of wavelet functions, wavelets are locally receptive field functions which approximate discontinuous or rapidly changing functions due to the multiresolution property of wavelets.

8.2. Robust Learning of RBF Networks

When a training set contains outliers, robust statistics [167] can be applied for RBF network learning. Robust learning algorithms are usually derived from the 𝑀-estimator method [2]. The 𝑀-estimator replaces the conventional squared error terms by the so-called loss functions. The loss function is a subquadratic function and degrades the effects of those outliers in learning. The derivation of robust RBF network learning algorithms is typically based on the gradient-descent procedure [168, 169].

8.3. Hardware Implementations of RBF Networks

Hardware implementations of neural networks are commonly based on building blocks and thus allow for the inherent parallelism of neural networks. The properties of the MOS transistor are desirable for analog designs of the Gaussian RBF network. In the subthreshold or weak-inversion region, the drain current of the MOS transistor has an exponential dependence on the gate bias and dissipates very low power, and this is usually exploited for designing the Gaussian function [170]. On the other hand, the MOS transistor has a square-law dependence on the bias voltages in its strong-inversion or saturation region. The circuits for the Euclidean distance measure are usually based on the square-law property of the strong-inversion region [171, 172]. There are examples of analog circuits using building blocks [173], pulsed VLSI RBF network chips [172], direct digital VLSI implementations [174, 175], and hybrid VLSI/digital designs [170].

Acknowledgments

The authors acknowledge Professor Chi Sing Leung (Department of Electronic Engineering, City University of Hong Kong) and Professor M. N. S. Swamy (Department of Electrical and Computer Engineering, Concordia University) for their help in improving the quality of this paper. This work was supported in part by NSERC.