Abstract

Extreme learning machine (ELM) has been put forward for single hidden layer feedforward networks. Because of its powerful modeling ability and it needs less human intervention, the ELM algorithm has been used widely in both regression and classification experiments. However, in order to achieve required accuracy, it needs many more hidden nodes than is typically needed by the conventional neural networks. This paper considers a new efficient learning algorithm for ELM with smoothing regularization. A novel algorithm updates weights in the direction along which the overall square error is reduced the most and then this new algorithm can sparse network structure very efficiently. The numerical experiments show that the ELM algorithm with smoothing regularization has less hidden nodes but better generalization performance than original ELM and ELM with regularization algorithms.

1. Introduction

Recently, the studies and applications of artificial intelligence have raised a new high tide. Artificial neural networks, also known as neural networks, are an usual artificial intelligence learning algorithm, which can autonomously learn the characteristics of data, hence avoiding the course of artificial selection.

Among them, the most common network type is feedforward neural networks (SLFNs), such as perceptron, back-propagation (BP), and radial basis function (RBF) networks. Generally, it includes an input layer, several hidden layers, and an output layer. Each neuron is arranged in layers, only connected to the antecedent layer. It takes the output of the previous layer and delivers it to the next layer. There is no feedback between the layers. As a result of its good characteristics, it has been widely used in various fields [14]. However, the drawback of the feedforward neural network is inefficiency, especially when it deals with complex data.

Lately, ELM was proposed by Huang et al. [57]. Other than the conventional neural network, ELM is a new type of SLFNs, which input weights and the thresholds of the hidden layer can be discretionarily assigned if the activation function of the hidden layer is immortally differentiable. The output weights can be decided by Moore–Penrose generalized inverse. In some simulations, the learning speed of the ELM algorithm can be completed in seconds [8]. It is extremely fast and has better generalization performance than other primitive gradient-based learning algorithms.

However, in order to guarantee the veracity of regression and classification experiments, the ELM method needs a huge number of hidden nodes. This leads to a particularly complex network structure. It inevitably increases the model size and testing time. Therefore, in order to ameliorate the predictive power and generalization performance of ELM, choosing the appropriate quantity of hidden layer nodes is a hot topic in the research of ELM.

Many scholars have done a lot of research on the hidden layer structure optimization of ELM and achieved many remarkable results. Rong et al. [9] proposed a fast pruned ELM (P-ELM) and successfully applied it to the classification problem. This learning mechanism mainly achieves the effect of “pruning” the network structure by deleting hidden layer nodes irrelevant to class labels. Then, Miche et al. [10] proposed an optimal extreme learning machine (OP-ELM) for optimal pruning and extended the pruning method. Incremental extreme learning machine (I-ELM) was put forward by Huang et al. [11], and it can increase the quantity of hidden layer nodes adaptively. The efficiency of learning is improved, and the network performance is optimized by increasing nodes in the training process. In [12], Yu and Deng proposed a series of new efficient algorithms for training ELM, which exploit both the structure of SLFNs and the gradient information over all training epochs.

In conclusion, the number of neurons is an important factor that determines the structure of the network. There are two methods to adjust the size of the network, one is the growing method, and the other is the pruning method. For the growing method, a common strategy is to begin with a smaller network and then add new hidden nodes during the process of training the network [13, 14]. The pruning method starts with a large network structure and then prunes unimportant weights or nodes [1519]. There are some network structure optimization methods, such as regularization method, cross validation, sensitivity analysis, mutual information-based, and magnitude-based methods [2023].

In this paper, based on the regularization method, we propose a new efficient algorithm to train ELM. This novel algorithm updates the weights in the direction along which the overall square error is reduced the most and then to the direction which enforces sparsity of the weights effectively. Our strategy is to combine the regularization term with the standard error function. On the basis of the regularization theory, when , the regularization method tends to produce more sparse results. However, it is not differentiable at the origin of coordinates and leads to NP-hard problem [24]. So, regularization cannot make use of the optimization algorithm directly [25]. Also, there are some other sparse strategies for optimized structure of neural networks [26, 27]. Inspired by the work on smoothing regularization for feedforward neural networks [28, 29] and other optimizing strategies, this paper draws on some smoothing techniques to overcome the shortcomings of the origin is not differentiable and the phenomenon of oscillation, and we will show the related details in the next section. The main contributions are as follows:(1)It is shown how the smoothing regularization is proposed to train ELM, which can discriminate important weights from unnecessary weights and drives the unnecessary weights to zeros, thus effectively simplified the structure of the network.(2)The ideal approximate solution is obtained by using the smoothing approximation techniques. The drawbacks of nonsmooth and nondifferentiable of the normal regularization term can be addressed. And this effectively prevents the oscillatory phenomenon of the training.(3)Numerical simulations shown that the sparsity effect and the generalization performance of the ELMSL0 are better than those of the original ELM and ELML1. For example, from Tables 13, the results clearly show that the novel algorithm ELMSL0 uses fewer neurons but has higher testing accuracy for most of the data sets.

The remainder of this paper is organized as follows. Section 2 describes the network structure of the traditional ELM algorithm. Section 3 shows the ELM algorithm with smoothing regularization (ELMSL0). Section 4 shows that how the smoothing regularization helps the gradient-based method to bring forth sparse results. The performance of the ELMSL0 algorithm is compared with ELM and ELM with regularization (ELML1) algorithms by regression and classification applications in Section 5. Finally, conclusion is given in Section 6.

2. Extreme Learning Machine (ELM)

Next, we will do a description of the traditional ELM. The neuron quantities of the input, hidden, and output layers are , , and , respectively (see Figure 1). The ELM algorithm randomly generates the connection weights of the input layer and hidden layer, and the thresholds of the hidden layer neurons need not be adjusted in the process of training the network. As long as the number of hidden layer neurons is set, the unique optimal solution can be acquired. The input matrix X and output matrix Y of the training set with Q samples are, respectively,

On the basis of the structure of ELM, the actual output T of the network is as follows:

The normative ELM with hidden nodes and activation function can be expressed aswhere , and the specific expression of is as follows:where is the weight vector connecting the th hidden node and input nodes, is the weight vector connecting the th hidden node and output nodes, is the threshold of the th hidden node, and denotes the inner product of and .

Equation (4) can be written as , and H is called the hidden layer output matrix of the ELM, where

By using the Moore–Penrose generalized inverse of , the least-squares solutions of can be written as

3. Extreme Learning Machine with Regularization (ELMSL0)

In the general case, the weights with large absolute value play a more important role in training. In order to prune the network availably, we need to distinguish the unimportant weights in the first place and then remove them. Therefore, the aim of the network training is to find the appropriate weights that can minimize as well as :where is the regularization coefficient to balance the training accuracy and the complexity of the network. is called regularization, and it shows different properties for different values of . Moreover, regularization can prevent the weight from increasing too large effectively, but it is not sparse. regularization can generate sparse solutions. regularization produces more sparse solutions than regularization. So, we focus on the regularization in this paper. Here, is the regularization with the norm defined byfor . We know that the quantity of nonzero elements of the vector equals to . However, according to combinatorial optimization theory, we know that minimizing is a NP-hard problem. In order to overcome this drawback, the following continuous function is employed to approximate the regularization at the origin:

Here, is continuously differentiable on andwhere is a positive number, and it is used to control the degree of how approximates the regularization. A representative choice for is

To sum up, the error function with regularization has the following form:where represents the th row of the matrix . We use the gradient descent method to minimize the error function, and the gradient of the error function is given bywhere . For any initial value , the batch gradient method with the smoothing regularization term updates the weights bywhere is the learning rate.

4. Description of Sparsity

Regularized sparse model plays an increasingly important role in machine learning and image processing. It removes a large number of redundant variables and only retains the explanatory variables that are most relevant to the response variables, which simplifies the model while retaining the most important information in the data sets and effectively solves many practical problems.

Next, we show that the ELMSL0 algorithm differentiates between important weights and unimportant weights. Among them, the weights with large absolute value are more important. The curves of and for different are shown in Figures 2 and 3. Notice that the only minimum point of is . On the grounds of (10), for adequately small , there exists a positive number , such that when and will be quite large when . Therefore, when using the ELMSL0 algorithm to train the network, the absolute value is greater than the weight of , which is not easily affected by the regularization term. The unimportant weights are absolutely less than , they will be driven to zero during the training. This makes clear why the smoothing regularization term can help the gradient-based method achieve sparse results.

In the light of the above discussion, we look forward to more weights fall into the interval to achieve the more sparse results. To make this true, one choice is to set the very small initial weights, but it leads to slow convergence at the beginning of the training procedure on account of the gradient of the square error function will be very small [30]. The weight decay method is another choice, and the size of the network weight is reduced compulsively during the training process. As shown in Figure 3, if the parameter is set too small, it will be unstable and affect the performance of the algorithm. Therefore, the parameter should be set to a decreasing sequence with lower bounds.

5. Simulation Results

In order to substantiate the reliability of the introduced ELMSL0 algorithm, we conduct some experiments which are both regression and classification applications. In Section 5.1, ELM, ELML1, and ELMSL0 algorithms are used to approximate the SinC and multidimensional Gabor function. There are a few real-word classification data to test the performance of three algorithms in Section 5.2.

5.1. Function Regression Problem

The SinC function is defined byand it has a training set and testing set with 5000 data, where is uniformly distributed on the interval (−10, 10). To make the regression problem more real, adding the uniform noise distributed in [−0.2, 0.2] to all the training samples, the testing samples remain noise-free. There are 50 hidden nodes, and the activation function is RBF for ELM, ELML1, and ELMSL0 algorithms. We choose the learning rate , the regularization coefficient , and , where is set to be a decreasing sequence. As demonstrated by Figure 3, may be absolutely too large when parameter is too little, which may cause instability during the training procedure. Thus, we set a lower bound for . Figures 46 exhibit the prediction results. Figure 4 shows the prediction values of the ELM algorithm for SinC function. Figure 5 shows the prediction values of the ELML1 algorithm. Figure 6 shows the prediction values of the ELMSL0 algorithm. It is obvious that the ELMSL0 algorithm has better prediction performance than other two algorithms.

The root mean square error (RMSE) is usually used as error function for evaluating the performance of the algorithm. The smaller value of the RMSE shows that the algorithm is more accurate to the described experimental data. By the following equation, we can calculate the RMSE:where denotes the predicted data and denotes the actual data. We run 50 experiments on each of the three algorithms and then give the average training and testing RMSE values in Table 1. By comparing the number of hidden nodes of three algorithms, the ELMSL0 algorithm needs less hidden nodes but the testing sets have the highest accuracy.

Next, we consider using three algorithms to approximate a multidimensional Gabor function individually:

We select 2601 samples from an equably spaced grid on and for training samples, and 2601 testing samples are selected similarly. In order to prevent overfitting in the training process, the noise which is evenly distributed in [−0.4, 0.4] has been added to training samples, and there is no noise in the testing samples. The original ELM algorithm has 100 hidden nodes, and RBF is selected for the activation function. Then, we use ELML1 and ELMSL0 algorithms to prune the network, respectively, choosing the learning rate , the regularization coefficient , and to approximate Gabor function (Figure 7). We perform 50 experiments on each of the three algorithms. As shown in Figures 810, it is so clear that the ELMSL0 algorithm has better prediction performance than conventional ELM and ELML1 algorithms. Table 2 gives the average training and testing RMSE values and the number of hidden nodes required by three algorithms. The accuracy of the ELMSL0 algorithm in testing sets is higher than other algorithms.

5.2. Real-Word Classification Problems

In this section, we compare the generalization performance of ELM, ELML1, and ELMSL0 algorithms in some real-word classification problems, which include seven binary classification problems and seven multiclass classification problems. Tables 4 and 5 describe these classified data clearly, including the number of training and testing data and the attributes of each classification data set.

To explain Tables 4 and 5 clearly, we take the Diabetes and Iris data sets for example. Diabetes data belongs to either positive or negative class. The data from “Pima Indians Diabetes Database” are created by Applied Physics Laboratory. It consists of 768 women over the age of 21 who come from Phoenix, Arizona. Iris data include four features: calyx length, calyx width, petal length, and petal width. It contains three classes: Setosa, Versicolor, and Virginica.

There are different hidden nodes of each data, and the activation function is sigmoid function for three algorithms. We choose the learning rate , the regularization coefficient , and , where is a decreasing sequence with a lower bound of 0.08. We run 50 experiments with each data set, and Table 3 shows the average training and testing accuracy. It can be found that the ELMSL0 algorithm requires fewer hidden nodes without affecting the testing accuracy of the algorithm. So, the ELMSL0 algorithm not only produces sparse results to prune the network but also has better generalization performance than other two algorithms in most classified data sets.

It is well known that regularization coefficient and the number of hidden nodes affect the accuracy of algorithms. Therefore, we need several experiments to select the appropriate regularization coefficients. A binary Sonar data set and a multiclassification Iris data set are selected here. We take the Sonar signal classification data set from the UCI machine learning repository for example. It is a typical benchmark problem in the field of neural networks. All samples are divided into two categories, one is sonar signals bounced off a metal cylinder, the other is those bounced off a roughly cylindrical rock. Here, we consider the number of hidden nodes is 500 in ELM, pruned by ELML1 and ELMSL0 algorithms. Figure 11 shows the testing accuracy of the two algorithms with different regularization coefficients. Figure 12 shows the number of hidden nodes required by the two algorithms as the regularization coefficient increases. From Figures 11 and 12, the ELMSL0 algorithm requires fewer hidden nodes but better generalization performance than the ELML1 algorithm with different regularization coefficients.

6. Conclusions

In this paper, we use a smoothing function to approximate regularization, proposing a pruning method with smoothing regularization (ELMSL0) for training and pruning extreme learning machine. And also it is shown that the ELMSL0 algorithm can produce sparse results to prune networks availably. Both regression and classification problems show that the ELMSL0 algorithm has better generalization performance and simpler network structure. In future, we consider applying the intelligent algorithm to the ELM algorithm to find the most appropriate weights and thresholds.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by Special Science Research Plan of the Education Bureau of Shaanxi Province of China (No. 18JK0344), Doctoral Scientific Research Foundation of Xi'an Polytechnic University (No. BS1432) and The 65th China Postdoctoral Science Foundation (No. 2019M652837).