Abstract

This paper provides an analysis of the combining effect of novel activation function and loss function based on M-estimation in application to extreme learning machine (ELM), a feed-forward neural network. Due to the computational efficiency and classification/prediction accuracy of ELM and its variants, they have been widely exploited in the development of new technologies and applications. However, in real applications, the performance of classical ELMs deteriorates in the presence of outliers, thus, negatively impacting the precision and accuracy of the system. To further enhance the performance of ELM and its variants, we proposed novel activation functions based on the psi function of M and redescend the M-estimation method along with the smooth 2-norm weight-loss functions to reduce the negative impact of the outliers. The proposed psi functions of several M and redescending M-estimation methods are more flexible to make more distinct features space. For the first time, the idea of the psi function as an activation function in the neural network is introduced in the literature to ensure accurate prediction. In addition, new robust 2 norm-loss functions based on M and redescending M-estimation are proposed to deal with outliers efficiently in ELM. To evaluate the performance of the proposed methodology against other state-of-the-art techniques, experiments have been performed in diverse environments, which show promising improvements in application to regression and classification problems.

1. Introduction

Neural networks (NNs) are biological-inspired predictive techniques that mimic the behavior and neural processing of the biological nervous system. NN has been extensively and successfully applied to pattern recognition, time series prediction and modeling, adaptive control, classification, and other areas of artificial intelligence (AI). Advancements in AI applications depend upon robust machine learning algorithms. The shortcomings of traditional NN were detached by Zhu et al. [1] developed a single-layer feed-forward neural network called ELM due to its high speed and accuracy, and further Huang et al. [2] observed its performance over the popular back propagation neural network(BPNN) and support vector machine (SVM) in regression and classification problems. ELM has been used widely in numerous application domains, such as biomedical engineering, system identification, computer vision, control, and robotics [3]. Harikumar et al. [4] developed ELM-based classifier for epilepsy identification from EEG signals and compared the computational performance with BPNN. In reference [5], Li et al. deployed ELM for daily stream forecasts and showed better performance than random forest. ELM performed considerably faster without a significant loss in accuracy. Bhatia et al. [6] used ELM in plant disease prediction for a highly imbalanced dataset. Fabric wrinkle evaluation model with regularized ELM based on improved Harris’s Hawk optimization was discussed in [7]. In short, more applications of ELM were found in the literature such as mine reclamation based on remote sensing information and error compensation, cooperative spectrum sensing for cognitive radio networks, detection of total iron content, evaluation of shape factor impact on the discharge coefficient of side orifices, coal exploration based on a multilayer extreme learning machine and satellite images, e-mail spam filtering techniques, and emotion recognition in election day tweets and compared the performance with existing classification approaches and noted the highest accuracy [814]. In machine learning, ELM has introduced a better alternative to existing algorithms used in several supervised and unsupervised learnings. There is no need to iteratively tune the inputs, hidden and output layer weights, and biases like BPNN [1, 1519]. Due to this, the ELM is capable of lower cost and high speed learning with good generalization accuracy and performance. ELM has the capacity to introduce nonlinearity if any, using different differentiable or nondifferentiable activation functions in its training/testing phase, and possession of a unique solution to a different complex problem in practice [20]. Furthermore, ELM avoids the problem of overfitting due to analytical solutions and local minima [1, 20]. ELM needs fewer hyperparameters such as activation function and hidden layer size to be optimized as compared to other techniques, as in conventional neural networks, SVM, and least-square SVM, with similar computational costs. ELM filled the gap between biological learning and conventional learning machines [21]. Instead of the great merits of classical ELM, it has several deficiencies such as contamination in data and ill-posed structure due to which sometimes analytical solution of output weights was not possible due to noninvertibility and sensitivity of hyperparameters. Deng et al. [22] and Horata et al. [23] introduced regularized and weighted regularized ELM (WRELM) to solve the problem of overfitting and noninvertibility. Following Deng et al. [22] and Horata et al. [23], Barreto et al. [24] used robust M-estimators based on cost functions to downweigh outliers and avoid their negative effects in the computational process in image classification with salt and pepper noise. Zhang and Luo [25] developed an outlier robust extreme learning machine for regression and classification purposes based on the norm and augmented language multiplier (ALM), a novel variant of ELM, and compared its performance with a weighted regularized extreme learning machine by taking real benchmark data from the UCI machine learning repository. Chen et al. [26] used some popular M estimators-based weight-loss functions to regression problems in the presence of outliers instead of the Huber loss function, as it has a linear relationship with error and has no smoothing criteria to downweigh outliers properly. Recently, more robust M-estimators are developed to properly filter and smoothly reduce the negative effects of outliers by many researchers in statistical regression analysis. The detailed information about existing and recently developed M-estimation based objectives, psi, and weight functions is given in Table 1 [2729]. Almost all researchers including [1, 2, 626] used sigmoid, sine, cosine, Gaussian, tan-sigmoid ReLu, radial basis function (RBF), and their modified version as activation functions in ELM and also used their variants to introduce nonlinearity in hidden layer space of the neural network. Unlike the traditional typical gradient-based learning algorithms which only work for differentiable activation functions, it is easily detected that ELM could be used to train with many nondifferentiable activation functions. Huang et al. [2] discussed the limitations of popular activation functions that behave S type bounded shape function between 0 and 1 or −1 and 1, which observed the problem of diminishing gradient necessary to differentiate between good and bad observations at extreme edges during the training process. Due to this mismanagement, significant information may be lost. Liu et al. [30] introduced a robust activation function (RAF) to keep activation function output away from zero as much as possible and make inputs fully informative. The very same problem may happen with tan sigmoid as well as with RAF, which was introduced by Liu et al. [30] in ELM and still has the problem of robustness against outliers. SIBI et al. [31] studied the effects of different activation functions while training BPNN to extract useful information by transforming inputs into output signals. They concluded that there is no significant difference found among them to prefer it over one another. Gomes et al. [32] analyzed the performance of different activation functions in NN to accurately forecast time series data. Later, Essai and Ellah [33] performed experiments using robust M-estimators objective functions as activation functions which outperformed the activation functions used earlier in the literature. Freire and Barreto [34] used the idea of batch intrinsic plasticity (BIP) to maximize hidden layer information combined with robust estimation of the output weights. This paper proposes several redescending M-estimators as activation functions in ELM and in their variants complemented by weight-loss functions to smoothly avoid the negative impact of contamination. A is a piecewise continuous redescending towards zero for, and C is often called the rejection point to real outliers. This study aims to extend the applicability of high breakdown M-estimator's psi-functions as activation functions against other competitors, complemented by robust loss functions in ELM and its variants. To evaluate the performance of the proposed methodology against other state-of-the-art techniques, experiments have been performed in diverse environments, which shows promising improvements in application to regression and classification problems. The details of the remaining paper is as in section 2. First, an overview of related work of ELM and its variants is discussed and extended to the proposed methodology. In section 3 experimental design of the simulation study is defined, in section 4 results and discussions are mentioned, and in the last section conclusions and future work is discussed. Figure 1 shows all loss functions.

1.1. Extended ELM Based on Convex and Nonconvex 2 Norm Loss Functions

The overfitting problem in ELMs using the 2-norm loss function is intrinsically caused by the number of outliers in data. In a real scenario far away data can transfers statistical results into a biased analysis. Wang et al. [35] have exploited the strict nonconvex loss function to mitigate the effect of wild observation if any while training the desired network. Though the outcomes produced are satisfactory in specific applications, yet, these good results negatively impact the overall performance in terms of accuracy and stability in general applications.

In our approach, a nonconvex 2-norm smooth loss function based on M and redescending M-estimation theory was incorporated in ELM, as inspired by Wang et al. [35] because strict nonconvex loss function sometimes loses valuable information. Graphically shown in Figure 1 where (a) 2-norm loss function which uses all data while training model even outliers, (b) strict nonconvex loss function which holds data in original form and excludes observations from a specific point, and (c) smooth nonconvex 2-norm loss function to assign weights in such way weights decreases as residuals increases.

It reduces to a 2-norm convex loss function that is applied to minimize the training error while keeping outliers as well if any.

is predefined constant, then it reduces to a strict nonconvex 2-norm loss function applied by Wang et al. [35] in ELM to reduce the negative impact of outliers during the training model.

is proposed loss function is based on M and re-descending M-estimation theory, where outliers are down weighed to normalize training data while training a classifier.

The conventional form of supervised learning data the general form of the objective function of ELM.where is the weights vector and bias term randomly generated from any continues distribution, connecting the hidden nodes and input nodes; is the weight vector connecting hidden nodes with output neurons .

The Lagrange function for optimization of (3) and (4) becomes as follows:where is the output weights vector and is the loss function. The parametenetwork as a regularization agent to maintain a bias-variance trade-off. Traditional ELM uses a simple square loss function which is highly sensitive to outliers. Therefore, M-estimator and redescending M-estimator loss functions have been used to enhance the robustness of ELM against outliers, which denotes robust loss function and is the standardized residuals. The psi function of is and corresponding weight function is In the present work, efficient M-estimation-based loss functions are studied along with their psi function as activation function complemented by their loss function to gain maximum accuracy. The detailed information on the proposed M-estimation with their objective, psi function, and weight function is given in Table 2.

After simplification, the output weights estimate of, the objective function (5) using -norm smooth loss function regularization term can be written as follows:where weight function for training data. Setting the diagonal matrix . Cases 1 and 2 are special cases of solution given in (6).

1.2. Proposed Iterative Reweighted Algorithm for Robust ELM

Input: training data ,number of hidden nodes, maximum of iterations, , and activation function g (.) given below in Table 2 and in Table 2 only psi function as activation function. Calculate the hidden layer output matrix and initiate the weight matrix .

Step 1. Compute initial output weights by equation-(6)
The estimate function is given in (2).

Step 2. Obtain residual and standardize it using robust location and scale parameter and assign weights using existing and proposed weight function based on M-estimation given in Table 2 to update

Step 3. Update computed in Step 1.

Step 4. If or, stop, and ; else go to step 5.

Step 5. Finally, the estimate function is given in equation (1)

1.3. Experimental Design and Simulation Studies

This section elaborates on the mechanism to know the performance of the proposed method against RELM and the existing weighted RELM. Several redescending M-estimators based on psi-functions are considered as activation functions to build hidden layer nonlinear space from the input space. As in ELM-related literature, common activation functions are used such as logistic sigmoid, tan-sigmoid, ReLU, softsign, Sin, Cos, leakyReLU, BentIdle, and Arc Tan in ELM and its variants including [125] respectively. Furthermore, we use redescending M-estimators based on nonconvex loss functions to reduce the effect of outliers in our proposed studies. The details of the nonconvex loss functions based on M-estimation are mentioned in Table 2 and in Table 3. The proposed psi functions are mentioned for convenience. However, we use the different numbers of hidden layer neurons to assess the performance of the proposed strategy. The results are shown here only for a single number, as our objective is not here to optimize hidden layer size. All experiments are carried out in an R-studio environment running on an Intel Core m3 7th Gen PC. In each experiment datasets are broken into two halves with a ratio of 70 : 30 where training and testing data sets are 70% and 30%, respectively. We have checked the performance of the proposed methods using two benchmark regression-related datasets, such as Boston Housing Price data and abalone age prediction data; however, only the results of the abalone data set with and without artificial outliers are kept to assess the performance of each method due to space limitation. Different scaling techniques were used in literature to reduce the size of data such as linear scaling minimax (0, 1) or (−1, 1) or statistical standardization We have considered minimax techniques to scale all data available on attributes and response variables to the range of (0, 1) before training the proposed and existing networks. The training dataset is contaminated in each trial with 20% outliers generated from uniform distribution but highly distant from the remaining dataset, which trains both the existing RELM and proposed RELM and check their performance on the test set. Proposed RELM and existing methodologies are repeatedly performed 50 times, and computed training and testing root mean square (RMSE) and their standard deviation (SD) are recorded in the case of regression. The step-by-step abstract block diagram of the proposed strategy is defined in Figure 2. While knowing the performance of the proposed strategy in classification, we consider a benchmark dataset IRIS, satellite image, and e-mails spam filtering datasets considered from the UCI machine repository. The photos of classification applications are shown in Figure 3 to know the importance of the proposed work easily. We have used three random choices of weights initialization from standard normal, uniform (−1, 1), and exponential distribution. However, due to space limitations, the results of standard normal distribution are kept into consideration as there is found no significant impact on initial weights.

1.4. Performance Metrics Root

Mean Square Error

Accuracy = No. of correctly classified classes by classifier/total cases ∗ 100..

2. Results and Discussions

The results of the simulation study revealed in Tables 45, explain the performance of each activation function complemented with their corresponding loss function in terms of RMSE and SD in regression application.

In Table 4 and Table 5, the application of proposed methods extends to well-known data abalone age prediction, where proposed methods outperform in terms of RMSE and SD as compared to other state-of-the-art techniques in both clean and contaminated data as well. For further confirmation of the performance of the proposed methods, we compared robust ELM using the sigmoid activation function by Wang et al. [35], called iterative reweighted ELM (IRRELM) compared with ELM, WELM, ORELM, and IRWELM, whose results are mentioned in Table 6 with different contamination levels and further performance of the proposed methods are shown in Figure 4. If we see their results and compar them with our proposed methods, they clearly show improvement in efficiency even at all levels of outliers. To further check the performance of proposed methods, we extended their applications to classification problems. We considered three popular benchmark data sets to show the classification accuracy of the proposed methods. The following table demonstrates the efficiencies of existing methods and proposed methods.

Table 7 along with Figure 5 describes the performance of the proposed methods in terms of percentage testing accuracy with SD using well-known low- and high-dimensional data clearly depicted in Figure 4. To clarify our results given in Table 7, for each trial of simulation of Iris, Satimage, and emails spam filtering, the training, and testing datasets are randomly generated from their database. Fifty trials have been set for the ELM algorithm using different activation functions trained under a fixed hidden layer size, and at the end, average testing accuracy was measured with standard deviation. In the case of the Satimage dataset, one of our proposed activation functions reaches an average accuracy of 89.86% with a 0.8644 standard deviation. These results are compared with Huang et al. [2]. The ELM classifier with sigmoid activation function got 89.04% accuracy with a standard deviation of 1.57 using 500 nodes in the hidden layer, whereas the proposed activation function achieved a higher accuracy with only 300 hidden nodes in the hidden layer, which clearly showed that ELM under the proposed activation function makes the desired classifier have less computational complexity with higher accuracy. The remaining proposed activation functions in the same experiment showed almost similar performance to all competitors. In the case of another high-dimensional data, “emails” proposed 8 and 9 number activation functions outperform all existing and proposed activation functions with an higher average accuracy of 93.66. In Iris data, the classification accuracy of the proposed and existing activation functions are the same nearly.

3. Conclusion

This paper proposed a new robust activation function complemented by weight-loss on M and redescending M-estimation in ELM with -norm regularization criteria for solving regression and classification problems. The focus of this work was to introduce the psi function of different redescending M-estimators as activation functions in ELM, complemented by existing and some new weight-loss functions. In the task of prediction, the proposed methods show improvement in terms of accuracy and precision over existing methods for predicting the age of abalone with and without adding outliers to the training set. Several combinations of activation function and loss function are studied and compared with their performance with proposed combinations. The performance of the proposed combination of activation and weight-loss functions outperformed existing methods in regression problems in the presence of contaminations. Moreover, the application of the proposed activation function in ELM is extended to know the classification accuracy in low and high-dimensional data sets. In almost all classification applications, the predictive performance of proposed activation functions in ELM outperformed. For instance, in the case of regression application using an abalone dataset, the proposed activation function along with the weighted loss function performed better than the existing combination of activation and weight-loss function in extreme learning machine in the presence of outliers. Furthermore, the application of proposed activation functions was deployed to classification problems using the famous Iris, Satimage, and emails datasets, where some of the proposed activation functions outperform their existing competitors. In the future, the role of the proposed activation function in ELM can be studied with different convolutional neural networks (CNNs), such as Google Net, Alex Net, VGG-16, and ResNet, using feature selection techniques from image data along with famous robust statistical feature selection methods. Moreover, in the future, the applications of proposed activation functions in extreme learning machines can be extended to analyze their performance in epilepsy identification from EEG signals, emotion recognition in Election Day tweets, total iron detection, and fault diagnostic of electric impact drills using thermal imaging.

Data Availability

Data and programming codes available on request. The manuscripts are available upon contacting the first author. Moreover, used datasets are available on Kaggle repository.

Conflicts of Interest

The authors declare that they have no conflicts of interest.