Data Analysis and Intelligent Decision Technology in the Energy IndustryView this Special Issue
Fractional Rectified Linear Unit Activation Function and Its Variants
This paper focuses on deriving and validating the fractional-order form of rectified linear unit activation function and its linear and nonlinear variants. The linear variants include the leaky and parametric, whereas the nonlinear variants include the exponential, sigmoid-weighted, and Gaussian error functions. Besides, a standard formula has been created and used while developing the fractional form of linear variants. Moreover, different expansion series such as Maclaurin and Taylor have been used while designing the fractional version of nonlinear variants. A simulation study has been conducted to validate the performance of all the developed fractional activation functions utilizing a single and multilayer neural network model and to compare them with their conventional counterparts. In this simulation study, a neural network model has been created to predict the system-generated power of a Texas wind turbine. The performance has been evaluated by varying the activation function in the hidden and output layers with the developed functions for single and multilayer networks.
An activation function in the neural network determines whether the neuron’s inputs to the network are relevant or not using simple mathematical operations. Thus, it decides whether a neuron should be activated or deactivated . Over the years, many activation functions have come to light. Among them, the prominently used functions are sigmoid, Hermite, mish, binary step, rectified linear unit (ReLU), softplus, adaptive spline, and hyperbolic tangent (tanh) [2–6]. The activation functions widely considered in the hidden layer of neural networks are ReLU, sigmoid, linear, and tanh [7–10]. And depending on the type of prediction the model requires, the output layer of a neural network will often utilize a different activation function than those used for hidden layers . Moreover, the best solutions are the linear functions for regression-type output layers and softmax for multi-class classification [7, 9].
The sigmoid function is smooth and continuously differentiable. However, no symmetry can be seen around zero; therefore, all of the neurons’ output will show identical signs . The tanh function is steeper than the sigmoid and symmetric around the origin . Compared to sigmoid and tanh functions, the ReLU function is computationally efficient. Nevertheless, the gradient of the ReLU function is 0 at the negative axis, which deactivates the neurons in that region [7, 9, 12]. Thus, a leaky ReLU (LReLU) has been proposed to overcome the issue in ReLU by making minor modifications for the negative values of input to get a nonzero value on the negative plane. Hence, there will be no dead neurons in that region using a leaky ReLU [4, 7, 12]. The parametric ReLU (PReLU) is another choice if leaky ReLU fails to solve the issue. The nonlinear variants of ReLU provide a change in the slope of the negative plane of the function by using some nonlinear functions [4, 7, 12].
Fractional calculus supports derivatives and integrals to noninteger orders or fractional orders [13, 14]. Over the years, it has been proven that the modeling with fractional differential equations usually results in a more exact and accurate performance than using conventional differentiation . It is worth noting that precise modeling is necessary for systems that need precise damping. Recently, various researchers have incorporated the application of fractional calculus in neural networks. For instance, the issues of robust stability and synchronization for the fractional-order delayed neural networks are presented in . In a similar vein, the researchers introduced an approach in , which allows the neural network to search for and improve its activation functions throughout the training phase by defining the fractional-order derivative of given primitive activation functions. Further, the authors of  have infused fractional activation functions into feedforward neural networks, providing more tunable hyper-parameters. Besides, the researchers in  have considered Caputo’s and Grünwald-Letnikov’s fractional derivatives-based method for feedforward neural network for optimizing fractional-order delay optimum control problems. Another reported work in  has used Grünwald-Letnikov’s derivative function as a learning algorithm and fractional calculus to reduce convergence error and speed. The single-point search algorithm reported in  has made an efficient global learning machine to determine the optimal search path.
Having been motivated by the above literature, this paper derives fractional-order derivation of ReLU and its variants. Firstly, the fractional derivative of ReLU has been calculated, followed by its variants. There are two types of variants that have been derived: linear variants, which consist of LReLU and PReLU. The nonlinear variants comprise exponential linear unit (ELU), sigmoid-weighted linear unit (SiLU), and Gaussian error linear unit (GELU). The original formula has been employed while calculating the fractional-order derivative of linear variants. However, different series expansions have been used first to calculate the fractional-order derivative of nonlinear variants.
The remaining contents of the manuscript are organized as follows: Section 2 defines the fractional derivative for order . Section 3 presents the fractional form of ReLU and its variants. The simulation study on performance analysis of developed fractional ReLU functions is given in Section 4. Finally, Section 5 summarizes essential findings.
2. Fractional Derivative
The derivative in (3) can be rewritten using the factorial function aswhere .
It is to be noted that (4) is valid for non-negative integers only. Therefore, the fractional derivative of for an order can be obtained by replacing the factorial with gamma function as follows :where function is computed as .
It is worth noting that the fractional derivative derived in (5) is valid for .
3. Development of Fractional Rectified Linear Unit Activation Function
The ReLU function has become one of the default activation functions for many neural networks. One example of such a type of network is a convolutional neural network. This is because the model with ReLU trains quicker and generally delivers higher performance. This performance of ReLU and its variants can be further improved using fractional calculus. Therefore, this section develops the fractional form of ReLU and its variants.
3.1. Fractional ReLU (FReLU)
ReLU is a piecewise-defined linear function in which, if the input is positive, the output value is the same as the input value. Otherwise, it produces the output of zero. As a result, the ReLU mathematical formula is as follows [15, 17, 18]:where is the input. Thus, in the above function can also be rewritten as follows:
The fractional ReLU function for an order is defined as
Using fractional derivative derived in (5), the value of is computed as follows:where . Thus, substituting the value of the above equation can be written as
Therefore, the FReLU function for an order is obtained as
The response of FReLU for various values of in comparison with response of conventional ReLU is shown in Figure 1. The figure depicts that the FReLU provides more flexibility than the ReLU. It can also be seen that for smaller values of , the function behaves like a ReLU, whereas for larger values of , the function behaves like a step.
3.2. Linear Variants
3.2.1. Fractional Leaky ReLU
The function in (12) can also be rewritten as follows:
Therefore, the fractional LReLU (FLReLU) function from in (13) for an order is defined as
The response of FLReLU for various values of in comparison with response of conventional LReLU is shown in Figure 1(b). As shown in the figure, in the positive plane, the response of FLReLU is equivalent to the response of FReLU. However, in the negative plane, the response of conventional LReLU starts at 0.1, whereas the starting point of FLReLU is in . Thus, it provides a more flexible behavior compared to conventional LReLU.
3.2.2. Fractional Parametric ReLU
The PReLU is another linear variant of the activation function ReLU. The leakage coefficient is turned into a parameter learned with the other neural network parameters in PReLU. This can be mathematically represented for an input as [22, 23]where is the hyper-parameter.
Using (16), the fractional PReLU (FPReLU) function for an order is defined as
The response of FPReLU for various values of in comparison with the response of conventional PReLU is shown in Figure 1(c). The figure shows that the positive plane FPReLU shows an equivalent response as in the case of FReLU and FLReLU. However, in the negative plane, the reaction of PReLU starts at , whereas the starting point of FPReLU is in depending on . In all these cases, the hyper-parameter has been set to 1.5.
3.3. Nonlinear Variants
3.3.1. Fractional Exponential Linear Unit
The ELU function is a nonlinear variant of ReLU. ELU becomes smooth slowly until its output equals , whereas ReLU is smooth. ELU strives to get the mean activations close to zero, accelerating learning. This can be mathematically represented as [22, 23]where is a hyper-parameter to be tuned.
From (19), the fractional ELU (FELU) for an order is computed as follows:
However, for , the value of is calculated as
The response of FELU for various values of in comparison with response of conventional ELU is shown in Figure 2. For the analysis, the hyper-parameter has been set to 1.5. As shown in the figure, for , the response of FELU is equivalent to the response of FReLU. However, for , the variation in exponential function can be seen for various values of . Thus, it can be concluded that the FELU provides more flexible behavior compared to conventional ELU.
3.3.2. Fractional Sigmoid-Weighted Linear Unit
The SiLU function is another nonlinear variant of the activation function ReLU. It is calculated by multiplying its input by the sigmoid function and thus resembles a continuous and “undershooting” variant of ReLU. This can be mathematically represented as [22, 23]where is the sigmoid function defined as
However, the procedure for computing the value of for is explained underneath. The expansion of using Maclaurin series is given aswhere is Bernoulli’s number. Further, the expansion of is given as
The response of FSiLU for various values of in comparison with response of conventional SiLU is shown in Figure 2(b). The figure depicts that for , FSiLU provides an equivalent response to FReLU. Moreover, for , a precise variation in continuous and undershooting can be seen as varies from 1 to 0. Thus, it proves the flexibility of FSiLU compared to conventional SiLU.
3.3.3. Fractional Gaussian Error Linear Unit
GELU is another nonlinear variant of ReLU, and it is a smooth approximation of ReLU. In contrast to ReLU, the negative plane of GELU has a nonmonotonic bump and its nonlinearity weight inputs by value. GELU is the default activation for several natural language processing models. The mathematical model of GELU can be represented as [22, 23]where is the Gaussian function which is defined aswhere is Euler’s number, and and are the position of the center of the peak and standard deviation, respectively.
Therefore, from (34), the fractional GELU (FGELU) for an order is computed as
However, for , the value of is computed using the following procedure. The expansion of using Taylor’s series is given as
In (38), multiplying on both sides yields
Thus, the fractional derivative is computed as
Figure 2(c) depicts the response of FGELU for various values of in comparison with the reaction of conventional GELU. The response shows that FGELU provides a similar behavior to FReLU for positive values of . In contrast, for negative values of , the variation in nonmonotonic bump and nonlinearity can be seen as varies from 1 to 0. Thus, similar to other variants, the response of FGELU achieves more flexibility than the conventional GELU.
The summary of all the proposed activation functions is provided in Table 1.
4. Simulation Study
In this section, the performance of all the developed fractional activation functions is compared with their conventional forms. The comparison has been made in three cases. A single hidden layer feedforward neural network has been used in the first and second cases. In the third case, a multilayer feedforward neural network with two hidden layers is used to evaluate the performance. In the first case, the activation function in the hidden layer has varied between the conventional and proposed activation functions, and the output layer activation function has been kept constant at purelin. In contrast, in the second case, the activation function in the output layer has varied between the conventional and proposed activation functions, while the hidden layer activation function has been kept constant at tansig. Moreover, in the third case, the activation functions in both the hidden layers have varied between the conventional and proposed activation functions, and the output layer activation function has been kept constant at purelin.
For all the case studies, a Texas wind turbine dataset has been chosen consisting of two input parameters: wind speed and wind direction . The dataset consists of 2000 samples in each parameter, as shown in Figure 3. Moreover, the system-generated power is the output parameter. Figure 3 shows that the data are highly nonlinear and chaotic. Further, the dataset is divided into training and testing sets of 80% and 20%, respectively. Three case studies with different architectures of neural networks have been employed for the performance comparison, and each has been discussed in detail underneath. The neural network training algorithm is chosen as the Levenberg-Marquardt algorithm in all cases. The performance measures used in this simulation study are mean square error (MSE) and , which are computed as follows:where “n” is the sample’s number, is the actual value, is the predicted values, and is the average of .
Case 1. As mentioned earlier, this case study has used a single hidden layer feedforward neural network. This neural network is used to predict the system-generated power using wind speed, and wind direction is shown in Figure 4. The architecture shows that the number of nodes in the input, hidden, and output layers is 2 : 10 : 1. In the figure, bias at the hidden and output layer is denoted as [10, 25]. The activation function chosen at the output layer is purelin, while the activation function “F” at the hidden layer is varied to evaluate the performance.
Therefore, the performance of the neural network with various activation functions at the hidden layer during training and testing is shown in Table 2. The table shows that all the activation functions achieved values of around 0.9 and MSE of approximately 0.02. Thus, the performance indicates that all activation functions performed better on reaching and MSE values close to 1 and 0, respectively. Further, it is to highlight that all the fractional linear variants have performed better than the conventional ones in achieving the highest and lowest MSE values.
Overall, FLReLU and FSiLU have achieved the best performance among all the activation functions. On the other hand, among the three nonlinear variants, FSiLU has performed better than SiLU. However, the conventional ELU and GELU have achieved better performance than FELU and FGELU. Thus, the performance of the neural network with the FLReLU activation function for predicting the system-generated power during training and testing is shown in Figure 5. The response in the figure highlights the best tracking ability during both the training and testing phases.
Case 2. Similarly, the neural network architecture to predict the system-generated power using wind speed and wind direction as inputs is developed in this case, as shown in Figure 6. The architecture shows that the number of nodes in the input, hidden, and output layers is 2 : 10 : 1 [10, 25]. In this case, the activation function chosen at the hidden layer is tansig, while the activation function at the output layer “F” is varied to evaluate the performance.
Therefore, the performance of the neural network model with various activation functions at the hidden layer during training and testing is shown in Table 3. The table shows that the fractional-order activation functions have performed better than the conventional ones in terms of values. However, that is not the case with the MSE values. Comparing the functions concerning their value, it is evident that among the three linear variants, LRELU and FLReLU performed better than other linear variants. Thus, the performance of the neural network with the FLReLU activation function in the output layer for predicting the system-generated power during training and testing is shown in Figure 7. The response in the figure highlights the best tracking ability during both the training and testing phases.
Besides, the FELU performs better than ELU among the three nonlinear variants. However, conventional SiLU and GELU performed better than FSiLU and FGELU. Overall, it can be concluded that the performance of the activation functions and their fractional orders has not performed well enough in the output layer. Thus, it has been proven that the activation function in the output layer is always preferably linear.
Case 3. A multilayer feedforward neural network has been used for this case which consists of two hidden layers and one output layer. Thus, the neural network architecture to predict the system-generated power using wind speed and wind direction as inputs is developed in this simulation study, as shown in Figure 8. The architecture shows that the number of nodes in the input, hidden layer 1, hidden layer 2, and output layer is 2 : 10 : 10 : 1. The activation function chosen at the output layer is purelin, while the activation function at both the hidden layers “” and “” is varied to evaluate the performance.
Table 4 indicates that keeping ReLU in hidden layer 1, PReLU, and FPReLU performs better than other linear variants. On the other hand, ELU and FELU perform better than other nonlinear variants. Considering the linear variants, the PReLU and FPReLU perform best in the 2nd hidden layer with ReLU in hidden layer 1. With LReLU in hidden layer 1, FLReLU and PReLU have served better than conventional ones. Similarly, with PReLU in a hidden layer 1, FReLU and FLReLU have performed better than conventional ones. Considering the nonlinear variants, all conventional ones performed better than fractional ones with ELU and SiLU in 1st hidden layer. However, all the fractional-order activation functions, namely, FReLU, FPReLU, and FLReLU, have achieved best performance than conventional ones with GELU in the 1st hidden layer. Considering the fractional-order activation functions, the combinations such as FReLU + FPReLU, FLReLU + FLReLU, FLReLU + FELU, FSiLU + FReLU, FSiLU + FLReLU, FSiLU + FPReLU, FGeLU + FReLU, FGeLU + FLReLU, and FGeLU + FPReLU have performed better than the conventional ones. As an example case, the performance of the neural network model with FReLU in 1st hidden layer and FPRELU in 2nd hidden layer during training and testing is shown in Figure 9. The response in the figure depicts the best tracking performance during both the training and testing phases.
This paper has developed the fractional form of ReLU and its variants to incorporate into the neural network’s layer to improve its performance. The response of all the derived fractional activation functions has been analyzed and compared with their conventional counterparts. From the analysis, it has been observed that the fractional activation functions have provided more flexible behavior compared to the conventional ones. Further, the simulation study to predict the system-generated wind power using a neural network with the developed activation functions in the hidden layer and output layers has shown the best performance. The case studies show that the fractional-order of linear activation functions has performed well, which could be the hyper-parameter they pose in the negative plane. In Case 1, PReLU and FPReLU performed better than every other activation function. In Case 2, LReLU and FLReLU performed better than other activation functions. In Case 3, despite keeping any activation function in hidden layer 1, FLReLU performed better in all the instances than its conventional LReLU and other functions. However, the best result has been showcased by keeping PReLU in hidden layer 2 with FReLU in hidden layer 1.
The limitation of the proposed activation functions performed at the output layer will be solved in the future part of the work. Also, the fractional form of the most widely used activation function tansig will be developed, and its performance analysis will be done in a real-time case study.
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
B. Karlik and A. Vehbi Olgac, “Performance analysis of various activation functions in generalized MLP architectures of neural networks,” International Journal of Artificial Intelligence and Expert Systems, vol. 1, no. 4, pp. 111–122, 2011.View at: Google Scholar
D. Stursa and P. Dolezel, “Comparison of ReLU and linear saturated activation functions in neural network for universal approximation,” in Proceedings of the 2019 22nd International Conference on Process Control (PC19), pp. 146–151, IEEE, Strbske Pleso, Slovakia, June 2019.View at: Publisher Site | Google Scholar
B. Kishore and B. R. Prusty, “Chaotic time series prediction model for fractional-order duffing’s oscillator,” in Proceedings of the 2021 8th International Conference on Smart Computing and Communications (ICSCC), pp. 357–361, IEEE, Kochi, Kerala, India, July 2021.View at: Publisher Site | Google Scholar
B. Kishore, P. A. Mozhi Devan, and F. Azmadi Hussin, “Reconstruction of chaotic attractor for fractional-order tamaševičius system using recurrent neural networks,” in Proceedings of the 2021 Australian & New Zealand Control Conference (ANZCC), pp. 1–6, IEEE, Gold Coast, Australia, November 2021.View at: Publisher Site | Google Scholar
M. Kaloev and G. Krastev, “Comparative analysis of activation functions used in the hidden layers of deep neural networks,” in Proceedings of the 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), pp. 1–5, IEEE, Ankara, Turkey, June 2021.View at: Publisher Site | Google Scholar
B. Kishore, B. R. Prusty, A. Kumra, and A. Chawla, “Torque and temperature prediction for permanent magnet synchronous motor using neural networks,” in Proceedings of the 2020 3rd International Conference on Energy, Power and Environment: Towards Clean Energy Technologies, pp. 1–6, IEEE, Shillong, Meghalaya, India, March 2021.View at: Publisher Site | Google Scholar
B. Kishore, R. Ibrahim, M. N. Karsiti, S. M. Hassan, and V. R Harindran, Fractional-order Systems and PID Controllers, Springer, NY, USA, 2020.
B. Kishore, R. Ibrahim, M. N. Karsiti, S. M. Hassan, and V. Rajah Harindran, “Fractional order PI controllers for real-time control of pressure plant,” in Proceedings of the 2018 5th international conference on control, decision and information technologies (CoDIT), pp. 972–977, IEEE, Thessaloniki, Greece, April 2018.View at: Publisher Site | Google Scholar
J. Zamora Esquivel, A. C. Vargas, R. Camacho Perez, P. L. Meyer, H. Cordourier, and O. Tickoo, “Adaptive activation functions using fractional calculus,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Korea (South), October 2019.View at: Publisher Site | Google Scholar
Kaggle, “Texas Wind Turbine Dataset,” pp. 01–14, 2022, https://www.kaggle.com/datasets/pravdomirdobrev/texas-wind-turbine-dataset-simulated.View at: Google Scholar