Abstract

Small samples are prone to overfitting in the neural network training process. This paper proposes an optimization approach based on L2 and dropout regularization called a hybrid improved neural network algorithm to overcome this issue. The proposed model was evaluated based on the Modified National Institute of Standards and Technology (MNIST, grayscale-28 × 28 × 1) and Canadian Institute for Advanced Research 10 (CIFAR10, RGB - 32 × 32 × 3) as the training data sets and data applied to the LeNet-5 and Autoencoder neural network architectures. The evaluation is conducted based on cross-validation; the result of the model prediction is used as the final measure to evaluate the quality of the model. The results show that the proposed hybrid algorithm can perform more effectively, avoid overfitting, improve the accuracy of network model prediction in classification tasks, and reduce the reconstruction error in the unsupervised domain. In addition, employing the proposed algorithm without increasing the time complexity can reduce the effect of noisy data and bias and improve the training time of neural network models. Quantitative and qualitative experimental results show that the accuracy of using the proposed algorithm in this paper with the MNIST test set has an improvement of 2.3% and 0.9% compared to L2 regularization and dropout regularization, respectively, and based on the CIFAR10 data set, the accuracy improvement of 0.92% compared with L2 regularization and 1.31% concerning dropout regularization. The reconstruction error of using the proposed algorithm in this paper with the MNIST data set has an improvement of 0.00174 and 0.00398 compared to L2 regularization and dropout regularization, respectively, and based on the CIFAR10 data set, the accuracy improvement of 0.00078 compared with L2 regularization and 0.00174 concerning dropout regularization.

1. Introduction

Generally, convolutional neural networks are used in the image classification field to extract feature information in images, and fully connected layers are used to build classifiers. However, due to a large number of weight parameters in the fully connected layer, it is easy to cause the problem of overfitting in the case of small samples, which makes neural networks overparameterized. The review of available research papers shows that to solve the overfitting of neural networks during training, Hinton et al. [1] first proposed a dropout regularization method to prevent overfitting. This method introduces dropout during network model training, which can randomly suppress the activation values of some neurons. Therefore, the neural network structure obtained by each training iteration is different and can be regarded as a subset of different neurons. The working principle of dropout can be understood from the perspective of model averaging. In the final prediction, all neurons are retained to participate in the test, and the average approximation of all trained models is obtained to prevent overfitting and improve model prediction accuracy. The remarkable effect of the dropout regularization technique is widely used in various types of neural networks. With the same target, Zhou and Luo [2] proposed a method to prevent overfitting, which selects the probability of node deletion according to the size of the activation value. The network deletes the node with a lower activation value with a higher probability of retaining more nodes with a higher activation value and enhances the model’s feature extraction ability. During testing, all deleted nodes are retained, and all training parameters are restored, which achieves the purpose of combining multiple networks and embodies the idea of model averaging. Zhong et al. [3] alleviated the overfitting of the model by proposing a multi-scale fusion method to optimize dropout. The method uses the genetic algorithm to find the optimal scale, then further updates the parameters in the network according to the optimal scale to obtain prediction submodels, and finally fuses these submodels into the final prediction model with a certain weight. This is the actual idea of model averaging. Ghiasi et al. [4] proposed the DropBlock method to alleviate model overfitting. In this method, during the model’s training process, the units in the continuous area of the feature map are discarded together, and the number of discarded units gradually increases as the number of iterations increases. This can improve the accuracy and robustness of network model predictions. Cheng et al. [5] proposed an improved model averaging method to prevent overfitting. The early dropout was applied to the fully connected layer, but the training phase of this method introduced dropout in the pooling layer, which made the unit value of the pooling layer sparse. In the test phase, the probability of the unit value selected by the pooling layer dropout during training is multiplied by the probability of each unit value in the pooling area as a double probability. The sparsity effect of the stage pooling layer dropout can be better reflected on the test stage pooling layer. Inspired by DropBlock, Pham and Le [6] proposed the method of auto dropout. Unlike DropBlock, this method automatically learns the dropout mode on each channel and layer during the model training process, automates the design process of the dropout mode, and divides the original continuous dropout area into small areas. This causes to have better model generalization performance [7]. Gomez et al. [8] proposed a new method (targeted dropout) to sparse the network. The method is based on dropout, in which the network is sparsed by randomly discarding some neurons. While targeted dropout is a purposeful sparse network, the idea is to rank weights or units according to an approximate measure of importance (such as size) and then apply dropout to those sets of units that are considered least useful to achieve a better effect. Many researchers studied the problem of preventing overfitting and have made great contributions.

In addition to the methods to prevent overfitting mentioned in the above literature, the L2 regularization term can also be added to the loss function. The trainable weight parameters can be attenuated (weight decay) to reduce the dependence on a certain feature. Improve the model’s generalization ability, which can prevent overfitting [9]. Start with a small sample problem [10]; in order to avoid overfitting during model training to the greatest extent and improve the generalization ability of the network model, it is particularly important to have a powerful regularization technique [7]. As the above survey shows to improve the model’s generalization of parameters such as performance, accuracy, and training time, the impact of noisy data and complexity reduction of the neural network on model training should be considered more and investigated. This research paper proposes an improved hybrid algorithm based on L2 and dropout regularization concerning the mentioned parameters.

In the hybrid proposed algorithm, the dropout method is initially used to reduce the number of updatable parameters, which can speed up the training of the network model. Then, L2 constraints are hired on these parameters to attenuate the weight parameters, reduce the dependence on a certain feature, and enhance the robustness of feature selection. Finally, the loss function is reconstructed to optimize the update process of the parameters in the network model training to prevent overfitting. For the convenience of subsequent description, this paper calls it the hybrid algorithm. The hybrid algorithm can be applied to various classification tasks requiring fully connected layers. The present study would contribute to the existing knowledge in several different ways, as follows:(1)This study proposes a new hybrid algorithm to solve the overfitting problem caused by the small number of samples in the neural network training process. As known, the large dataset required for training a reliable model is one of the crucial obstacles to implement the neural network. However, such datasets are not easily accessible because of the required time, cost, and energy for preparing them. Therefore, an alternative approach would be making the best use of the data in hand rather than straightforwardly increasing the costly dataset size. With this in mind, the results of this study would be used in different disciplines, including engineering, medicine, and geosciences, where neural networks might be potentially used.(2)The proposed algorithm acts on the fully connected neural network layer and concentrates on improving prediction accuracy and reducing the reconstruction error, which can effectively reduce data noise and error. This is very crucial considering the approaches that researchers have put forth recently where they translate the developed networks into user-friendly apps for the convenient use of fellow researchers [11].(3)The proposed algorithm was compared with two data sets of the MNIST and CIFAR10 with different dimensions and image types. Considering the wide usage of these two well-known datasets by different researchers, the comparison results prove the efficiency of the introduced method.(4)The proposed algorithm with the supervised (LeNet-5) and unsupervised RAE neural network architecture algorithms are compared to investigate the prediction accuracy reconstruction error, respectively.

In general, the authors believe that based on the conducted assessments of the obtained results, as mentioned in the upcoming sections and the observed satisfying performance, the introduced method would efficiently contribute to solving various problems in different fields of science.

The paper is set as follows: first, the two algorithms of L2 and dropout and the proposed hybrid algorithm are described in section 2; then, in section 3, the process of hyperparameter estimation with the LeNet-5 and Rough Auto Encoder (RAE), based on two datasets (MNIST and CIFAR10) and each algorithm’s performance (without regularization, L2 regularization, dropout regularization, and hybrid techniques) were experimented and analyzed. In section 4, the obtained results are compared and shown. Finally, the paper is concluded in section 5.

2. Algorithm Introduction

This section provides detailed descriptions of L2 regularization [12] and dropout regularization [13]. The core idea of the algorithm is described according to the mathematical theory behind it, and the pseudocode is given later to illustrate the working principle of the algorithm. Finally, the improved algorithm is described in detail in the third module, and the pseudocode is also given.

2.1. L2 Algorithm

The L2 regularization works with the principle of adding a regularization term (also known as a penalty term) to the loss function and participating in the model’s training process. The L2 constraint [14, 15] imposes a greater penalty on a larger weight. The larger the weight parameter value, the greater the attenuation, thereby reducing the dependence on a certain feature. Therefore, the absolute value of the weight parameter in the network layer tends to decrease, and there will be no particularly large value to avoid the problem of overfitting and improve the model’s ability. The general mathematical expression of the loss function of the neural network that introduces the regularization term during the training process of the model is shown in the following:

As (1) shows, means the loss function after adding the regularization term to the loss function , where the parameter (its value is between 0 and 1) is used to control the penalty on the weight parameter, the parameter does not participate in the training of the network model and is a hyperparameter, and represents the weight parameter in the network. means that the network has fully connected layers. means that each layer has neurons.

In the process of error backpropagation (BP) [16], the calculation formula for the partial derivative of a weight parameter in the network model is as follows:

Taking the weight parameters of the layer as an example, it can be concluded that the gradient vector of the weight parameter is in a certain layer of the network model.

The calculation of a parameter update in the network model is represented as in the following equation:

The parameter in (4) is the step size of the parameter update when using the gradient descent algorithm, also called the learning rate. In summary, the update of all weight parameters of a certain layer in the network model can be obtained, as shown in (5), which is expressed in the vector form.

It can be seen from (5) mathematical formulas, in the training process of the model, the L2 regularization term is introduced based on the loss function , and the regularization term can control the weight parameters (weight decay) to prevent overfitting, reduce the dependence on all features, improve the generalization ability of the model, and improve the accuracy of the model prediction. Using L2 regularization can reduce the dependence on some weight parameters to prevent overfitting, but the network structure cannot be optimized, and the neural network with a complex network and many weight parameters is helpless. The pseudocode description of the L2 regularization is described as Algorithm 1.

Initialization:
    Initialize the neural network’s weight parameters,
    regularization parameter,
    regularization term,
    and parameter learning rate.
# Iterate over the parameters of each layer,
     represents a certain layer,
     represents all layers.
For in :
  # Get the number of neurons in this layer. _nums_of_ neurons = len ()
  # Get the number of neurons in the next layer. _next_nums_of_ neurons = len ()
  forin range ( _nums_of_ neurons):
    forin range ( _next_nums_of_ neurons):
     
    End for.
  End for.
End for.
# Get the regularization term.
Suppose is our loss function, then : final cost function.
Repeat do:
    for epoch in epochs:
      # Parameter update in units of mini_batch.
          for mini_batch in range(mini_batches):
           # repeat do, parameter update process.
           
           
          End for
End for
2.2. Dropout Algorithm

The dropout [16] regularization method was first proposed by Hinton et al. [1] to solve the overfitting problem of neural networks during training. Dropout works on the idea that it is added to the neural network in the training process, and some neurons are inhibited by randomly generating a probability vector of 0 and 1 (p_vector) to act on the activation unit [17]. The network structure of each training iteration is different, while all neurons are preserved in the test. In each iterative training process, the output of the activation value of the inhibited neuron is zero, and the connection weight parameter with the inhibited neuron does not participate in the update process. Each neuron in the neural network is inhibited with a certain probability. This mechanism of inhibiting neurons can optimize the structure of the network model and reduce the weight parameters that can be updated during the model’s training process. Because of this mechanism, the network produces a different network structure each time after it is trained. Therefore, each network structure can be regarded as a subset of different neurons. If it is iterated n times, there are n different network model structures that will be randomly generated, and the n different network models will jointly determine the final prediction result of the model. This technique is also called model averaging [18, 19]. In this way, the overfitting problem is prevented, which is how the dropout regularization technique works. The core formula of the dropout algorithm is shown in the following formulas (6)–(9).

In equation (6), the function Bernoulli (p) is to generate a probability vector between 0 and 1. (7) indicates that the probability vector is applied to the activation unit of the layer-th to obtain a new activation value (9) uses the RELU activation [20] function to enhance the nonlinear expression ability. The dropout algorithm can optimize the neural network’s structure, suppress some neurons’ activation values, and reduces the number of network weight parameters that can be updated during the training process. However, the dependence on some features cannot be alleviated during each iteration, and only some activated units are suppressed. In addition, due to the huge number of parameters, the effect of simply using dropout is not obvious for large networks. Algorithm 2 shows the pseudocode description of the dropout algorithm.

Input: activation value, note as .
Initialization:
    Initialize the neural network’s weight parameters ,
    bias parameters ,
    and neuron drop rate .
for epoch in epochs:
# Generate probabilistic vectors for each layer of neurons used to simulate neuronal inactivation.
  for layer in layers
    # According to drop rate generate .
    .
  End for.
  x ← random min-batch from Dataset D.
  Repeat do:
   for x in D: # Use x to represent mini_batch, Use D to represent all mini_batches.
      for layer in layers-1:
        .
        
      if the layer is not the output layer:
          .
      else:
        .
        # note as
       
      End for.
   End for.
  Pass: parameter update using gradient descent.
End for.
2.3. Hybrid Algorithm

The hybrid algorithm is an improved algorithm derived from L2 regularization and dropout regularization. While using L2 regularization for weight decay, dropout is introduced to optimize the network structure in the model training process. Using dropout can inhibit some neurons to reduce the dependence between neurons, thinning the connections between neurons and not relying too much on certain features to prevent overfitting better as the structural level optimization. For example, a neuron responsible for a key feature may fail during an iteration due to a neuron being inhibited. Therefore, the network must find other important features that improve the network model’s generalization ability from another perspective. The use of L2 regularization can prevent overfitting by attenuating the updatable parameters, which is an optimization at the parameter level. Therefore, the hybrid algorithm is proposed to solve the overfitting problem better so that the neural network can learn robust and concurrent features, improve the accuracy of network model prediction, and improve the stability of the model training process. A new and improved algorithm for preventing overfitting is proposed to combine optimization at the parameter and network structure levels. The general regularization term is denoted as and defined as follows:

The parameter (its value is between 0 and 1) is used to control the penalty on the weight parameter; the parameter does not participate in the training of the network model, and it is a hyperparameter, represents the weight parameter in the network. In (10), means that the network has fully connected layers and shows that each layer has neurons. We know that the last step of the network model is to calculate the error between the predicted value and the true value . The is defined as follows:

In Equation (11), represents the activation value vector of the layer, represents the weight parameter of the layer, and represents the bias vector composed of all neurons in the layer. Finally, use the Softmax function to get the output value of the model. The loss function used in this paper is the multi-class cross-entropy loss function, denoted as , thenwhere, n represents the number of samples in each training iteration, and it is used to find the average loss for each batch size, the loss function of the network model is shown in the following:

In each epoch, a set of probability vectors containing only 0 and 1 are randomly generated to suppress some neurons.

In equation (14), p is a hyperparameter drop rate. The loss function of the final network model using the hybrid algorithm is represented in the following: where, represents the element in the probability vector corresponding to the neuron in the layer , acting on the weight parameter connected to the neuron. The core formula involved in forwarding propagation is shown in the following equations:

In (16), the probability vector acts on the activation unit of a certain layer. The activation value is subjected to a dot product operation to achieve partial neuron inhibition (17), and the activation function used in implementing this algorithm is rectified linear units (RELU) [21]. Based on (18), the SoftMax [21] function is used to get the final output value of the network model. Finally, the parameters are updated using the gradient descent algorithm [22] and also Adam (adaptive learning rate) as the optimizer [23]. It can be seen from the above formula that the hybrid algorithm attenuates the weight parameters in the network model through L2 regularization and, at the same time, it reduces the complexity of the network model in each iterative training process. By introducing dropout to optimize the network model’s structure, the network model’s generalization ability can be enhanced, and the effect of preventing overfitting can be better achieved. The pseudocode description of the hybrid algorithm is repressed in Algorithm 3.

Input: the feature map extracted by the convolutional neural network makes
   the Flatten operation is a one-dimensional feature vector; note as .
Initialization: initialize the neural network’s weight parameters , bias parameters , regularization parameter , and neuron drop rate .
for epoch in epochs:
   # Generate probabilistic vectors for each layer of neurons
    used to simulate neuronal inactivation.
   for layer in layers:
    # According to drop rate generate .
    .
   End for.
   X ← random min-batch from Dataset D.
   Repeat do:
   for X in D: # Use X to represent mini_batch, and use D to represent all mini_batches.
     for layer in layers-1:
      if is the layer of output; note as :
        
         .
      else:
        .
         .
     End for.
     # layer-th consists of n neurons
     #Turn into
      
     #L2 regularization term note as .
     .
     .
   End for.
   Pass : update parameters by backpropagation.
End for.

3. Experimental Results and Analysis

The experiment section is performed on the small sample problem to investigate the overfitting phenomenon and compare the algorithm performance efficiency. The PyCharm IDE is used for coding all algorithms [23], in order to verify the effectiveness of the proposed algorithm, the Modified National Institute of Standards and Technology databases (MNIST) dataset [24] and the Canadian Institute for Advanced Research, 10 classes (CIFAR10) dataset [25, 26] with respect to open-source accessibility. The MNIST dataset contains the image of handwritten digits from 0 to 9 in 10 categories with the grayscale format and 28 × 28 × 1 dimension, and the CIFAR10 dataset (collected by Alex et al.) consists of 10 categories of 32 × 32 × 3 different RGB images. To use both datasets (MNIST and CIFAR10), initially, the data were randomly mixed to be randomly distributed, and then, a total of 1000 samples were selected, of which 800 samples were used for training, and 200 samples were used for testing and validation. The L2 and dropout and the proposed hybrid algorithm mainly work on the fully connected layer. Then, the LeNet-5 neural network [27] with the convolutional and fully connected layers and rough auto encoder (RAE) [28] neural network fully connected layers in their architecture were selected to estimate the hyperparameter [29]. Afterwards, in LeNet-5 and rough auto encoder (RAE) neural network, the 5-fold cross-validation method is used to evaluate the performance of each model (L2 and dropout and the proposed hybrid algorithm), and finally, the average value is taken as the standard to measure the model’s performance [30].

The following experiments in 4 steps for each data set are set up for comparison and verification to ensure comparability between experiments. The one-factor variable control method is used. The following experiments are set up in four steps to compare and verify each data set and ensure comparability between experiments. In this experiment, the one-factor variable control method is used in each test, the accuracy of the training set (average accuracy), the accuracy of the validation set (average validation accuracy), the loss value of the training set (average loss), and the loss value of the validation set (average validation loss) are calculated and compared. Tests on both datasets include Step 1, test without using any overfitting prevention method and regularization; Step 2, use the L2 regularization; Step 3, test based on the dropout regularization method and investigate the effect of using the dropout in this network model that suppresses some neurons; Step 4, use the hybrid algorithm to conduct experiments and observe the changes in the model during the training process which is shown as in the following:

3.1. Experiment on the MNIST Dataset

As mentioned before, initially, the MNIST data set, with the help of LeNet-5 and RAE neural network used to determine the appropriate hyperparameters as described in the following:

3.1.1. LeNet-5 Network Architecture vs. MNIST Dataset

Since the dimensions of the MNIST dataset are 28 × 28 × 1, and the input to the LeNet-5 network is 32 × 32, the image data need to be resized to 32 × 32 × 3 to be applied to the LeNet-5 network for training. Therefore, the outermost edges of the image data with two zero layers are considered [31].

(1) LeNet-5 hyperparameters calculation. Based on the LeNet-5 network architecture, experiments are carried out using the MNIST data set. The experimental results determine the hyperparameters through the 5-fold cross-validation method, and the average value is finally taken as a suitable hyperparameter. The neural network’s performance with different values of the hyperparameter learning rate and hyperparameter weight_decay is shown in Table 1, and drop rate (p) (in Table 2, respectively. In Table 1, the average parameters accuracy of train data (average train acc), average train loss, average value accuracy (average val acc), and average value loss (average val loss) for both cases of learning rate and weight decay based on LeNet-5 are presented. It should be mentioned that the ideal value for the hyperparameter values should be selected based on the best performance for average train accuracy (AT_ACC) and average value accuracy (Av_Acc) near the 1 and the average train loss (AT_Loss), as well as average val loss (AV_ Loss), should have a value close to zero that should be chosen.

Table 1 indicates the learning rate and weight decay values from 0.0001 to 0.01 with the step between 0.001 and 0.0001, respectively, for the LeNet-5 neural network architecture. In order to select better hyperparameters, the first three top-performing hyperparameters (with respect to AT_ACC and Av_Acc near 1, AT_Loss, and AV_ Loss close to zero) were selected [29], and the average was taken as the final hyperparameter. As shown, the learning rate (LeNet-5) for hyperparameter values of 0.002, 0.001, and 0.0006 have the best performance. Then, 0.0012 as the average value of these three hyperparameters is considered the final learning rate hyperparameter (LeNet-5). Similarly, the weight decay values of 0.0002, 0.0004, and 0.0005 with an average of 0.00037 is considered the final weight decay hyperparameter. The drop rate (p) hyperparameter from 0.1 to 0.9 value with the step of 0.1 for the LeNet-5 neural network is presented in Table 2.

As illustrated in Table 2, the 0.2, 0.3, and 0.4, with an average of 0.3 selected as the best performance for the hyperparameter drop rate (p) of the LeNet-5 neural network.

(2) LeNet-5 performance and cross-validation five-fold. According to selected hyperparameters with the LeNet-5 architecture, the performance of each model (without regularization, L2 regularization, dropout regularization, and the proposed hybrid algorithm) is verified by cross-validation five-fold [32] along with the average train accuracy, average train loss, average accuracy value, and average loss value which, are represented in Tables 3 and 4.

According to the extracted average value (Tables 3 and 4), each algorithm’s result performance is shown in Figures 14 with four steps, without using any overfitting prevention method and regularization, L2 regularization, dropout regularization method, and hybrid algorithm as follows:

The Step 1 (without using any overfitting prevention method and regularization) results in Figure 1 show that severe overfitting occurs because the accuracy rate on the training set has reached 0.9773, which is much higher than the accuracy rate on the validation set, which is about 0.903. The error between the training and validation set accuracy is about 0.0743, indicating that the training data are perfectly fitted. The loss value of the validation set is also higher than the loss value of the training set. The loss value of the training set is 1.4843, and the loss value of the value set is 1.5597. The error between the training set loss value and the validation set loss value is about 0.0754.

As can be seen from Step 2 (L2 regularization), the results in Figure 2 show the problem of overfitting. However, the problem of overfitting in step 2 is alleviated compared with that in step 1, and the phenomenon of overfitting is alleviated to a certain extent. The accuracy on the training set is 0.9778, and the accuracy on the validation set is about 0.92. The error between the training and validation set accuracy is about 0.0578. The loss value on the training set (value 1.4831) differs from the loss value on the validation set (value 1.5407) by about the value of 0.0576.

From the Step 3 results in Figure 3 (dropout regularization method), it can be observed that the overfitting problem is greatly improved when compared with both steps 1 and 2. Although there is a gap between the accuracy rate on the training set (value 0.9631) and the accuracy rate on the validation set (value 0.934), the gap will not be as large as in steps 1 and 2. The error between the training and validation set accuracy is about 0.0291. There is still a little overfitting, and it can be seen from the training set loss (value 1.5027) and the validation set loss (value 1.5276) that overfitting is gradually highlighted; the loss value on the training set differs from the loss value on the validation set by about 0.249, which shows that as the number of iterations increases, overfitting becomes more and more serious.

It can be seen from the Step 4 (hybrid algorithm) results in Figure 4 and Table 4 that the results in this step are the best among all other steps. There are very large improvements in Accuracy and Loss. The error between the training set accuracy (value 0.971) and the validation set accuracy (value 0.943) is about 0.028, and the loss value on the training set (value 1.4918) differs from the loss value on the validation set (value 1.5127) by about 0.0209. Moreover, it also has stability during the training process, which ensures the stability of the model during the retraining process and prevents the overfitting problem to a large extent.

(3) Time complexity. The cross-validation method is used to compare the time complexity performance of the tested method (without regularization, with L2 regularization, dropout regularization, and hybrid algorithm) LeNet-5 neural network, and the average time is calculated. The time complexity of the models is measured by comparing the average time spent on model training with the LeNet-5 neural network. The training time of each model on the MNIST dataset is defined in Table 5.

Table 5 shows the 5-fold cross-validation based on the LeNet-5 that represented the small average time difference with the L2 regularization as the maximum and the hybrid algorithm (26.27 seconds) as the least value. The result shows that although the proposed hybrid algorithm complicates the loss function, the time performance does not deteriorate. Compared with L2 regularization, the time performance is improved by 0.99 seconds.

3.1.2. Rough Auto Encoder (RAE) vs. MNIST Dataset

Along with the LeNet-5, another neural network architecture named rough auto encoder (RAE) was used to compare the proposed hybrid algorithm with the same MNIST dataset. The results are compared to the model’s performance.

(1) Rough Auto Encoder (RAE) hyperparameters calculation. In a similar way to hyperparameter extraction with LetNet5, the hyperparameter extraction test was conducted based on Rough Auto Encoder (RAE) and MNIST data-based, and finally, the average of the best three performances considers as the final hyperparameter. The RAE neural network’s performance with different values of the learning rate hyperparameter, weight decay hyperparameter, and drop rate hyperparameter (p) are given in Tables 6 and 7, respectively. Table 6 shows the learning rate and weight decay hyperparameter of RAE neural network architecture (from 0.01 to 0.0001 with the step of 0.0001 and 0.0001 to 1.00E-07 with the step of 1.00E-05 and 1.00E-06) trained on the MNIST dataset. It should mention the lower value for the reconstruction error parameter selected as the best performance result.

As depicted in Table 6, the RAE neural network architecture learning rate hyperparameters with the three maximum values of 0.003, 0.004, and 0.005 as the best performance are averaged as 0.004, and on the other side for the weight decay hyperparameter, the three values of 4.00E-07, 3.00E-07, and 2.00E-07 with the average value of 3.00E-07 with respect to the best performance of the reconstruction error on the MNIST dataset selected. The drop rate hyperparameter (p) and respected reconstruction error from 0.1 to 0.9 with the step of 0.1 trained on the MNIST dataset are expressed in Table 7.

Table 7 shows that as the rate of hyperparameter p increases and the reconstruction error decreases, the value of hyperparameter p as 0.1 performs best, and reconstruction error is defined as 0.038906.

(2) Rough Auto Encoder (RAE) performance vs. MNIST dataset. Since the input of the RAE fully connected network is one-dimensional (1D) data, it is necessary to process the image data into a 1D vector. Then, based on the RAE, the reconstruction errors of each algorithm on the MNIST dataset are compared. The smaller value of reconstruction error considers the better algorithm performance. Figure 5 and Table 8 show the compassion of the RAE reconstruction errors based on the MNIST dataset and reconstruction errors.

As Figure 5 states, using the proposed hybrid algorithm can more effectively reduce the impact of data noise and bias on model training under the unsupervised neural network and minimize the reconstruction error. As can be seen from Table 8, the reconstruction error on the MNIST dataset using the proposed hybrid algorithm has the least value (0.02438), and the maximum value belongs to dropout regularization (0.02836).

(3) Time complexity. As the last stage of algorithm comparison, the algorithm performance time in terms of time complexity concerning the MNIST dataset based on the RAE neural network architecture is calculated and listed in Table 9.

As depicted in Table 9, the maximum time belongs to L2 regularization but compared with the dropout regularization, and without regularization, the hybrid algorithm has a higher value. It can be seen from these data that although the proposed algorithm complicates the loss function, the time performance does not deteriorate, and compared with L2 regularization, the performance time is improved by 64.57 seconds.

3.2. Experiment on the CIFAR10 Dataset

As the second data set, CIFAR10 is used to compare and check the affordance algorithm. Then, as in the previous data set, LeNet-5 and RAE neural network architectures are used to extract the hyperparameters, and then, the performance of each algorithm is checked as follows:

3.2.1. LeNet-5 Network Architecture vs. CIFAR10 Dataset

To use the CIFAR10 image data set, as the image and LeNet-5 neural network architecture have the same dimension (32 × 32), the data can be used directly to speed up the convergence of the gradient descent algorithm, a normalization operation is performed on the dataset [30].

(1) LeNet-5 hyperparameters calculation. Based on the LeNet-5 network architecture, experiments are carried out under the CIFAR10 data set to find the hyperparameters (learning rate hyperparameter, the weight decay hyperparameter, and the drop rate hyperparameter p) through the 5-fold cross-validation method, and finally, the average value is taken as the fitting hyperparameter. The obtained value as hyperparameters is represented in Tables 10 and 11. It should mention the first three top-performing hyperparameters selected as the final hyperparameter, the same as the previous dataset.

As displayed in Table 10, the LeNet-5 neural network architecture has the maximum three values for the learning rate of 0.001, 0.0009, and 0.0007, with the average of 0.00087 as the final value for the learning rate and the weight decay hyperparameter the three maximum values 0.001, 0.0005, and 0.0001 performance averaged to 0.00053 and considered as weight decay hyperparameter value. The same process was followed for the hyperparameter drop loss (p), and the results are shown in Table 11.

As demonstrated in Table 11, the values of 0.3, 0.4, and 0.5 are the maximum three, with the best performance averaged to be 0.4 and defined as the hyperparameter p for the next steps.

(2) LeNet-5 performance and cross-validation five-fold. In order to verify the proposed hybrid algorithm’s effectiveness, each model’s performance is measured by cross-validation and the second data set (CIFAR10) according to the LeNet-5 architecture tests [22]. The average train accuracy, average validation accuracy, average train loss, and average validation loss parameters are calculated and compared. The cross-validation results for all affordance algorithms, along with the train accuracy, train loss, accuracy value, and loss value, are presented in Tables 12 and 13.

Based on the obtained result in cross-validation tables (Tables 12 and 13), the result of all four steps is illustrated in Figures 69.

As Figure 6 shows, in Step 1 (without using any overfitting prevention method and regularization), as the number of iterations increases, the difference between the training set’s loss (value 1.6456) and the validation set’s loss (value 2.0868) is large; the error between the training set accuracy (value 0.815) and the validation set accuracy (value 0.3716) is about 0.4434. The loss value on the training set differs from the loss value on the validation set by about 0.4412, which also reflects that overfitting is serious from the side.

Figure 7 (Step 2: L2 regularization) illustrates that the difference between the training set’s loss (value 1.611) and the validation set’s loss (value 2.0764) is also large. The training set’s loss value differs from the value on the validation set by about 0.4724. From the perspective of average accuracy, the error between the training set accuracy (value 0.8515) and the validation set accuracy (value 0.3791) is about 0.4654. The overfitting problem is also serious.

As can be seen from Figure 8 (Step 3: dropout regularization method), as the number of iterations increases, the difference between the training set’s loss (value 1.6635) and the validation set’s loss (value 2.081) becomes larger and larger; the error between the training set accuracy (value 0.807) and the validation set accuracy (value 0.3752) is about 0.4318, and from the perspective of average accuracy, the loss value on the training set differs from the loss value on the validation set by about 0.4175, which also reflects that overfitting is becoming more and more serious from the side.

The step 4 (hybrid algorithm) results shown in Figure 9 and Table 14 are the best compared to the previous steps and regarding average accuracy values. The error between the training set accuracy (value 0.751) and the validation set accuracy (value 0.3883) is about 03627. From the perspective of loss value, the stability during training is greatly improved compared to other steps; the loss value on the training set (value 1.710) differs from the loss value on the validation set (value 2.0692) by about 0.3592, which also shows that using the proposed hybrid algorithm is more effective than using other regularization methods.

(3) Time complexity. To compare the performance of each model in terms of time complexity, the cross-validation method is used, and the average time is calculated based on the LeNet-5 neural network architecture and CIFAR10 dataset, as indicated in Table 14.

As can be seen from Table 14, the average time difference of each model in the CIFAR10 dataset training is very small, and without regularization with the 227.27 value, the least value, L2 regularization is the maximum (235.45), and proposed hybrid algorithm has the second least value (228.09) among other algorithms. The result shows that although the hybrid algorithm complicates the loss function, the time performance does not deteriorate. Compared with L2 regularization, the time performance is improved by 7.36 seconds, and the difference with the without regularization is 0.82 seconds.

3.2.2. Rough Auto Encoder (RAE) vs. CIFAR10 Dataset

Similar to the method used in the previous data set, the RAE -based neural network architecture was also used for the CIFAR10 dataset. To conduct the following test, the fully connected layers are used to build the RAE’s encoder and decoder parts, and the hyperparameters are determined. Then, all mentioned algorithms are compared based on the reconstruction error to measure the model’s performance.

(1) Rough Auto Encoder (RAE) hyperparameters calculation. Based on the Autoencoder’s RAE network architecture, several experiments are carried out under the CIFAR10 data set. The hyperparameters are determined from the results of many experiments, and finally, the average value is taken as the appropriate hyperparameter. The neural network’s performance with different values of the hyperparameter learning rate, weight decay hyperparameter, and drop rate hyperparameter p are displayed in Tables 15 and 16, respectively. In order to select more suitable hyperparameters, this paper selects the first three hyperparameters with the best performance and then, takes the average value as the final hyperparameter.

Table 15 indicates that the RAE neural network architecture is trained data based on the CIFAR10 dataset. The top three perform best when the learning rate hyperparameters are equal to 0.0003, 0.0004, and 0.0005 averaged as 0.0004, and for the hyperparameter weight decay as 9.00E-07, 3.00E-07, and 1.00E-07, and the top three with the best performance averaged as 4.33E-07. The drop rate (p) hyperparameter from 0.1 to 0.9 with the step of 0.1 is given in Table 16.

As Table 16 shows, the reconstruction error’s loss value on the CIFAR10 dataset increases as the drop rate hyperparameter p increases, and the best drop rate hyperparameter belongs to the value 0.1.

(2) Rough Auto Encoder (RAE) performance vs. CIFAR10 dataset. Since the input of the fully connected network is one-dimensional data, and the CIFAR10 dataset contains 32 ∗ 32 ∗ 3 RGB images, then the image data should convert into a one-dimensional vector. Then, based on the RAE, the reconstruction errors of each algorithm on the CIFAR10 dataset are compared. The smaller value of reconstruction error considers the better algorithm performance. Figure 10 and Table 17 offer the compassion of the RAE reconstruction errors based on the CIFAR10 dataset and reconstruction errors.

As Figure 10 indicates, the proposed hybrid algorithm can more effectively reduce the impact of data noise and bias on model training under the RAE unsupervised neural network and minimize the reconstruction error. The reconstruction errors of each algorithm are presented in Table 17. The table shows that the reconstruction error on the CIFAR10 dataset using the proposed hybrid algorithm is the least (0.02808), and the dropout is the most compared to the algorithms; the reconstruction error value is reduced by 0.00174.

(3) Time complexity. To compare the algorithm’s performance in terms of time complexity based on the CIFAR10 and RAE neural network architecture, the obtained results are displayed in Table 18.

As Table 18 demonstrates, the average time by the proposed hybrid algorithm is 4560 seconds as the second maximum value, but it can be seen that although the proposed algorithm complicates the loss function, the time performance does not deteriorate. Compared with L2 regularization, the time performance is improved by 945.74 seconds.

4. Result

The trained models were applied to the test set of the MNIST and CIFAR10 datasets. Concerning the different network architectures, cross-validation was adopted to evaluate the performance of the without regularization, L2, dropout, and hybrid algorithm models to calculate the accuracy and training time. The results using different algorithms are specified in Table 19.

As Table 19 explains, the highest accuracy belongs to the proposed hybrid algorithm, and the lowest one fits without regularization; the highest reconstruction errors parameter goes for the dropout regularization and the least performed by the hybrid algorithm for both the data sets MNIST and CIFAR10. The accuracy result for the MNIST data shows that using the hybrid algorithm causes an improvement of 4.0%, 2.3%, and 0.9%; on the other side, for the CIFAR10, the accuracy improved by 1.67%, 0.92%, and 1.31%, in comparison with without regularization, L, and dropout model respectively. The training time for LeNet-5 is less for the hybrid algorithm and second high for RAE neural network in both data sets. But the results show that its performance improves by 0.99 seconds in MNIST and 7.36 seconds in CIFAR10, compared to L2 regularization for LeNet-5. Also, RAE trains the neural network on the MNIST dataset, the hybrid algorithm improves the time performance by 64.57 seconds and 945.74 seconds on the CIFAR10 compared to L2 regularization.

5. Conclusion

This study introduces the hybrid algorithm as an improved algorithm for the neural networks model training process based on L2 and dropout regularizations. The proposed algorithm combines the advantages of the L2 regularization term with the loss function so that the network model attenuates the weight parameters during the training process and prevents overfitting; on the other side, the dropout benefits from optimizing the network structure in the training process and some neurons are randomly suppressed to obtain different network structures; finally, a model-averaging strategy is used in the testing phase to prevent overfitting. Under the cross-validation method, comparative experiments are conducted under different datasets by designing different neural network architectures. Based on the supervised (LeNet-5) and unsupervised (RAE) neural network architectures and verified under the MNIST (grayscale— 28 × 28 × 1) and CIFAR10 (RGB—32 × 32 × 3) datasets. The obtained results show that hybrid algorithms can effectively improve the model’s prediction performance and performance without much increment in the training time, and even compared to L2 regularization, the results are improved.

In addition, the proposed algorithm can reduce the reconstruction error to a certain extent. The experimental results on small samples show that although the loss function of the hybrid algorithm proposed in this paper becomes complicated, the algorithm can effectively improve the model’s prediction performance and reduce the reconstruction error without reducing the time performance. However, this also has a small drawback. Due to the introduction of the hybrid algorithm, the loss function becomes more complex. In each iteration process, the forward and backward propagation calculations require additional time, leading to the network model training taking longer compared with no regularization and dropout regularization. Given this potential shortcoming, we will suggest that the algorithm can be optimized in future work to reduce the algorithm’s time complexity and speed up the convergence of the network model.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Xiaoyun Xie, Ming Xie, Ata Jahangir Moshayedi Methodology Xiaoyun Xie, Ming Xie, and Mohammad Hadi Noori Skandari conceptualized this study. Xiaoyun Xie, Ata Jahangir Moshayedi, Mohammad Hadi Noori Skandari investigated this study. Xiaoyun Xie found the resources of this study. Xiaoyun Xie and Ming Xie, Ata Jahangir Moshayedi wrote the original draft. Ata Jahangir Moshayedi and Mohammad Hadi Noori Skandari reviewed and edited the study. All authors have read and agreed to the published version of the manuscript.

Acknowledgments

This work was supported by 03 Special Project and 5G Project of Jiangxi Province (Grant no. 20204ABC03A18), The National Key Research and Development Program of China (Grant no. 2020YFB1713700).