#### Abstract

Convolutional neural networks (CNNs) are becoming more and more popular today. CNNs now have become a popular feature extractor applying to image processing, big data processing, fog computing, etc. CNNs usually consist of several basic units like convolutional unit, pooling unit, activation unit, and so on. In CNNs, conventional pooling methods refer to 2×2 max-pooling and average-pooling, which are applied after the convolutional or ReLU layers. In this paper, we propose a Multiactivation Pooling (MAP) Method to make the CNNs more accurate on classification tasks without increasing depth and trainable parameters. We add more convolutional layers before one pooling layer and expand the pooling region to 4×4, 8×8, 16×16, and even larger. When doing large-scale subsampling, we pick top-k activation, sum up them, and constrain them by a hyperparameter . We pick VGG, ALL-CNN, and DenseNets as our baseline models and evaluate our proposed MAP method on benchmark datasets: CIFAR-10, CIFAR-100, SVHN, and ImageNet. The classification results are competitive.

#### 1. Introduction

Convolutional neural networks (CNNs) have excellent performance on image classification and many other visual tasks [1–10] in recent years since AlexNet [11] achieved great success in ImageNet Challenge. With the rapid development of hardware capacity, wider and deeper networks can be trained smoothly. Nowadays, increasing the depth of networks is the main trend of improving networks’ overall performance. The first proposed convolutional neural network, LeNet5 [12], has 5 layers. AlexNet contains 8 layers which consist of 5 convolutional layers and 3 fully connected layers. VGG [13] networks are designed even deeper. The deepest number of layers reaches 19, while GoogLeNet [14] achieves 22. Residual Networks (ResNets) [16, 17] and Dense Convolutional Networks (DenseNets) [15] which have been proposed in the last two years start to use shortcuts structure to allow the networks to surpass 100, even 1000-layer barrier.

When networks are deep, gradient vanishing is a serious problem during training process. ResNets [16, 17] and DenseNets [15] explicitly take advantage of the shortcuts structure between each two blocks. Highway Networks [18] integrate nonlinear transform and transform gates to create shortcuts implicitly. FractalNets [19, 20] apply drop paths, which can achieve the similar effect as ResNets. All of the methods can effectively solve the gradient vanishing.

Research [21] shows that classification accuracy will degrade rapidly when applying backpropagation [22] to deeper plain networks. Normalization method [23], to some degree, can help ease the problem. Despite lacking the potential to become deeper, plain networks still have many advantages, especially when dealing with some embedded vision tasks. The ALL Convolution Net (ALL-CNN) [24] uses only convolutional layers with small amount of parameters instead of alternating convolutional and max-pooling layers, which has outstanding performance. Based on plain network structure, many small networks perform better on limited resource. Similarity Networks (SimNets) [25] use the similarity operator and global MEX pooling method to improve the capacity of small-size networks as much as possible. SqueezeNet [26] reduces the channels of 3×3 filters and partially substitutes them by 1×1 in order to simplify the networks without hurting network capability. MobileNets [27] use depth-wise separable convolutions to build lightweight networks and work well on embedded devices. Moreover, some notable researches such as ShuffleNet [28] and SEP-Nets [29] have already applied residual structure on mobile device and can reduce computational complexity effectively. With concise structure and less parameters, many researches [10] apply different kinds of CNNs as part of their model to process the big data, which improves the whole performance.

In this paper, we propose a new pooling method that picks top- activation in every pooling region (Figure 1) to make sure that the maximum information can pass through subsampling gates. To enable this method suitable for plain networks, we do some necessary adjustment on networks architecture.** First, **we use small 3×3 receptive fields (stride = 1) throughout the whole net and add more convolutional layers before certain pooling layer. Nonlinear transform unit: ReLU [30] is followed after each convolutional layer. As input feature maps pass through more convolutional and ReLU layers, output feature maps become sparse. Therefore, expanding the pooling region and picking the max activation surely have effect on reducing noise even if some information might be ignored.** Second, **max-pooling method shows its strong performance in sparse features representation, but more key features will certainly be dropped when pooling regions are larger. To further improve the representation capability, we pick top- activation, sum them up, and constrain the summation by a constant , which ranges from zero to one. If is set equivalent to 1/, the output value of pooling layer is the average of the picked top- activation. Generally, we make a little bigger than 1/ to prevent activation from weakening too much when there are only few features active.

In plain networks, such as VGG and ALL-CNN, our MAP method is competitive in achieving higher accuracy on classification tasks with depth and trainable parameters not increased. When reducing the number of layers, the cost of graphics memories or internal memories will decrease a lot during training process. Therefore, networks with MAP method have huge advantage against both traditional networks and deep networks with shortcuts structure. Figure 2 shows the plain networks used different pooling strategies. In Networks with shortcuts structure, like DenseNets, we proposed plain structure (Figure 3) with MAP method into the Transition layers to extract features. Although a little more trainable parameters are added, the classification results improve.

We evaluate MAP method on 4 notable benchmarks datasets (CIFAR-10, CIFAR-100 [31], SVHN, and ImageNet). We mainly choose VGG [13] as our baseline models and it turns out that our enhanced models outperform them. We also apply our method to ALL-CNN [24] and DenseNets [15] and also get better results.

#### 2. Related Works

In visual task, most of the networks insert pooling layers between convolutional layers for features abstraction and dimension reduction. The idea of pooling originates in Hubel and Wiesel’s seminars [32] on complex cells in the visual cortex. In 1982, Fukuhshima and Miyake used feature pooling in neocognitron [33]. Pooling method has been widely used since the concept of convolutional neural networks structure was first put forward. LeNet-5 [12] has two convolutional layers and a pooling layer closely follows each of them. At that time, max-pooling and average-pooling both performed well in LeNet-5. Although the development of CNN slowed down for a period, the researches of pooling method never stop. In many famous traditional algorithms like SIFT [34] and HOG [35], pooling method plays an important role.

Two most popular pooling methods, max-pooling and average-pooling, have already been deeply researched in theory [37]. When choosing pooling strategies, there is always a trade-off between preserving more information which may include noise and decreasing more noise as well as useful information. Through research, it is found that max-pooling is more suitable for sparse coding than average-pooling [38].

Recently, CNNs have achieved great success on various visual tasks. Pooling layers have already become a fixed part in almost every network. Moreover, many new pooling methods are proposed one after another, which inspire us a lot. Different from conventional deterministic pooling operations, Stochastic Pooling [39] method picks activation randomly from each pooling region according to multinomial distribution. This notable research highly improves the generalization of networks. When applying Dropout to max-pooling layers [40] on training datasets, a new pooling method called scaled max-pooling [41] is designed to be applied on testing datasets. This idea of adopting different pooling methods on training and testing datasets is novel and effective. Combined with Dropout, probability weighted pooling [39] method also performs well. Another creative method, Spatial Pyramid Pooling [42] method, uses different size of pooling windows before the fully connected layers to solve the problem of feature loss caused by arbitrary input image size. This method has huge advantage on extracting different active features, which is similar to our idea. In addition, the idea of expanding pooling regions has already been implemented in some networks such as Value Iteration Network (VIN) [43]. VIN uses the large-scale channel-wise max-pooling method to improve the generalization of networks. The idea of mixing max-pooling and average-pooling method [44] has been proven effective. It is similar to our proposed method.

At the bottom of networks, almost every high-level feature in feature maps is useful. Global average-pooling (GAP) [45] method enables all of them to contribute to the final representation. Global max-pooling [46], which focuses on the most active feature in every feature map, is a universal method in object detection tasks. These two existing global pooling methods share a key characteristic: they both concentrate on large-scale pooling size.

#### 3. Multiactivation Pooling

##### 3.1. Conventional Pooling Methods

Formally, we denote a single input feature map of a pooling layer as an assemblage . Before entering the pooling gate, can be regarded as a combination of several small local regions [39], , where n is constrained by both the size of input feature maps and the size of pooling regions.

Consider an arbitrary local region where is an index between 0 and . We denote the size of pooling region as and each element as . Here, we define

The conventional pooling methods, max-pooling and average-pooling, make the use of different strategies dealing with the elements in each pooling region. We denote the output after pooling as . Equation (2) describes max-pooling:

Equation (3) describes average-pooling:

Max-pooling method only picks the most active feature in a pooling region. On the contrary, average-pooling method takes all of the features into consideration. Thus, max-pooling method detects more texture information, while average-pooling method preserves more background information.

##### 3.2. Multiactivation Pooling Method

In most cases, 2×2 pooling method without overlapping is frequently used. This means in (1) is always set to 2. Large-scale pooling means is set bigger such as 4, 8, and 16, which depends on the input image size. Figure 2 (**middle)** shows the 8×8 max-pooling and 4×4 average-pooling in VGG-11 architecture.

Based on the large pooling region, we propose Multiactivation Pooling method which allows top- activation to pass through the pooling gate, where indicates the total number of the picked activation. If = , it means every activation will contribute to the final output; if = , only the max activation can pass through the gate. To avoid mixing too much useless features, is set small and constrained by the size of pooling region; e.g., is set to 4, while pooling region is 8×8.

In an arbitrary pooling region , as defined in (1), we denote the -picked activation as , where . Here we have

where the symbol means removing elements from an assemblage. The summation symbol in (4) indicates a small set of elements, which contains top1~ activation but not adding up all the activation numerically.

After picking out the top- activation, we do not simply add them together as an output or just compute the average of them. We propose a hyperparameter as a constraint factor to multiply the sum of the top- activation.

The final output refers to

Here, the summation symbol means sum operation, where . Particularly, if , the output is the average value.

The constraint factor, , is used to adjust the output. When few features remain active after ReLU gate, i.e., few positive values can move forward, if is set too small, positive values will be heavily weakened. On the contrary, when more features remain active, if is set more close to 1, output values will be big and easy to distort. In practice, the value of is influenced by the value of . We carry out a lot of experiments and finally get a group of proper combination of and shown in Table 2.

##### 3.3. Network Architectures

We investigate two different plain network architectures in popularity: VGG and ALL-CNN series. We mainly do research on VGG-11 and VGG-13 architectures shown in Table 1. We also pick one model from ALL-CNN series as our baseline model, which is shown in Table 5. For network with shortcuts structure, we pick one model from DenseNets series as our baseline model, which is shown in Table 7.

*VGG. *The series of VGG architectures first adopt small size (33) receptive field and use large number of filters (512) with much more trainable parameters in each convolutional layers. VGG architectures also put more convolutional layers together before pooling layers, which better extract features.

*ALL-CNN. *ALL-CNN first uses convolutional layers to subsample the feature maps. Compared to VGG, ALL-CNN are designed with less trainable parameters.

*DenseNets. *DenseNets are typical deep convolutional neural networks with shortcuts structure. The architecture alternatively uses Dense Blocks and Transition layers to extract features [15]. In every Transition layer, structures such as “BN-ReLU-conv1-avgpool (2×2)” are adopted to subsample the features.

##### 3.4. Implementation

On VGG nets, all convolutional layers have equal kernel size 3×3 together with the same padding strategy, zero-padded by one pixel. On ALL-CNN, 1×1 convolutional layers are also included. When applying large-scale pooling method or MAP to certain architectures, total number of layers and trainable parameters are kept the same as baseline models. Before the fully connected layers and softmax classifier, average-pooling method is adopted in every model. When dealing with two key hyperparameters, and , we use different combinations according to different pooling size. Table 2 shows the optimal combinations in experiment.

#### 4. Experimental Results

##### 4.1. Datasets

*CIFAR*. The two benchmark datasets, CIFAR-10 and CIFAR-100 [31], are consisted of tiny RGB images with 32×32 pixels. Both of the two datasets contain 50000 training images and 10000 testing images. Before training, we adopt a standard data normalization method [23] using the channel means and standard deviations. For training data augmentation, we randomly apply horizontal flip and crop on 4 pixels padded images. Finally, we evaluate the classification result by the test error rate.

*SVHN. *The Street View House Number datasets is a real-world image dataset obtained from house numbers in Google Street View images, which contains 10 classes, one for each digit. The dataset contains 73257 digits for training, 26032 digits for testing, and 531131 additional, somewhat less difficult samples, to use as extra training data. The images in SVHN are in two formats: original images and MNIST-like images with 32×32 pixels, and we choose the latter in our experiments.

*ImageNet. *The ILSVRC 2012 classification dataset [47, 48] consists of 1.2 million images for training and 50,000 for validation, 1000 classes in total. We adopt the same data augmentation scheme for training images as in [16, 17] and apply a single-crop or 10-crop with size 224×224 at test. Our experiment is only based on a few models due to the limit of computing speed because of the lack of GPU device.

##### 4.2. Training Details

For every network, we use stochastic gradient descent (SGD) with momentum of 0.9 and weight decay of 0.0005 in backpropagation process. The initial learning rate is set to 0.05. The batch size is set to 128 and the images in every batch are shuffled every epoch. We do not use BN [23] and LRN [11] in every layer of our networks in order to keep the sparsity of feature maps. In fully connected layers, 50% dropout rate is adopted. For small-size datasets (CIFAR and SVHN), the total number of epochs is set to 150 and learning rate will be divided by 10 every 45 epochs.

For ImageNet, we train models for 100 epochs with the batch size of 128. The learning rate is set to 0.1 initially and is lowered by 10 times at epoch 40 and 80. We only experiment this large-scale datasets on a few model of VGG-11 series (Table 1). Due to different baseline models, some training details will be slightly adjusted.

##### 4.3. Classification on VGG

To evaluate the efficiency of our MAP method, we train the networks shown in Table 1 on VGG architectures. We evaluate top-1 error rates and show the result in Table 3.

*Large-Scale Max-Pooling Method. *When the depth of layers and the number of parameters keep unchanged, piling up more convolutional layers together with large-scale pooling area extract features better. Based on this theory, we first reconstruct the two VGG architectures and then replace the original 2×2 pooling layers with larger-scale pooling layers. As is shown in Table 1, model A and model E are two baseline models. Model B and model F use 4×4 pooling layers. Model C and Model G use 8×8 pooling layers, while model D and model H apply even larger pooling size 16×16. Increasing the size of pooling layers leads the number of pooling layers to decrease. The result in second row of Table 3 shows that large-scale max-pooling method decreases error rate.

*MAP Method. *Table 3 shows that large-scale max-pooling method helps improve the classification accuracy. Then, we keep using the large pooling size, but instead of using the max-pooling method, we replace it with MAP method. The results from the last row of Table 3 (both** top **and** bottom**) are the most noticeable. First, with MAP method, the lowest error rate based on VGG-11 architecture is 6.94%, dropping around 20% compared to the original error rate of baseline model. We get similar result when the baseline model is VGG-13. The error rate also drops when integrating the MAP method into models. Second, if we analyze the result by column, it is clear that models with MAP layers perform better than models with max-pooling layers. We also try average-pooling method. Most of the results are even poorer than the results of baseline model.

Noticeably, the relationship between pooling size and error rate is not monotonic. Through experiment, we find that blindly increasing the pooling size is of no use.

On CIFAR-100 datasets, models with MAP method work much better than baseline models. For the reason that 88 MAP method is the strongest (Table 3), we pick 2 models with this method as comparison. Results are shown in Table 4. When integrating MAP method in VGG-11, the error rate decreases around 2.5%. In VGG-13, the error rate comes to 1.5%. With MAP method, model C performs even better than model E that has two more layers and more trainable parameters. On SVHN datasets, model with 88 MAP method is also the strongest as Table 3 shows. The classification results prove that MAP method with large pooling kernel size can still work well, while the number of classes are larger and the training images of each class are fewer.

##### 4.4. Classification on ALL-CNN

Through changing the stride from 1 to 2, convolutional layers are able to subsample as well as pooling layers. Based on ALL-CNN, we first rearranged the convolutional layers and then integrated large-scale max-pooling and MAP method into the model (Table 5). We only experimented on CIFAR-10 datasets and got exciting classification results in Table 6. It shows that the architecture with 88 MAP method lowers the error rate to 7.11%, which outperforms the result (8.07%) of our baseline model and even a little better than the result in their paper (7.25%).

##### 4.5. Classification on DenseNets

To integrate our MAP method into DenseNets, we change the original structure into “BN-ReLU-[conv3-ReLU] ×3-MAP (4×4)”. As is shown in Table 7, DenseNet-40_MAP is the model with MAP method Based on DenseNet-40 with L = 40, k = 12, = 1 [15].

We used the code implemented on Pytorch at https://github.com/liuzhuang13/DenseNet. When training the two models, the initial learning rate is set to 0.1 and is divided by 10 at 50% and 75% of total 200 epochs [15]. The batch size is 64. The classification results are shown in Table 8. The model with plain structure (Figure 3,** right**) achieves lower error rate on every small-size datasets (4.99% for CIFAR-10, 24.03% for CIFAR-100, 1.68% for SVHN) than baseline model. CIFAR-10 testing error rate curves show the strong performance of DenseNet-40_MAP (Figure 4) more clearly.

In plain networks, the basic structure like “conv3-ReLU-MAP” (Figure 3) is proved very effective on extracting features. In DenseNets, we integrate the structure into Transition layers to deal with output feature maps from Dense Blocks and classification results are encouraging.

##### 4.6. Classification on ImageNet

We only used this large-scale datasets on model A and model C in our experiment. The results are shown in Table 4. Compared to VGG-11 baseline model, VGG-11 model with 88 MAP method lowers the top-5 error rate from 10.86% to 10.19%. Although limited results are listed, our results on large-scale datasets are still encouraging.

##### 4.7. Classification on Compact Models

*Model Complexity Analysis*. We evaluate the model complexity mainly in several perspectives. First, we compare the number of parameters among VGG models. From Table 10, it can be seen that the two series model B-D and model F-H both have fewer trainable parameters than their baseline models. Compared to baseline model A, model B-D series cut trainable parameters from 9.8M to 5.4M. To achieve that, we simply reduce the number of filters in some convolutional layers. It is obvious that models with large-scale max-pooling method or MAP method show strong performance even when the trainable parameters are fewer. Second, each of the three models such as model B-D has the same number of trainable parameters and depth. MAP method is better than large-scale max-pooling method under the same condition.

All of our test models are simply composed of only three kinds of different layers: convolutional layers, pooling layers, and fully connected layers without adding any additional layers like BN [23] and LRN [11]. In addition, all of our searched networks are not that deep. It saves a lot of memory during training and it is easy to implement.

*Compact Model Classification. *Based on models with same number of trainable parameters and depth, we have proven our proposed method more effective. For further research, we keep reducing the number of layers. Based on model A, we drop two more convolutional layers and name the model series VGG-9 which is shown in Table 9. By comparing the classification results of model A, A0, and B on CIFAR-10 datasets (Figure 5), we conclude that the test error rate of model B0 that has MAP layers achieves 8.59%, even lower than the result of model A which has more layers and trainable parameters.

Compared to model A, model B0 cuts down nearly 70% of the trainable parameters from 9.8M to 2.9M. However, 2.9M is still a large number. Therefore, more compact models with smaller amount of parameters are proposed in Table 10. Inspired by bottleneck structure [15], we choose to use 1×1 filters instead of 3×3 on the first convolutional layer locating after MAP layer in model C0. Compared to model B0, model C0 achieves better result (Figure 5) with more than 50% decline of trainable parameter (Table 10). Based on this encouraging result, we continue reducing one fully connected layer and achieve close result to model C0. Both model C0 and D0 perform better than baseline model A with only 12% of the trainable parameters.

However, using the MAP method without increasing the number of parameters and layers cannot avoid increasing the computational complexity during training process, because more filters are forced to slide on bigger features maps.

#### 5. Discussion

##### 5.1. MAP Method Analysis

*Sparsity Analysis*. In CNNs, smaller receptive field size is proved more effective in most cases. Researches also show that piling up more convolutional layers extract features better. If basic blocks like “conv3-ReLU” are put together, the networks will be able to do multilevel template matching where negative values are set to zero by ReLU. Output feature maps are very sparse and activation left is quite important.

Figure 6 shows the output feature maps in different convolutional layers. Model A-C adopts different pooling size and the arrangement of the convolutional layers is different either. From left to right, the feature maps become more and more sparse. Theoretically, average-pooling is not suitable in sparse feature maps because useful activation will be heavily diluted. From top to bottom, we find that activation distributes densely in small local parts. So simply using max-pooling method may cause the loss of some good activation. Our MAP method finds a way to balance feature lost and feature dilution, which performs very well.

*Visualization with Deconvolution. *Deconvolution can be thought as a model that uses the same components (filtering, pooling) but in reverse. Here, deconvolution is used on our pretrained CNN model C and we use this method to map multiactivation back to the input pixel space, showing what input pattern eventually cause given activation in the feature maps. [36].

In this paper, we use basic structure (Figure 3 right) in our CNN models to thinning the feature maps. Every feature map in different convolutional layers can be regarded as a kind of representation of the original image. In other words, every feature map reflects characteristics of an original image. Through deconvolution, the reflection maps are visualized. In Figure 7, the deconvolution results on CIFAR-10 are shown. We use pretrain model C (Table 1), which contains an 8×8 pooling layers. There are 5 convolutional layers piling up before the large kernel-size-pooling layers. The channel numbers of each layer are 64, 128, 128, 128, and 256. As is shown in Figure 7, features maps become more and more sparse from the first convolutional layer to the fifth convolutional layer. Sparse feature maps often reflect the remarkable local characteristics of an image. In this way, our models can focus on different local parts specifically owned by different class of images, which represent raw images better.

##### 5.2. Model Structure Analysis

On plain networks and networks with shortcuts structure, MAP layers are always combined with plain structure shown in Figure 3. Through the analysis of feature maps sparsity and deconvolution, we believe that the plain structure can provide good activation and MAP method can make good use of this activation. In addition, the size of feature maps after 8×8 MAP layer becomes small (4×4 on CIFAR-10; 28×28 on ImageNet). We think that each activated feature (positive value in feature maps) on small feature maps is relatively independent. Therefore, trying to find the relationship between activated features in a 3×3 area may be unnecessary. Through experiment, we prove that 1×1 convolution can better represent the features picking out from MAP layer. At the same time, the trainable parameters are reduced a lot.

Since the MAP method can provide such good activation, we find a good way to take better advantage of this activation. Through mixing the plain structure and shortcuts structure, the classification results improve again! In Figure 8, the shortcuts structure is used to sum the output from MAP layer and the output from the 1×1 convolutional layer. We apply this structure to model C0 (Table 9) and name the new model “C0+shortcuts”. All the parameters are keeping the same. The lowest error rate decreases from 8.16% to 8.07%. Noticeably, in Figure 5, the error rate curve of the new model lies beneath all the other curves, indicating that the overall capability of the new model is superior when fully trained.

##### 5.3. Comparison between MAP and Mixing Pooling

Mixing pooling method [44] mainly contributes to doing a trade-off between max-pooling and average-pooling. Both our method and theirs are based on a hypothesis that only using max-pooling method or average-pooling method cannot produce optimal results. However, MAP and mix pooling are quite different in the following aspects.

First, the novelty of mixing pooling method is bringing learning into the pooling operation. Two main expressions are

where denotes the values in the region being pooled; is a scalar specifying a certain combination of max and average; denotes the values in a “gating mask”. is a sigmoid function. and are trainable.

Obviously, mixing pooling method combines max-pooling and average-pooling and constrains them with trainable factors. However, our MAP method only chooses top-k activation in a pooling region, which means most of the features are ignored.

Second, our MAP method is working on large pooling kernel size and takes good use of the sparsity of feature maps, while mixing pooling method does not take feature maps pattern and pooling kernel size into consideration.

Third, MAP method does not need more layers and trainable parameters. It can even achieve higher performance with fewer trainable parameters. On the contrary, mixing pooling method introduces trainable parameters into pooling layers to help improve model capacity.

##### 5.4. Hyperparameter Analysis

In Table 2, the optimal combinations of and under different pooling size are listed. Instead of arbitrarily choosing these two hyperparameters, we picked the best combination of and through a great amount of experiments. The CIFAR-10 classification results with different hyperparameters based on VGG-11 are shown in Table 11. The first row of Table 11 corresponds to the max-pooling result and the fourth row corresponds to the average-pooling result. On large sparse feature maps, only choosing one or two features ignores quite big number of active features. On the contrary, choosing too much features adds a lot of useless closed features, which leads to feature dilution.

#### 6. Conclusion

In this paper, we propose a new pooling method, which refers to Multiactivation Pooling (MAP) method. It provides a new subsampling idea to better pick features. We apply our method on convolutional neural networks. In our experiment, MAP method can highly improve the classification accuracy in CNNs. Actually, the plain structure “conv-ReLU-MAP” is a better feature-extractor in CNNs. Based on this structure, we also prove the feasibility of applying MAP method on DenseNets. Moreover, we carry out additional experiments such as sparsity analysis and deconvolution to illustrate that the plain structure can provide “good” activation.

We hope to study further and try to explore the potential of the basic layers. MAP method still has huge space for development. It may perform better by carefully fine-tuning hyperparameters, and . In the future, we plan to study further on MAP method and test it on different datasets and different networks.

#### Data Availability

All the datasets in this paper are available on the Internet. Previously reported CIFAR (CIFAR-10 and CIFAR-100) data were used to support this study and are available at http://www.cs.toronto.edu/~kriz/cifar.html. Previously reported SVHN data were used to support this study and are available at http://ufldl.stanford.edu/housenumbers/. The ImageNet (ILSVRC2012) data used to support the findings of this study were supplied by http://image-net.org/download-images under license and so cannot be made freely available. Signing up at http://image-net.org/download-images should be made for data availability.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

All the authors are supported by the National Natural Science Foundation of China (Grant 61772052).