Research Article  Open Access
Multiactivation Pooling Method in Convolutional Neural Networks for Image Recognition
Abstract
Convolutional neural networks (CNNs) are becoming more and more popular today. CNNs now have become a popular feature extractor applying to image processing, big data processing, fog computing, etc. CNNs usually consist of several basic units like convolutional unit, pooling unit, activation unit, and so on. In CNNs, conventional pooling methods refer to 2×2 maxpooling and averagepooling, which are applied after the convolutional or ReLU layers. In this paper, we propose a Multiactivation Pooling (MAP) Method to make the CNNs more accurate on classification tasks without increasing depth and trainable parameters. We add more convolutional layers before one pooling layer and expand the pooling region to 4×4, 8×8, 16×16, and even larger. When doing largescale subsampling, we pick topk activation, sum up them, and constrain them by a hyperparameter . We pick VGG, ALLCNN, and DenseNets as our baseline models and evaluate our proposed MAP method on benchmark datasets: CIFAR10, CIFAR100, SVHN, and ImageNet. The classification results are competitive.
1. Introduction
Convolutional neural networks (CNNs) have excellent performance on image classification and many other visual tasks [1–10] in recent years since AlexNet [11] achieved great success in ImageNet Challenge. With the rapid development of hardware capacity, wider and deeper networks can be trained smoothly. Nowadays, increasing the depth of networks is the main trend of improving networks’ overall performance. The first proposed convolutional neural network, LeNet5 [12], has 5 layers. AlexNet contains 8 layers which consist of 5 convolutional layers and 3 fully connected layers. VGG [13] networks are designed even deeper. The deepest number of layers reaches 19, while GoogLeNet [14] achieves 22. Residual Networks (ResNets) [16, 17] and Dense Convolutional Networks (DenseNets) [15] which have been proposed in the last two years start to use shortcuts structure to allow the networks to surpass 100, even 1000layer barrier.
When networks are deep, gradient vanishing is a serious problem during training process. ResNets [16, 17] and DenseNets [15] explicitly take advantage of the shortcuts structure between each two blocks. Highway Networks [18] integrate nonlinear transform and transform gates to create shortcuts implicitly. FractalNets [19, 20] apply drop paths, which can achieve the similar effect as ResNets. All of the methods can effectively solve the gradient vanishing.
Research [21] shows that classification accuracy will degrade rapidly when applying backpropagation [22] to deeper plain networks. Normalization method [23], to some degree, can help ease the problem. Despite lacking the potential to become deeper, plain networks still have many advantages, especially when dealing with some embedded vision tasks. The ALL Convolution Net (ALLCNN) [24] uses only convolutional layers with small amount of parameters instead of alternating convolutional and maxpooling layers, which has outstanding performance. Based on plain network structure, many small networks perform better on limited resource. Similarity Networks (SimNets) [25] use the similarity operator and global MEX pooling method to improve the capacity of smallsize networks as much as possible. SqueezeNet [26] reduces the channels of 3×3 filters and partially substitutes them by 1×1 in order to simplify the networks without hurting network capability. MobileNets [27] use depthwise separable convolutions to build lightweight networks and work well on embedded devices. Moreover, some notable researches such as ShuffleNet [28] and SEPNets [29] have already applied residual structure on mobile device and can reduce computational complexity effectively. With concise structure and less parameters, many researches [10] apply different kinds of CNNs as part of their model to process the big data, which improves the whole performance.
In this paper, we propose a new pooling method that picks top activation in every pooling region (Figure 1) to make sure that the maximum information can pass through subsampling gates. To enable this method suitable for plain networks, we do some necessary adjustment on networks architecture. First, we use small 3×3 receptive fields (stride = 1) throughout the whole net and add more convolutional layers before certain pooling layer. Nonlinear transform unit: ReLU [30] is followed after each convolutional layer. As input feature maps pass through more convolutional and ReLU layers, output feature maps become sparse. Therefore, expanding the pooling region and picking the max activation surely have effect on reducing noise even if some information might be ignored. Second, maxpooling method shows its strong performance in sparse features representation, but more key features will certainly be dropped when pooling regions are larger. To further improve the representation capability, we pick top activation, sum them up, and constrain the summation by a constant , which ranges from zero to one. If is set equivalent to 1/, the output value of pooling layer is the average of the picked top activation. Generally, we make a little bigger than 1/ to prevent activation from weakening too much when there are only few features active.
In plain networks, such as VGG and ALLCNN, our MAP method is competitive in achieving higher accuracy on classification tasks with depth and trainable parameters not increased. When reducing the number of layers, the cost of graphics memories or internal memories will decrease a lot during training process. Therefore, networks with MAP method have huge advantage against both traditional networks and deep networks with shortcuts structure. Figure 2 shows the plain networks used different pooling strategies. In Networks with shortcuts structure, like DenseNets, we proposed plain structure (Figure 3) with MAP method into the Transition layers to extract features. Although a little more trainable parameters are added, the classification results improve.
We evaluate MAP method on 4 notable benchmarks datasets (CIFAR10, CIFAR100 [31], SVHN, and ImageNet). We mainly choose VGG [13] as our baseline models and it turns out that our enhanced models outperform them. We also apply our method to ALLCNN [24] and DenseNets [15] and also get better results.
2. Related Works
In visual task, most of the networks insert pooling layers between convolutional layers for features abstraction and dimension reduction. The idea of pooling originates in Hubel and Wiesel’s seminars [32] on complex cells in the visual cortex. In 1982, Fukuhshima and Miyake used feature pooling in neocognitron [33]. Pooling method has been widely used since the concept of convolutional neural networks structure was first put forward. LeNet5 [12] has two convolutional layers and a pooling layer closely follows each of them. At that time, maxpooling and averagepooling both performed well in LeNet5. Although the development of CNN slowed down for a period, the researches of pooling method never stop. In many famous traditional algorithms like SIFT [34] and HOG [35], pooling method plays an important role.
Two most popular pooling methods, maxpooling and averagepooling, have already been deeply researched in theory [37]. When choosing pooling strategies, there is always a tradeoff between preserving more information which may include noise and decreasing more noise as well as useful information. Through research, it is found that maxpooling is more suitable for sparse coding than averagepooling [38].
Recently, CNNs have achieved great success on various visual tasks. Pooling layers have already become a fixed part in almost every network. Moreover, many new pooling methods are proposed one after another, which inspire us a lot. Different from conventional deterministic pooling operations, Stochastic Pooling [39] method picks activation randomly from each pooling region according to multinomial distribution. This notable research highly improves the generalization of networks. When applying Dropout to maxpooling layers [40] on training datasets, a new pooling method called scaled maxpooling [41] is designed to be applied on testing datasets. This idea of adopting different pooling methods on training and testing datasets is novel and effective. Combined with Dropout, probability weighted pooling [39] method also performs well. Another creative method, Spatial Pyramid Pooling [42] method, uses different size of pooling windows before the fully connected layers to solve the problem of feature loss caused by arbitrary input image size. This method has huge advantage on extracting different active features, which is similar to our idea. In addition, the idea of expanding pooling regions has already been implemented in some networks such as Value Iteration Network (VIN) [43]. VIN uses the largescale channelwise maxpooling method to improve the generalization of networks. The idea of mixing maxpooling and averagepooling method [44] has been proven effective. It is similar to our proposed method.
At the bottom of networks, almost every highlevel feature in feature maps is useful. Global averagepooling (GAP) [45] method enables all of them to contribute to the final representation. Global maxpooling [46], which focuses on the most active feature in every feature map, is a universal method in object detection tasks. These two existing global pooling methods share a key characteristic: they both concentrate on largescale pooling size.
3. Multiactivation Pooling
3.1. Conventional Pooling Methods
Formally, we denote a single input feature map of a pooling layer as an assemblage . Before entering the pooling gate, can be regarded as a combination of several small local regions [39], , where n is constrained by both the size of input feature maps and the size of pooling regions.
Consider an arbitrary local region where is an index between 0 and . We denote the size of pooling region as and each element as . Here, we define
The conventional pooling methods, maxpooling and averagepooling, make the use of different strategies dealing with the elements in each pooling region. We denote the output after pooling as . Equation (2) describes maxpooling:
Equation (3) describes averagepooling:
Maxpooling method only picks the most active feature in a pooling region. On the contrary, averagepooling method takes all of the features into consideration. Thus, maxpooling method detects more texture information, while averagepooling method preserves more background information.
3.2. Multiactivation Pooling Method
In most cases, 2×2 pooling method without overlapping is frequently used. This means in (1) is always set to 2. Largescale pooling means is set bigger such as 4, 8, and 16, which depends on the input image size. Figure 2 (middle) shows the 8×8 maxpooling and 4×4 averagepooling in VGG11 architecture.
Based on the large pooling region, we propose Multiactivation Pooling method which allows top activation to pass through the pooling gate, where indicates the total number of the picked activation. If = , it means every activation will contribute to the final output; if = , only the max activation can pass through the gate. To avoid mixing too much useless features, is set small and constrained by the size of pooling region; e.g., is set to 4, while pooling region is 8×8.
In an arbitrary pooling region , as defined in (1), we denote the picked activation as , where . Here we have
where the symbol means removing elements from an assemblage. The summation symbol in (4) indicates a small set of elements, which contains top1~ activation but not adding up all the activation numerically.
After picking out the top activation, we do not simply add them together as an output or just compute the average of them. We propose a hyperparameter as a constraint factor to multiply the sum of the top activation.
The final output refers to
Here, the summation symbol means sum operation, where . Particularly, if , the output is the average value.
The constraint factor, , is used to adjust the output. When few features remain active after ReLU gate, i.e., few positive values can move forward, if is set too small, positive values will be heavily weakened. On the contrary, when more features remain active, if is set more close to 1, output values will be big and easy to distort. In practice, the value of is influenced by the value of . We carry out a lot of experiments and finally get a group of proper combination of and shown in Table 2.
3.3. Network Architectures
We investigate two different plain network architectures in popularity: VGG and ALLCNN series. We mainly do research on VGG11 and VGG13 architectures shown in Table 1. We also pick one model from ALLCNN series as our baseline model, which is shown in Table 5. For network with shortcuts structure, we pick one model from DenseNets series as our baseline model, which is shown in Table 7.


VGG. The series of VGG architectures first adopt small size (33) receptive field and use large number of filters (512) with much more trainable parameters in each convolutional layers. VGG architectures also put more convolutional layers together before pooling layers, which better extract features.
ALLCNN. ALLCNN first uses convolutional layers to subsample the feature maps. Compared to VGG, ALLCNN are designed with less trainable parameters.
DenseNets. DenseNets are typical deep convolutional neural networks with shortcuts structure. The architecture alternatively uses Dense Blocks and Transition layers to extract features [15]. In every Transition layer, structures such as “BNReLUconv1avgpool (2×2)” are adopted to subsample the features.
3.4. Implementation
On VGG nets, all convolutional layers have equal kernel size 3×3 together with the same padding strategy, zeropadded by one pixel. On ALLCNN, 1×1 convolutional layers are also included. When applying largescale pooling method or MAP to certain architectures, total number of layers and trainable parameters are kept the same as baseline models. Before the fully connected layers and softmax classifier, averagepooling method is adopted in every model. When dealing with two key hyperparameters, and , we use different combinations according to different pooling size. Table 2 shows the optimal combinations in experiment.
4. Experimental Results
4.1. Datasets
CIFAR. The two benchmark datasets, CIFAR10 and CIFAR100 [31], are consisted of tiny RGB images with 32×32 pixels. Both of the two datasets contain 50000 training images and 10000 testing images. Before training, we adopt a standard data normalization method [23] using the channel means and standard deviations. For training data augmentation, we randomly apply horizontal flip and crop on 4 pixels padded images. Finally, we evaluate the classification result by the test error rate.
SVHN. The Street View House Number datasets is a realworld image dataset obtained from house numbers in Google Street View images, which contains 10 classes, one for each digit. The dataset contains 73257 digits for training, 26032 digits for testing, and 531131 additional, somewhat less difficult samples, to use as extra training data. The images in SVHN are in two formats: original images and MNISTlike images with 32×32 pixels, and we choose the latter in our experiments.
ImageNet. The ILSVRC 2012 classification dataset [47, 48] consists of 1.2 million images for training and 50,000 for validation, 1000 classes in total. We adopt the same data augmentation scheme for training images as in [16, 17] and apply a singlecrop or 10crop with size 224×224 at test. Our experiment is only based on a few models due to the limit of computing speed because of the lack of GPU device.
4.2. Training Details
For every network, we use stochastic gradient descent (SGD) with momentum of 0.9 and weight decay of 0.0005 in backpropagation process. The initial learning rate is set to 0.05. The batch size is set to 128 and the images in every batch are shuffled every epoch. We do not use BN [23] and LRN [11] in every layer of our networks in order to keep the sparsity of feature maps. In fully connected layers, 50% dropout rate is adopted. For smallsize datasets (CIFAR and SVHN), the total number of epochs is set to 150 and learning rate will be divided by 10 every 45 epochs.
For ImageNet, we train models for 100 epochs with the batch size of 128. The learning rate is set to 0.1 initially and is lowered by 10 times at epoch 40 and 80. We only experiment this largescale datasets on a few model of VGG11 series (Table 1). Due to different baseline models, some training details will be slightly adjusted.
4.3. Classification on VGG
To evaluate the efficiency of our MAP method, we train the networks shown in Table 1 on VGG architectures. We evaluate top1 error rates and show the result in Table 3.

LargeScale MaxPooling Method. When the depth of layers and the number of parameters keep unchanged, piling up more convolutional layers together with largescale pooling area extract features better. Based on this theory, we first reconstruct the two VGG architectures and then replace the original 2×2 pooling layers with largerscale pooling layers. As is shown in Table 1, model A and model E are two baseline models. Model B and model F use 4×4 pooling layers. Model C and Model G use 8×8 pooling layers, while model D and model H apply even larger pooling size 16×16. Increasing the size of pooling layers leads the number of pooling layers to decrease. The result in second row of Table 3 shows that largescale maxpooling method decreases error rate.
MAP Method. Table 3 shows that largescale maxpooling method helps improve the classification accuracy. Then, we keep using the large pooling size, but instead of using the maxpooling method, we replace it with MAP method. The results from the last row of Table 3 (both top and bottom) are the most noticeable. First, with MAP method, the lowest error rate based on VGG11 architecture is 6.94%, dropping around 20% compared to the original error rate of baseline model. We get similar result when the baseline model is VGG13. The error rate also drops when integrating the MAP method into models. Second, if we analyze the result by column, it is clear that models with MAP layers perform better than models with maxpooling layers. We also try averagepooling method. Most of the results are even poorer than the results of baseline model.
Noticeably, the relationship between pooling size and error rate is not monotonic. Through experiment, we find that blindly increasing the pooling size is of no use.
On CIFAR100 datasets, models with MAP method work much better than baseline models. For the reason that 88 MAP method is the strongest (Table 3), we pick 2 models with this method as comparison. Results are shown in Table 4. When integrating MAP method in VGG11, the error rate decreases around 2.5%. In VGG13, the error rate comes to 1.5%. With MAP method, model C performs even better than model E that has two more layers and more trainable parameters. On SVHN datasets, model with 88 MAP method is also the strongest as Table 3 shows. The classification results prove that MAP method with large pooling kernel size can still work well, while the number of classes are larger and the training images of each class are fewer.


4.4. Classification on ALLCNN
Through changing the stride from 1 to 2, convolutional layers are able to subsample as well as pooling layers. Based on ALLCNN, we first rearranged the convolutional layers and then integrated largescale maxpooling and MAP method into the model (Table 5). We only experimented on CIFAR10 datasets and got exciting classification results in Table 6. It shows that the architecture with 88 MAP method lowers the error rate to 7.11%, which outperforms the result (8.07%) of our baseline model and even a little better than the result in their paper (7.25%).


4.5. Classification on DenseNets
To integrate our MAP method into DenseNets, we change the original structure into “BNReLU[conv3ReLU] ×3MAP (4×4)”. As is shown in Table 7, DenseNet40_MAP is the model with MAP method Based on DenseNet40 with L = 40, k = 12, = 1 [15].
We used the code implemented on Pytorch at https://github.com/liuzhuang13/DenseNet. When training the two models, the initial learning rate is set to 0.1 and is divided by 10 at 50% and 75% of total 200 epochs [15]. The batch size is 64. The classification results are shown in Table 8. The model with plain structure (Figure 3, right) achieves lower error rate on every smallsize datasets (4.99% for CIFAR10, 24.03% for CIFAR100, 1.68% for SVHN) than baseline model. CIFAR10 testing error rate curves show the strong performance of DenseNet40_MAP (Figure 4) more clearly.

In plain networks, the basic structure like “conv3ReLUMAP” (Figure 3) is proved very effective on extracting features. In DenseNets, we integrate the structure into Transition layers to deal with output feature maps from Dense Blocks and classification results are encouraging.
4.6. Classification on ImageNet
We only used this largescale datasets on model A and model C in our experiment. The results are shown in Table 4. Compared to VGG11 baseline model, VGG11 model with 88 MAP method lowers the top5 error rate from 10.86% to 10.19%. Although limited results are listed, our results on largescale datasets are still encouraging.
4.7. Classification on Compact Models
Model Complexity Analysis. We evaluate the model complexity mainly in several perspectives. First, we compare the number of parameters among VGG models. From Table 10, it can be seen that the two series model BD and model FH both have fewer trainable parameters than their baseline models. Compared to baseline model A, model BD series cut trainable parameters from 9.8M to 5.4M. To achieve that, we simply reduce the number of filters in some convolutional layers. It is obvious that models with largescale maxpooling method or MAP method show strong performance even when the trainable parameters are fewer. Second, each of the three models such as model BD has the same number of trainable parameters and depth. MAP method is better than largescale maxpooling method under the same condition.
All of our test models are simply composed of only three kinds of different layers: convolutional layers, pooling layers, and fully connected layers without adding any additional layers like BN [23] and LRN [11]. In addition, all of our searched networks are not that deep. It saves a lot of memory during training and it is easy to implement.
Compact Model Classification. Based on models with same number of trainable parameters and depth, we have proven our proposed method more effective. For further research, we keep reducing the number of layers. Based on model A, we drop two more convolutional layers and name the model series VGG9 which is shown in Table 9. By comparing the classification results of model A, A0, and B on CIFAR10 datasets (Figure 5), we conclude that the test error rate of model B0 that has MAP layers achieves 8.59%, even lower than the result of model A which has more layers and trainable parameters.


Compared to model A, model B0 cuts down nearly 70% of the trainable parameters from 9.8M to 2.9M. However, 2.9M is still a large number. Therefore, more compact models with smaller amount of parameters are proposed in Table 10. Inspired by bottleneck structure [15], we choose to use 1×1 filters instead of 3×3 on the first convolutional layer locating after MAP layer in model C0. Compared to model B0, model C0 achieves better result (Figure 5) with more than 50% decline of trainable parameter (Table 10). Based on this encouraging result, we continue reducing one fully connected layer and achieve close result to model C0. Both model C0 and D0 perform better than baseline model A with only 12% of the trainable parameters.
However, using the MAP method without increasing the number of parameters and layers cannot avoid increasing the computational complexity during training process, because more filters are forced to slide on bigger features maps.
5. Discussion
5.1. MAP Method Analysis
Sparsity Analysis. In CNNs, smaller receptive field size is proved more effective in most cases. Researches also show that piling up more convolutional layers extract features better. If basic blocks like “conv3ReLU” are put together, the networks will be able to do multilevel template matching where negative values are set to zero by ReLU. Output feature maps are very sparse and activation left is quite important.
Figure 6 shows the output feature maps in different convolutional layers. Model AC adopts different pooling size and the arrangement of the convolutional layers is different either. From left to right, the feature maps become more and more sparse. Theoretically, averagepooling is not suitable in sparse feature maps because useful activation will be heavily diluted. From top to bottom, we find that activation distributes densely in small local parts. So simply using maxpooling method may cause the loss of some good activation. Our MAP method finds a way to balance feature lost and feature dilution, which performs very well.
Visualization with Deconvolution. Deconvolution can be thought as a model that uses the same components (filtering, pooling) but in reverse. Here, deconvolution is used on our pretrained CNN model C and we use this method to map multiactivation back to the input pixel space, showing what input pattern eventually cause given activation in the feature maps. [36].
In this paper, we use basic structure (Figure 3 right) in our CNN models to thinning the feature maps. Every feature map in different convolutional layers can be regarded as a kind of representation of the original image. In other words, every feature map reflects characteristics of an original image. Through deconvolution, the reflection maps are visualized. In Figure 7, the deconvolution results on CIFAR10 are shown. We use pretrain model C (Table 1), which contains an 8×8 pooling layers. There are 5 convolutional layers piling up before the large kernelsizepooling layers. The channel numbers of each layer are 64, 128, 128, 128, and 256. As is shown in Figure 7, features maps become more and more sparse from the first convolutional layer to the fifth convolutional layer. Sparse feature maps often reflect the remarkable local characteristics of an image. In this way, our models can focus on different local parts specifically owned by different class of images, which represent raw images better.
5.2. Model Structure Analysis
On plain networks and networks with shortcuts structure, MAP layers are always combined with plain structure shown in Figure 3. Through the analysis of feature maps sparsity and deconvolution, we believe that the plain structure can provide good activation and MAP method can make good use of this activation. In addition, the size of feature maps after 8×8 MAP layer becomes small (4×4 on CIFAR10; 28×28 on ImageNet). We think that each activated feature (positive value in feature maps) on small feature maps is relatively independent. Therefore, trying to find the relationship between activated features in a 3×3 area may be unnecessary. Through experiment, we prove that 1×1 convolution can better represent the features picking out from MAP layer. At the same time, the trainable parameters are reduced a lot.
Since the MAP method can provide such good activation, we find a good way to take better advantage of this activation. Through mixing the plain structure and shortcuts structure, the classification results improve again! In Figure 8, the shortcuts structure is used to sum the output from MAP layer and the output from the 1×1 convolutional layer. We apply this structure to model C0 (Table 9) and name the new model “C0+shortcuts”. All the parameters are keeping the same. The lowest error rate decreases from 8.16% to 8.07%. Noticeably, in Figure 5, the error rate curve of the new model lies beneath all the other curves, indicating that the overall capability of the new model is superior when fully trained.
5.3. Comparison between MAP and Mixing Pooling
Mixing pooling method [44] mainly contributes to doing a tradeoff between maxpooling and averagepooling. Both our method and theirs are based on a hypothesis that only using maxpooling method or averagepooling method cannot produce optimal results. However, MAP and mix pooling are quite different in the following aspects.
First, the novelty of mixing pooling method is bringing learning into the pooling operation. Two main expressions are
where denotes the values in the region being pooled; is a scalar specifying a certain combination of max and average; denotes the values in a “gating mask”. is a sigmoid function. and are trainable.
Obviously, mixing pooling method combines maxpooling and averagepooling and constrains them with trainable factors. However, our MAP method only chooses topk activation in a pooling region, which means most of the features are ignored.
Second, our MAP method is working on large pooling kernel size and takes good use of the sparsity of feature maps, while mixing pooling method does not take feature maps pattern and pooling kernel size into consideration.
Third, MAP method does not need more layers and trainable parameters. It can even achieve higher performance with fewer trainable parameters. On the contrary, mixing pooling method introduces trainable parameters into pooling layers to help improve model capacity.
5.4. Hyperparameter Analysis
In Table 2, the optimal combinations of and under different pooling size are listed. Instead of arbitrarily choosing these two hyperparameters, we picked the best combination of and through a great amount of experiments. The CIFAR10 classification results with different hyperparameters based on VGG11 are shown in Table 11. The first row of Table 11 corresponds to the maxpooling result and the fourth row corresponds to the averagepooling result. On large sparse feature maps, only choosing one or two features ignores quite big number of active features. On the contrary, choosing too much features adds a lot of useless closed features, which leads to feature dilution.

6. Conclusion
In this paper, we propose a new pooling method, which refers to Multiactivation Pooling (MAP) method. It provides a new subsampling idea to better pick features. We apply our method on convolutional neural networks. In our experiment, MAP method can highly improve the classification accuracy in CNNs. Actually, the plain structure “convReLUMAP” is a better featureextractor in CNNs. Based on this structure, we also prove the feasibility of applying MAP method on DenseNets. Moreover, we carry out additional experiments such as sparsity analysis and deconvolution to illustrate that the plain structure can provide “good” activation.
We hope to study further and try to explore the potential of the basic layers. MAP method still has huge space for development. It may perform better by carefully finetuning hyperparameters, and . In the future, we plan to study further on MAP method and test it on different datasets and different networks.
Data Availability
All the datasets in this paper are available on the Internet. Previously reported CIFAR (CIFAR10 and CIFAR100) data were used to support this study and are available at http://www.cs.toronto.edu/~kriz/cifar.html. Previously reported SVHN data were used to support this study and are available at http://ufldl.stanford.edu/housenumbers/. The ImageNet (ILSVRC2012) data used to support the findings of this study were supplied by http://imagenet.org/downloadimages under license and so cannot be made freely available. Signing up at http://imagenet.org/downloadimages should be made for data availability.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
All the authors are supported by the National Natural Science Foundation of China (Grant 61772052).
References
 S. Ren, K. He, R. Girshick, and J. Sun, “Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2017. View at: Publisher Site  Google Scholar
 L. Zhao and K. Jia, “Multiscale CNNs for Brain Tumor Segmentation and Diagnosis,” Computational and Mathematical Methods in Medicine, vol. 2016, Article ID 8356294, 2016. View at: Publisher Site  Google Scholar
 Y. Xu, G. Yu, Y. Wang, X. Wu, and Y. Ma, “Car detection from lowaltitude UAV imagery with the faster RCNN,” Journal of Advanced Transportation, vol. 2017, Article ID 2823617, 11 pages, 2017. View at: Publisher Site  Google Scholar
 J. Shotton, J. Winn, C. Rother, and A. Criminisi, “TextonBoost for image understanding: multiclass object recognition and segmentation by jointly modeling texture, layout, and context,” International Journal of Computer Vision, vol. 81, no. 1, pp. 2–23, 2009. View at: Publisher Site  Google Scholar
 P. Sermanet, D. Eigen, X. Zhang et al., Overfeat: Integrated recognition, localization and detection using convolutional networks, 2013, arXiv:1312.6229.
 J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, realtime object detection,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, pp. 779–788, July 2016. View at: Google Scholar
 T. N. Sainath, B. Kingsbury, G. Saon et al., “Deep Convolutional Neural Networks for Largescale Speech Tasks,” Neural Networks, vol. 64, pp. 39–48, 2015. View at: Publisher Site  Google Scholar
 E. Shelhamer, J. Long, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 640–651, 2017. View at: Publisher Site  Google Scholar
 E. Ribeiro, A. Uhl, G. Wimmer, and M. Häfner, “Exploring deep learning and transfer learning for colonic polyp classification,” Computational and Mathematical Methods in Medicine, vol. 2016, Article ID 6584725, 16 pages, 2016. View at: Publisher Site  Google Scholar
 Z. Lv, H. Song, P. BasantaVal, A. Steed, and M. Jo, “NextGeneration Big Data Analytics: State of the Art, Challenges, and Future Research Topics,” IEEE Transactions on Industrial Informatics, vol. 13, no. 4, pp. 1891–1899, 2017. View at: Publisher Site  Google Scholar
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS '12), pp. 1097–1105, Lake Tahoe, Nev, USA, December 2012. View at: Google Scholar
 Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2323, 1998. View at: Publisher Site  Google Scholar
 K. Simonyan and A. Zisserman, Very deep convolutional networks for largescale image recognition, 2014, https://arxiv.org/abs/1409.1556.
 C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '15), pp. 1–9, Boston, Mass, USA, June 2015. View at: Publisher Site  Google Scholar
 G. Huang, Z. Liu, L. v. Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269, Honolulu, HI, July 2017. View at: Publisher Site  Google Scholar
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, pp. 770–778, July 2016. View at: Google Scholar
 K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Proceedings of the European Conference on Computer Vision, pp. 630–645, 2016. View at: Publisher Site  Google Scholar
 R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Proceedings of the 29th Annual Conference on Neural Information Processing Systems, NIPS 2015, pp. 2377–2385, can, December 2015. View at: Google Scholar
 G. Larsson, M. Maire, and G. Shakhnarovich, Fractalnet: Ultradeep neural networks without residuals, 2016, https://arxiv.org/abs/1605.07648.
 Y. Yang, H. Li, Y. Han, and H. Gu, “High resolution remote sensing image segmentation based on graph theory and fractal net evolution approach,” in Proceedings of the 2015 ISPRS International Workshop on Image and Data Fusion, IWIDF 2015, pp. 197–201, usa, July 2015. View at: Publisher Site  Google Scholar
 H. Lei, T. Han, F. Zhou et al., “A deeply supervised residual network for HEp2 cell classification via crossmodal transfer learning,” Pattern Recognition, vol. 79, pp. 290–302, 2018. View at: Publisher Site  Google Scholar
 Y. LeCun, B. Boser, J. S. Denker et al., “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, 1989. View at: Publisher Site  Google Scholar
 S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning (ICML '15), pp. 448–456, July 2015. View at: Google Scholar
 J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, Striving for simplicity: The all convolutional net, 2014.
 N. Cohen, O. Sharir, and A. Shashua, “Deep SimNets,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, pp. 4782–4791, usa, July 2016. View at: Google Scholar
 F. N. Iandola, M. W. Moskewicz, K. Ashraf et al., Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and< 0.5 MB model size, 2016, arXiv:1602.07360.
 A. G. Howard, M. Zhu, B. Chen et al., Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017, arXiv:1704.04861.
 X. Zhang, X. Zhou, M. Lin, and J. Sun, ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices, 2017, arXiv:1707.01083.
 Z. Li, X. Wang, X. Lv, and T. Yang, SEPNets: Small and Effective Pattern Networks, 2017, 1706.03912.
 V. Nair and G. E. Hinton, “Rectified linear units improve Restricted Boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML '10), pp. 807–814, Haifa, Israel, June 2010. View at: Google Scholar
 A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009. View at: Google Scholar
 D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction, and functional architecture in the cat's visual cortex,” The Journal of Physiology, vol. 160, pp. 106–154, 1962. View at: Publisher Site  Google Scholar
 K. Fukushima and S. Miyake, “Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position,” Pattern Recognition, vol. 15, no. 6, pp. 455–469, 1982. View at: Publisher Site  Google Scholar
 D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. View at: Publisher Site  Google Scholar
 N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), vol. 1, pp. 886–893, June 2005. View at: Publisher Site  Google Scholar
 M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part I, vol. 8689 of Lecture Notes in Computer Science, pp. 818–833, Springer, 2014. View at: Publisher Site  Google Scholar
 Y.L. Boureau, F. Bach, Y. LeCun, and J. Ponce, “Learning midlevel features for recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '10), pp. 2559–2566, San Francisco, Calif, USA, June 2010. View at: Publisher Site  Google Scholar
 J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '09), pp. 1794–1801, June 2009. View at: Publisher Site  Google Scholar
 M. D. Zeiler and R. Fergus, Stochastic pooling for regularization of deep convolutional neural networks, 2013, arXiv:1301.3557.
 N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014. View at: Google Scholar  MathSciNet
 H. Wu and X. Gu, “Towards dropout training for convolutional neural networks,” Neural Networks, vol. 71, pp. 1–10, 2015. View at: Publisher Site  Google Scholar
 K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–1916, 2015. View at: Publisher Site  Google Scholar
 A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel, “Value iteration networks,” in Proceedings of the 30th Annual Conference on Neural Information Processing Systems, NIPS 2016, pp. 2154–2162, esp, December 2016. View at: Google Scholar
 C. Y. Lee, P. W. Gallagher, and Z. Tu, “Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree,” in Artificial Intelligence and Statistics, pp. 464–472, 2016. View at: Google Scholar
 M. Lin, Q. Chen, and S. Yan, Network in network, 2013, arXiv:1312.4400.
 M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring midlevel image representations using convolutional neural networks,” in Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '14), pp. 1717–1724, IEEE, Columbus, Ohio, USA, June 2014. View at: Publisher Site  Google Scholar
 J. Deng, W. Dong, and R. Socher, “ImageNet: a largescale hierarchical image database,” in Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255, Miami, Fla, USA, June 2009. View at: Publisher Site  Google Scholar
 O. Russakovsky, J. Deng, H. Su et al., “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015. View at: Publisher Site  Google Scholar  MathSciNet
Copyright
Copyright © 2018 Qi Zhao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.