Crowd management is critical to preventing stampedes and directing crowds, especially in India and China, where there are more than one billion people. With the continuous growth of the population, crowded events caused by rallies, parades, tourism, and other reasons occur from time to time. Crowd count estimation is the linchpin of the crowd management system and has become an increasingly important task and challenging research direction. This work proposes an optimized encoder-decoder architecture with the squeeze-and-excitation block for crowd counting, called SENetCount, which includes SE-ResNetCount and SE-ResNeXtCount. The deeper and stronger backbone network increases the quality of feature representations. The squeeze-and-excitation block utilizes global information to impress worthy informative feature representations and suppress unworthy ones selectively. The encoder-decoder architecture with the dense atrous spatial pyramid pooling module recovers the spatial information and captures the contextual information at multiple scales. The modified loss function considers the local consistency measure compared with the foregoing Euclidean loss function. The experiments on challenging datasets prove that our approach is competitive compared to thoughtful approaches, and analyses show that our architecture is extensible and robust.

1. Introduction

Crowd counting is a task to evaluate the population in congested scenarios [1] accurately. Crowd counting plays a crucial role in social security and control measures in such scenarios as crossroads, stadiums, and scenic spots. With the increasing population growth and urbanization, crowd counting is a meaningful yet challenging task. Crowd counting has been brought to bear in numerous object countings, such as traffic monitoring [2] and yield estimation [3].

Numerous approaches are raised to tackle diverse challenges in the crowd-counting task, such as scale variation, perspective distortion, heavy occlusion in the digital camera, illumination variation, background complexity in the wild, and nonuniform distribution of the crowd itself. These approaches could be assorted into regression-based, detection-based, and density map estimation-based methods. The earliest crowd counting researches usually utilized detection-based methods [4] and applied head or person detector via a sliding window on a frame. In the early stages of research, some great object detectors have been adopted directly and, in sparse scenes, may achieve noticeable detection accuracy. Unfortunately, these methods present unsatisfactory results when encountering heavy occlusion and background complex in extremely dense settings. Some writings introduce regression-based methods [5] to break through the above problems by mapping extracted features to objective functions. They usually first extract global or local features and then utilize regression skills such as linear regression and Gaussian mixture regression [6] to learn a mapping relationship to the crowd counting. These methods are fruitful in dealing with the above obstacles, but they always ignore spatial information. Some studies develop density map estimation-based methods by establishing the mapping function between image features and interrelated density maps. However, the conventional handcrafted feature extraction seems incapable of guiding the high-quality density map to measure accurate counting. More researchers have recently utilized it to boost the density map estimation by taking advantage of the compelling feature representation of convolutional neural networks (CNNs) [1, 7, 8].

Benefiting from the capable feature representation capability of CNNs, more researchers employ it to raise the density map estimation. Inspired by the encoder-decoder architecture, DeeplabV3+ is diffusely utilized in pixel-wise prediction tasks, primarily semantic segmentation. Squeeze-and-excitation (SE) block adeptly adjusts channel-wise feature responses by plainly modeling channel dependency relationships. Under the circumstances, the squeeze-and-excitation block brings compelling improvements for current remarkable CNNs at a negligible extra computational cost. We build an optimized encoder-decoder architecture with a squeeze-and-excitation block to breed a superior density map estimation. To tackle the difficulty of continuous variation caused by camera perspective, we propose the dense atrous spatial pyramid pooling (DASPP) module, which contains three atrous convolutional layers with an increasing rate of 1 to 3. In addition, we use the structural similarity index or multiscale structural similarity index combined with Euclidean loss to remedy some disadvantages and to yield substantially better performance on density map estimation tasks.

To summarize, we bring about the following accomplishments: (1)We propose a novel encoder-decoder architecture and choose residual network SE-ResNet and SE-ResNeXt as the backbone network to develop the preferable density map estimation(2)We design the DASPP module to conquer the gridding artifacts and capture continuous contextual information in dense crowd scenes(3)Combining consistency loss and Euclidean loss, we design weighted loss functions to correct the density map estimation capability of the models

In the coming sections, related research is reviewed in Section 2, and the methodology in detail is proposed in Section 3. Experimental detail and result analysis on the open mainstream datasets are provided in Section 4, and conclusions are followed in Section 5.

This section divides our analysis of related research into three groups: crowd counting based on CNNs, deeper networks, and encoder-decoder architectures.

2.1. Crowd Counting Based on CNNs

Crowd counting is an essential and challenging task for understanding congested scenes, estimating crowd flows, and preventing overcrowding accidents. In consideration of the specific emphasis and various obstacles of crowd counting, researchers have attempted to make up different sophisticated approaches to figure out the crowd-counting problem progressively. Given different network architectures of convolutional neural networks (CNNs), we subdivide crowd counting methods into three categories: multicolumn-based methods, single-column-based methods, and detection-based methods, as shown in Table 1.

2.1.1. Multicolumn-Based Methods

Such model architectures commonly take on diverse columns to capture multiscale information correlatively to numerous receptive fields, bringing conspicuous crowd counting achievement.

Zhang et al. proposed a clean but effective multicolumn convolutional neural network (MCNN) structure [9] in 2016. The MCNN utilizes a multicolumn design with receptive fields of varying kernel sizes (small, medium, and large) to capture corresponding people feature. Each column is adaptive to scale changes and employs a convolution layer instead of the fully connected layer to merge the features maps into the density map estimation so that model can be of any size to avert distortion. Ranjan et al. proposed the iterative counting convolutional neural networks (ic-CNN) method [10] in 2018. The ic-CNN has two convolutional neural network branch architectures for achieving high-resolution density maps. The first branch accomplishes a low-resolution density map, and the second one composes the low-resolution prediction to generate a high-resolution density map. Hossain et al. presented scale-aware attention networks (SAAN) model for crowd counting [11] in 2019. The SAAN uses the attention mechanism to gingerly select the applicable scales at both global and local levels. The SAAN model consists of three subnetworks: multiscale feature extractor (MFE) extracts feature maps in three diverse scales, and global scale attentions (GSA) and local scale attentions (LSA) operate three overall and pixel-wise attention maps, respectively. The fusion network integrates the attention-weighted features from the GSA and LSA outputs to predict the density map.

Although these multicolumn networks have achieved great progress, they still suffer from several disadvantages. Multicolumn networks are challenging to train, considering more elapsed time and bloated architecture. Using divergent branches but almost the look-alike network architecture inescapably causes information redundancy and degrades the density map generation quality.

2.1.2. Single-Column-Based Methods

The single-column model architectures generally arrange sole and deeper convolutional neural networks rather than the bloated architecture of the multicolumn model.

Xu et al. suggested a simple yet effective method [12] named scale preserving network (SPN) in 2019. The suggested SPN generates an original density map from stacked features and significantly assuages the density pattern shift issue resulting from the significant density variation between sparse and dense regions. Zhou et al. raised a locality-aware (LA-Batch) approach to tackle the unbalanced data distribution problem in crowd counting [13] in 2022. The approach comprises locality-aware in locality-aware data partition (LADP) and locality-aware data augmentation (LADA). The LADP approach constructs a more balanced data batch by grouping the training data into different bins via locality-sensitive hashing. The LADA approach adaptively augments the image patches to reduce training bias further and enhance collaboration. Li et al. proposed a multiscale aggregation network for crowd counting [14] in 2021, named MANet. The method includes a feature extraction encoder (FEE) module and density map decoder (DMD) module. The FEE extracts multiscale features through a cascaded scale pyramid network and obtains contextual features through dense connections. The DMD generates feature information via deconvolution and fusion operations.

Single-column model architectures pander to the requisitions of more tanglesome situations in crowd counting. Due to training efficiency and architectural simplicity, single-column model architecture has progressively been noticed in recent years.

2.1.3. Detection-Based Methods

Though CNN-based density map estimation is the predominant paradigm in crowd counting and delivers admirable count calculation, certain severe obstructions also exist. Ideally, we await the model to provide the authentic location for each body in the scenarios with a bounding box. Researchers could make applications such as person recognizing and tracking with an additional bounding box [17]. Detection-based methods come into the perspective again.

Sam et al. introduced a dense detection algorithm for crowd counting with three functional parts [15] in 2021, named LSC-CNN. Experiments illustrate that LSC-CNN performs better crowd counting than present regression methods and has remarkable localization with detection programs. Xu et al. leveraged a feature pyramid networks (FPN) model to frame both regression-based and detection-based methods with the learning to scale module (L2SM) [16] in 2022, termed AutoScale. The AutoScale achieves very competitive performance on two sparse datasets and demonstrates noticeable transferability under cross-dataset validation on different datasets.

These models achieve better counting performance and superior localization than previous regression methods. Switching from the regression approach to dense detection further considers some person recognizing and tracking tasks.

2.2. Deeper Networks

Deep convolutional neural networks (DCNNs) such as VGGNet and GoogLeNet [18, 19] could significantly increase the ability of feature representation and have led to a suite of breakthroughs for visual recognition tasks. Network depth (the number of successive stacked layers) has played a crucially important role in these breakthroughs. Unexpectedly, deeper networks are not as straightforward as stacking more layers because of the notorious vanishing or exploding gradients problem. Although normalization initialization and intermediate normalization can solve this problem essentially, which enables networks with tens of layers to start converging, the optimization of deeper networks has proven to be considerably more difficult. With the growth of the network depth, accuracy obtains saturated and then degrades agilely.

Srivastava et al. proposed a Highway Network architecture [20]. Highway Networks exploit deep CNN for learning enhanced feature representation and introduce a new cross-layer connectivity mechanism by using gating units to handle the flow of information without considerably increasing the total network size. Highway Network architecture opens up a new way of studying hugely deep and efficient architectures. He et al. introduced a deep residual learning framework through short-cut connections [21], named ResNet, to ease the degradation problem when the network depth increases without adding neither an extra parameter nor computational complexity. These deep residual structures readily optimize and gain accuracy from significantly enhanced depth, acquiring substantially better results than earlier structures. He et al. analyzed the propagation formulations behind the residual building blocks [22]. They proposed that the forward and backward signals can be straight propagated from one block to any other block using identity mappings as short-cut connections and after-addition activation. Large numbers of ablation experiments and derivations support that short-cut connections and after-addition activation are fundamental for smooth information propagation. They also design 1000-layer deep networks that can be trained with no difficulty and improve accuracy. Xie et al. presented a simple and highly modularized network architecture ResNeXt which adopts a repeating layers strategy in VGGNet/ResNets while exploiting the split-transform-merge technique [23] in inceptions. In these strategies, the input is split into some lower-dimensional embeddings by convolutions, transformed by a suite of specially designed filters, and merged by concatenation. The split-transform-merge technique is anticipated to access the representational capacity of high-density and large-scaled layers but at an appreciably lower calculating cost. Hu et al. focused instead on the channel relationship. They raised an exquisite architectural unit [24] named squeeze-and-excitation (SE) block. The SE block adaptively recalibrates channel-wise feature responses by expressly modeling dependency relationship between channels. So the SE block brings significant improvements for excellent CNNs at a little excess calculating cost.

2.3. Encoder-Decoder Architectures

The encoder-decoder architectures are widely used in pixel-wise prediction tasks, primarily semantic segmentation, to pick up feature representation of fully convolutional networks (FCNs) and set up high-resolution output. As a pixel-wise task, density map estimation-based crowd counting also needs encoder-decoder architectures to develop a high-quality density map of the same size as the initial input. Atrous convolutions have been popular in this field, and many recent publications report using this trick.

Chen et al. proposed a series of algorithms named the DeepLab family from 2014 to 2018. DeepLabV1 uses the fully-connected conditional random fields (CRFs) to ease the spatial invariance hurdle in the DCNNs [25]. The CRFs have been widely serviced in semantic segmentation to merge class scores computed by multiway classifiers with the low-level information from the local pixels and edges. Combining with the ideas from DCNNs and CRFs, DeepLabV1 produces accurate semantical predictions and detailed segmentation maps under computational efficiency. DeepLabV2 addresses the task of semantic image segmentation [26] with deep learning. The proposed atrous spatial pyramid pooling (ASPP) module probes an incoming convolutional feature layer with filters to capture objects and image context at multiple sampling rates. To handle the obstacle of segmenting objects at multiple scales, DeepLabV3 employs atrous convolution in parallel or cascade to capture multiscale contexts by adopting multiple atrous rates [27]. DeepLabV3 elaborates on the application detail and shares the training experience, including a simple yet effective bootstrapping method for capturing long-range information due to image boundary effects. DeepLabV3+ extends DeepLabV3 by employing an encoder-decoder framework [28], which contains rich semantic information from the encoder module and recovers the precise object boundaries from the simple yet effective decoder module. The encoder module extracts features at an arbitrary resolution by employing atrous convolution to trade-off precision and runtime, resulting in a more robust encoder-decoder network.

Beneficial from previous research, we submit an end-to-end FCNs architecture SENetCount for crowd counting tasks. This work chooses SE-ResNet or SE-ResNeXt as the backbone network and adopts the optimized encoder-decoder with dense atrous spatial pyramid pooling and squeeze-and-excitation block.

3. Methodology

This section describes the designed SENetCount approach, including its comprehensive architecture, objective loss function, and ground truth generation.

3.1. SENetCount Architecture

As shown in Figure 1, SENetCount is an encoder-decoder architecture to estimate density maps accurately. SENetCount comprises the SE-ResNet or SE-ResNeXt backbone network as feature extractor, the dense atrous spatial pyramid pooling (DASPP) module for capturing dense scale diversity, and the fused feature map (FFM) for fusing features.

3.1.1. SE-ResNet or SE-ResNeXt Backbone Network

We use SE-ResNet or SE-ResNeXt as the backbone network and tailor it to crowd counting. SE-ResNet or SE-ResNeXt pays more attention to the channel relationship simultaneously and proposes the squeeze-and-excitation (SE) block, as shown in Figure 2. The SE block promotes the features representations quality by expressly modeling the dependency relationships between the channels.

In this squeeze-and-excitation block, the transformation operator uses convolution to map the input to feature map , the squeeze operator uses global average pooling to generate global spatial information into a channel statistic , the excitation operator uses a dimensionality-reduction fully-connected layer with reduction ratio , a ReLU and then a dimensionality-increasing fully-connected layer to map the channel statistic to a set of channel weights , the scale operator multiplies the feature map by the channel weights to carry out feature recalibration dynamically.

The SE block uses global information to emphasize informative features and restrain unnecessary ones selectively. SE block can also be tailored dexterously with the aforementioned residual networks, such as SE-ResNet or SE-ResNeXt module shown in Figure 3.

The detailed description of ResNet50, SE-ResNet50, and SE-ResNeXt50/101 is given in Table 2. We discover that using the first three bottlenecks can perform better with lower computation costs in the following fundamental experiments.

3.1.2. Dense Atrous Spatial Pyramid Pooling Module

The atrous spatial pyramid pooling module or encoder-decoder architecture is used in deep convolution neural networks for semantic segmentation tasks. The former module encodes multiscale contextual information by probing the incoming features with filter operations at multiple scales and pooling operations at various receptive fields. In contrast, the latter architecture captures sharper object boundaries by progressively recovering spatial information. Attempting to combine the advantages of the above methods and inspired by the encoder-decoder architecture, we apply the depthwise separable convolution in decoder modules and propose the dense atrous spatial pyramid pooling (DASPP) module, resulting in a more substantial encoder-decoder architecture.

To cope with scale variation, we wish an end-to-end FCNs architecture to acquire a large-scale region as dense as possible. The classic method engages stacked atrous convolution to develop multiscale context and adopts multiple atrous rates to handle the difficulties of segmenting objects at various scales [28]. However, the significant atrous rates in their method have caused the broad gap between the sizes of the diverse receptive fields, i.e., 6 pixels. This seemingly effective atrous rates setting is unfit for crowd estimating. Because the scale variation of crowd surroundings caused by camera perspective is virtually continuous, a more densely sampled scale region may be required. Here, we rethink the DASPP module that contains three atrous convolutions with the applicable atrous rates of 1, 2, or 3. As illustrated in Figure 4(a), the final pixel only views original information in a gridding fashion because of the atrous convolution with the sizeable atrous rate of 6 and 12. It loses a large amount of pixel information. This gridding artifact is harmful to crowd estimating to acquire detailed features, as partial information is missing from the original feature maps. The top layer covers all pixel information of the original feature map by adopting the applicable atrous rates of 1, 2, or 3, as illustrated in Figure 4(b). This dense atrous rates setting is critical for crowd estimating to develop accurate density maps because the information may be irrelevant across remote distances.

3.1.3. Fused Feature Map Module

The output low-resolution feature maps from encoder architecture are upsampled by bilinear interpolation and then concatenated with the output high-resolution feature maps from the DASPP module in decoder architecture. The fused feature map (FFM) module can retrieve high-resolution feature maps through convolution unit to achieve the final refined density map.

3.2. Objective Loss Function

The customized designing of the objective function is a fundamental program in training practical models. As a regression task, crowd-estimating methods usually employ Euclidean distance as an objective function to compare the discrepancy between the estimated density map and ground truth. Although universally utilized, the Euclidean loss engaged may have weaknesses in density maps, such as outlier sensitivity and local coherence deficiency [1]. To overcome these issues, we apply the structural similarity (SSIM) index [29] or multiscale structural similarity (MS-SSIM) index [30] combined with Euclidean loss. The SSIM or MS-SSIM index measures the local pattern consistency from luminance, contrast, and structure, which is helpful for density map estimation. Designing appropriate objective loss functions is of great benefit to enhance the capability of the models.

3.2.1. Euclidean Loss

The Euclidean loss evaluates the pixel-wise mean squared error, same with most anterior studies [57, 31], which is defined as follows: where is the image number, is the generated density map for image with parameter , and is the ground truth for image .

3.2.2. Consistency Loss

The consistency loss evaluates the local pattern consistency from three aspects, i.e., luminance, contrast, and structure, which is defined as follows: where is the image number. And the structural similarity (SSIM) index [29] is the local structural similarity of generated density maps and ground truths, which is defined as follows:

Among, the luminance function, contrast function, and structure comparison function are defined as , , and , respectively; , , and are utilized to adjust the relative importance of the three comparisons.

Also, and are the local mean, and are the local variance estimation, is the local covariance estimation, and are constants to avoid division by zero and set as infinitesimal positive numbers. Simplistically, and are set in this paper. So, the local structural similarity is reduced as follows:

In practice, the subjective evaluation of a specified image will vary with the sampling density of the image, the distance from the image to the observer, the perception of the observer, and other factors. But the previous single-scale method SSIM may be appropriate only for strict settings. The multiscale approach is a pleasant way to merge image details at various resolutions. The multiscale structural similarity (MS-SSIM) index [30] is acquired by combining the three aspects at multiple scales, which is defined as follows:

Similar to (3), , , and are utilized to adjust the relevant significance of respective components and are given nonzero values. Simplistically, are set in this paper. So, the consistency loss is optimized as follows:

3.2.3. Final Objective Loss

The final objective function considers Euclidean loss and consistency loss, which is defined as follows: where is the weight to balance the pixel-wise Euclidean loss and local pattern consistency loss. In experiments, is empirically set as 0.1 or 0.01.

3.3. Ground Truth Generation

As the cornerstone of density map estimation-based methods, the generation of high-fidelity ground truth density maps is crucial to data preparation. Opportunely, Zhang et al. discovered that head size is relevant to the distance between two neighboring people [7]. A geometry-adaptive kernel-based density map generation method inspires numerous studies to adopt this trick to optimize training data.

The ground truth density maps can be generated by blurring head annotations with the geometry-adaptive Gaussian kernel, which is defined as follows: where is the count of head annotations in the image. Each object in the ground truth , indicates the average distance of nearest neighbors. To generate the density map, we convolve with a geometry-adaptive Gaussian kernel with the parameter .

4. Experiments

This section first summarizes the implementation details and introduces the evaluated datasets and evaluation metrics. Second, the fundamental experiment study verifies the importance of pretraining strategy and backbone network selection. Then, the ablation experiment study demonstrates the improvements of different modules in our method. Finally, the comparison and transfer experiment study evaluate the performance of our models SENetCount with some previous crowd counting approaches. Compared with remarkable approaches, the experiments on challenging datasets prove that our method is competitive, and analyses show that our architecture is extensible and robust.

4.1. Implementation Details

The experiments are accomplished on NVIDIA RTX3090, using the PyTorch framework. An efficient and solid open-source crowd counting code framework developed on PyTorch [32] facilitates our experiment comparison. We set the initial learning rate , the weight decay rate , and the number of training iterations 300. We report the results of the proposed SENetCount on the four mainstream datasets. ResNetCount50, SE-ResNetCount50, and SE-ResNeXtCount50/101, respectively, choose ResNet, SE-ResNet, or SE-ResNeXt as the backbone network and adopt the DASPP module and FFM module given in Table 2 and Figure 1. Given different types of network architectures, comparative crowd counting methods in experiments are divided into multicolumn-based methods MCNN, ic-CNN, and SAAN [911], single-column-based methods SPN, LA-Batch, and MANet [1214], and detection-based methods AutoScale and LSC-CNN [15, 16].

4.2. Evaluated Datasets

We present our results on four publicly mainstream datasets, namely, ShanghaiTech Part_A or Part_B [9], UCF_QNRF [33], and Mall [34]. Table 3 gives the statistical comparison of these datasets.

ShanghaiTech dataset consists of 1,198 pictures, with 330,165 annotated people [9]. ShanghaiTech is a challenging dataset in previous years that comprises both low-density and high-density crowds. According to various density distributions, this dataset is classified into Part_A with 482 pictures and Part_B with 716 pictures. Pictures in Part_A are randomly selected from the Internet. Part_B is derived from the congested streets of metropolitan areas in Shanghai. Part_A is composed of high-density crowds from 33 to 3,139, and Part_B is made up of low-density crowds from 9 to 578. The scale change and perspective distortion in datasets bring brand-new challenges and opportunities for programming better architectures.

UCF_QNRF is assembled from Web Search with a wider variety of scenes, consisting of 1,535 complex and exciting photos with about 1.25 million annotations, and is the released dataset with high-density images and annotations in 2018 [33]. The number of head annotations in the images ranges from 49 to 12,865. The UCF_QNRF dataset brings scale variation, perspective distortion, and uneven distribution challenges. Moreover, some high-resolution photos may lead to GPU memory issues during model training.

Mall is an indoor crowd counting dataset gathered from surveillance cameras installed in a shopping mall [34]. The dataset records 2000 sequential frames with the resolution of , containing 62,325 pedestrians. Mall covers diverse flow densities and nonuniform walking speeds under more significant illumination conditions. Additionally, the mall dataset brings perspective distortion, scale variation, and severe occlusions challenges.

4.3. Evaluation Metrics

Referring to existing works, we evaluate our method with both the mean absolute error (MAE) and the root mean squared error (RMSE), which are adopted to assess the capability of the proposed methods [914]. The MAE demonstrates the accuracy while the RMSE reflects the robustness, and a lower value means better capability. MAE and RMSE are defined as follows: where is the image number, represents the estimated count, and represents the ground truth count.

4.4. Results and Analysis

The experiments include pretraining strategy experiments, backbone network selection experiments, ablation experiments, comparison experiments, and transfer experiments.

4.4.1. Fundament Experiments

Pretraining has always been an effective strategy for learning the parameters of DCNNs, and bottleneck restructuring has usually been an available method to improve the accuracy. As early as 2006, Hinton and Salakhutdinov pointed out that initialization weights close to the optimal model parameters can be obtained through pretraining [35]. Pretraining is typically utilized to initialize the backbones of object detection and image segmentation models. Zeiler and Fergus confirmed that the front bottleneck modules extract low-level features such as color and edge. The middle bottleneck modules extract relatively complex texture features or local features. The back bottleneck modules extract relatively complete discernible features or pose variation [36]. The experiments shown in Table 4 demonstrate the advantages of the pretraining strategy and bottleneck restructuring on the crowd counting.

Experiments show that the pretraining strategy can improve the evaluation index such as MAE and the RMSE by 30-40% in the same number of training iterations 300. Constrained by time and resources, the pretraining strategy saves the number of training iterations. The pretraining strategy ensures convergence of random initialization, so the conventional wisdom of pretraining strategy still does more good than harm. Experiments show that the last bottleneck does not significantly improve the capability but increases the computational cost. For pixel-level prediction tasks like crowd counting, especially in heavy occlusion or congested scenes, extracting completely identifiable features may be complex or misleading. Performance and computational cost are critical for crowd counting to estimate efficient density maps.

4.4.2. Ablation Experiments

To understand the effectiveness of various modules in the network, we perform ablation experiments with the different settings in Table 5 and demonstrate the following conclusion. From SE-ResNet to SE-ResNeXt as the backbone network, the crowd counting ability of the corresponding model is significantly improved successively. Experiments show that the multibranched split-transform-merge architecture in SE-ResNeXt approaches the representational ability of large-scale and high-dense layers at a reasonably lower computational complexity. From SE-ResNeXt50 to SE-ResNeXt101, with the network depth, the improvement of feature representation ability promotes the breakthroughs of crowd counting results. Network depth has played perhaps a crucially important role in these breakthroughs. But deeper networks do not mean as straightforward as stacking more layers. Taking UCF_QNRF for example, crowd-counting accuracy gets saturated and degrades. The SSIM or the MS-SSIM index considers the consistency of local patterns. The weighted loss function may improve the performance by enhancing the insensitivity to statistic shift and spatial variation in some cases. This targeted design of loss function is a needful procedure in training more effective models.

4.4.3. Comparison Experiments

To prove the efficacy of our suggested approach, we conduct comparison experiments on four complicated public crowd counting datasets. The experimental results are shown in Table 6 and Figure 5. Our method reaches excellent capability compared with the foregoing techniques on most datasets and evaluation metrics. Comparison experiments manifest that our approach may perform very well in the high-density crowd scenes and the low-density crowd scenes.

Our model SE-ResNeXtCount101 performs better MAE and RMSE than other methods on the ShanghaiTech Part_B dataset, and SE-ResNeXtCount50 achieves 5.5 MAE and 1.7 RMSE improvement compared with the second-best approach SPN on the UCF_QNRF dataset. The contrast experiment manifests that our approach achieves superior performance on congested crowd scenes and sparse crowd datasets. Our model SE-ResNeXtCount101 performs 0.19 MAE and 0.11 RMSE improvement on the mall dataset compared with the prominent method LA-Batch. The contrast experiment demonstrates that our approach has an outstanding ability to cope with larger-scale variation and extract complex feature expressions. To sum up, the comparison experiment indicates that our model has apparent advantages in different application scenarios.

4.4.4. Transfer Experiments

Since scenario variation mostly leads to a noticeable capability drop, cross-dataset evaluation is increasingly brought to the forefront in crowd counting [12, 15]. In practice, a crowd-counting method with strong generalizability is usually expected. The transfer experiments under cross datasets is designed to manifest the transferability of the suggested method SE-ResNeXtCount listed in Table 7.

Our model SE-ResNeXtCount performs slightly better than LSC-CNN in transferring models trained on ShanghaiTech Part_A and Part_B or UCF_QNRF. SE-ResNeXtCount50 improves 5.5 in MAE and 4.6 in RMSE on ShanghaiTech Part_A transferring to Part_B. SE-ResNeXtCount101 improves 2.9 in MAE and 15.1 in RMSE on ShanghaiTech Part_B transferring to Part_A. Our method exceeds MCNN by a large margin and stays ahead of existing models. Transfer experiments confirm the effectiveness and generalizability of our method.

5. Conclusion

This paper proposes an optimized encoder-decoder architecture with the squeeze-and-excitation block for crowd counting, called SENetCount, which includes SE-ResNetCount and SE-ResNeXtCount. The dense atrous spatial pyramid pooling with an incremental atrous rate may overcome the gridding artifacts and obtain satisfactory results for the continuous variation of crowd scenes. Fundament experiments indicate the importance of the pretraining strategy and bottleneck restructuring. Ablative experiments confirm the benefits of the backbone network, the increase of the network depth, and the final objective loss function with a structural similarity index. Comparison experiments on challenging datasets demonstrate that our strategy is competitive compared with prominent approaches. Transfer experiments confirm the effectiveness and generalizability of our method.

In the future, we will investigate the effectiveness of SENetCount in other tasks such as vehicle counting or remote sensing object counting. We also will further explore the more extensible and robust solution between low and high-density regions to prompt the performance in various scenes.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The author declares no competing interests.


This work was supported by grants from the Talent Construction Project of Shanghai Tourism College (Project no. A4-0252-21-15-CY04) and the Scientific Research Project of Shanghai Tourism College (Project no. E3-0250-20-001-030).