Abstract

Accurate counting in dense scenes can effectively prevent the occurrence of abnormal events, which is crucial for flow management, traffic control, and urban safety. In recent years, the application of deep learning technology in counting tasks has significantly improved the performance of models, but it still faces many challenges, including the diversity of target distribution between image and background, the drastic change of target scale, and serious occlusion. To solve these problems, this paper proposes a spatial context feature fusion network, abbreviated as SCFFNet, to understand highly congested scenes and perform accurate counts as well as produce high-quality estimated density maps. SCFFNet first uses rich convolutions with different scales to calculate scale-aware features, adaptively encodes the scale of contextual information needed to accurately estimate density maps, and then calibrates and refuses the fused feature maps through a channel spatial attention-aware module, which improves the model’s ability to suppress background and focus on main features. Finally, the final estimated density map is generated by a dilated convolution module. We conduct experiments on five public crowd datasets, UCF_CC_50, WorldExpo’10, ShanghaiTech, Mall, and Beijing BRT, and the results show that our method achieves lower counting errors than existing state-of-the-art methods. In addition, we extend SCFFNet to count other objects, such as vehicles in the vehicle dataset HBR_YD, and the experimental results show that our proposed method significantly improves the output quality with higher accuracy than previous methods.

1. Introduction

Crowd research is an important research content in video surveillance and intelligent images [13]. Among them, crowd counting [46] is a key research point, whose purpose is to automatically calculate the number of crowds in images or videos or predict the density maps of different dense scenes. With the continuous growth of the world population and the increase of human diversified social activities, the situation of large population gatherings often appears in our daily life. Accurate counting and density prediction of dense crowds have been widely used in the fields of public safety management [7], urban planning [8], and video surveillance [9]. Moreover, the statistical method of dense crowds can also be extended to similar statistical work, such as cell statistics in medical research, vehicle statistics in traffic jams, and extended sample surveys in biology. However, various situations in the real world, such as severe scale changes, heavy occlusions, and cluttered backgrounds, have brought great challenges to the practical solutions for crowd statistics. As shown in the red box in Figure 1, due to the different shooting angles of the surveillance cameras, the size of the human head will inevitably change greatly in the image. Faced with the complexity of the background, some methods based on convolutional neural networks (CNN) often mistakenly identify the background clutter as a crowd area, as analyzed in [10, 11]. In particular, some grid areas (such as occluders) are more prone to errors in the prediction map, because the appearance of the background is very similar to dense crowd regions, as shown by the yellow box in Figure 1.

Scale change is a serious challenge in crowd statistics. For this reason, some researchers have proposed many ideas and methods based on multicolumn network structures to deal with scale changes. For example, MCNN [12] uses a multibranch multicolumn network structure to compute scale changes, where each branch uses convolution kernels of different sizes. McML [13] added the statistical network to the multicolumn network and guided each column to learn the feature information of multiple scales in the image through the information obtained from the interaction between different columns. These methods based on multiscale feature fusion improve the adaptability of the model to crowd scale changes to a certain extent, but the counting accuracy is still unsatisfactory in practical tests. The main reason why these methods are hard to address the challenge of scale change is that the multicolumn CNN uses convolution kernels of different sizes, the features extracted from each column are not much different, and the structure is redundant and complicated. Moreover, the training of multicolumn convolutional networks is difficult and time-consuming, which cannot meet the requirements of practical applications, as analyzed in CSRNet [5]. CSRNet uses the deep convolutional neural network VGG-16 net [14] that removes the fully connected layer as the feature extractor, followed by a 7-layer dilated convolution as the regression, which can expand the receptive domain of the network to obtain sufficient spatial context information. However, CSRNet cannot make the counting network obtain appropriate spatial context information, because it cannot explore the final contribution of different receptive domain information to the counting model. In addition, it does not encode attention feature information, which makes it easy for the network to mistakenly predict the background as a crowd region.

In response to the shortcomings of the above research methods, this paper designs a new network SCFFNet, which combines a multiscale context feature fusion module and channel spatial attention-aware module for understanding highly crowded scenes and counting accurately; this is a trainable end-to-end deep network architecture, which mainly consists of a front-end feature extraction network, multiscale context feature fusion module (MCFFM), channel spatial attention-aware module (CSAM), and back-end network. The front-end feature extraction network selects the VGG-16 net [14] with the fully connected layer removed to extract feature information, and the back-end network consists of 2D dilated convolution layers [5]. SCFFNet can adaptively fuse multiscale context features to accommodate rapid scale changes and suppress the interference of background noise to generate high-quality density maps. To be specific, MCFFM is an improvement on the full convolutional network (FCN), which combines the features extracted from multiple receptive fields of different sizes and calculates the importance of the corresponding features at each spatial position of the image, to capture rich context information and scale diversity. Then, SCFFNet uses CSAM to encode spatial correlation in the entire feature map to predict pixel-level density map more accurately and handles the channel dependency between any two-channel mappings, allowing the network to selectively enhance features with a large amount of information and suppress unnecessary features, to help the model focus on the core of the crowd scene-the head region, and avoids the wrong estimation of the background clutter.

The contributions of this paper are summarized as follows:(1)We propose a new network SCFFNet to accurately count the number of objects of different scales and densities. The proposed MCFFM divides low-level features into parallel blocks of different sizes to encode rich context information and learns the importance of corresponding features at each location of the image, thereby addressing potential rapid scale changes.(2)We propose a CSAM to integrate the correlation of feature map from channel dimension and spatial dimension, improve the focusing ability of feature map, and reduce the influence of background and noise, to produce a high-quality density map.(3)Experimental results on UCF_CC_50, WorldExpo’10, ShanghaiTech, Mall, and Beijing BRT datasets show that the proposed method achieves a lower mean absolute error (MAE) compared to many existing methods.(4)This paper proposes a new vehicle dataset HBR_YD and compares some previous methods with SCFFNet on this dataset. The results show that our method has higher counting accuracy.

The rest of our work is scheduled as follows. Section 2 briefly reviews related work. Section 3 details the structure of SCFFNet. Section 4 mainly analyzes the comparative experimental results and ablation studies. In Section 5, we give conclusions and future research directions.

In recent years, deep learning technology has been widely used in the field of computer vision, such as surveillance image analysis [15, 16], self-driving [17], stud pose detection [18], and stock value prediction [19]. As a result, neural network technology has become a research trend for various artificial intelligence tasks. Early methods [20] solved the counting problem by detecting pedestrians in crowds, but in scenes with dramatic scale changes and dense crowds, small objects and occlusions greatly reduced the detection accuracy of the model. With its powerful feature representation ability, CNN has now occupied an absolute advantage in counting tasks, and the CNN method [21] using the attention mechanism has also achieved good results in dense crowd counting research. Therefore, in this section, we introduce some related research on mainstream CNN-based crowd counting algorithms and attention modules.

2.1. Crowd Counting

In recent years, the application of deep learning in counting tasks has made most models based on CNN [2224] have achieved good results. We selected some previous models and divided them into scale-aware features and context-aware features according to feature types.

2.1.1. Scale-Aware Feature

The scale variation is an arduous challenge in crowd statistics and density estimation. Extracting scale-aware features can effectively increase the performance of counting networks. Therefore, researchers in various countries have carried out a lot of research. Zhang et al. [12] designed a three-branch CNN model to deal with scale changes, which is composed of three-column full convolution networks with different kernel sizes. However, [5, 25] pointed out through experiments that the counting accuracy of multicolumn networks is not better than that of single-column networks. Thus, some works have proposed the use of a single-column structure to extract feature information and use a well-designed regression module to generate the final crowd density map. For example, Kang and Chan [26] designed the use of an image pyramid to deal with scale changes. Cao et al. [22] used a new encoder-decoder counting model, abbreviated as SANet, where the encoder uses the scale aggregation module to encode multiscale features, and the decoder uses transpose convolution to produce a high-resolution density map. Guo et al. [27] designed a new counting network that deals with scale changes, namely DADNet, which can extract different visual information from the region of interest of the crowd through the scale-aware attention fusion method.

In addition to these methods, some researchers also proposed some novel models to deal with scale variations. For example, Wang et al. [28] designed a multiscale dilated convolution module that can fuse multiscale information so that the network can integrate low-level information into high-level semantic information and improve the scale awareness of the model. Liu et al. [29] designed a simple and efficient detection-estimation network (DENet) that enhances the model’s scale representation and generates high-quality density maps. Huang et al. [30] designed a scale-aware representation learning model, which improved the overall counting accuracy by modeling multiscale features at various levels and expanding spatial resolution. Zhou et al. [31] designed a multiscale feature enhancement learning network that combines a semantic enhancement module, diversity enhancement module, and context enhancement module.

2.1.2. Context-Aware Feature

Some counting tasks apply image context information to the counting process, which can enable the network to obtain high-quality density maps and effectively improve the performance of the counting model. For example, Sindagi and Patel [23] proposed a context pyramid network, which combines local and global context information of crowd images to obtain high-quality density maps. Sam and Babu [32] designed a top-down feedback structure, which uses high-level context information to suppress error detection through density regression.

In addition, some researchers use the method of expanding the receptive field of the network to obtain sufficient contextual information. For example, Li et al. [5] designed a network CSRNet for dense crowd counting, which uses dilated convolution kernels to expand the receptive field to provide accurate counts and density maps. Ilyas et al. [33] designed a context-aware scale aggregation network based on CNN, which can obtain deep level, perspective change, and scale change crowd features. Wang et al. [24] designed a network SCLNet for counting in complex scenes, which obtains spatial context information for counting estimation by adaptively selecting the features of different receptive fields.

2.2. Attention Module

In recent years, attention modules have found many applications in computer vision work. In dense crowd counting tasks, the attention mechanism focuses the model on useful information, which can significantly improve the counting performance of the network.

Mnih et al. [34] designed a visual attention mechanism through a recurrent neural network model. Vaswani et al. [35] designed a simple network structure based solely on the attention mechanism, which eliminates recursion and convolution. Chen et al. [36] proposed a new counting network SCA-CNN based on CNN, which combines spatial attention and channel attention. Liu et al. [37] designed the detection and density estimation network DecideNet, which can adaptively match the appropriate statistical pattern for each position of the image through attention guidance according to the actual density of the image. Hu et al. [38] focused on the channel relationship and designed a “Squeeze-and-Excitation” block, which enhanced the accuracy of important features by modeling the correlation between feature channels. Later, based on [38], they proposed an aggregation excitation operator [39] using convolution feature context. Liu et al. [40] designed attention injective deformable CNN to overcome the problem of reduced accuracy in crowded noise scenes. Miao et al. [41] proposed a densely connected hierarchical structure that can reduce background interference and capture multiscale feature information. Zou et al. [42] proposed a dual attention module that can produce two masks, which enables the network to better assign regions of interest in complex backgrounds.

However, most of the existing scale-aware-based methods [5, 43] indiscriminately fuse multiscale features without considering continuous scale changes, resulting in the network being unable to extract appropriate spatial context information. In addition, some methods [38, 44] adopting the attention module cannot effectively learn the dependencies of the channel and spatial features simultaneously, resulting in the insufficient ability of the model to suppress useless features and retain more detailed information, reducing the recognition accuracy. As a result, we construct SCFFNet to solve the problem of severe occlusion and scale changes in dense scenes.

Compared with previous research methods, our proposed method has two main differences. First, our proposed MCFFM utilizes four parallel blocks to process the features extracted by the front-end network to adaptively encode the contextual information needed to estimate the density map, instead of processing the entire image as in previous methods, or directly fusing multiple columns of features. Second, we use CSAM to handle feature correlations from both channel dimension and spatial dimension and improve the focusing ability of feature maps by calibrating and refusion of feature maps, thereby reducing the error between the model’s predicted values and the ground truth.

3. Methodology

In this part, we first introduce the network structure of SCFFNet and then introduce the loss function.

3.1. Spatial Context Feature Fusion Network

Our proposed SCFFNet is mainly composed of the front-end feature extraction network, the MCFFM, the CSAM, and the back-end network. The overall architecture is shown in Figure 2. Compared with existing CNN-based crowd counting algorithms, SCFFNet has two main novelties. First, for the scale change and occlusion problem, SCFFNet does not process the entire input image but divides the low-level features into four parallel blocks of different sizes through MCFFM to obtain features of different scales and learn the importance of the corresponding features at each image position, which helps to take full advantage of the appropriate context at each location in the image and enhances the robustness of the model to changes in crowd scale. Second, to reduce the influence of background and noise, SCFFNet utilizes a CSAM composed of a channel attention module and a spatial attention module, which is responsible for encoding the spatial correlation in the feature map and processing the dependencies between channels, which helps the model accurately locate the head position of the person in the image, improve the regression performance of the model, and generate high-quality crowd density maps. We take crowd counting as a regression task and learn the mapping between crowd density distribution and image content. Next, we show how the various modules of SCFFNet regress from the images to the final crowd density map.

3.1.1. Front-End Feature Extraction Network

Similar to some previous research work on crowd counting [5, 40, 45], we use the first 10 convolutional layers of the VGG-16 network [14] as the front-end feature extraction network because of its strong transfer learning ability. Given an image , the feature extracted by the VGG-16 backbone is calculated as follows:where represents the first 10 convolutional layers of the pretrained model VGG-16.

3.1.2. Multiscale Context Feature Fusion Module

To address the limitation of the front-end feature extraction network encoding the same receiving domain on the whole image, MCFFM first uses rich convolutions with different scales to capture the contextual information from the output of the features by equation (1) to calculate scale-aware features. We compute these scale-aware features as follows:where represents the scale and represents the adaptive average pooling. In practice, we use different scales of pooling operations to obtain feature information of different depths. It averages the original features extracted by the VGG-16 backbone into blocks, with the corresponding block size , because it has better performance than other settings. is a convolutional layer with kernel size 1 that incorporates cross-channel contextual features with no change in dimension. The reason for this operation is that the independence of each feature channel can be guaranteed, thereby limiting the presentation ability. Our experiments show that without this point, the performance of the network will drop, and this result is in contrast to the effect of the early convolution dimension reduction structure [46]. means upsampling so that the feature map has the same dimension as .

Then, we use the learning prediction weight maps to better explain the scale varies on the image. Each spatial position of these learned prediction weight maps defines the relative effects of scale-aware features. To model this, we take the important information in the contrast feature that helps us understand the local scale of the image region as the input of the auxiliary network with the weight to calculate the weight of each different scale. The computing is as follows:where the scale-aware feature is obtained by downsampling and then upsampling; compared with , the detailed information is lost. Therefore, the contrast feature is , which can extract differences between features at specific spatial locations and neighborhoods. is a convolutional network, to avoid division by zero, followed by a Sigmoid function.

Finally, we use these weights to calculate the final context feature as follows:where represents the product of elements between a weight map and a feature map , and represents channel-level join operation.

3.1.3. Channel Spatial Attention-Aware Module

Most of the previous crowd statistics methods only encode the local spatial information of the image region, while ignoring the crowd attention information and large-scale pixel-level contextual information, which affects the accuracy of the model prediction. For this reason, we propose a CSAM that consists of a channel attention mechanism and a spatial attention mechanism. The structure of the CSAM is shown in Figure 3.

The channel attention mechanism can extract the most distinguishing features from channels so that the network model has stronger robustness to noise background. So, the channel attention mechanism can effectively reduce the estimation error in the dense scene.

Figure 4 shows the structure of the channel attention mechanism, which has only one convolutional layer with kernel size 1 to process the input feature map A. To obtain the channel attention map, the channel attention mechanism first performs a matrix multiplication operation on and with sizes of and , respectively, and then performs a Softmax operation to get the channel attention map of size , which realizes the relationship between any two-channel mappings coding. The is defined as follows:where denotes the influence of the -th channel on the -th channel. In order to obtain a feature map containing global context features and channel attention information, matrix multiplication is applied to and with sizes and , respectively, and the final output of the feature map with size is computed as follows:where represents a learnable parameter.

The local and global density changes in crowded scenes have a certain trend, in order for the model to better handle the spatial correlation between features and predict the density map more accurately. We propose a spatial attention mechanism with a structure similar to the channel attention mechanism, and the structure is shown in Figure 5. Compared with the channel attention mechanism shown in Figure 4, the spatial attention mechanism has two differences in structure. Firstly, the spatial attention mechanism has three convolutional layers of size to process the input feature map, but the channel attention mechanism has only one. Secondly, the intermediate feature map generated by the spatial attention mechanism and the intermediate feature map generated by the channel attention mechanism have different sizes. The dimension of is , while that of is .

For the operation of the spatial attention mechanism, specifically, we first feed the feature A of size output by the multiscale feature fusion structure into three different convolutional layers of and then perform reshape or transpose operations on the obtained feature map to obtain three feature maps , , and . To generate the spatial attention map, we then perform matrix multiplication and Softmax operations on and of sizes and , respectively, to obtain the spatial attention map of size , which realizes the encoding of the spatial dependence relationship in the feature map. The calculation is as follows:where represents the influence of the -th on the -th location. The more the features of two different locations tend to be the same, the greater the correlation between them. Then, in order to obtain a feature map containing global context features and spatial attention information, matrix multiplication is applied to and with sizes and , respectively, and the feature map with the size is output by reshaping. Finally, sum with to generate the final feature map , which is calculated as follows:where is a learnable parameter used to scale the output, and the spatial attention mechanism learns using a network with a kernel size of .

3.1.4. Back-End Network

The output of CSAM is finally fed to the back-end network for final crowd density estimation. DCM consists of 2D dilated convolutional layers [5]. Dilated convolution can avoid the use of downsampling and provide a larger receptive field with the same amount of calculation. Dilated convolutions have been demonstrated in crowd counting tasks, significantly improving the counting performance of the network.

3.2. Loss Function

In our SCFFNet, to get an accurate density map, we choose a method similar to previous works [5, 47, 48], using L2 as the loss function, which can constrain the pixel-level error between the real density map and the predicted density map. The definitions are as follows:where is the real density map, is the estimated density map, is the input image, and is the training batch size.

4. Experiments

In this section, the generation of the ground truth density map is first introduced, and then the evaluation metrics, implementation details, and datasets of the network are introduced. After that, we compared SCFFNet with many other crowd statistics methods on five public crowd datasets. Additionally, the method is tested on the vehicle dataset HBR_YD and compared with some previous methods. Finally, we performed ablation experiments on UCF_CC_50 and ShanghaiTech Part B datasets to verify the effectiveness of two key components of SCFFNet.

4.1. Ground Truth Density Map Generation

Ground truth density maps are an important part of crowd counting tasks. Similar to the existing works [5, 22, 45], we choose a Gaussian kernel to construct a point mapping on each head annotation point to generate a real density map. The annotation of the crowd image is the point located in the center of the pedestrian’s head; for the annotation point at pixel , it can be represented by a dot map . The ground truth density map is defined as follows:where denotes the pixel location of the -th head on the ground truth, represents the pixel location in the input image, and denotes the Gaussian kernel with standard deviation . In our experiments, we did not use the geometric adaptive kernel [12] but fixed the diffusion parameter of the Gaussian kernel equal to 15.

4.2. Evaluation Metrics

We choose two metrics to evaluate our proposed method, namely, the mean absolute error (MAE) and the mean square error (MSE), calculated as follows:where represents the number of images in the test set, represents the real count value, and represents the estimated count value.

4.3. Experiment Details

We choose the same method as CSRNet [5] to train the proposed SCFFNet end-to-end, in which the front-end network loads the pretrained model parameters of the VGG-16 net [14] to improve the training speed. For WorldExpo’10, UCF_CC_50, ShanghaiTech Part B, Mall, Beijing BRT, and HBR_YD datasets, we fixed the learning rate of the network to 10−4 and used the Adam algorithm with batch size 8 to process the same size datasets. For the ShanghaiTech Part A dataset, we adjust the size of all images to , the learning rate of the network is initially set to 10−5, each epoch is reduced to 0.995, the optimizer of the network is the Adam algorithm with a batch size of 8, and the best results were obtained after 800 epochs. In addition, during the training process, we randomly crop the input image at different locations into image blocks of size equal to 1/4 of the original image, doubling the training set to reduce the overfitting of the training model.

All our experiments are under the Ubuntu 18.04.1 system, coded under the PyTorch framework using Python 3.6, and trained on NVIDIA GeForce RTX 2080Ti GPU.

4.4. Datasets

We conduct experiments on five public standard crowd datasets, including the most widely used and challenging datasets, ShanghaiTech [12], WorldExpo’10 [49], and UCF_CC_50 [50]; in addition to these three existing large-scale crowd datasets, we also conduct experiments on two datasets of low-density crowd images captured with surveillance cameras in a single scene, including Mall [51] and Beijing BRT [52], to evaluate the generalization performance of our model in daily life. In order to further verify the performance of our model in vehicle counting, and considering that the existing vehicle counting dataset has the problem of poor image quality and small quantity, we choose to build a new vehicle counting dataset HBR_YD for experiments. The statistics of the dataset are shown in Table 1. The specific introduction is as follows. Additionally, we selected one representative image from each of the five public crowd datasets and vehicle dataset HBR_YD, as shown in Figure 6.

ShanghaiTech [12]. This dataset includes 330,165 head annotations with 1198 crowd images divided into Part A and Part B. Among them, Part A is 482 dense crowd images from the Internet, and the test dataset and training dataset are 182 and 300, respectively; Part B is the 716 sparse crowd images taken in the bustling streets of Shanghai; the test dataset and training dataset are 316 and 400 images, respectively.

UCF_CC_50 [50]. This dataset has 50 very dense crowd images, where the number of people per image varies from 94 to 4543. These images contain many different crowd scenes including square gatherings, concert halls, and gymnasiums. These make it a very challenging dataset in crowd counting tasks. We choose the same fivefold cross-validation method as follows [50]: divide the dataset into 5 equal subsets, select 4 groups each time as the training dataset, and the remaining group as the test dataset, and finally calculate the average of the 5 groups of training results value.

WorldExpo’10 [49]. This dataset contains 3980 annotated sequence images with a size of 576 × 720, of which the test dataset contains 600 images, and the training dataset includes 3380 images. The entire dataset comes from 1132 video sequences of 108 scenes; each image contains a region of interest. The dataset is annotated with a total of 199,923 human heads, and the number of pedestrians per image varies from 1 to 253.

Mall [51]. The Mall dataset is a relatively sparse crowd counting dataset that includes 2000 annotated images, which were selected from surveillance equipment in a shopping mall. Each picture in this dataset ranges from 13 to 53 people, with a total of 62,325 heads annotated.

Beijing BRT [52]. This dataset contains 1280 annotated images selected from monitoring equipment at Beijing BRT stations. Each picture in this dataset ranges from 1 to 64 people, with a total of 16,795 heads annotated. The time of these high-angle images is from morning to evening, and the images contain glare, shadow, and sunlight interference. These factors often appear in our life, so the Beijing BRT dataset can well verify the effectiveness of the counting network in real life.

HBR_YD. The images in this dataset are from surveillance cameras installed on different high-speed road sections, with flexible data annotation and high accuracy. The HBR_YD dataset contains 2000 images of vehicle congestion scenes with different degrees of density, including cars, buses, trucks, and other types of vehicles. During network training, 1600 samples were randomly selected to establish a train set, and the rest of the samples were established as a test set.

4.5. Evaluation and Analysis

We experiment with the proposed method on the six datasets mentioned above and compare it with many existing methods.

4.5.1. ShanghaiTech

The experimental results of our method on the ShanghaiTech dataset are shown in Table 2. Our method obtained the best MAE of 66.2 and 7.3 respectively. In particular, compared with CSRNet [5], our method has increased by 2.0/11.9 and 3.3/5.1 in MAE and MSE indicators, respectively. On the Part B dataset, the MAE and MSE of our proposed method decreased by 20.7% and 21.0%, respectively, even compared with the recently proposed MGANet [57]. Figure 7 shows the visualization results of some samples on the datasets of ShanghaiTech Part A and Part B, which can visually show the performance of the network. Among them, the first three columns are partial samples of Part A, and the last three columns are partial samples of Part B. The first row, the second row, and the third row in Figure 7 represent the sample test image, the real density map, and the estimated density map, respectively. It can be found that SCFFNet performs well for crowd scenes with different levels of congestion. The prediction map can show the density of different crowd areas, and the predicted count value is very close to the marked count value.

In addition, we also randomly selected 50 image samples from the test sets of Part A and Part B, respectively, and the corresponding ground truth and predicted values are shown in Figures 8 and 9. It can be found that our method performs better on the Part B dataset than on Part A, the error fluctuation range is smaller, and the error of many samples is close to zero. The above experimental results show that SCFFNet can enable the counting network to capture appropriate multiscale spatial context information and effectively solve the problem of scale variation.

4.5.2. UCF_CC_50

We compare with existing methods on the UCF_CC_50 dataset, and it can be found from Table 3 that SCFFNet has achieved the lowest MAE and MSE. On the MAE and MSE evaluation indicators, our results are better 101.8/170.5 than the CSRNet [5]. Even compared with the recently proposed MGANet [57], the MAE and MSE indicators of SCFFNet also decreased by 31.8% and 27.1%. This means that in extremely congested crowd scenes, the multiscale context feature fusion module is applied to model multiscale contextual features, an attention mechanism is introduced to process the extracted features, and attention information is obtained from the channel and spatial dimensions, which effectively solves the problems of complex background, scale variation, and occlusion in congested scenes. Figure 10 shows the visualization of some samples of the UCF_CC_50 dataset, from which we can see that SCFFNet performs well in crowd scenarios with different levels of congestion.

4.5.3. WorldExpo’10

We compare the proposed method with other methods on this dataset and the results are shown in Table 4; our method achieves the lowest MAE in all five different scenarios. In particular, compared with CSRNet, our SCFFNet achieved 1.3, 1.2, 0.5, 7.9, and 0.8 improvements in MAE indicators in five scenarios, respectively. This is a major improvement in the accuracy of the counting network on the WorldExpo’10 dataset. Even compared with the recently proposed MGANet, our method achieves first place in five test scenarios, and the average performance is 20.3% higher than that of MGANet. Figure 11 shows the qualitative results of some samples on this dataset, further demonstrating the effectiveness of SCFFNet in achieving accurate counting in complex crowded scenes.

4.5.4. Mall

The comparison results of our method with other methods tested on the Mall dataset are shown in Table 5, including CNN-Boosting [59], MoCNN [60], Weight V-LAD [61], SAMC-Net [54], and ED-CNN [56]. Among them, ED-CNN is a new type of encoder-decoder CNN proposed by Ding et al., and other experimental data in the table are from the literature [56]. We also conduct experiments on CSRNet [5] to verify the effectiveness of our proposed method on this dataset. As shown in Table 5, our method achieves the best MAE of 1.56. This shows that our proposed SCFFNet can estimate the density map from a relatively sparse single scene. Compared with CSRNet, the metrics of MAE dropped by 37.1%, verifying the effectiveness of multiscale context awareness and attention mechanisms to improve the performance of the counting network. The qualitative results of SCFFNet on the Mall dataset are shown in the first three columns of Figure 12.

4.5.5. Beijing BRT

The Beijing BRT dataset is a dataset with sparse crowd density, and the scene is relatively single. Similar to the Mall dataset, it lacks the challenge. Therefore, most of the existing methods have not experimented on this dataset. But the images in this dataset are from high-angle surveillance cameras at bus stops and contain shadows and sunlight interference. We conduct experiments on this dataset, which can well verify the generalization performance of our proposed method in real life. The comparison results of our method with MCNN [13], DR-ResNet [52], FCN [53], and ED-CNN [56] are shown in Table 6; other experimental results in the table are from the literature [52]. Similarly, we also conduct experiments on CSRNet [5] to verify the effectiveness of our proposed method on this dataset, and the results are shown in Table 6. It can be found that SCFFNet achieved the lowest MAE of 1.14 and MSE of 1.69 on the dataset, which improves the performance by 18.6%/15.5% compared with the best performing ED-CNN in recent years. The last three columns of Figure 12 are visualizations of some samples on this dataset, and the experimental results verify that SCFFNet can be well applied to daily life.

4.5.6. HBR_YD

Besides understanding congested crowd scenes, our method is also experimented on our proposed vehicle dataset HBR_YD and compared with some previous methods. Considering that the dataset is not publicly available for the time being, the existing methods lack experiments on this dataset. Therefore, in order to better verify the effectiveness of our method on this dataset, we choose two more classical models, MCNN and CSRNet experiments on this dataset, and the comparison results are shown in Table 7. As can be seen from the table, our method achieves a 6.43/9.99 improvement over CSRNet in terms of MAE and MSE metrics. To better evaluate our method, we visually show the estimation errors of MCNN, CSRNet, and the proposed method on 400 test samples of HBR_YD in Figure 13. As can be seen from Figure 13, our method has a small estimation error on the test samples. Figure 14 shows the visualization of some samples on the HBR_YD dataset, and the results verify the generalization ability and robustness of our proposed method.

4.6. Ablation Experiment

We conducted the ablation experiments on UCF CC 50 dataset and ShanghaiTech Part B to show the effect of each module (MCFFM and CSAM). The performance of the model under four different settings is shown in Table 8. Compared with the crowd counting algorithm based on a multicolumn network structure, CSRNet [5] extracts enough spatial context information by expanding the receptive field of the network. The network structure is simple and easy to train, which is more in line with the requirements of practical applications. Our method is also based on the improvement made by this network. Therefore, we take CSRNet as the baseline of this paper and conduct ablation experiments to verify whether our designed module can help improve the performance of crowd counting networks in crowded scenes.

CSRNet + MCFFM: only MCFFM is added between the front-end feature extraction network and the back-end network; CSRNet + CSAM: only CSAM is added between the front-end feature extraction network and the back-end network; SCFFNet: this is our proposed method, adding CSAM after MCFFM. As shown in Table 8, the performance difference between MCFFM and CSAM alone is not significant, and the effect of SCFFNet on MAE and MSE indicators is better than adding MCFFM or CSAM alone. This means that injecting spatial and channel-dependent information into multiscale contextual features via CSAM can adapt the network to rapid scale changes and help the network estimate crowd density maps more accurately and reduces error estimates in background regions, which is similar to our original motives.

5. Conclusions

In this paper, we proposed a spatial context feature fusion network (SCFFNet) for understanding highly congested scenes, which can adaptively encode multiscale context through MCFFM, expanding scale-aware diversity and the acceptance range of feature, and use CSAM to calibrate and refuse feature maps to enhance the model’s ability to suppress background and retain more detailed information. Our proposed method achieves higher counting accuracy than many other methods on five public crowd datasets. In addition, we conduct experiments on the proposed vehicle dataset HBR_YD, and the results verify that SCFFNet has certain robustness and generalization, and the visualizations also illustrate that our method can produce better density estimates in high-density regions. Therefore, in future work, we will apply our proposed model to highway traffic flow statistics, and video people flow statistics, especially in scenes with complex backgrounds and highly crowded targets. Furthermore, we will continue to explore better ways to combine multiscale perception and attention mechanisms to further enhance the generalization ability to count networks in transfer learning.

Data Availability

The labeled dataset used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Nos. 62067002, 61967006, and 62062033), in part by the Natural Science Foundation of Jiangxi Province, China (No. 20212BAB202008), and in part by the Science and Technology Project of Transportation Department of Jiangxi Province, China (Nos. 2021X0011 and 2022X0040).