Abstract

In the past ten years, crowd detection and counting have been applied in many fields such as station crowd statistics, urban safety prevention, and people flow statistics. However, obtaining accurate positions and improving the performance of crowd counting in dense scenes still face challenges, and it is worthwhile devoting much effort to this. In this paper, a new framework is proposed to resolve the problem. The proposed framework includes two parts. The first part is a fully convolutional neural network (CNN) consisting of backend and upsampling. In the first part, backend uses the residual network (ResNet) to encode the features of the input picture, and upsampling uses the deconvolution layer to decode the feature information. The first part processes the input image, and the processed image is input to the second part. The second part is a peak confidence map (PCM), which is proposed based on an improvement over the density map (DM). Compared with DM, PCM can not only solve the problem of crowd counting but also accurately predict the location of the person. The experimental results on several datasets (Beijing-BRT, Mall, Shanghai Tech, and UCF_CC_50 datasets) show that the proposed framework can achieve higher crowd counting performance in dense scenarios and can accurately predict the location of crowds.

1. Introduction

The crowd counting methods are used in videos and pictures to predict the number of people. For example, it's beneficial, especially in case of an emergency, such as Corona Virus Disease 2019. Otherwise, it can also be used to perform similar tasks, such as vehicle counting and cell counting under a microscope. Like other computer vision tasks, crowd counting also faces enormous challenges in terms of occlusion, background interference, and image distortion.

Many excellent models and algorithms are proposed to solve these problems in crowd counting. The methods for solving crowd counting can be classified into two categories: traditional methods and methods based on convolutional neural network (CNN). The conventional methods focus on carefully designed features extraction algorithms to solve this problem. However, the conventional methods are difficult to handle dense scenes. Due to the good performance of deep learning in various fields in recent years, the problem of crowd counting is increasingly being solved by CNN. CNN-based methods are easy to use and have better performance.

Crowd counting methods based on CNN consist of two categories: DM-based methods and detection-based methods. The DM-based method [1] first uses a normalized Gaussian kernel to represent the number of people, then predicts the DM through the CNN, and finally sums the DM to obtain the number of people. The detection-based method is to detect the number and location of the crowd by training a crowd detector. Compared with the detection-based methods, the DM-based methods have more robust to highly occluded scenes [2]. However, the DM-based methods lead to the following problems [3]: (1) higher the proportion of false positives and (2) loss of crowd location information.

As the crowd density increases, it is particularly important to study methods for dense scenes. However, most of the current research methods only focus on the design of the network structure and ignore the fundamental problem brought by DM: “location information loss.” Location information and the number of people are complementary to each other. Therefore, a new crowd detection and counting framework is proposed to solve this problem.

Our main contributions are as follows.

We propose a new network structure called ResNet-DC. It uses the ResNet [4], which performs well on classification problems, as backbone. It uses the deconvolution layer as upsampling. It is compatible with other powerful network structures so that we can migrate other network structures, and the structure is applied to both DM and PCM.

We propose a new PCM that links the crowd counting problem with the crowd detection problem. In dense scenes, PCM shows better performance than DM in the same network.

For crowd counting, many powerful methods and algorithms are proposed. This section briefly describes two different methods: traditional methods and CNN-based methods.

2.1. Traditional Methods

In traditional crowd detection and counting methods, Chan and Vasconcelos [5] and Ryan et al. [6] proposed a regression-based method that predicts the number of people by first separating the background and then extracting features from the foreground. Lin and Davis [7] and Wang and Wang [8] proposed a detection-based method, which uses two consecutive video frame sequences. Idrees et al. [9] proposed an approach based on a carefully designed set of features: HOG. With HOG, head detection, Fourier analysis, and points of interest are integrated to avoid the disadvantages of a single feature. In traditional research methods, most research work focuses on carefully designed features to solve this problem. However, these methods are challenging to handle dense scenes or the image severely disturbed by the background.

2.2. Methods Based on CNN

With the development and application of deep learning, more and more research work is currently using CNN to solve crowd counting problems. At present, deep learning has been applied in many fields, such as traffic sign recognition [10], vehicle speed estimation [11], object tracking [12], and bus arrival prediction [13]. Compared with carefully designed solutions for feature extraction, CNN based methods are easy to use and have outstanding performance. CNN-based methods consist of two categories: the DM-based methods and the detection-based methods.

In DM-based methods, Zhang et al. [14] proposed a strategy based on DM in a cross-scene scenario, which randomly crops the image, divides the obtained features into two subtasks, and gets DM and the number of people through full connection. Ding et al. [15] proposed the use of a deeply recursive network (DR-ResNet). Unlike the previous ResNet, the ResNet block in DR-ResNet is constructed in different convolution, batch normalization (BN) [16], and rectified linear unit (ReLU) [17] order and then add to the input to adapt to the scene changes. When processing video data, the CNN-based method will only consider each video frame separately and ignore the temporal correlation of adjacent frames. Xiong et al. [18] highlighted a new variant of CNN, called CNN LSTM, which captures space and time dependencies. To obtain high resolution DM, Liu et al. [19] proposed a method to optimize the multicolumn convolution neural network by learning global features and recover the lost details in downsampling by deconvolution. To adapt to the characteristics of multiscale crowds, Zhang et al. [1] first proposed a method to solve the scale problem through different convolution kernel sizes. Sam et al. [20] proposed the use of a switching convolutional neural network, which maps image patches to specific CNN columns. Sang et al. [21] optimized the geometric adaptive Gaussian kernel function of SaCNN to generate a higher quality real DM. Kong et al. [22] proposed an adaptive attention mechanism method to automatically adjust the network structure through the crowd size.

In the detection-based methods, [2, 23, 24] all use Faster R-CNN [25] as the crowd detector. To overcome the limitations of pedestrian detectors, Saqib et al. [23] proposed a motion-guided filter (MGF), which uses temporal and spatial information among successive frames of video to recover lost details. The performance of the detector in dense scenes is improved, but this scheme is only applicable to video stream data. In dense scenes, due to the severe occlusion, Vora [2] and Kong et al. [22] detected the crowd heads, which increased the accuracy of detection. Vora [2] proposed faster R-CNN directly for binary classification tasks, to determine whether the detection frame is a human head and to reduce the number of anchor boxes according to the human head scale, speeding up the detection process. Basalamah et al. [24] and others proposed a Faster R-CNN-based scale driven convolutional neural network (SD-CNN) model to detect crowd heads and to solve the problem of different head sizes in video streams based on a scale map.

3. A New Framework for Crowd Detection and Counting Combining RESNET-DC and PCM

The framework includes two parts. (1) The first part is a full CNN, namely, ResNet-DC, which consists of backend and upsampling. (2) The second part is PCM, which contains information about the location. In this section, the proposed framework is introduced firstly. Then, two critical parts of the framework are described in detail. Finally, some training details are shown.

3.1. Framework Structure

As shown in Figure 1, there are three steps in the structure of the proposed framework for crowd detection and counting. The first step aims to extract input image features based on a CNN consisting of backend and upsampling. Backend shown in Figure 2 uses the ResNet to extract the features, and upsampling shown in Figure 3 uses the deconvolution layers to restore the feature map scale. The second step aims to predict high-quality PCM. The last step is to analyze the estimated position set P to get the number of people and location. To obtain the location information of the crowd, it is only necessary to perform nonmaximum suppression on PCM to get the location set. Therefore, we only need to count the location of the crowd to get the number of people.

3.2. ResNet-DC

The first part of the proposed framework is named as ResNet-DC. In ResNet-DC, backend extracts the features of the input image and reduces the input size by eight times, and upsampling restores the size of the feature map to obtain a high-quality PCM.

3.2.1. Backend

In this work, ResNet-18 [4] is used as the backbone network, which has outstanding performance in classification problems. In the backbone network, the deeper the network, the more increased the memory, training, and inference time. Due to the real-time nature of crowd detection, it is reasonable to use the first to third layers of ResNet. As the step size increases, the downsampling of the feature map increases. The step size of the residual block 1 in the third layer of ResNet is changed from two to one according to the crowd counting framework [26] to avoid severe loss of location information due to downsampling. Figure 2 shows the modified structure of the first residual block in layer three of ResNet. The detailed configuration is shown in Table 1. The subsequent residual blocks still retain the original design of ResNet. Under this setting, backend extracts the feature information of the original image and performs downsampling to obtain a feature map that is eight times smaller than the original.

3.2.2. Upsampling

In crowded scenes, excessive downsampling causes loss of feature information (especially location information). It is a feasible method to use the deconvolution layer to recover the feature information and obtain high-quality PCM. Deconvolution can be regarded as the inverse process of convolution and pooling. Long et al. [27] show that the deconvolution layer can recover more feature information than using convolution and bilinear interpolation. In this paper, the structure of upsampling is shown in Figure 3. It consists of two and three deconvolution layers. The first convolutional layer is responsible for compressing the channels of the feature map. The three deconvolution layers in the middle are accountable for upsampling the feature map to the original image size. The last convolutional layer is responsible for mapping the feature map to PCM. Table 2 shows configuration information for upsampling.

Under the above structure, ResNet-DC can restore the feature map reduced by backbone to the same size as the input. In this way, the predicted feature map will not ignore some peaks due to overlapping peaks.

3.3. Peak Confidence Map

PCM, an improvement over DM, is designed and compares with DM in this section. Then, a nonmaximum suppression algorithm is introduced to obtain crowd information from PCM.

3.3.1. Density Map

The density map design is based on [1, 28]. For a head position in an image, a normalized Gaussian kernel function is generated in the neighborhood of its ksizeksize. can be expressed as follows: where is the normalization factor so that . is the variance of the Gaussian kernel of the ith head. In traditional DM, it is designed as a constant. To convert the marked points into a density function, the normalized Gaussian kernel function at different positions needs to be summed. The density function F (x, y) can be expressed as follows:where represents a density function that already contains i head positions and N represents the number of people in the ith image.

However, each head position is sampled in a 3D scene. Due to perspective distortion, different head sizes in the image are caused. Zhang et al. [1] found that the denser the crowd, the smaller the head size. To solve the problem of perspective distortion, [1] proposed a DM using an adaptive geometric Gaussian kernel based on the previous findings; that is, . represents the average of the distance between the ith head position and the k nearest heads, and is obtained through experiments. Since is normalized, each position corresponds to a Gaussian kernel function or adaptive geometric Gaussian kernel function with a sum of 1. By summing the pixels of the density function , the number of people can be obtained. However, due to the addition operation, false peaks may occur, which leads to the loss of position information. For example, there is a situation as shown in the left of Figure 4 (represented in one dimension), and the red and blue curves represent the Gaussian kernel function that transforms the position information of different people. It is easy to know that x1 and x3 represent different head positions, and the black curve can be obtained after the addition. Since a false peak x2 is generated at this time, it is impossible to determine which peak is the head position.

3.3.2. Peak Confidence Map

The different Gaussian kernel peaks correspond to the marked position of the head. In this paper, a design scheme for PCM that overcomes the shortcomings of location information loss is proposed. Unlike previous DM, the peak confidence function performs a maximum operation. PCM is defined in this paper as follows:where represents the Gaussian kernel corresponding to the ith head position, represents a confidence function that already includes i-head positions, N represents the number of persons in the image, and is the ith heads correspond to the variance of the Gaussian kernel. Compared with DM, PCM no longer normalizes the Gaussian kernel because it uses the number of peaks to count the number of people and does not need to be summed like DM. The reason for named PCM is that (1) the peak represents the number and location of the crowd and (2) the closer to the head position, the higher the value. To some extent, it can reflect the confidence that a certain head position exists in PCM. Figure 4 shows the difference between PCM and DM. As shown to the right of Figure 4, if it is expressed in one dimension, the red and blue curves in the figure represent the Gaussian kernel functions corresponding to different heads positions. The black curve shows the results obtained by taking the maximum of different Gaussian kernel functions. As can be seen from the black curve, the two peaks exactly represent the head positions of different people. During the experiment, the peak confidence function was regressed to make the network produce different peaks at different people's head positions. By obtaining the extreme point from PCM to get the position of the peak, it is easy to know how many people will produce how many peaks.

According to the design method of PCM and DM, PCM and DM on Beijing-BRT [15], Mall [29], Shanghai Tech [1], and UCF_CC_50 [9] can be calculated. Figure 5 shows that there is not much difference between PCM and DM when the crowd is scattered. When the crowd is dense, the maximum value of PCM is at the head position of each person, and the location information and the crowd distribution can be calculated more precisely. But in DM, the denser the crowd, the greater the value, so the position information is lost.

In general, PCM and DM have the following differences. (1) DM takes the sum between Gaussian kernels, while PCM takes the maximum value between Gaussian kernels. (2) DM needs to normalize the Gaussian kernel, but PCM does not. (3) DM calculates the number of people by calculating the sum, and PCM calculates the position and the number of people by calculating the peak value.

3.3.3. Nonmaximum Suppression

Nonmaximum suppression aims at maximum local searching, that is, finding extreme points. In DM, due to the interference of false peaks, many incorrect positions will be detected by nonmaximum suppression method. So, it uses the regularized Gaussian kernel to calculate the number of people. This leads to the loss of location information. But in PCM, since each person's head corresponds to a peak, nonmaximum suppression becomes possible. The extreme point set P is calculated as follows:where denotes the (i, j)th pixel in PCM with the size of (W, H), represents the four neighborhoods of pixels, is the confidence, and argmax denotes the subscript to get the maximum value. For each pixel of PCM, (7) compares it with its four domains. If the point is the maximum in four domains, then the pixel is the local maximum, that is, the extreme point. In other words, the head position P is a set: it is a local maximum and greater than the confidence.

3.4. Train Details

This section gives detailed training information on ResNet-DC. By using pretrained ResNet, ResNet-DC can quickly converge.

3.4.1. Label Normalize

The current work in [26] points out that a regression value will affect the performance of the network if a regression value is too small in DM. Considering the same effect on PCM, we multiply PCM by a factor of amplification. In this paper, we set the amplification factor to 10. The reason for setting the magnification factor is that if the value of the PCM is too small, the network is easy to predict the wrong peak value, which is caused by the small difference between adjacent values. If the value of the PCM is too large, it is difficult for the network to converge, which is caused by the excessively large loss value.

3.4.2. Data Augment

The current work in [1] obtains nine times images by cropping at different positions. Since cropping may cause the loss of global information, in our experiments, we only flip the original image horizontally to obtain twice the image.

3.4.3. Loss Function

Most research work [1, 15, 20] uses the mean square loss to evaluate the error. In this paper, the mean square loss is also used. The MSE loss function is defined as follows:where θ represents the parameters that ResNet-DC needs to learn, N represents the number of pictures, represents PCM predicted by the ith input image I, and represents the ground-truth PCM of the ith input image I. But when the mean square loss is only used, the network is biased towards more peaks predicted. Although the mean square loss can penalize the error between the ground-truth PCM and the estimated PCM, it ignores the relationship between adjacent pixels. Compared with DM, PCM has a stricter relationship with neighboring pixels. The reason for the extra peak is caused by ignoring the relationship between adjacent pixels.

In PCM, considering the importance of the relationship between adjacent pixels, a feasible solution is to calculate the difference between adjacent pixels. As we all know, the relationship between adjacent pixels can express important information. For example, the pixel values that are close to each other represent the same element, and the pixel values that are relatively different represent the boundaries of different elements. In order to express the above information, we use a convolution kernel with kernel = [[−1, −1, −1], [−1, 9, −1], [−1, -1, −1]]. The specific convolution kernel form is not important. We can use kernel = [[0, −1, 0], [−1, 5, −1], [0, −1, 0]] to achieve the same effect. Only the previous convolution kernel takes into account the values of the four corners. In this work, we use a convolution kernel = [[−1, −1, −1], [−1, 9, −1], [−1, −1, −1]] of size 3  3 to convolve with PCM to get the relationship between adjacent pixels. The loss is defined as follows:

We use the kernel to convolve with PCM to obtain the difference value between the center point and its eight neighborhoods and then calculate the mean square error within the area. The total loss can be calculated as follows:

3.4.4. Learning Setting

According to transfer learning in [30] to accelerate model convergence, a straightforward way to train the ResNet-DC is used as an end-to-end structure. Backend is fine-tuned from a well-trained ResNet-18 [4]. For upsampling, the initial values come from a Gaussian initialization with 0.01 standard deviation. Using the Adam optimization algorithm, the learning rate is 5e − 5, and the weight decay rate is 1e − 4. The input image is regularized (mean and variance on the Imagenet dataset) and then trained on the dataset to predict PCM. At the same time, each iteration on the training set is verified on the validation set, and the best model in the validation set is retained.

4. Performance Evaluation

In this section, several datasets are used to evaluate performance. The crowd count evaluation metric and location evaluation metric are proposed. Based on the datasets and the metrics, the performance of different methods is compared and analyzed.

4.1. Dataset

Currently, the mainstream crowd count dataset includes Beijing-BRT [15], Mall [29], Shanghai Tech [1], and UCF_CC_50 [9]. In the framework, we performed experiments on the above four datasets, each of which is described as Table 3. In Beijing-BRT, we divided the training set and test set according to the criteria of [15]. In Mall, we divide the training set and test set according to the criteria of [18]. In Shanghai Tech, we divide the training set and test set according to the criteria of [14]. In UCF_CC_50, we use 5-fold cross-validation according to the standard of [9]. In these datasets, due to the different image resolutions of the Shanghai Part A and UCF_CC_50 datasets, we counted their average resolutions. We resized the image size so that it is closest to the average resolution and divisible by eight.

4.2. Evaluation Metric

According to the existing methods [1, 14], the mean square absolute error (MAE) and mean squared error (MSE) are used to evaluate the performance of crowd counting, which are defined as follows:where N is the number of pictures, is the number of people in the ith picture, and is the number of people predicted in the ith picture. To some extent, the mean square absolute error can be regarded as the accuracy of the prediction, and the mean square average error can be regarded as the generalization ability of the model. These two indicators are equally important. From the value of MAE and MSE, the lower the value, the higher the accuracy.

To quantitatively analyze the position performance, we use a method similar to object detection to evaluate the position performance as follows. (1) If a real position of the SS neighborhood exists in the predicted position, we classify it as true positive. (2) If a predicted position does not belong to any of the real positions of the SS neighborhood, we classify it as false positive. Then, the standard Average Precision (AP) and Average Recall (AR) scores are calculated. In this experiment, S represents the allowed position error. We believe that due to the differences in manual marking, not all positions are accurately marked in the center of the human head, and there will be some errors. Therefore, when S is set to eight, it is reasonable to predict the position as the true positive.

4.3. Experimental Results and Analysis

In the experiment, we use the framework proposed in this paper to solve the crowd counting problem and crowd location prediction simultaneously. The experimental results on the above four datasets show that the proposed framework is not only suitable for dense scenes but also can predict the position of the crowd.

4.3.1. Counting Performance

In the experiment, we compared the crowd counting performance of DM and PCM. At the same time, we also compared it with other powerful algorithms. Tables 47 show the performance results of crowd counting on four different data sets. In DM, we use ResNet-DC to compare with other excellent algorithms, and the results show that the ResNet-DC has made slight progress in the Shanghai Tech part A (0.2 MAE) dataset and achieved good performance on other datasets. The results are acceptable because we used the simplest ResNet-18 as backend network. We can also use other deeper networks such as ResNet-32, ResNet-50, and ResNet-101. When we use PCM in ResNet-DC, we have performed excellent performance in Shanghai Tech Part A (2.33 MAE, 6.8 MSE) and good performance on other datasets.

4.3.2. Localization Performance

Because there are fewer experiments on localization on the crowd counting dataset, we only compare the AP and AR of different methods on the UCF_CC_50 dataset, as shown in Table 8. Compared with the current best algorithm SD-CNN [24], our approach is slightly worse on AP. But we have reached the best level in AR, and the improvement of 1.48 AP is better than the current best algorithm. We believe that this is because we only place the position with higher confidence (confidence is 0.5) as the location. Higher confidence leads to higher AP, but also lower AR. And because AP is soft, this leads to the degradation of MAE performance. Table 9 shows the position performance of our algorithm on the other three datasets. We found that although the performance of AP and AR gradually decreased with the increase of the crowd density, even on the worst-performing Shanghai Tech Part A, both AP and AR reached 59%. The result of four datasets shows that our algorithm can detect reliable locations even in dense scenes.

4.3.3. Result Analysis

Why is the performance of DM slightly better than PCM in sparse scenarios? Under the same network structure, the results show that in the sparse crowd scene (Beijing-BRT, Mall, and Shanghai Tech Part B), the design of DM is slightly better than PCM for crowd counting. We consider that this is since (5) has robustness for DM. In DM, it ignores the relationship between adjacent pixels. Even if there is a small amount of value prediction error, the impact on the crowd count is relatively small. PCM pays more attention to the comparison between adjacent pixels. Although (6) can mitigate the error value, it has not reached the optimal performance. Besides, we visualized the prediction results of PCM and DM under the same network structure. As shown in Figure 6, due to picture distortion, ResNet-DC loses information about people in the distance. Affected by the shooting environment, ResNet-DC lost the information of nearby people. For the missing information, PCM shows a lower confidence (lower than the confidence value 0.5), so PCM directly discards these values. Instead, the DM will add these values to the number of people. As a result, the predicted number of DM is closer to the true value than PCM.

Why is the performance of PCM better than DM in dense scenarios? Under the same network structure, the results show that in crowded scenes (Shanghai Tech Part A and UCF_CC_50), the crowd counting performance of PCM is significantly improved compared to the DM method. We also visualized the prediction results of PCM and DM under the same network structure. As shown in Figure 7, due to the defects of the convolutional network, the predicted picture in the dense scene is disturbed by the background (red rectangle). In PCM, since we define the peak value to be greater than the threshold value, nonmaximum suppression can filter out small activation values. In DM, these interference values are usually added to the number of people, resulting in the DM.

The method predicts a larger number of people. Besides, the results show that the network is generally interfered by occlusion in dense scenes, resulting in incorrect predictions (black rectangles) in dense areas. Because PCM combines position information, it is only sensitive to peaks. DM will directly add these false values to the number of people, which leads to the instability of the forecast results.

Why is PCM better than DM? First of all, due to design differences, PCM naturally contains location information, but DM does not. Secondly, since the peak value indicates the location of the crowd, PCM can ignore small activation values, thereby significantly reducing background interference. Conversely, DM adds these interference values to the crowd count. Finally, because PCM focuses on local maximums, it can ignore the second largest activation values generated in crowded places. Conversely, DM will also add the false activation values to the crowd count. We also visualized some of the results on the test set in Figure 8. Figure 8 shows that the predicted PCM generally has a high confidence level for the predicted crowd location on a dataset with low crowd density (Beijing-BRT, Mall, and Shanghai Tech Part B). On a dense crowd dataset (Shanghai Tech Part A and UCF_CC_50), the confidence level of the predicted PCM for the predicted crowd location is generally lower. As the crowd density increases, the peak confidence decreases. This phenomenon is consistent with people's intuitive feelings. At the same time, Figure 8 also shows that PCM can accurately predict the location of the crowd, which DM cannot do.

In general, PCM shows better performance than DM when faced with computer vision occlusion, background interference, and image distortion. Specifically, for occlusion and image distortion issues, PCM only considers the peak value. That is to say, even if there are overlapping or different-sized headers, PCM only needs to consider whether there is otherwise in the prediction result and does not need to consider the global information of the headers like DM. As for background interference, PCM can also filter out the interference information.

5. Conclusion

In this paper, a new framework is proposed to solve the problem of crowding detection and counting at the same time. The framework combines ResNet-DC with PCM to predict the number of people and the position of the person. ResNet-DC is a full CNN consisting of backend and upsampling. Backend is used as a feature extractor, and upsampling maps the extracted features into a high-quality PCM. The entire network is an end-to-end structure, and it is easy to migrate other excellent models to ResNet-DC. PCM retains the crowd distribution and location information. It can obtain position information through nonmaximum suppression and is also an effective method to solve background interference. Experimental results on four public datasets show that the proposed framework has good crowd counting performance and can even get accurate location information.

Data Availability

The related codes and data in the literature are released at https://github.com/Yuesheng321/RestNet-DC.git.

Conflicts of Interest

The authors declare that there are no conflicts of interest in the submission of this manuscript.

Authors’ Contributions

The manuscript was approved by all authors for publication.

Acknowledgments

This work was supported by the education and research projects of Hunan Provincial Education Department (JG2018A012, XiangJiaoTong [2019] no. 291-410, no. 248-27, no. 370, [2020], no. 9, no. 90, and no. 233 HNKCSZ-2020-0122), the projects of the Ministry of Education of the People's Republic of China (201901051021), and the Science and Technology Progress and Innovation Project of Hunan Provincial Department of Transportation (no. 201927).