Abstract

The automatic detection and tracking of pedestrians under high-density conditions is a challenging task for both computer vision fields and pedestrian flow studies. Collecting pedestrian data is a fundamental task for the modeling and practical implementations of crowd management. Although there are many methods for detecting pedestrians, they may not be easily adopted in the high-density situations. Therefore, we utilized one emerging method based on the deep learning algorithm. Based on the top-view video data of some pedestrian flow experiments recorded by an unmanned aerial vehicle (UAV), we produce our own training datasets. We train the detection model by using Yolo v3, a very popular deep learning model among many available detection models in recent years. We find the detection results are good; e.g., the precisions, recalls, and F1 scores could be larger than 0.95 even when the pedestrian density is as high as . We think this approach could be used for the other pedestrian flow experiments or field data which have similar configurations and can also be useful for automatic crowd density estimation.

1. Introduction

The study of pedestrian flow dynamics dates back to the 1960s [13]. Ever since, the pedestrian data collection is one of the fundamental works for this field. Generally speaking, there are two major ways to obtain the data: the first approach is to gather empirical data, which comes from the real life; the second one is to conduct experiments and ask the recruited participants to move according to the orders [4]. Since the parameters and conditions could be controlled in the experiments, it is a great help for theoretical studies [5, 6]. In recent years the second way becomes more and more popular in this field [7, 8].

In most pedestrian flow experiments, for the convenience of measuring the positions and velocities, the participants are usually required to wear markers such as caps [6, 914]. At earlier stage, pedestrian motion data were extracted manually or using semi-automatic tracking techniques [10, 1517]. Some researchers used some simple software to extract the data of these caps manually, e.g., Tracker (http://physlets.org/tracker/) [1821]. But it needs lots of time to finish, and the efficiency is quite low. Meanwhile, some researchers adopted the unsupervised methods from computer vision field to detect and track pedestrians automatically [10, 15, 17, 22]. Boltes et al. developed automatic detecting and tracking software named PeTrack [9, 23]). This software has been used in many recent studies of pedestrian flow experiments [2428].

Nevertheless, the limitations of PeTrack are also clear. The preparation work for PeTrack is heavy in terms of the camera views and parameters. Although the detection results of PeTrack are good when the pedestrian density is not high, in some experiments under high-density [20, 21] its performance is not good, especially when the density reaches 8∼9. Besides, the operation of PeTrack is difficult, since many important parameters are not easy to be determined. Sometimes, the adjustment of these parameters is very complex. Therefore, we think a better method is needed.

In recent years, the development of deep learning techniques gives us an easier way to solve this problem. Many new algorithms of object detection are proposed in recent years, e.g., RCNN [29], Fast RCNN [30], Faster RCNN [31], R-FCN [32], SSD [33], Yolo series [3436], etc. Among them, Yolo (“You Only Look Once”) is a balanced object detection model in terms of the speed and accuracy. It is a one-stage detection model, which skips the region proposal stage, and runs detection directly over a dense sampling of possible locations. In many previous studies, it was found that the speed of Yolo v3 [36] is faster than the other models. Its accuracy is much higher than the two previous versions, including Yolo v1 [34] and Yolo v2 (also named Yolo 9000) [35]. And the performance of detecting small objects is also significantly improved. Therefore, in this paper we choose Yolo v3 to do the job.

The pretrained version of Yolo v3 with open datasets (such as ImageNet and COCO) has good performance when detecting multi-types of objects, including pedestrians [37, 38]. However, since the pretrained labels are primarily annotated using real life images from side-view cameras, the performance is not good when directly using for many pedestrian flow experiments, especially when the cameras are perpendicular to the ground. The features of pedestrians appeared in the top-view cameras are quite different from that when shooting from the side view. Therefore, we have to train the new model with the samples of various caps found in the video data. We find the trained results are satisfactory: most pedestrians could be immediately detected by Yolo v3, and both the precision and recall are high enough for extracting our required pedestrian flow parameters.

The focus of this paper is not to develop the state-of-the-art pedestrian detection models or algorithms, but rather try to explore the capabilities of applying the new detection approaches into pedestrian traffic domain. We will demonstrate that, through a simple training process, the deep learning model Yolo V3 could be able to achieve a good detection results for pedestrian flow analysis. And we open-source a series of training datasets for pedestrians wearing caps at top-view from UAV videos. The reproducibility of our methods could be proved by using another dataset for validation.

The rest of this paper is organized as follows. In Section 2, the brief introduction of Yolo v3 is given. Section 3 introduces the characteristics of our pedestrian flow experiments under high-density conditions and also shows the training process. Section 4 discusses the testing results of the detection model. Section 5 presents some notes about the application of this model, and the conclusion is shown in Section 6.

2. The Brief Introduction of Yolo v3

Since the main topic of this paper is to discuss the detection of pedestrians by training, here we only give a brief introduction for the improvement in Yolo v3, which could be found in Figure 1.

Firstly, Yolo v3 creates one new feature extraction network named the Darknet-53, rather than the Darknet-19 used in Yolo v2. The accuracy of the Darknet-53 is close to that of the ResNet-101 and ResNet-152, but much faster.

And then, Yolo v3 introduces prediction cross scales by using the concept of feature pyramid networks. It predicts boxes at 3 different scales and extracts features from those scales. In the scale 1, the objects are sampled by the convolutional layer of the penultimate layer. In the scale 2, the 16  16 size feature map is added, and the accuracy of detecting medium objects can be improved. In the scale 3, the 32  32 size feature map is used, which makes the detection accuracy of small-scale objects similar to that of medium objects.

Besides, The YOLOv3 no longer uses the Softmax function to classify each box, in order to avoid the overlapping category labels for the objects. Instead, the independent multiple logical classifiers are used, and the classification loss is represented by the binary cross entropy loss. Due to the above improvements, on the COCO dataset Yolo v3 can finally achieve the accuracy which is similar to that of RetinaNet, but Yolo v3 is nearly four times faster.

3. The Video Data and Training Process

For the use of Yolo v3, there are many versions written by different programming languages. Since Python is very popular and easy to use for deep learning, we choose a Python version of Yolo v3 based on the framework of Keras (https://github.com/qqwweee/keras-yolo3). Figure 2 shows the detection results by the pretrained datasets of Yolo v3. Figure 2(a) is a snapshot of one simple pedestrian flow experiment conducted about 5 years ago, and the camera is not high. We can find that nearly all the pedestrians can be easily detected in Figure 2(b). For such a situation, training is not necessary.

However, if we record the experimental video by UAV, and the camera is very high from the ground, the situation will be significantly different. In the winter of 2016, 2017, and 2018, we conducted three large-scale pedestrian flow experiments in the campus of Southeast University [20, 21, 39]. We asked the students recruited to move on the circular road, and the boundaries of the road were built by plastic stools. The widths of the road were 1.5 m in the 2016 experiment, 2.5 m in the 2017 experiment, and 2m in the 2018 experiment: these differences have no influence on the detection of pedestrians. At the same time, the colors of plastic stools were also different: they were blue in 2016 and 2017 and white in 2018. These differences can slightly influence the detection results, which will be discussed later.

In these experiments, the pedestrians wore two types of caps. The first part was unidirectional movement, and all the pedestrians just moved together. The second part was bidirectional movement: the ones with red caps went on moving forward, while the ones with blue caps turned around and moved forward again. In each experiment, the proportions of red ones and blue ones were about 50 : 50. In the 2016 experiment, we asked one pedestrian to wear yellow cap and considered it as a special one. But in the 2017 and 2018 experiment, we thought it was not necessary, and no longer used such configuration. Therefore, we do not have enough samples of yellow caps, and we only try to do the detection on the red ones and blue ones.

In the experiments, the authors use different pedestrians in different runs, in order to study the flow-density and velocity-density relationship of pedestrian movement. Here, the authors show 4 typical images with different global densities (the local density in some area could be a little larger or smaller than the global value, but the calculation and discussion of this topic is out of the scope of this paper) in Figure 3. For example, in Figure 3(c), there are 175 red ones and 182 blue ones, and the area of the circular road is about 51.1 m2. Therefore, the averaged global density is about 7.0. In this paper, the authors name all the experimental runs as “year-predetermined density () (in some runs, the predetermined density is a little different from the actual value. But these differences have no influence on the topic discussed in this paper)-order of run,” e.g., “16-8-2” means , the run is the second run, and the experiment was conducted in 2016.

We find, in all of these images, no pedestrian could be detected by the pretrained Yolo v3. We think it is possible that, in the training datasets of original Yolo v3, most samples of pedestrians are shoot from the side, rather than from the top. Therefore, we need to help Yolo v3 recognize them, and use the red caps and blue caps as the training datasets. Actually, these caps could be considered as small objects. Thanks to the recent improvement of Yolo v3, the detection of these small objects could have good performances.

In order to check how many images are enough for the training process, we try to use different training sets, and the 4 images in Figure 3 are used for testing the differences between them. The training images are extracted from the video of pedestrian flow experiments with low densities, and the time interval is 1s, as shown in Table 1. For example, in dataset T2, we use two videos. The first one (16-1-1) is 110 s, and 25 pedestrians with red and blue caps are used in this run (this number does not change, since we use circular road). The second one (16-2-1) is 128 s, and 55 ones are used in this run. And then, we use the software named LabelImg (https://github.com/tzutalin/labelImg) to manually label the red caps and blue caps in these images. In each dataset, 25% images are used as validation sets. For the training, the learning rate is set as . We have tried some other hyperparameters, e.g., set the learning rate as . But it is found that the training results keep nearly unchanged.

Here, we show some typical training processes in Figure 4. The GPU we used for training is Tesla T4 with 16G RAM. Due to the limitation of resources, if we choose larger batch size, e.g., 16 or 32, our program cannot work. Since batch size = 8 is possible, in this paper we set batch size = 10, which is just between 8 and 16.

We also show the results when batch size = 4 and 8 for comparisons. It is clear that, in Figures 4(a) and 4(c), all the losses gradually decrease with time, although sometimes the results of validation set increase a little. No obvious overfitting phenomenon is observed during the training. In Figure 4(d), we can further confirm that the influence of different batch sizes on the training is very small. Besides, here we adopt the early stopping mechanism to determine the end of training. The criterion is when the validation loss is smaller than 120, we just stop. Therefore, the epochs needed for batch size = 4, 8, 10 are 447, 424, 441, which are also close to each other.

4. Results and Evaluations

In this section, we evaluate the training results. We show the statistics of different datasets in Table 2, including precisions, recalls, and F1 scores. Here, we set score = 0.4, IOU = 0.2, since we find these values are suitable for the video data after tuning up the parameters. We can find the following.(1)When the training set is small (e.g., T1), the testing results are always not good.(2)For the low-density situation (e.g., Figure 3(a)), starting from T2, all the three results become much better.(3)But if the densities become higher, T2 is also not enough, and the growth of sample size is very necessary. Until T4 is used, the results seem satisfactory, as shown in Figure 5(d). Thus, in the following tests, we will use the result of T4.(4)For the fixed parameters (score = 0.4, IOU = 0.2), we can find that the precisions are always very high in our detection: most of them are larger than 0.95. In Table 2, the differences between different datasets mainly come from the increase of recalls.

And then, we check the results of T4 on the testing sets. The model trained in low-density situations is used for the detection under high-density condition. In Table 3, the data of 32 runs are used for testing. For each experimental video, we get the snapshots with the interval of 15s (e.g., if the total running time is 263s, the image number we choose will be 263/15 + 119). The values of precisions, recalls, and F1 scores are the averaged results of these images. It is clear that for all the runs the values of precisions, recalls, and F1 scores are greater than 0.95, which means the results are good enough.

Although the performance of our model is quite good, there are small amount of errors in the detection results. After carefully checking, we find they mainly result from the following factors:(1)The high densities when the pedestrians are too close to each other. As shown by the yellow circles in Figure 6(a), some red caps and some blue caps are not detected. Sometimes they are overlapped, which makes the detection more difficult. Actually, many smaller recalls come from this factor, e.g., 16-9-1, 16-9-2, etc.(2)The color of the clothes. Some pedestrians’ clothes (or shoes) are red or blue, which are similar to the color of caps. This also makes the detection difficult: some errors are marked by the two yellow circles in Figure 6(b).(3)The interference of stools. Sometimes the stools are recognized as blue caps, as marked by the two yellow circles in Figure 6(c). This type of errors only occasionally occurs in our results.(4)Some special cases. For example, in Figure 6(d), we can see that one pedestrian rises his head. This behavior makes his cap not detected at this moment. For such a situation, it is not the responsibility of our model.

5. Some Notes about the Applications

Since this program is a great help for extracting data from the video of pedestrian flow experiments, we have uploaded the codes and the trained model in Github.com. The link is https://github.com/chengjie-jin/detection-model-for-pedestrian-experiments. We hope this program could be beneficial for the other researchers in the field of pedestrian flow dynamics.

Finally, about the application of this model, some notes should be introduced:(1)In our detection results, usually the precision values are close to 1.0, while the recall values are a little smaller. For such a situation, the pedestrians which are not detected could be added by hand. The simplest way is to use Microsoft Paint in Windows: move the mouse pointer on the center point of one cap, and we can immediately see the coordinate (the position of only one yellow cap in 2016 experiments could also be recorded in such a way). Just input the numbers and the color in the csv file to finish the work. Even for the hardest level (in 16-9-1), usually only 5–10 pedestrians need to be added manually, and it could be quite fast. After that, we can make some transformations for the position data according to the perspective of camera and set up the coordinate system. And then, related analysis could be performed on the pedestrian dynamics in the experiments, which could be found in other paper [39].(2)Although the training datasets in this paper are obtained from the UAV video, the trained models also could be used for the other videos of pedestrian flow experiment, as long as the participants wear similar caps. For example, in Figure 7(a) we can see the image of one experiment under open boundary conditions (the discussion of the detail of this bi-directional experiment could be found in [5]), and the height of the camera is much smaller than that of UAV. Most of the pedestrians with red and blue caps could be correctly detected in Figure 7(b). Besides, if the pedestrians in other experiments have different colors of caps (e.g., black or white), the method introduced in this paper could be copied: use Yolo v3 to do the similar training if possible.

6. Conclusions

In this paper, we testify the usability of an object detection model based on deep learning techniques in terms of detecting the pedestrians recorded in top-view cameras. Since the previous software (e.g., PeTrack) does not perform well for the high-density situation, it is necessary to use some new approach. Yolo v3 is chosen due to the fast speed and decent accuracy, but its pretrained version does not work for our experimental video. Therefore, training is necessary. We choose some images under low-density condition as training datasets, and all the situations with different densities are chosen as testing datasets. We find that when the datasets become larger, the recalls significantly increase, while the precision is always close to 1.0. For the final model, the performance is good: all the precisions, recalls, and F1 scores are larger than 0.95, even when the pedestrian density is as high as . This model could be used in other pedestrian flow experiments, as long as they have similar configurations.

Although we have made some progress in the pedestrian detection under high-density conditions, one deficiency of this program is that it does not include the part of tracking pedestrians’ trajectories. However, tracking objects is not the function of Yolo v3, which is out of the scope of this paper. In recent years, some tracking programs based on deep learning have been proposed, e.g., Sort and Deepsort. It is possible to merge the detection model and tracking model into an integrated one, which may bring more convenience for the scholars in the field of pedestrian flow studies. In the future, we will try to solve this technical problem and contribute more to this field.

Data Availability

The codes and the trained model are available in Github.com. The link is https://github.com/chengjie-jin/detection-model-for-pedestrian-experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

The authors are very grateful to Ting Hui, Yi-Xiao Zou, Fei Xie, and Hong-Feng Liang for their help to process the video data and label the objects in the images. This work was funded by the National Natural Science Foundation of China (Nos. 71801036, 71901060, and 71971056), the Fundamental Research Funds for the Central Universities, and the Science and Technology Project of Jiangsu Province, China (BZ2020016).