Abstract

A pedestrian counting method based on Haar-like detection and template-matching algorithm is presented. The aim of the method is to count pedestrians that are in a metro station automatically using video surveillance camera. The most challenging problem is to count pedestrians accurately in the case of not changing the position of the surveillance camera, because the view that surveillance camera uses in a metro station is always short-shot and nondirect downward view. In this view, traditional methods find it difficult to count pedestrians accurately. Hence, we propose this novel method. In addition, in order to improve counting accuracy more, we present a method to set the parameter value with a threshold-curve instead of a fixed threshold. The results of experiments show the high accuracy of our method.

1. Introduction

Metro as a mode of passenger transport with the characters of speed and comfort has been favored by the public in many cities. More and more urban residents like to choose metro as their transportation tool. Because of the large number of passengers that take metro, many cities’ metro stations have often crowded especially in some rush hours. If the crowded situation cannot be evacuated promptly, it will impact the metro station’s operation and management. What is more, a stampede will very likely occur if the crowded situation lasts for long or becomes severe. Therefore, to avoid the congestion and prevent accidents, metro station staff needs to get passenger flow information in real time so as to evacuate passengers promptly. Pedestrian counting information is the most important among these passenger flow information.

However, at present, the main way to get the pedestrian counting information in a metro station is to estimate the passenger flow volume from surveillance video by monitoring personnel. This way is simple and easy to implement, but it has some disadvantages that are difficult to overcome. Firstly, because of getting the information in manual collection, the pedestrian counting information that metro station staff can get is not quantitative data but qualitative data. So the information is inaccurate and easily affected by subjective factors of monitoring personnel. Secondly, monitoring personnel is difficult to provide the metro station staff with full and correct information when monitoring points are added.

Getting pedestrian counting information in automatic collection has unparalleled advantages compared with getting the information in a manual way. These advantages are as follows. First, the automatic way can improve the precision of pedestrian counting data and turn the qualitative data into the quantitative data. Therefore, this way can provide accurate and real-time data that can help metro station staff to analyse the situation of passenger flow and handle anomalous event. Second, this way can decrease workload of the monitoring personnel and reduce influence of subjective factors.

The way to count pedestrians in a metro station automatically based on computer vision has a huge cost advantage compared to the other ways, because it can use the existing surveillance video equipment which is common in a metro station. However, the computer vision way is not mature due to the factors of complicated occlusions and cluttered scenes. For that reason, this paper studies how to count pedestrians that are in a metro station accurately using related algorithm of image processing and pattern recognition. This paper’s purpose is to present a method to count pedestrians in a metro station accurately and in real time without changing the position of the existing surveillance camera. The method presented in this paper consists of three parts. First part is to detect pedestrian using Haar-feature and MHI algorithm and then to track pedestrian using template-matching algorithm in the second part. Finally the method of threshold-curve is proposed to improve the counting accuracy.

Recently, many image-processing based methods of counting people were proposed. As using in different scenes, these methods are quite different.

Mittal et al. have proposed a traffic management system using both audio and video data [1]. Hou and Pang have developed an effective method for estimating the number of people in a low-resolution image with complicated scenes in real time [2]. Chan and Vasconcelos have presented an approach to the problem of estimating the size of inhomogeneous crowds which are composed of pedestrians that travel in different directions [3]. Xiong et al. have proposed a potential energy-based model to estimate the number of people in public scenes as the fundamental research to detect the abnormal crowd behavior [4]. Sacchi et al. have exploited the image-processing tools for moving-object detection and classification in the context of an actual application involving the remote monitoring of a tourist site [5]. Schofield et al. have designed a system to distinguish between parts of the background scene and nonbackground objects (people) [6]. Hashimoto et al. have developed a people-counting system with human information sensors [7]. Amin et al. have presented a system for counting people in a scene using a combination of low cost, low-resolution visual and infrared cameras [8]. Huang and Chow have described a people-counting system using hybrid RBF neural network [9]. Schofield et al. have described a method for counting the number of people in any predefined scene using RAM-based neural network classifiers [10]. Kopaczewski et al. have presented an algorithm for people counting in crowded scenes based on the idea of virtual gate which uses optical flow method [11]. Vicente et al. have shown the algorithm implementation for a field-programmable gate array- (FPGA-) based design for people counting using a low-level head-detection method [12]. Conte et al. have presented a novel method by establishing a mapping between some scene feature and the number of people to provide an estimate of people count [13].

Most of the previous works can be classified into two classes by camera view. One captures the video clips by long-shot and nondirect downward view camera [15] and the other by short-shot and direct downward view camera [613] (see Figure 1). The image-processing method of the first class is the first to detect and track the foreground group by body shape feature and then segment the group into individuals for counting. And the image-processing method in the second class is the first to detect foreground group and segment it into individuals by head-shoulder shape feature and then track the individuals for counting. The second class method always has lower computational complexity than the first class because it has no need to process occlusions. However, the view of the second class method is smaller than the first class method. Both of the two class methods can make good performances in counting people accurately by their suitable views, but they are not suitable for counting pedestrians in a metro station as its camera is always set in short-shot and nondirect downward view (see Figure 2). In this view, it is difficult to segment group into individuals because of the severe body shape deformation. To solve this problem, this paper aims to propose a novel method using Haar-like feature to detect pedestrians’ head and using template-matching approach to track them for counting. The reason why we select this method is that head of the pedestrian is seldom occluded in this view. Although this method is easy to make error detected, the approach of threshold-curve can compensate it.

3. Detecting and Tracking Pedestrian

3.1. Haar-Like Based Pedestrian Detection

In this paper, pedestrian is recognized and detected using Haar-like featured detector and AdaBoost classifier. This detection process has two steps—training and recognition. The flowchart of this process is shown in Figure 3.

The first step is to train samples. First of this step, we use the mouse to cut head samples from video clips by Pedestrian Detection Software (see Figure 4). In this paper, a total of 4324 samples are collected which are composed of 1000 pedestrian head positive samples and 3324 negative samples. Each positive sample is resized to pixels and the negative samples are collected from the Internet including images of mountain, river, cartoon, and animal. After preparing this sample set, a classifier is trained to judge whether an object is a head or not. At first glance, samples’ raw pixel value is an ideal feature to train a classifier, but the challenge of calculation complexity makes it difficult to take into use. Thus, the Haar-like feature instead of raw pixel value feature has been chosen. Each Haar-like feature is composed of two or three “black” and “white” rectangles joining together—these rectangles can be up-right or rotated by 45 degrees (see Figure 5). The Haar-like features value is calculated as a weighted sum of two components: the pixel gray level values sum over the black rectangle and the sum over the whole feature area (all black and white areas) [14]. An image of contains millions of Haar-like features, so an effective feature selecting method is necessary. AdaBoost has been proven to be an effective method in selecting classifiers from millions of weak classifiers. So the AdaBoost algorithm has been selected as the learning algorithm to train the cascade classifier. After training, a pedestrian head classifier has been got.

The second step is to recognize pedestrian by pedestrian head classifier. In order to improve the speed of recognition, the region of interest (ROI) is set in the video clips. In ROI, the head classifier is adopted to recognize pedestrian in each frame (see Figure 6). The range between upper-redline and lower-redline is ROI. Two pedestrian heads are detected and highlighted by green rectangles in ROI.

3.2. MHI Based Orientation Detection

To represent how motion pedestrian works a motion history image (MHI) is formed. In an MHI, pixel intensity is a function of the temporal history of motion at that point. For the results presented here a simple replacement and decay operator is used:

The result is a scalar-valued image compared with which moving pixels are brighter [15]. Examples of MHI are presented in Figure 7(b). Green circle represents the orientation of this pedestrian.

3.3. Template-Matching Based Pedestrian Tracking

Object association is of great importance to the application of multiple targets tracking. In order to get the associations between a single recognized rectangle and a specific template, the method of template-matching is adopted. In other words, we use template-matching algorithm to judge whether a single recognized rectangle matches to a specific template [16]. If the association between one recognized rectangle and one template satisfies the following two criteria, this recognized rectangle is considered as matching to this template.

The center of recognized rectangle is located in the searching area of the template (see Figure 8(a)). The formula is listed as follows:

In formula (2), Template   stands for position of template in frame image, and Detect   stands for position of detection rectangle in frame image, and so on for Template and Detect . Search represents the searching threshold of the template.

The orientation of detection rectangle keeps the pace with the deviation angle of the template (see Figure 7(b)). The formula is presented as follows:

In formula (3), refers to the detection rectangle, refers to the template, cur stands for the current frame, first stands for the first frame of the template, last stands for the last frame of the template, AT stands for the threshold of the angle, and refers to position and so does .

If one detection rectangle matches with one template, this template is updated as in Table 1.

In order to make the template’s position updated with a fast speed while the template’s size with a slow speed, the specific values (alpha1 = 0.95, alpha2 = 0.35) are set in this paper.

If one detection rectangle cannot match each template in template list, a new template is created and added to template list. This new template is initialized as in Table 2.

While a template is unable to match with a new detection rectangle for a moment (5 frames in this paper), it is feasible to delete this template from the template list so as to judge whether or not this template is a pedestrian. In this paper, the template is considered as a pedestrian if it meets the following conditions:

In formula (4), the parameters of MIN_SIZE and MAX_SIZE are set for deleting some templates the size of which is too big or too small. The parameters of GOOD_SCORE, GOOD_SCORE_RATIO, and GOOD_SCORE_MAX are set because templates created by image noise are more difficult to match than the other templates. And the parameters of GOOD_DIR_AVER and Y_AXIS_MIN are set because the positions of templates produced by the noise of images change a little. The method of how to set these parameters’ thresholds will be elaborated in the next section.

4. Threshold-Curve Set

These parameters, which decide whether a template is regarded as a pedestrian template or not, have great influence on the accuracy of counting. Hence, in order to guarantee counting accuracy, a novel method to set these parameters’ thresholds is presented. The novelty of this method is that these parameters’ thresholds are set not by fixed values but by curve values. In other words, whether the template is regarded as a pedestrian or not is decided by different thresholds’ values when it is in different positions of the frame image. This novel method’s implementation includes two steps. The first step is data collection of the samples, and next step is to make an analysis of the data and calculate the threshold-curve.

4.1. Data Collection

In order to find out the different characteristics between the pedestrian template and other noise templates, a piece of software to collect pedestrian templates is designed. The piece of software is shown in Figure 9.

In this software, the display of the rectangular template and identification code are shown in the video clip in real time (see Figure 9). This software will pop up a new dialogue box named “dataForm” automatically while a template cannot match with a new detection rectangle for 5 frames. There are two buttons in this dialogue box—“Yes” and “No.” The relevant characteristics of this template will be added to the pedestrian template database if the “Yes” button is pressed. In this database, each field stands for a specific feature of the template and each record represents a pedestrian template. All fields are listed in Table 3.

In this paper, 908 pedestrian templates to make a more comprehensive analysis have been collected.

4.2. Data Analysis

From what is presented above, we judge a template by whether a pedestrian uses the parameters of MIN_SIZE, MAX_SIZE, GOOD_SCORE, GOOD_SCORE_RATIO, GOOD_DIR_AVER, GOOD_SCORE_MAX, and Y_AXIS_MIN or not.

We find out that the counting error is bigger than expected if these parameters are set by fixed value. That is because template’s detecting and tracking are susceptible to the surroundings. Therefore, a novel method which is to set the parameter with different values is proposed. The value is decided by the last position of the template. In other words, we want to get not a fixed value but a curve value to be the parameter’s threshold, so we call it threshold-curve.

Figure 10 is the scatter gram of the frequency of the template samples. In this figure, colorful dots stand for the last position (the position where the template is deleted) of pedestrian templates which have been collected in Section 4.1, and the different colors represent the different frequency of their occurrences (red for 3 times, green for 2 times, and so on). It is observed that these dots’ -value is quite close and -value is disperse, so we use these dots’ -value as the independent variable of the threshold-curve. To help readers better understand this new method, we will take the example of setting the threshold of MIN_SIZE to further explain how to set these thresholds’ values.

4.2.1. MIN_SIZE Threshold-Curve

As is mentioned above, MIN_SIZE threshold-curve has been gotten according to -values of the template samples. The -value is divided into 704 groups as this video resolution is pixels. Each group includes 576 pixels from (1, yn) to (576, yn). Figure 11 is the scatter gram of the frequency of the template samples in -axis. The figures show that it is easy to make mistakes to adopt this method of classification, which are shown in Figure 11(b). In this figure, none of the -values of the pedestrian templates comes to 55; however, it does not mean that a pedestrian will never appear at this position (55, yn). And events with low frequency may account for this error.

It is the most direct way to expand the number of the samples so as to avoid this mistake. However this solution is not feasible because more samples would increase the intensity of manual work. The solution we adopt is to integrate four small groups into a big group, each of which includes pixels. The detail is presented in Table 4.

A new scatter gram of the frequency of the template samples in -axis is shown after using this method (see Figure 12).

The templates in a big group form a template set, each of which has the property of size. We will get a set of the templates by this way. In order to get the threshold of the MIN_SIZE parameter, we sort this template’s size set from small to large and collect the 10~(th) percentile value of this set (see Figure 13(a)). The reason why 10~(th) percentile value is chosen instead of the smallest value lies in the principle of reducing the impact of events with low frequency.

Both the size value and the frequency distribution of the pedestrian templates have impacts on the threshold of the MIN_SIZE. For example, in Table 5, when -coordinate value equals 145–148, we consider that the value of the MIN-SIZE is a little larger than its actual value because its frequency is only 2 and the sizes nearby (1849, 1764, 2809, and 1521) are smaller. To guarantee the veracity of the threshold-curve, the following is used to calculate the MIN-SIZE value of the templates:

stands for the frequency value of group ; , stand for weight value (in this paper, , ); stands for the size value of group ; stands for the new size value of group . This method to get a new MIN_SIZE threshold figure is adopted (see Figure 14).

Now the discrete function of the thresholds of the MIN_SIZE has been got, but it is far from enough. To make it convenient to design of the program and boost the calculating speed, we intend to get a continuous function (threshold-curve) of the MIN_SIZE, which can be obtained by two steps.

Firstly, the discrete MIN_SIZE threshold data is adapted to a sum of the sin function () by fitting at least squares principle. As is known to all, the larger the -value becomes, the smaller the SSE-value (Sum of Squares for Error) becomes. However, it takes more time to calculate. Therefore, in order to work out a proper -value that can make SSE-value and calculating speed in the ballpark, the value of is increased one by one and the SSE-value of the fitting function is calculated. When SSE-value is less than ( for the max value of MIN_SIZE), the value of can be worked out.

Secondly, to make the threshold-curve more fault-tolerant, the fitting function is multiplied by a coefficient. If the lower limit value is needed, this coefficient should be 0.7; if the upper limit value is wanted, this coefficient should be 1.3. We will get the threshold-curve of the MIN-SIZE by this way.

The threshold-curve of the MIN_SIZE is shown in Figure 15 and the following formula:

4.2.2. The Other Parameters’ Threshold-Curve

The other parameters’ (MAX_SIZE, GOOD_SCORE, GOOD_SCORE_RATIO, GOOD_DIR_AVER, GOOD_SCORE_MAX, and Y_AXIS_MIN) threshold-curve can be gotten like MIN_SIZE. These parameters’ threshold-curve is worked out as in Figures 16, 17, 18, 19, 20, and 21 and the following:

5. Results

To test the results of our method, we design a piece of software named Pedestrian Counting Software V1.0. This software has been developed by C# language and EmguCV library. EmguCV is the NET version of OpenCV which is an open source computer vision library developed by Intel that includes a lot of general algorithms for image processing and computer vision.

The first experiment is to test the accuracy of counting pedestrian with fixed threshold. The value of parameters’ thresholds is decided by the pedestrian template set we collect in Section 4.1. For example, the value of MIN_SIZE’s threshold is the minimum value of template set. The values we use in this paper are as in Table 6.

The accuracy results of counting pedestrian with fixed threshold are as in Table 7.

And then we test the accuracy of counting pedestrian with threshold-curve. The value of parameters’ thresholds we use is from Section 4.2. The accuracy results are as in Table 8.

Compared with using fixed threshold, using threshold-curve has made a performance on counting accuracy. As seen from Tables 6, 7, and 8, the method with threshold-curve improves counting accuracy by 13.9 percent and actual accuracy by 17.7 percent compared to the method with fixed threshold; hence the method using threshold-curve is much better. In addition, the accuracy is the same whether we use fixed threshold or use threshold-curve. That is because down passenger flow is very little and it is difficult to calculate threshold-curve. Therefore, we use fixed threshold instead of threshold-curve to count down passenger flow.

Although the method we present in this paper makes a good performance on counting pedestrian, counting errors like mistake and missing still exist. There are two reasons of missing counting. One is due to pedestrian wearing hats with a similar color to the surrounding. The other is pedestrian’s fast speed of running. And the reason causing mistake counting is that pedestrian luggage is similar to pedestrian head.

The test video clip has been taken from Xizhimen metro station (see Figure 22). It is composed of 90000 frames, for a total amount of one hour. Its frame size is pixels and frame rate is 25 (one frame is 40 ms). The CPU we use in experiment is Intel(R) Core(TM) 2 Duo with 3.0 GHz and the software needs about 27 ms to process one frame image. Hence, this method can meet the demands of real-time processing.

6. Conclusion

This paper proposes a novel method to count pedestrians in metro station based on Haar-like detection and template-matching algorithm. This new method avoids the disadvantages of other computer vision methods in short-shot and nondirect view. Thus there are the following two points between our method and the other traditional methods.

To avoid counting error due to occlusions and body-shape deformation in short-shot and nondirect view, this method uses Haar-like and AdaBoost algorithms which are mature in face recognition skill to detect pedestrian.

Template-matching algorithm which includes seven parameters (MIN_SIZE, MAX_SIZE, GOOD_SCORE, GOOD_SCORE_RATIO, GOOD_DIR_AVER, GOOD_SCORE_MAX, and Y_AXIS_MIN) is presented in our method to track and count pedestrian. The value of these parameters is the key to judge a template whether a pedestrian or not; thus it has a great influence on the accuracy of counting pedestrian. Nevertheless, we found that the accuracy results cannot meet our demands if we set these parameters with the fixed value. Therefore, a novel method to set these parameters with the threshold-curve is proposed. In other words, a threshold function is built to every parameter so as to improve the accuracy of pedestrian counting.

The experiments prove that our method can make a good performance on counting application. And the accuracy of the method with threshold-curve is nearly ninety percent which is higher than that of the fixed threshold method’s seventeen percent, so the method with threshold-curve is proved to be a better method.

The method presented in this paper is of practicable value to apply in a metro station, but there are also two disadvantages to this method. One is that a lot of workload is needed to collect template samples, so we expect to adopt some automatic collection methods to collect the samples instead of the manual collection method in the future. The other is that we must judge a template whether a pedestrian or not manually when we collect template samples; thus an automatic method to judge the template’s property is expected to be proposed in the future.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This study is supported by the Science and Technology Project of Ministry of Transport: The Study on Key Technology of Transfer Information Sensing and Co-Scheduling, no. 2012-364-220-109; and Science and Technology Planning Project of Beijing Science and Technology Commission: The Study on the Mechanism of Travel Behavior and Guiding Strategy, no. Z1211000003120100.