Abstract
Video target tracking is a critical problem in the field of computer vision. Particle filters have been proven to be very useful in target tracking for nonlinear and nonGaussian estimation problems. Although most existing algorithms are able to track targets well in controlled environments, it is often difficult to achieve automated and robust tracking of pedestrians in video sequences if there are various changes in target appearance or surrounding illumination. To surmount these difficulties, this paper presents multitarget tracking of pedestrians in video sequences based on particle filters. In order to improve the efficiency and accuracy of the detection, the algorithm firstly obtains target regions in training frames by combining the methods of background subtraction and Histogram of Oriented Gradient (HOG) and then establishes discriminative appearance model by generating patches and constructing codebooks using superpixel and Local Binary Pattern (LBP) features in those target regions. During the process of tracking, the algorithm uses the similarity between candidates and codebooks as observation likelihood function and processes severe occlusion condition to prevent drift and loss phenomenon caused by target occlusion. Experimental results demonstrate that our algorithm improves the tracking performance in complicated real scenarios.
1. Introduction
Video target tracking is an important research field in computer vision for its wide range of application demands and prospects in many industries, such as military guidance, visual surveillance, visual navigation of robots, humancomputer interaction and medical diagnosis [1–3], and so forth. The main task of target tracking is to track one or more mobile targets in video sequences so that the position, velocity, trajectory, and other parameters of the target can be obtained. Two main tasks needs to be completed by moving target tracking during the processing procedure: the first one is target detection and classification which detects the location of relevant targets in the image frames; the second one is the relevance of the target location of consecutive image frames, which identifies the target points in the image and determines their location coordinates, thus to determine the trajectory of the target as time changes. However, automated detection and tracking of pedestrians in video sequences is still a challenging task because of following reasons [4]. (1) Large intraclass variability which refers to various changes in appearance of pedestrians due to different poses, clothing, viewpoints, illumination, and articulation. (2) Interclass similarities which are the common likeness between pedestrians and other background objects in heavy cluttered environment. (3) Partial occlusions, which may change frequently in a dynamic scene, of pedestrians which are caused by other interclass or intraclass targets.
Considering the difficulties mentioned above in pedestrians detection and tracking tasks, pedestrians tracking has been studied intensively and a number of elegant algorithms have been established. One popular tracking method is mean shift procedure [5], which finds the local maximum of probability distribution in the direction of gradient. Comaniciu and Ramesh [6] gave a strict proof of the convergence of the algorithm and proposed a mean shift based on tracking method. As a deterministic method, mean shift keeps single hypothesis and is thus computationally efficient. But it may run into trouble when similar targets are presented in background or occlusion occurs. Another common approach is the use of the Kalman filter [7]. This approach is based on the assumption that the probability distribution of the target state is Gaussian, and therefore the mean and covariance, computed recursively by the Kalman filter equations, can fully characterize the behavior of the tracked target. However, in video target tracking, tracking targets in real world rarely satisfy Gaussian assumptions required by the Kalman filter in that background clutter may resemble a part of foreground features. One promising category is sequential Monte Carlo approach, which is also known as particle filter [8], which recursively estimates target posterior with discrete sampleweight pairs in a dynamic Bayesian framework. Due to particle filters’ nonGaussian, nonlinear assumption and multiple hypothesis property, they have been successfully applied to video target tracking [9].
2. Previous Work
Various researchers have attempted to extend particle filters to target tracking. Among others, one of the most successful features used in target tracking is color. Nummiaro et al. [10] proposed a tracking algorithm that considered color histograms, as a feature, that were tracked using the particle filter algorithm. Despite the algorithm being more robust to the partial blocked target and the target shape changes, the algorithm exhibits high sensitivity to illumination changes that may cause the tracker to fail. Vermaak et al. [11] introduced a mixture particle filter (MPF), where each component was modeled with an individual particle filter that formed part of the mixture. The filters in the mixture interacted only through the computation of the importance weights. By distributing the resampling step to individual filters, the MPF avoids the problem of sample depletion. Okuma et al. [12] extended the approach of Vermaak et al. and proposed a boosted particle filter. The algorithm combined the strengths of two successful algorithms: mixture particle filters and adaboost. It is a simple and automatic multiple target tracking system, but it is easy to fail in tracking when the background image is complex.
Therefore, a more effective method for target recognition is needed. Superpixel has been one of the most promising representations with demonstrated success in image segmentation and target recognition [13–15]. For this reason, Ren and Malik [16] proposed a tracking method based on superpixel, which regards tracking task as a figure/ground segmentation across frames. However, as it processes every entire frame individually with Delaunay triangularization and conditional random field (CRF) for region matching, the computational complexity is rather high. Further, it is not designed to handle complex scenes including heavy occlusion and cluttered background as well as large lighting change. Wang et al. [17] proposed a tracking method from the perspective of midlevel vision with structural information captured in superpixel. The method is able to handle heavy occlusion and recover from drifts. Thus in this paper, the observation model adopts superpixel which is combined with the LBP to extract the target feature.
In recent years, bag of features (BoF) representation has been successfully applied to object and natural scene classification owing to its simplicity, robustness, and good practical performance. Yang et al. [18] proposed a visual tracking approach based on BoF. The algorithm randomly samples image patches within the object region in training frames to construct two codebooks using RGB and LBP features instead of only one codebook in traditional BoF. It is more robust in handling occlusion, scaling and rotation, but it can only track one target. Based on the advantages of BoF in target tracking, the paper employs BoF to establish discriminative appearance model, which converts highdimensional feature vector into lowdimensional histogram comparison, overcoming high computational complexity due to superpixel in the observation model.
Therefore, to achieve automated and robust tracking of pedestrians in complex scenarios, we present multitarget tracking of pedestrians in video sequences based on particle filters. The algorithm uses BoF algorithm to create discriminative appearance model which is then used to be combined with particle filter algorithm to achieve target tracking. In order to improve the efficiency and accuracy of the detection, firstly, background subtraction and the HOG detection methods are combined to get the target motion regions in the training frames. And then the discriminative appearance model established by the target regions is used to discriminate the candidate targets. During the process of tracking, severe occlusion condition is handled to prevent drift and loss phenomenon due to pedestrians’ mutual occlusion. Figure 1 shows the entire algorithmic flowchart.
The paper is organized as follows: Section 3 introduces detection of pedestrians; Section 4 describes our particle filter algorithm; Section 5 presents the experimental results and the performance evaluation and conclusion work is given in Section 6.
3. Detection of Pedestrians
There are mainly two parts in this section, one is target regions extraction, and the other is the construction of the discriminative appearance model. The former aims to determine the target regions of video sequence in the first frames, the latter aims to do sampling, feature extraction in the target region when these target regions are seen as a training set, and eventually establish the discriminative appearance model.
3.1. Target Regions Extraction
Before tracking, we need to detect the targets in the first frames and get the target regions in each frame for later trainings. Figure 2 shows the whole flow diagram of target regions extraction of the first frames.
We can see from Figure 2 that, first of all, in order to get motion region, a simple and fast approach is to perform background subtraction, which identifies motion targets from the portion of video sequences that differ significantly from a background model, as shown in Figure 3. Then we use the HOG descriptors [19] and Support Vector Machines (SVMs) to build a pedestrian detector. Since the method has been proved to be capable but timeconsuming, we only detect motion regions which have been acquired by background subtraction and frames. This not only reduces the HOG detection region, but also improves the efficiency and the accuracy of the detection. Figure 4 shows that adopting the HOG detection after background subtracting improves the accuracy of pedestrian detection, whereas using the HOG directly can lead to false detection.
(a)
(b)
3.2. Discriminative Appearance Model
During this stage, discriminative appearance model is created by target regions extraction of the first frames to distinguish targets from cluttered backgrounds. The th pedestrian in the th frame is , where is the number of training frames, is the number of target pedestrians in the training frames. According to all frame regions in which pedestrian appears, we draw the pedestrian’s discriminative appearance model (We assume that the number of targets in the training frames is invariable.), and therefore we need get discriminative appearance models.
3.2.1. Patch Generation
In the training stage, some patches with a constant scale are randomly sampled within the region of the pedestrian . For pedestrian , image patches are collected and represented by superpixel descriptor and LBP descriptor, respectively, in each training frame. Superpixel descriptor and LBP descriptor extraction process in training frames is illustrated in Figure 5.
The superpixel segmentation method we adopt in this paper is SLIC [15] (Simple Linear Iterative Clustering) that clusters pixels in the combined fivedimensional color and image plane space to efficiently generate compact, nearly uniform superpixel. For superpixel descriptor, we segment target region in th training frame into superpixels, as shown in Figure 5. As the superpixel does not have a fixed shape, and its distribution is often irregular, it is unsuitable for extracting the local template information; in addition, due to the similarity of the superpixel’s internal pixel texture as well as the similarity of color characteristics, more stable superpixel information can be obtained by extracting the color space histogram. However, RGB color space distribution does not accord with human’s vision distribution, and it is not robust enough for illumination changes, therefore we only use the normalized histogram of HSV color space which is simple and accords with human’s vision as a feature for all superpixels.
LBP is vastly used for texture description which has good performance in texture classification, fabric defect detection and moving region detection. LBP is an illumination invariant descriptor which is not sensitive to the intensity change caused by the light changes. The LBP descriptor is stable as long as the differences among the image pixel values do not change a lot. In addition, there are certain complementary between LBP and color features, so we adopt LBP descriptor as a feature. The LBP descriptor is defined as follows: where is the intensity value of center pixel and is the intensity of neighboring pixels.
The image histogram obtained from the computation of LBP is defined as follows: where represents the length of the encode bit generated by the LBP operator, represents the number of pixels in the neighborhood, is the LBP value at , in this way, represents the number of pixels which have the LBP value of , the histogram can reflect the distribution of the LBP values.
3.2.2. Codebook Construction
As frames slip, patches accumulate. For extracted collections of sample features , features are gathered into a number of clusters by performing mean shift clustering, and cluster centers compose the codebook. Here is the number of cluster centers as well as the size of the codebook. Cluster centers which represent the most typical features are regarded as the keywords in the codebook and used to create bags. In this way, a large collection of sample characteristics is converted into a comparatively small codebook. Figure 6 shows the process of codebook construction.
After codebook construction, for each characteristic of a set of features in each training sample image, find the codeword which has the nearest Euclidean distance from it, then count the appearance times of all features corresponding nearest codeword to acquire the final histogram. Repeat the above steps to training sample images, a set of training images will be converted into a set of histograms called bags. A bag is equivalent to the occurrence frequency of codewords in an image and can be represented as a histogram. training images are converted to a set of bags by raw counts.
Here the discriminative appearance model has been established for subsequent classification decisions.
3.2.3. Updating
Since appearance and pose changes of a target occur all the time, updating is necessary or even crucial. After frames, a new collection of patches is obtained. We then perform mean shift clustering again on and the old codebook using Here, denotes the new codebook. is a forget factor imposed on the old codebook to reduce its importance gradually so that the newlyconstructed codebook pays more attention to the latest patches.
4. Particle Filter Tracking
The particle filter [8] is a Bayesian sequential importance sampling technique, which recursively approximates the posterior distribution using a finite set of weighted samples. It consists of two essential steps: prediction and update. We use to express the set of states of the target system at moment . In the set of states , stands for the target’s states number at moment ; stands for the state of the th target at moment .
Given all available observations , up to time , the prediction stage uses the probabilistic system transition model to predict the posterior at time as At time , the observation is available, the state can be updated using Bayesian’s rule: where is described by the observation equation.
4.1. StateSpace Model
In the video scene, the movement of each target can be considered as an independent process, and therefore statespace model can be regarded as the joint product form of a singletarget motion model:
Suppose the target state number of both moment and are , , is the state number in , is the state number in , is the state of the th video target at moment , , and are respectively the rectangle center’s position in the direction of and in the image, and are the length and width of the rectangle.
To get the state transition density function of the th target at moment , random perturbation model is used to describe the state transition of the th target from momet to moment , that is, where is the normal density function whose covariance is . is a diagonal matrix, and the variances of the four parameters in ’s diagonal elements corresponding state are . Random perturbation model is used to describe the motion of each target mainly in the condition that the tracking targets of the video are pedestrians who have movement randomness, thus it is difficult to predict the state of motion for the next moment by using constantvelocity model or constant acceleration model.
4.2. Observation Model
When a new frame arrives, for target , firstly, according to its location at the last frame, statespace model is used to randomly sample candidate targets, as illustrated in Figure 7.
Secondly, each candidate target is handled as follows:(1) extract superpixel patches.
We adopt superpixel segmentation to each candidate target and obtain superpixels. Then extract each superpixel’s HSV color histogram and normalized them. (2) extract LBP patches.
Extract patches from each candidate target, and then calculate each patch’s LBP histogram and normalize them.
Then calculate the color histogram and the LBP histogram of patches (each superpixel is also referred to as a patch) separately according to the following process:
We calculate the patches’ similarities with codewords, so a similarity function is defined as follows: where denotes the similarity between patch and each codeword , ; denotes the eigenvector of the test patch , denotes the eigenvector of the codeword in the codebook, denotes the histogram intersection distance between the two histogram images.
Thus, the patches in each candidate target all have their most similar codewords. Make a statistics of the occurring frequency of codewords in each candidate target as a bag of features, , which is illustrated in the following formula:
Then we compute the similarity of bags to get the weight of each candidate target: where denotes the eigenvector of the test sample, denotes the eigenvector of the template, denotes the bag of features intersection distance between the two patches.
The observation likelihood function is defined as follows:
In this way, we get , , , and , respectively. In the condition that is the given target state, the total observation likelihood function of the target is defined as follows: where , are the observation likelihood functions of superpixel and LBP features respectively, are the weights of the two characteristics information in the fusion. The feature weights can be dynamically calculated through the weight distribution of the particle sets.
4.3. Occlusion Handling
The above procedure can be used to handle partial occlusion of the target. However, when there is severe or complete occlusion, the total observation likelihood value of the target becomes extremely small. As to that situation, when the total observation likelihood value is smaller than certain threshold, we keep the target’s last tracking state unchanged and the particles continue state transition. Tracking result and particles’ movements in severe occlusion condition are illustrated in Figures 8 and 9, respectively.
4.4. The Algorithmic Process
The entire algorithmic process can be summarized as in Algorithm 1.

5. Experimental Verification and Analysis
To verify performance of our algorithm, we evaluate our algorithm on some video sequences. These sequences are acquired from our own dataset, PETS 2012 Benchmark data and CAVIAR database where the target pedestrians move in different conditions which include complex background, severe occlusion, illumination and changes of walking speed, and so forth.
In our algorithm, parameter settings are shown in Table 1. These parameters are fixed for all video sequences.
5.1. Comparison with Other Trackers
For comparison purposes, these sequences are utilized to evaluate the performance of superpixel tracking, boost particle filter (BPF) and our algorithm under the situation of occlusion.
The video parameters in the evaluation are shown in Table 2.
First of all, sequence “three pedestrians in the hall” is tested, in which three pedestrians are walking in the hall from our own dataset. In Figure 10, the first row and the second row represents the outcomes of the algorithm which are contrasted with those of superpixel tracking and BPF respectively. We can see from these frames that BPF tracker leads to drifts under the situation of the pedestrian’s occlusion and the pedestrian’s distraction in that BPF tracker constructs proposal distribution using a mixture model that incorporates information from the dynamic models of each pedestrian and the detection hypotheses generated by Adaboost. However, when partial occlusion occurs, BPF tracker cannot get enough pedestrian feature descriptions, which leads to the failure. By contrast, both superpixel tracking and our algorithm track the targets because they require only part of the feature to track targets, and they are able to handle severe occlusion and recover from drifts. Therefore, both superpixel tracking and our algorithm can track the targets accurately, but the latter has better tracking accuracy and robustness than the former.
The pedestrians’ weight variation curves of superpixel weight and LBP weight in the process of tracking are illustrated in Figure 11. Because occlusion does not occur in the tracking process to pedestrian 3, there is no obvious fluctuation of superpixel weight and LBP weight. Superpixel weight begins to decline after the 107th frame in which the occlusions between pedestrians emerge and LBP weight begins to increase. As the targets move, the interferences of the occlusions between pedestrians move away after the 123th frame, therefore superpixel weight regains the state of being higher than LBP weight.
(a)
(b)
(c)
Figure 12 shows three pedestrians’ position error respectively in the process of target tracking. For each pedestrian, the position error is defined as follows: where denotes the estimation value of target position at moment , denotes the real position at moment , denotes the meansquareroot error at moment .
(a)
(b)
(c)
We can see that the our algorithm has better accuracy than any of the other two in that using the superpixel tracking and the BPF tracking. It can be seen that the robustness of tracking is improved by using our algorithm.
Figure 13 shows target motion trajectories from the first frame to the last by using our algorithm. The different colors represent different pedestrian trajectories. The points in the graph constitute target motion trajectory, and each point represents the target location of each frame.
Secondly, sequence “five pedestrians in the corridor” is tested from CAVIAR database, in which there are twice severe occlusions. Figure 14 shows that our algorithm has better tracking accuracy and robustness, although the pedestrians’ severe mutual occlusion occurs. Figure 15 shows target motion trajectories from the first frame to the last by using our algorithm.
Thirdly, sequence “sparse crowd” is tested from PETS 2012 Benchmark data. It can be seen from Figure 16 that there are failures in tracking when either the superpixel tracking or the BPF tracking is used. However, our algorithm can track all the targets in the condition of severe occlusion, pose variation, or changes of walking speed. Figure 17 shows target motion trajectories from the first frame to the last by using our algorithm.
Finally, sequence “two pedestrians in the square” is tested, in which one pedestrian was severely obscured by another pedestrian at a time. It differs from the first group of videos in that certain changes happen to pedestrians’ walking environment illumination, that is, from the strong illumination into the weak illumination environment. Figure 18 shows our algorithm has better tracking accuracy and robustness. Although the pedestrians’ walking illumination changes and severe mutual occlusion occurs, they are tracked out with accurate location. Figure 19 shows target motion trajectories from the first frame to the last by using our algorithm.
The quantitative evaluations of the superpixel tracking, BPF, and our algorithm are presented in Table 3. It can be seen from the table that our algorithm has smaller average errors of center location in pixels than the other two algorithms, thus it has better tracking accuracy. For each pedestrian, the average position error is defined as follows: where denotes the total frame numbers of the tracked video sequence, denotes the average meansquareroot error which measures the experiment results error; the smaller the , the better the tracking effect.
5.2. More Tracking Results
Our algorithm is tested in more sequences which are acquired from our own dataset, PETS 2012 Benchmark data and CAVIAR database. Tracking results are showed in Figure 20.
It can be seen from the test results of the above three groups of video sequences, the our algorithm has better tracking performances in dealing with complex situations such as the target’s translation, severe occlusion, illumination, and changes of walking speed, as well as analogue interference, and so forth.
6. Conclusions
In this paper, we propose multitarget tracking of pedestrians in video sequences based on particle filters. The contribution of our work can be listed as the following: (1) we apply background subtraction and HOG to getting target regions in training frames rapidly and accurately. (2) Our algorithm builds discriminative appearance model to collect training samples and construct two codebooks using superpixel and LBP features. (3) We integrate BoF into particle filter to get better observation results, and then automatically adjust the weight value of each feature according to the current tracking environment. Our algorithm was tested on a pedestrian tracking application in campus environment. In that case the algorithm can reliably track multiple targets and targets’ motion trajectories in difficult sequences with dramatic illumination changes, partial or severe occlusions, and background clutter edges. Experimental results demonstrate the effectiveness and robustness of our algorithm.
Acknowledgments
This work was supported in part by the National Science Foundation of China under Grant no. 61170202 and Wuhan Municipality Programs for Science and Technology Development under Grant no. 201210121029.