Automatic Moving Object Segmentation for Freely Moving Cameras
This paper proposes a new moving object segmentation algorithm for freely moving cameras which is very common for the outdoor surveillance system, the car build-in surveillance system, and the robot navigation system. A two-layer based affine transformation model optimization method is proposed for camera compensation purpose, where the outer layer iteration is used to filter the non-background feature points, and the inner layer iteration is used to estimate a refined affine model based on the RANSAC method. Then the feature points are classified into foreground and background according to the detected motion information. A geodesic based graph cut algorithm is then employed to extract the moving foreground based on the classified features. Unlike the existing global optimization or the long term feature point tracking based method, our algorithm only performs on two successive frames to segment the moving foreground, which makes it suitable for the online video processing applications. The experiment results demonstrate the effectiveness of our algorithm in both of the high accuracy and the fast speed.
Moving object detection and segmentation is a basic technique for many applications such as intelligent video surveillance, intelligent transportation system, video content analysis, video event detection, and video semantic annotation. In all these applications, the cameras capturing the videos may not be static. For example, the camera of an outdoor surveillance system may be slightly shaking because of strong winds, and the video used for content analysis or event detection may be captured by a hand-held camera. Thus a moving object detection and segmentation algorithm that can handle the freely moving cameras is necessary for these cases. However, on one hand most of the existing moving object detection and segmentation algorithms are only designed for the static cameras, such as Gaussian Mixture Models proposed by Stauffer and Grimson , Kernel density estimation (KDE) used in . Although many methods have been proposed to improve these kinds of algorithms, such as Sun et al.  who proposed to employ graph cut  algorithm to improve the accuracy of the segmentation results and Patwardhan et al.  who constructed a layer model for the scene to improve the robustness of foreground detection and segmentation, none of these methods can be directly extended for the freely moving cameras.
In recent years, several moving object detection and segmentation algorithms for freely moving cameras have been proposed [6–12]. Liu and Gleicher  proposed to learn a moving object model by collecting the sparse and insufficient motion information throughout the video. They first detect the moving patches of the foreground object, and then combine the moving patches of many frames to learn a color model of the foreground object which is used for segmentation. However, this kind of method can only be used to process video sequences offline and cannot be applied for the online cameras. Kundu et al.  proposed a motion detection framework based on multiview geometric constraints such as the epipolar constraints. However, this method needs to calibrate the robot-camera with a chess board and can only detect rough moving regions instead of accurate object segmentation. This restricts the application of this algorithm. Zhang et al.  proposed to use structure from motion method to detect and segment the foreground moving object. This method needs to first estimate the dense depth map for each frame, and then in the segmentation step, a global optimization is applied to multiframes to extract the moving object. The depth map estimation and object segmentation step will be run iteratively for several times in order to obtain accurate results. This method is very time consuming and can only be used for offline video sequences. Several algorithms [9–11, 13] employing point trajectories to segment the moving objects are proposed in recent years. The intuition of these kinds of methods is that the motion caused by the camera movement is restricted by some geometric constraints, while the motion caused by the object movement is not. Thus the moving object can be detected and segmented by analyzing the long term trajectories of the key points. However these methods usually need to calculate the dense optical flow over long time frames, which may be too time consuming to run in real time. Once again, these methods cannot be used in online scenario, because they are not processing the video frame by frame. Elqursh and Elgammal  improve point trajectories based method by adding Bayesian filtering framework to estimate the motion and appearance models. And it also updates the point trajectories and motion/appearance models online, so that this algorithm can be used for the online video segmentation scenario. However, the high computational cost is still a problem.
In this paper, we propose a novel moving object detection and segmentation algorithm for the freely moving cameras.
Compared to the existing moving object segmentation algorithms for freely moving cameras, our algorithm has the following characteristics.(1)Unlike most of the existing algorithms, our algorithm does not employ the global optimization or long term feature point tracking. It only uses two successive frames to extract the moving object, which makes it suitable for the online video processing task.(2)A two-layer iteration based camera motion compensation method is proposed, where the outer layer iteration is used to update the foreground and background feature sets according to the current parameters of the camera motion compensation models, and the inner layer iteration employs a RANSAC method to estimate the parameters of the camera motion compensation model based on the current background feature set. This two-layer iteration based method makes the camera motion compensation more robust and accurate.(3)A feature classification and filtering algorithm based on GMM color model is proposed, and the classified feature points are used as the input of the geodesic distance based graph cut algorithm, which can return a very accurate segmentation result.
The rest of the paper is arranged as follows. Section 2 is an overview of our algorithm, and Section 3 describes the details of our algorithm. After the experiments and discussions in Section 4, the conclusions are presented in Section 5.
2. Algorithm Overview
Figure 1 shows a flow chart of our algorithm. As we described before, our algorithm is just based on two successive frames, so the input of our algorithm is the former and current frames of one video. The algorithm has 3 steps.
(1) Camera Motion Compensation. Since the camera movement between two successive frames is very small in most cases, we can simply assume that the background between the former frame and the current frame only has the translation and the rotation movement. Thus an affine transformation model can be employed to simulate the movement of the background. When estimating the affine transformation parameters, the corresponding feature points are first found by a forward and backward optical flow algorithm, and then a two-iteration based method is proposed to estimate the parameters.
(2) Feature Extraction and Classification. The edge and the corner features  are extracted and then classified into the moving foreground features (denoted as red points) and the background features (denoted as blue points) according to the detected motion regions. The foreground and background feature sets are then filtered by GMM color models.
(3) Foreground Extraction with Geodesic Distance Based Graph Cut. After the foreground and background feature sets are obtained, the geodesic distance from other pixels to the feature points are calculated, and a geodesic confidence map is generated. By incorporating the geodesic distance and the geodesic confidence map with the graph cut algorithm, accurate foreground object can be segmented.
3. Details of Our Algorithm
3.1. Camera Motion Compensation
For most of the videos, the camera only has a very small movement between two successive frames; thus it is assumed that the camera only has the translation and rotation movement in such a short interval, which can be modeled by the affine transformation. In this model, it is just assumed that the displacement vector of pixel can be written as an affine function of the coordinate : where is the rotation matrix with parameters , is a translation matrix with . Here the rotation matrix also contained the scale change parameters and ; thus this model can handle the scale changes of the background scene, such as the video captured by a forward or backward moving camera.
Since the camera motion and the foreground motion are distinct, this means that the foreground motion is not appropriate to be modeled by the affine transformation model. Thus in ideal, the pixels used to estimate the affine parameters should only contain the background pixels. This can be achieved by our two-layer iteration based method as shown in Figure 2. The outer layer iteration is used to update the fore- and background feature points according to the motion regions detected by the current affine parameters. The RANSAC process is used to estimate the affine parameters based on the updated background features.
The feature points used in our paper are the edge and corner points which can be detected using the method described in . In order to estimate the affine model parameters, the corresponding feature points of the two successive frames should be detected. We employ the forward and backward optical flow estimation to achieve this goal. For the current frame , we first extract its feature points (denoted as , where is the number of the feature points) and then use the pyramid Lucas Kanade optical flow  to track these features to the next frame . Thus we obtain a set of the feature points on frame by this forward optical flow, which are denoted as . Then we track the features back from to using the backward optical flow and obtain a new set of features on and denote it as . In the ideal case, and should be the same. However, due to the errors of the optical flow estimation, they are not identical. By comparing , , and , we can remove the feature points that have erroneous optical flow, so as to find the correct corresponding feature points between the two successive frames. We use two criteria to filter the optical flow errors. The first is to employ the ZNCC (zero-mean normalized cross correlation), which is defined as where and are the coordinates of the corresponding feature points in and , respectively, and are the mean values of the pixel intensity for the given ( in our experiment) windows centered at and , respectively. The ZNCC score for each pair of feature points in and should be calculated, and then a part of the feature points with erroneous optical flows can be filter out by setting a threshold ; that is, if , then the optical flow from to is considered as error. In our experiment, we find that setting as the median value of the ZNCC scores can obtain good enough results.
Another criterion to filter the erroneous optical flows is to use the displacements of the corresponding pixels between , , which is defined as the Euclidian Distance between the coordinates of the corresponding points and denoted as . Similarly, if , the forward optical flow from to , and the backward optical flow from to are considered as errors. is also set as the median value of the displacements of the corresponding features points. After filtering the erroneous optical flow, we obtain the feature point matching results as shown in Figure 2. The matching feature points are denoted as a feature set .
Once obtaining the matching feature points, the two-layer iteration is performed. The detail is described as Algorithm 1. In the inner-layer iteration, the RANSAC algorithm requires 3 pairs of corresponding feature points to estimate the affine parameters , , , , , . Since we use the 6 parameters to estimate the global motion of the whole image, the 3 pairs of feature points sampled from should be distributed over the whole image instead of a local area. For the moving region detection, we use the estimated affine parameter to compensate the camera motions, and calculate the frame difference to find the moving regions. Then can be updated by classifying the features set into foreground and background according to the frame difference: where denotes the feature points at location , and are two sets of foreground and background feature points, respectively, and is a threshold value. is the frame difference value at pixel , and is calculated as where is the affine warped current frame.
3.2. Feature Extraction and Classification
After obtaining the final affine parameters, we can obtain the frame difference using (4) and then classify the feature points of the current frame into the foreground and background feature sets and using (3) as shown in Figure 1.
and cannot be directly used for graph cut algorithm in the following step to extract the foreground object, because there usually exist some classification errors. As pointed out by the green ellipses in Figure 3(a), some foreground feature points are misclassified into background. This is because the moving regions detected by the frame difference as shown in Figure 3(c) are composed of both the real foreground regions and the false foreground regions. These false foreground regions are actually background regions occluded by the moving object. In order to eliminate these misclassifications, we further perform a refining process in our algorithm. Since we already have an initial classification of the feature points, we can build two Gaussian mixture models (GMMs) for and and then use these two models to reestimate the probability of each feature point belonging to the foreground. The feature points in or are first classified into clusters, respectively, by a farthest-point clustering algorithm , and then the mean and variance for each cluster are calculated to construct GMMs. The probability of feature points belonging to the foreground can be estimated as where is the color vector of one feature point to be estimated, and are the prior probability of the feature points in this Gaussian component and can be calculated as the ratio between the number of feature points in this component and the number of feature points in the whole GMM, and denotes the color channel and denotes the Gaussian kernel. Then for the feature points in , if , this point will be removed from . Similarly, for the feature points in , if , then this feature point will be removed from . It should be noted that the feature points removed out from the (or ) are not added into (or ); they are all denoted as unknowns and will be assigned a label by the graph cut algorithm. The feature classification after eliminating the error becomes much better as shown in Figure 3(b).
3.3. Foreground Extraction with Geodesic Graph Cut
Till now, we have obtained the foreground and background key points. This means we have labeled partial pixels as foreground and background. Starting from the initial labeling, we can obtain a complete foreground segmentation by employing a geodesic graph cut algorithm , where we use the geodesic distance and color models to calculate the energy function of the graph cut algorithm, which is defined as where is a binary vector and is the label or for pixel . is a unary term and is the pairwise term of the energy function. is a weight to balance the unary and pairwise term. The unary term is defined as follows: where is a constraint for the foreground and background feature points: where indicates the foreground and background features and denotes the label opposite (i.e., if , then ).
is computed by normalizing the relative foreground/background geodesic distances: where the geodesic distances from each pixel to the foreground and background feature points are computed efficiently by the method proposed by . is the geodesic confidence which is defined as
The pairwise term is defined as where and are pixel colors.
4. Experiment and Discussion
4.1. Test of Our Algorithm
We test our algorithm with many videos that were captured by freely moving cameras. Some results are shown in Figure 4.
It can be seen that the image alignment algorithm employed in our algorithm is very efficient, so that the moving object regions can be well detected by the frame difference algorithm as shown in Figure 4(b), and the feature points can be classified accurately as shown in Figure 4(c). From Figure 4(d), we can see that although we only constrain a small part of pixels (feature points), the labels can be correctly propagated to other nonfeature pixels, and the accurate segmentation can be obtained.
We test our algorithm on a laptop with four cores, 2.1 GHz CPU, and 8 G RAM. The image alignment step costs most of the computational time. However, we speed up this algorithm by using the affine transformation parameters obtained from the former pair of frames to initialize parameters of the current pair of frames. Thus the whole system can run in about 15 fps for videos with size. For image alignment step, we suggest to downsample the image to a relatively small scale to estimate the affine transformation parameters. This not only can improve the speed, but also can help improve the accuracy. This is because for the large scale image, the background displacement may be very large, which may not be well estimated by the affine transformation models, though the affine estimation algorithm has already dealt with this problem by employing a pyramid model. Figure 5 shows a comparison of the image alignment results for the same pair of frames at different scales. As you can see, when aligning the images at scale, the frame difference for the background regions is a little high, so that some background feature points are misclassified into foreground as shown in Figure 5(b). While as a comparison shown in Figure 5(c), aligning the images in scale obtains much better results.
Our algorithm can also work for the static camera videos and can usually run in very high speed because the image alignment step can be removed. According to our test, our system can run in more than 25 fps on the laptop mentioned above. Figure 6 shows some result of applying our algorithm on the videos captured by static cameras.
4.2. Comparison with Existing Algorithms
We compare our algorithm with the three algorithms proposed recently [20–22]. The comparison results are shown in Figure 7. As can be seen, the results obtained by our method are clearly more accurate than the results obtained by Zhang et al.  and Zhou et al. . Both Zhang’s and Zhou’s methods may mistakenly label the large area of foreground or background area. More concretely, our algorithm is more robust to the topology changes of the object, while Zhang’s and Zhou’s methods tend to erroneously segment the object when the topology of the object changes greatly. Our algorithm obtains comparable or even better segmentation results than the Papazoglou and Ferrari  method as shown in the last rows of Figure 7, which is reported outperforming the state-of-the-art algorithms . The comparison results can be observed more clearly by the Error Rate evaluation of these algorithms. Error Rate evaluation is criteria commonly used for the accuracy comparison of the object segmentation algorithms, and it is defined as where “#” means “the number of”. The error rate comparison results for each frame of the two test image sequences are shown in Figure 8. It can be seen that our results are better than the results obtained by  and Zhou et al. . Although the results obtained by our algorithm are comparable with Papazoglou and Ferrari , our algorithm runs much faster than Papazoglou and Ferrari  as discussed in the following section.
(a) Error rate comparison on image sequence-1
(b) Error rate comparison on image sequence-2
4.3. Run Time
As described before, our algorithm can perform in a high speed, about 15 fps for the videos captured by nonstatic cameras and 25 fps for the videos captured by static cameras. The three existing algorithms [20–22] are all performed in a relatively low speed compared to our method. According to our test, Zhou’s algorithm  takes more than 2 s in average to process a image, which may take at lease 30 times longer time to process the same sequence compared to our method. Papazoglou and Ferrari  have reported that given optical flow and superpixels, their fast segmentation method only takes 0.5 s/frame, which is still slower than our algorithm. Furthermore, the optical flow and superpixels should also be calculated and will cost a long time. For example, in Papazoglou and Ferrari , they employ Brox’s optical flow estimator , and it will take more than 7 s which is a very time consuming process. Zhang et al.  employ an even more complex method including optical flow estimation, GMM and EM algorithm, and a graph cut optimization algorithm. This costs even longer time, that is, more than 8.5 s/frame in our test.
This paper proposed a real time online moving object detection and segmentation algorithm for the video captured by freely moving cameras which only use two successive frames to segment the moving object. A two-layer iteration algorithm is proposed to accurately estimate the affine transformation parameters between two successive frames. A feature point detection and filtering algorithm is proposed to remove the error foreground and background feature points. The object is finally extracted by a geodesic graph cut algorithm. This algorithm is demonstrated to be very efficient for many videos. Compared to the existing long term key point trajectory based algorithm, our algorithm not only can perform in online processing mode, but also can run in high speed. This makes our algorithm very practical in many applications.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is supported by the China Postdoctoral Science Foundation (no. 2013M530020).
C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '99), pp. 246–252, June 1999.View at: Google Scholar
A. Elgammal, D. Hanvood, and L. S. Davis, “Nonparametric model for background subtraction,” in European Conference on Computer Vision, pp. 751–767, 2000.View at: Google Scholar
A. Kundu, K. M. Krishna, and J. Sivaswamy, “Moving object detection by multi-view geometric techniques from a single camera mounted robot,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS '09), pp. 4306–4312, October 2009.View at: Publisher Site | Google Scholar
T. Brox and J. Malik, “Object segmentation by long term analysis of point trajectories,” in European Conference on Computer Vision, pp. 282–295, 2011.View at: Google Scholar
A. Elqursh and A. Elgammal, “Online moving camera background subtraction,” in European Conference on Computer Vision, pp. 228–241, 2012.View at: Google Scholar
J. Shi and C. Tomasi, “Good features to track,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 593–600, June 1994.View at: Google Scholar
J. Y. Bouguet, Pyramidal Implementation of the Lucas Kanade Feature Tracker Description, Technical Report for Intel Corporation Microprocessor Research Lab, Santa Clara, Calif, USA, 2000.
D. Zhang, O. Javed, and S. Mubarak, “Video object segmentation through spatially accurate and temporally dense extraction of primary object regions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR '13), pp. 628–635, June 2013.View at: Google Scholar
A. Papazoglou and V. Ferrari, “Fast object segmentation in unconstrained video,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV '13), pp. 1777–1784, June 2013.View at: Google Scholar