Abstract
Video background modeling is an important preprocessing stage for various applications, and principal component pursuit (PCP) is among the stateoftheart algorithms for this task. One of the main drawbacks of PCP is its sensitivity to jitter and camera movement. This problem has only been partially solved by a few methods devised for jitter or small transformations. However, such methods cannot handle the case of moving or panning cameras in an incremental fashion. In this paper, we greatly expand the results of our earlier work, in which we presented a novel, fully incremental PCP algorithm, named incPCPPTI, which was able to cope with panning scenarios and jitter by continuously aligning the lowrank component to the current reference frame of the camera. To the best of our knowledge, incPCPPTI is the first lowrank plus additive incremental matrix method capable of handling these scenarios in an incremental way. The results on synthetic videos and Moseg, DAVIS, and CDnet2014 datasets show that incPCPPTI is able to maintain a good performance in the detection of moving objects even when panning and jitter are present in a video. Additionally, in most videos, incPCPPTI obtains competitive or superior results compared to stateoftheart batch methods.
1. Introduction
Video background modeling consists of segmenting the “foreground” or moving objects from the static “background.” It is an important first step in various computer vision applications [1] such as abnormal event identification [2] and surveillance [3].
Several video background modeling methods, using different approaches such as Gaussian mixture models [4], kernel density estimations [5], or neural networks [6], exist in the literature. More comprehensive surveys of other methods are presented in [1, 7]. Principal component pursuit (PCP) is currently considered to be one of the leading algorithms for video background modeling [8]. Formally, PCP was introduced in [9] as the nonconvex optimization problem:where the matrix is formed by the n observed frames, each of size (rows, columns, and number of channels, respectively); is a lowrank matrix representing the background; is a sparse matrix representing the foreground; is a fixed global regularization parameter; and is the rank of and is the norm of .
Although the convex relaxation is given bywhere is the nuclear norm of matrix L (i.e., , the sum of the singular values of L) and is the norm of S, which is at the core of most PCP algorithms (including the augmented Lagrange multiplier (ALM) and inexact ALM (iALM) algorithms [10, 11]), there exists several others (for a complete list, see [12], Table 4). In particular, we point outwhere is the Frobenius norm, which was originally proposed in [13] since we will use it as the starting point of our proposed method (Sections 2.2.2 and 3.1).
Bouwmans and Zahzah [8] showed that PCP provides stateoftheart performance in video background modeling problems but also states some of its limitations.
First, PCP is inherently a batch method with high computational and memory requirements. This problem has been addressed in the past by means of solutions based on rank1 updates for thin SVD [14, 15] (applied to (3)), by lowrank subspace tracking [16] (applied to (2)), stochastic optimization [17] (which applies the maximummargin matrix factorization (M3F) method [18] to (2)), or random sampling [19] (also applied to (2)).
The second shortcoming of PCP, which is particularly relevant to the present work, is its sensitivity to jitter and its inability to cope with panning video frames. For a general review and classification of methods for motion segmentation able to cope with different degrees of camera motion, we recommend [20] and the many references therein. Among the methods based on lowrank plus additive matrices model, we highlight the robust alignment by sparse and lowrank decomposition (RASL) method [21]. This method used (2) as its starting point and addressed the problem of jitter in PCP using a series of geometric transformations on the observed frame, but as originally casted, it is a batch method. On the contrary, tGRASTA [22] and incPCPTI [23], which used (2) and (3) apiece as their starting point, addressed the problem of jitter in a semiincremental or fully incremental way by applying geometric transformation to the observed frames or lowrank component, respectively. Other proposed methods are robust against moving camera and panning [24–26], but all of them are batch or semibatch methods; furthermore, all of them used (2) as their starting point and also used the same general ideas (4) as RASL. Recently, Gao et al. [27] presented a new batch PCP method that produces a panoramic lowrank component that spans the entire field of view, which gives much better results in long panning sequences. We notice, nonetheless, that a fully online PCP algorithm able to cope with both jitter and panning is still an open problem. This phenomenon is of particular importance in some applications such as surveillance systems that use moving traffic or air cameras.
In the present study, we expand our previous work [28], where we proposed to address the panning problem by modifying the optimization problem solved by incPCPTI [14], which in turn uses (3) as its starting point, and y applying a set of transformations to the lowrank component that are updated with each incoming new frame. We substantially expand [28] by(1)expanding the theoretical basis of our algorithm so that the present manuscript is selfcontained(2)testing our algorithm in an additional reallife dataset with moving cameras(3)comparing our algorithm with two previously proposed batch methods
Our computational experiments on synthetically created datasets and publicly available videos of the Moseg [29], DAVIS [30], and CDnet2014 [31] datasets show that the proposed algorithm, henceforth referred as panning and transformation invariant incPCP (incPCPPTI), is able to correctly handle video background modeling in panning and basic jitter conditions.
2. Previous Related Work
2.1. Batch Methods
In this section, two previous motion segmentation batch methods that work under jitter/panning conditions are reviewed. For a more complete review of all available methods, refer [20]. The two algorithms hereby described work on batch fashion but were chosen as a comparison benchmark in this paper due to the public availability of their codes and/or binary executables. It is noted that although [27] recently published a new PCP method for moving cameras, the algorithm was not chosen for comparison due to the unavailability of public code or executables.
2.1.1. Segmentation by LongTerm Video Analysis
In [29], the authors proposed to use a dense point tracker based on variational optical flow in which, instead of the classical twoframe approach of optical flow, longterm analysis is used. It is worth mentioning that following the general classification of methods for motion segmentation proposed in [12, 29] was catalogued as a trajectory classification method (see Table 3 of [12] for a summary of the aforementioned classification, along with their associated properties). After the initial tracking of points, spectral clustering with a spatial regularity constraint is utilized to form groups of point trajectories corresponding to different objects in the image. Finally, an energy minimization model is used to transform the clusters into a dense segmentation of moving objects. Throughout this paper, this method will be referred as LTVA.
2.1.2. DECOLOR
The detecting contiguous outliers in the lowrank representation (DECOLOR) method [25] uses a nonconvex penalty and a Markov random field [32] model to detect outliers that correspond to moving objects. Bouwmans et al. [12] classified this algorithm as a low rank and as sparse representation method [21]. For moving cameras, the method uses a transformation obtained from a prealignment to the middle frame. The prealignment is performed using the robust multiresolution method proposed in [33] and DECOLOR then iteratively refines this transformation.
2.2. Online or Semionline Methods
In this section, two previous online or partially online PCP methods that work under jitter conditions are reviewed. It should be noted that, without modification, these two methods are not directly applicable to panning scenarios.
2.2.1. tGRASTA
The Grassmannian Robust Adaptive Subspace Tracking Algorithm (GRASTA) [16] is a semionline method for lowrank subspace tracking that has been applied to the foregroundbackground separation problem. GRASTA is not a fully online algorithm as it requires an initialization stage to obtain an initial lowrank subspace from the first p frames. A modification called tGRASTA was presented in [22], and it is based on the Robust Alignment by Sparse and LowRank decomposition (RASL) algorithm [21]. RASL tries to handle the misalignment in the video frames by solvingwhere are a series of per frame transformations that align all the observed frames; it is straightforward to note that (4) is an extension of (2).
The nonlinearity in the transformations τ of (4) is handled via a linearization using the Jacobian. The main drawback of tGRASTA is that, aside from the required lowrank subspace initialization, the initial transformation τ is estimated by using a similarity transformation obtained from a series of three points manually chosen from each of the p initial frames. This initialization stage severely constraints its application in automatic processes and reduces its applicability in panning scenarios, as the feature points in initial frames may not be present on subsequent frames.
2.2.2. incPCPTI
The incPCPTI [23] considers the optimization problemwhere is the observed video sequence that suffers from jitter, is the properly aligned lowrank representation, and is a set of transformations that compensate translational and rotational jitter; that is,where represents the unobserved jitterfree video sequence, is a set of filters that independently models translation for each frame, represents the convolution, and is a set of independent rotations applied to each frame with angle . It is interesting to note that ; that is, the transformation used in (5) can be understood as the inverse of the transformation used in RASL or tGRATSA (4).
In [14, 15], a computationally efficient and fully incremnetal algorithm, based on rank1 updates for thin SVD [34–37] (also see Section 2.3), was proposed to solve (3); in [23], it was shown that, since (5) is based on (3), such incremental solution can also be used: letting represent each frame of the observed video D, and using similar relationships for and w.r.t. and , respectively, then indeed the solution ofcan be efficiently computed in an incremental fashion ([23], Section 3.3 for details).
2.3. Incremental and Rank1 Modifications for Thin SVD
Given a matrix with thin SVD where and column vectors and (with m and l elements, respectively), note thatwhere is a zero column vector of the appropriate size. Based on [35, 36], as well as on [37], we will briefly describe an incremental (thin) SVD and rank1 modifications (update, downdate, and replace) for thin SVD.
The generic operation consisting of the Gram–Schmidt orthonormalization of and w.r.t. and , i.e., , , , and , , , and , is used as a first step for all the cases described below.
2.3.1. Incremental or Update Thin SVD
Given , we want to compute thin , with (i) or (ii) . In this case, we note that and that , where is a unit vector (with elements in this case); then, (8) is equivalent to (9) and (10), where :
Using (11), we get with (i) ; similarly using (12), we get with (ii) (Matlab notation is used to indicate array slicing operations):
2.3.2. Downdate Thin SVD
Given , with , we want to compute thin with r singular values. Noting that , then the rank1 modification (8) is equivalent tofrom which we can compute thin via the following equation:
2.3.3. Replace Thin SVD
Given , with , we want to compute thin with r singular values. This case can be understood as a mixture of the previous cases and can be easily derived noticing that .
Finally, we point out that the computational complexity of any of the above procedures ([35], Section 3 and [37], Section 4) is upper bounded by . If holds, then the complexity is dominated by .
3. Methods
3.1. Proposed incPCPPTI Method
The proposed algorithm (named incPCPPTI) is a modification of the previously proposed incPCPTI [23] so that it is able to handle panning and camera motion. It was briefly presented in [28], and is more thoroughly explained and evaluated in this work. The method continuously estimates the alignment transformation so that , i.e., the transformation that aligns the previous lowrank representation with the observed current frame. Thus, incPCPPTI effectively uses as a local estimation of a composite panoramic background image. After applying such transformation to , the PCP problem can be solved in the reference frame of . After this initial alignment, it is considered that only minor jitter remains in the image and so a procedure similar to incPCPTI is utilized by estimating a transformation for the kth frame. However, instead of solving the Affinely Constrained Matrix Rank Minimization [38] as in the original incPCPTI [14], the lowrank approximation problem is solved in the reference frame of by applying to the residual . The whole procedure is presented in Algorithm 1. This algorithm makes use of the incSVD, repSVD, and downSVD operators, which correspond to the thin SVD update, replacement, and downdate operators, respectively (Section 2.3).

In line 3 of Algorithm 1, the latest lowrank frame is aligned to the current frame . The transformation is estimated as the composition of a translation and rotation. Such found aligned transformation is used then to update the whole lowrank matrix representation L to the current reference axis (lines 4 and 5 of Algorithm 1) in order to obtain . After this initial align transformation is performed, it is assumed that only minor misalignments, modeled by , due to jitter remain (line 10 of Algorithm 1).
The ghosting suppression mentioned in line 16 is detailed in Section 3.3. The shrinkage in line 9 of Algorithm 1 can be performed by either soft thresholding or projection on the ball. Soft thresholding is performed with a simple elementwise shrinkage operator (). Projection onto the ball is detailed in Section 3.2. For all our experiments, the latter was chosen.
3.2. Projection on the Ball
Although theoretical guidance is available for selecting a minimax optimal regularization parameter λ in (2) [39], practical problems do not fully satisfy the idealized assumptions, and thus λ often has to be heuristically tuned. This problem is also observed if (3) is used instead of (2).
To tackle this problem, Rodrıguez and Wohlberg [40] introduced the alternative relaxation of (1) given bywhich can also be incrementally solved via rank1 updates for thin SVD (as is the case of the incPCP and related algorithms [14, 15, 23]); however, (15) has the advantage that a simple heuristic can be derived for the adaptive selection of μ for each frame. Furthermore, μ can be spatially adapted in order to reduce ghosting effects. The algorithm they propose is very similar to incPCP, save for the shrinkage step, which is calculated as , where
Thus, for the shrinkage step, the solution is given by projections into the ball of radius μ.
While there are several wellknown and efficient algorithms that solve (16), studies [40–43] used the algorithm in [44], a recently published algorithm, for solving (16) that has a better computational performance than either that in [41] or [42].
Furthermore, Rodrıguez and Wohlberg [40] also proposed a simple scheme for adapting with every frame, which is given bywhere is a value between 0.5 and 0.75.
3.3. Ghosting Suppression
Ghosting refers to when the foreground estimates include phantoms or smear replicas from actual moving objects. Rodríguez and Wohlberg [45] proposed a procedure for ghosting suppression in the incPCP algorithm which consists using binary masks obtained from different frames in order to remove the ghosts from the lowrank component. In this approach, two sparse components at different time steps and are used to compute respective binary masks and . These masks will include the moving objects as well as ghosts. A new binary mask , i.e., the complement of the intersection of binary masks obtained from the aforementioned two frames, will include, with high probability, all pixels of the background that are not occluded by a moving object. can then be used to generate a modified input frame , where represents an Hadamard product, which is used to update the lowrank component. Additionally, if the procedure with the ball projection described in Section 3.2 is used for the shrinkage step, can be spatially adapted in order to reduce ghosting [40]. Based on the difference between current and previous sparse approximation , a binary mask can be computed and then the sparse component is modified aswhere and and , where is defined in (16), and β is suggested to take values between 0.1 and 0.3.
4. Description of Datasets and Computational Experiments
For the evaluation of the proposed incPCPPTI algorithm, four datasets were considered. The first dataset consists of synthetic jitter and panning videos. The second one consists of videos of real panning taken from the MoSeg dataset [29]. The third dataset consists of videos of the recently published DAVIS dataset [30, 46]. The last one was obtained from the CDnet2014 dataset [31]. All datasets are detailed in this section. All tests were carried out using GPUenabled Matlab code running on an Intel i72600K CPU (8 cores, 3.40 GHz, 8 MB cache, and 32 GB RAM) with a 12 GB NVIDIA Tesla K40C GPU card. To the best of our knowledge, no other incremental lowrank plus additive matrix video background modeling technique capable of handling panning has been reported in the literature. This situation puts some constraints in our evaluations, and for most of our tests, we do comparisons with batch methods. Specific details for each of the datasets are described in their corresponding sections. Furthermore, to test the stability of incPCPPTI to jitter, we also use stab + incPCPPTI, which consists on a preprocessing stage using a recent stateoftheart video stabilization technique [47] followed by incPCPPTI.
4.1. Synthetic Datasets
A dataset with synthetic panning and jitter was generated from the 3rd Tower video of the USC Neovision2 dataset [48], which consists of 900 frames of size 1920 × 1088 pixel at 25 fps. For this purpose, a subregion of 720 × 480 pixel was selected from each frame and the centroid of the subregion was translated with each new frame in order to simulate an aerial panning scenario using the piecewise linear trajectory given bywhere is the initial point (in this case, it is chosen as pixel) and is the point of slope change in the curve (chosen as pixel). This process is depicted in Figure 1. The panning velocity was taken as 1, 3, and 5 pixels per frame. A fourth case in which the velocity changed randomly between 1 and 7 pixels per frame was also considered. This dataset will be referred as SP (“synthetic panning”) dataset. Additionally, this same procedure was used to construct a dataset on jittered versions of the original frames. Each frame of the 3rd Tower video was jittered with random uniformly distributed translations on the pixels range and random uniformly distributed rotations on the degrees range. The same trajectory and subregion selection of the SP dataset was used. This second synthetic dataset will be referred as SPJ (“synthetic panning and jitter dataset”) dataset. For both SP and SPJ datasets, the sparse approximation via the batch iALM method [10] using 20 outer iterations was used as a proxy ground truth by selecting the same regions that were selected from the original frames. The iALM was chosen as the proxy ground truth since, as reported in [8], Tables 6 and 7, its segmentation is considered to be reliable. For these synthetic datasets, the performance of the proposed algorithm was measured in terms of the normalized distance:where and are the ground truth and computed sparse components for frame k, respectively, and N is the number of pixels of the frame. Considering images normalized between 0 and 1, the value of varies from 0 (perfect match with the ground truth) to 1.
For the SP dataset, only the incPCPPTI method was evaluated. For the SPJ dataset, we evaluated incPCPPTI and, as mentioned in Section 4, stab + incPCPPTI, which consists of a preprocessing stage using a recent stateoftheart video stabilization technique [47] followed by incPCPPTI. This comparison had as objective determining if jitter is handled correctly by incPCPPTI [23] alone. Additionally, we include a baseline comparison with the sparse components obtained with incPCP on the full Neovision2 Tower video and then segmented using the same procedure described in Section 4.1.
4.2. Moseg Dataset
We used 15 video sequences of the FreiburgBerkeley Motion Segmentation (Moseg) Dataset [29, 49, 50]. We selected sequences that contained panning or camera movement. For all incPCPPTI variants, three inner loops and a window size of 30 background frames were used. For the ball projection, α was set to 0.75 and the ghosting suppression was set to 20 frames. α controls the adaptation of τ, with lower α forcing a sparser solution, whereas the difference controls the number of frames used for ghosting suppression. The binary mask obtained with incPCPPTI was postprocessed using the computation of the convex hull of the connected objects [51]. It is noted that the application of this postprocessing was not applied to the other methods as it tended to reduce their performance.
For comparison, both LTVA and DECOLOR algorithms were used as a reference. The LTVA code was obtained from [52], and the default parameter of a spatial subsampling of 8 for the point tracking was used. The tracking component of the algorithm runs in singlethreaded C, whereas the dense clustering component runs in CUDA C 5.5. The Singlethreaded Matlab DECOLOR code was obtained from [53], and a tolerance of 1E4 was used. All other parameters were left at their default values. For reporting the results, we further subdivided the selected videos into two categories: short panning (comprising nine cars sequences and one people sequences) and long panning (comprising 5 marple videos). In the former category, the final and first frames in the panning motion still share some common area, while in the latter, these two frames do not. This subdivision was necessary as DECOLOR performs a preregistration and was not able to work properly on the long panning sequences.
For all videos, the binary masks of the methods were compared to the ground truth provided in the dataset in order to obtain an F measure, defined aswhere and stands for precision and recall, respectively, and TP, FN, and FP are the number of true positives, false negatives, and false positive pixels, respectively. It is noted that approximately only one in ten frames possessed ground truth information.
4.3. DAVIS Dataset
We used 10 video sequences of the DAVIS [30, 46]. We selected some sequences that contained panning or camera movement and that contained at least 50 frames. The algorithms were configured with the same parameters specified in Section 4.2 and the same F measure comparison is performed. However, as mentioned before, DECOLOR is not able to run on all sequences and just the F measures where its prealigment phase runs correctly are reported. Additionally, in this dataset, LTVA presented oversegmentation of some objects and, accordingly, two methods to obtain the final binary mask of moving objects are considered (this problem was not present in the Moseg sequences). The first, reported as LTVAaPP (automatic postprocessing), considers an automatic selection of the mask by designating the object label with the largest area as the background and considering all others labels as foreground objects. The second, reported as LTVAmPP (manual postprocessing), entails the manual selection of the labels corresponding to moving objects. Although the latter method is more accurate, for an automatic pipeline, LTVAaPP would be the more realistic option.
4.4. Change Detection 2014 (CDnet2014) Dataset
Two videos from the PTZ category of the CDnet2014 dataset [31] were chosen:(i)continuousPan(CP): 704 × 480 pixel, 1700 framescolor video containing a continuous panning of a PTZ camera. The video is almost jitter free.(ii)intermittentPan(IP): 560 × 368 pixel, 3500 frame color of a PTZ camera that changes between two fixed positions. The video contains intermittent panning and additional real jitter.
As mentioned in the previous section, DECOLOR is unable to work on this type of long panning sequences and accordingly, only incPCPPTI and LTVA are compared. The F measure was evaluated only on frames that contained ground truth motion. LTVA presented the same segmentation problems as in the DAVIS dataset. However, in this case, the mask did not provide a segmentation good enough to produce a manual postprocessing, and thus, only the results of LTVA + aPP are presented. All the parameters for incPCPPTI and LTVA are the same as in previous sections. We also included a comparison with the edge based foreground background segmentation and interior classification (EFIC) [54] and its color version, CEFIC [55]. These methods were chosen as they obtained the second and third best F measure in the PTZ category of the CDnet2014 dataset results [56]. The top performer in the category was not selected as it corresponded to a supervised convolutional neural network that needs proper training before classification. Unfortunately, no open code is available for EFIC and CEFIC, and we only had access to the segmented binary masks submitted to the challenge [57, 58]. Due to this limitation, only a referential F measure could be computed. The absence of open code makes it difficult to ascertain if EFIC and CEFIC can be implemented in a fully incremental way and to compare them in terms of computational performance. Additionally, as EFIC and CEFIC already include a postprocessing step on the binary mask, we did not apply the convex hull postprocessing of the connected objects [51] that was used for incPCPPTI.
5. Results
5.1. Synthetic Datasets
5.1.1. SP Dataset
The distance (20) computed for each frame of the different videos of the SP dataset is shown in Figure 2. Table 1 shows the average distance and average time for processing one frame along with a baseline metric, described in Section 4.1. It can be noticed that the distance tends to increase as the panning velocity increases but the distance in all cases maintains relatively small (below 0.01).
5.1.2. SPJ Dataset
Representative frames of the SPJ video with changing velocity and the segmented sparse components with incPCPPTI and stab + incPCPPTI are shown in Figure 3. The distance computed for each frame of the different videos of the SPJ dataset are shown in Figures 4 and 5 for incPCPPTI and stab + incPCPPTI.
(a)
(b)
5.2. Moseg Dataset
Representative frames of the video and the segmented sparse components for the cars8, people1, and marple13 videos of the Moseg dataset are shown in Figure 6. Tables 2 and 3 show the F measure obtained in the short and long panning subsets, respectively. As noted in the previous sections, DECOLOR did not properly work on the long panning sequences, and so it is excluded from the comparisons in this subset.
5.3. DAVIS Dataset
Representative frames of the video and the segmented sparse components for the tennis, horsejumphigh, swing, and doggooses videos of the Davis dataset are shown in Figure 7. Table 4 shows the F measure obtained in the short and long panning subsets, respectively. The sequences in which DECOLOR did not run properly were left unreported. As described in Section 4.3, LTVAaPP and LTVAmPP refer to automatic and manual postprocessing of the binary masks generated by the LTVA method, respectively.
5.4. CDnet2014 Dataset
Representative frames of the video and the segmented sparse components for the CP and IP videos are shown in Figures 8 and 9, respectively. Figure 10 shows the F measure (with no postprocessing) for incPCPPTI (grayscale and color versions) and EFIC and CEFIC on the frames of the CP video, while Figure 11 shows the same metric for all methods on the frames of the IP video. Tables 5 and 6 show the average F measure and computational time obtained overall frames. For stab + incPCPPTI, the computational time is shown as (total stabilization time) + (incPCPPTI time per frame). For LTVA, the total time of the batch execution was divided over the total number of frames in order to obtain an average time per frame.
(a)
(b)
(a)
(b)
(a)
(b)
(a)
(b)
6. Discussion
It is observed in the results of Section 5.1.1 that, as expected, the distance increased; that is, the sparse approximation was worse, as the panning velocity increased. On the contrary, incPCPPTI is able to maintain an adequate performance even when the panning velocity changes. Also expected is the fact that adding jitter to the panning scenario (Section 5.1.2) increased the distance for all panning velocities with respect to their jitterfree counterparts. The overall stability of the estimated distance also decreased, as evidenced in the higher variability of the curves in Figure 4. The inclusion of a video stabilization preprocessing technique (stab + incPCPPTI) seemed to decrease such variability, as evidenced in Figure 5. Nevertheless, even with jitter, standalone incPCPPTI maintained a low average distance and its performance is comparable with stab + incPCPPTI, as can be observed in Table 7. Furthermore, although incPCPPTI obtained higher distances than baseline incPCP, values tend to be close to each other and, for all tested velocities, incPCPPTI managed to maintain a very small distance from the ground truth (below 0.01 for all cases).
The results of Table 2 (related to the Moseg dataset, Section 4.2) suggest that incPCPPTI is able to perform comparably to DECOLOR, even though the latter is a batch method and our proposed method is incremental. LTVA has substantially higher average F measure for this particular dataset, although working in a batch fashion. In Table 3, the same trend is observed. As mentioned above, DECOLOR has problems working in these sequences due to its prealignment phase failing to find a suitable unique frame for reference. The low performance of incPCPPTI in some of the Moseg sequences might stem from the short number of video frames that cause the initial lowrank estimation of PCP to be less precise.
The results of Sections 5.3 and 5.4 (related to the DAVIS and CDNet datasets, respectively, Sections 4.3 and 4.4) suggest that incPCPPTI can perform adequately in longer real panning videos with more complex scenarios. Regarding the DAVIS dataset, we observe that the highest performance is obtained with LTVAmPP. Nevertheless, this is a batch method and the final binary segmentation required human interactions. On the contrary, incPCPPTI shows an average performance superior to LTVAaPP and comparable to DECOLOR, though the latter did not run on all tested sequences.
The representative frames of Figures 8 and 9 exhibit different positions of the PTZ camera and thus evidence the ability of incPCPPTI of handling the panning movements in the scene. IncPCPPTI presents a relatively good F measure for both videos. This metric tended to be higher for the color version of the algorithm. In Figure 11, it can be observed that the F measure suffers decays at specific intervals of the video that coincide with sudden movements of the PTZ camera. However, after these sudden movements, the algorithm is able to restabilize and perform correctly. On the contrary, LTVA fails to track moving objects in a large number of frames. The lower performance on the CDNet dataset might be caused by the higher speed moving objects and panning movements that complicate the optical flow tracking using by LTVA. Additionally, the higher complexity of the objects and panning movements causes the clustering stage to produce a large number of false positives.
For both CP and IP videos (described in Section 4.4), incPCPPTI showed a higher F measure than stab + incPCPPTI, although a possible explanation is the misalignment of the ground truth reference frame and the reference frame of the stabilization algorithm. Nevertheless, the visual inspection of the frames and the results from the SPJ dataset suggests that incPCPPTI is able to handle the presence of jitter in a panning scenario and that it does not need a stabilization preprocessing step. Compared to EFIC, incPCPPTI showed superior performance in F measure in the CP videos, even without the postprocessing stage. In the IP video, incPCPPTI is comparable or superior in F measure when compared with EFIC. As mentioned, the absence of open code for EFIC makes it difficult to make a more throughout comparison and to draw further conclusion from these comparisons. Compared to LTVA, incPCPPTI shows a much higher F measure in both cases. These results suggest that incPCPPTI might be more adequate than LTVA to track fastest panning and more complex scenarios. It is also noticed that incPCPPTI attains these results in an incremental manner and with comparable or less computational average time per frame, despite the fact that the LTVA public code implemented in C and CUDA.
7. Conclusion
We have presented a novel algorithm, incPCPPTI, and have shown with artificial datasets and real videos from the Moseg, DAVIS, and CDnet2014 datasets that it can adequately detect moving objects in scenarios with simultaneous panning and jitter. To the best of our knowledge, this is the first incremental PCPlike method able to handle panning conditions. For the synthetic datasets, the algorithm maintained a low distance with respect to a proxy ground truth, and for the real videos, it maintained an adequate F measure and was able to stabilize after sudden panning of the camera. Additionally, the comparisons with stab + incPCPPTI (independent video stabilization followed by incPCPPTI) suggest that a stabilization stage preceding incPCPPTI is not needed, as it is able to handle the jitter present in the camera motions. The evaluations on real videos show that the incPCPPTI might be comparable or superior, depending on the case, to the stateoftheart batch PCP (e.g., DECOLOR) and nonPCPlike (e.g., LTVA, EFIC) foreground separation methods.
Further improvements of the algorithm might focus on (i) making it able to handle other types of distortionlike perspective changes or zooming in/out of the camera and (ii) reduce the time it takes per frame in order to make it more readily accessible for high frame rate realtime applications.
Data Availability
The video data used to support the findings of this study are included within the article. The datasets used in the article are referenced and can be found at publically available sites, namely, (1) synthetic datasets (Section 4.1) were constructed from: USC Neovision2 Project, https://goo.gl/5Si2Nm; (2) Moseg dataset (Section 4.2): “FreiburgBerkeley motion segmentation dataset,” https://goo.gl/bzEvvi; (3) DAVIS dataset (Section 4.3): “DAVIS: Densely Annotated VIdeo Segmentation,” https://goo.gl/G8Hb7o; (4) CDnet2014 dataset (Section 4.4): http://www.changedetection.net/. Additionally, our implemented method can be found at http://goo.gl/4jEvck.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was supported by the “Programa Nacional de Innovación para la Competitividad y Productividad” (Innóvate Perú) Program, 169Fondecyt2015.