Abstract

Video background modeling is an important preprocessing stage for various applications, and principal component pursuit (PCP) is among the state-of-the-art algorithms for this task. One of the main drawbacks of PCP is its sensitivity to jitter and camera movement. This problem has only been partially solved by a few methods devised for jitter or small transformations. However, such methods cannot handle the case of moving or panning cameras in an incremental fashion. In this paper, we greatly expand the results of our earlier work, in which we presented a novel, fully incremental PCP algorithm, named incPCP-PTI, which was able to cope with panning scenarios and jitter by continuously aligning the low-rank component to the current reference frame of the camera. To the best of our knowledge, incPCP-PTI is the first low-rank plus additive incremental matrix method capable of handling these scenarios in an incremental way. The results on synthetic videos and Moseg, DAVIS, and CDnet2014 datasets show that incPCP-PTI is able to maintain a good performance in the detection of moving objects even when panning and jitter are present in a video. Additionally, in most videos, incPCP-PTI obtains competitive or superior results compared to state-of-the-art batch methods.

1. Introduction

Video background modeling consists of segmenting the “foreground” or moving objects from the static “background.” It is an important first step in various computer vision applications [1] such as abnormal event identification [2] and surveillance [3].

Several video background modeling methods, using different approaches such as Gaussian mixture models [4], kernel density estimations [5], or neural networks [6], exist in the literature. More comprehensive surveys of other methods are presented in [1, 7]. Principal component pursuit (PCP) is currently considered to be one of the leading algorithms for video background modeling [8]. Formally, PCP was introduced in [9] as the nonconvex optimization problem:where the matrix is formed by the n observed frames, each of size (rows, columns, and number of channels, respectively); is a low-rank matrix representing the background; is a sparse matrix representing the foreground; is a fixed global regularization parameter; and is the rank of and is the norm of .

Although the convex relaxation is given bywhere is the nuclear norm of matrix L (i.e., , the sum of the singular values of L) and is the norm of S, which is at the core of most PCP algorithms (including the augmented Lagrange multiplier (ALM) and inexact ALM (iALM) algorithms [10, 11]), there exists several others (for a complete list, see [12], Table 4). In particular, we point outwhere is the Frobenius norm, which was originally proposed in [13] since we will use it as the starting point of our proposed method (Sections 2.2.2 and 3.1).

Bouwmans and Zahzah [8] showed that PCP provides state-of-the-art performance in video background modeling problems but also states some of its limitations.

First, PCP is inherently a batch method with high computational and memory requirements. This problem has been addressed in the past by means of solutions based on rank-1 updates for thin SVD [14, 15] (applied to (3)), by low-rank subspace tracking [16] (applied to (2)), stochastic optimization [17] (which applies the maximum-margin matrix factorization (M3F) method [18] to (2)), or random sampling [19] (also applied to (2)).

The second shortcoming of PCP, which is particularly relevant to the present work, is its sensitivity to jitter and its inability to cope with panning video frames. For a general review and classification of methods for motion segmentation able to cope with different degrees of camera motion, we recommend [20] and the many references therein. Among the methods based on low-rank plus additive matrices model, we highlight the robust alignment by sparse and low-rank decomposition (RASL) method [21]. This method used (2) as its starting point and addressed the problem of jitter in PCP using a series of geometric transformations on the observed frame, but as originally casted, it is a batch method. On the contrary, t-GRASTA [22] and incPCP-TI [23], which used (2) and (3) apiece as their starting point, addressed the problem of jitter in a semiincremental or fully incremental way by applying geometric transformation to the observed frames or low-rank component, respectively. Other proposed methods are robust against moving camera and panning [2426], but all of them are batch or semibatch methods; furthermore, all of them used (2) as their starting point and also used the same general ideas (4) as RASL. Recently, Gao et al. [27] presented a new batch PCP method that produces a panoramic low-rank component that spans the entire field of view, which gives much better results in long panning sequences. We notice, nonetheless, that a fully online PCP algorithm able to cope with both jitter and panning is still an open problem. This phenomenon is of particular importance in some applications such as surveillance systems that use moving traffic or air cameras.

In the present study, we expand our previous work [28], where we proposed to address the panning problem by modifying the optimization problem solved by incPCP-TI [14], which in turn uses (3) as its starting point, and y applying a set of transformations to the low-rank component that are updated with each incoming new frame. We substantially expand [28] by(1)expanding the theoretical basis of our algorithm so that the present manuscript is self-contained(2)testing our algorithm in an additional real-life dataset with moving cameras(3)comparing our algorithm with two previously proposed batch methods

Our computational experiments on synthetically created datasets and publicly available videos of the Moseg [29], DAVIS [30], and CDnet2014 [31] datasets show that the proposed algorithm, henceforth referred as panning and transformation invariant incPCP (incPCP-PTI), is able to correctly handle video background modeling in panning and basic jitter conditions.

2.1. Batch Methods

In this section, two previous motion segmentation batch methods that work under jitter/panning conditions are reviewed. For a more complete review of all available methods, refer [20]. The two algorithms hereby described work on batch fashion but were chosen as a comparison benchmark in this paper due to the public availability of their codes and/or binary executables. It is noted that although [27] recently published a new PCP method for moving cameras, the algorithm was not chosen for comparison due to the unavailability of public code or executables.

2.1.1. Segmentation by Long-Term Video Analysis

In [29], the authors proposed to use a dense point tracker based on variational optical flow in which, instead of the classical two-frame approach of optical flow, long-term analysis is used. It is worth mentioning that following the general classification of methods for motion segmentation proposed in [12, 29] was catalogued as a trajectory classification method (see Table 3 of [12] for a summary of the aforementioned classification, along with their associated properties). After the initial tracking of points, spectral clustering with a spatial regularity constraint is utilized to form groups of point trajectories corresponding to different objects in the image. Finally, an energy minimization model is used to transform the clusters into a dense segmentation of moving objects. Throughout this paper, this method will be referred as LTVA.

2.1.2. DECOLOR

The detecting contiguous outliers in the low-rank representation (DECOLOR) method [25] uses a nonconvex penalty and a Markov random field [32] model to detect outliers that correspond to moving objects. Bouwmans et al. [12] classified this algorithm as a low rank and as sparse representation method [21]. For moving cameras, the method uses a transformation obtained from a prealignment to the middle frame. The prealignment is performed using the robust multiresolution method proposed in [33] and DECOLOR then iteratively refines this transformation.

2.2. Online or Semionline Methods

In this section, two previous online or partially online PCP methods that work under jitter conditions are reviewed. It should be noted that, without modification, these two methods are not directly applicable to panning scenarios.

2.2.1. t-GRASTA

The Grassmannian Robust Adaptive Subspace Tracking Algorithm (GRASTA) [16] is a semionline method for low-rank subspace tracking that has been applied to the foreground-background separation problem. GRASTA is not a fully online algorithm as it requires an initialization stage to obtain an initial low-rank subspace from the first p frames. A modification called t-GRASTA was presented in [22], and it is based on the Robust Alignment by Sparse and Low-Rank decomposition (RASL) algorithm [21]. RASL tries to handle the misalignment in the video frames by solvingwhere are a series of per frame transformations that align all the observed frames; it is straightforward to note that (4) is an extension of (2).

The non-linearity in the transformations τ of (4) is handled via a linearization using the Jacobian. The main drawback of t-GRASTA is that, aside from the required low-rank subspace initialization, the initial transformation τ is estimated by using a similarity transformation obtained from a series of three points manually chosen from each of the p initial frames. This initialization stage severely constraints its application in automatic processes and reduces its applicability in panning scenarios, as the feature points in initial frames may not be present on subsequent frames.

2.2.2. incPCP-TI

The incPCP-TI [23] considers the optimization problemwhere is the observed video sequence that suffers from jitter, is the properly aligned low-rank representation, and is a set of transformations that compensate translational and rotational jitter; that is,where represents the unobserved jitter-free video sequence, is a set of filters that independently models translation for each frame, represents the convolution, and is a set of independent rotations applied to each frame with angle . It is interesting to note that ; that is, the transformation used in (5) can be understood as the inverse of the transformation used in RASL or t-GRATSA (4).

In [14, 15], a computationally efficient and fully incremnetal algorithm, based on rank-1 updates for thin SVD [3437] (also see Section 2.3), was proposed to solve (3); in [23], it was shown that, since (5) is based on (3), such incremental solution can also be used: letting represent each frame of the observed video D, and using similar relationships for and w.r.t. and , respectively, then indeed the solution ofcan be efficiently computed in an incremental fashion ([23], Section 3.3 for details).

2.3. Incremental and Rank-1 Modifications for Thin SVD

Given a matrix with thin SVD where and column vectors and (with m and l elements, respectively), note thatwhere is a zero column vector of the appropriate size. Based on [35, 36], as well as on [37], we will briefly describe an incremental (thin) SVD and rank-1 modifications (update, downdate, and replace) for thin SVD.

The generic operation consisting of the Gram–Schmidt orthonormalization of and w.r.t. and , i.e., , , , and , , , and , is used as a first step for all the cases described below.

2.3.1. Incremental or Update Thin SVD

Given , we want to compute thin , with (i) or (ii) . In this case, we note that and that , where is a unit vector (with elements in this case); then, (8) is equivalent to (9) and (10), where :

Using (11), we get with (i) ; similarly using (12), we get with (ii) (Matlab notation is used to indicate array slicing operations):

2.3.2. Downdate Thin SVD

Given , with , we want to compute thin with r singular values. Noting that , then the rank-1 modification (8) is equivalent tofrom which we can compute thin via the following equation:

2.3.3. Replace Thin SVD

Given , with , we want to compute thin with r singular values. This case can be understood as a mixture of the previous cases and can be easily derived noticing that .

Finally, we point out that the computational complexity of any of the above procedures ([35], Section 3 and [37], Section 4) is upper bounded by . If holds, then the complexity is dominated by .

3. Methods

3.1. Proposed incPCP-PTI Method

The proposed algorithm (named incPCP-PTI) is a modification of the previously proposed incPCP-TI [23] so that it is able to handle panning and camera motion. It was briefly presented in [28], and is more thoroughly explained and evaluated in this work. The method continuously estimates the alignment transformation so that , i.e., the transformation that aligns the previous low-rank representation with the observed current frame. Thus, incPCP-PTI effectively uses as a local estimation of a composite panoramic background image. After applying such transformation to , the PCP problem can be solved in the reference frame of . After this initial alignment, it is considered that only minor jitter remains in the image and so a procedure similar to incPCP-TI is utilized by estimating a transformation for the k-th frame. However, instead of solving the Affinely Constrained Matrix Rank Minimization [38] as in the original incPCP-TI [14], the low-rank approximation problem is solved in the reference frame of by applying to the residual . The whole procedure is presented in Algorithm 1. This algorithm makes use of the incSVD, repSVD, and downSVD operators, which correspond to the thin SVD update, replacement, and downdate operators, respectively (Section 2.3).

Input: observed video D, internal parameters for shrinkage, internal parameters for transformation estimation, number of innerLoops , background frames ,
Initialization: , initial rank r,
(1)for do
(2);
(3) find such that is minimized
(4) obtain
(5)
(6)
(7)for do
(8)  
(9)  
(10)  
(11)  if then
(12)   break
(13)  
(14)  
(15)end
(16) Apply ghosting suppression
(17)if then
(18)  
(19) Update k if necessary
(20)end

In line 3 of Algorithm 1, the latest low-rank frame is aligned to the current frame . The transformation is estimated as the composition of a translation and rotation. Such found aligned transformation is used then to update the whole low-rank matrix representation L to the current reference axis (lines 4 and 5 of Algorithm 1) in order to obtain . After this initial align transformation is performed, it is assumed that only minor misalignments, modeled by , due to jitter remain (line 10 of Algorithm 1).

The ghosting suppression mentioned in line 16 is detailed in Section 3.3. The shrinkage in line 9 of Algorithm 1 can be performed by either soft thresholding or projection on the ball. Soft thresholding is performed with a simple element-wise shrinkage operator (). Projection onto the ball is detailed in Section 3.2. For all our experiments, the latter was chosen.

3.2. Projection on the Ball

Although theoretical guidance is available for selecting a minimax optimal regularization parameter λ in (2) [39], practical problems do not fully satisfy the idealized assumptions, and thus λ often has to be heuristically tuned. This problem is also observed if (3) is used instead of (2).

To tackle this problem, Rodrıguez and Wohlberg [40] introduced the alternative relaxation of (1) given bywhich can also be incrementally solved via rank-1 updates for thin SVD (as is the case of the incPCP and related algorithms [14, 15, 23]); however, (15) has the advantage that a simple heuristic can be derived for the adaptive selection of μ for each frame. Furthermore, μ can be spatially adapted in order to reduce ghosting effects. The algorithm they propose is very similar to incPCP, save for the shrinkage step, which is calculated as , where

Thus, for the shrinkage step, the solution is given by projections into the ball of radius μ.

While there are several well-known and efficient algorithms that solve (16), studies [4043] used the algorithm in [44], a recently published algorithm, for solving (16) that has a better computational performance than either that in [41] or [42].

Furthermore, Rodrıguez and Wohlberg [40] also proposed a simple scheme for adapting with every frame, which is given bywhere is a value between 0.5 and 0.75.

3.3. Ghosting Suppression

Ghosting refers to when the foreground estimates include phantoms or smear replicas from actual moving objects. Rodríguez and Wohlberg [45] proposed a procedure for ghosting suppression in the incPCP algorithm which consists using binary masks obtained from different frames in order to remove the ghosts from the low-rank component. In this approach, two sparse components at different time steps and are used to compute respective binary masks and . These masks will include the moving objects as well as ghosts. A new binary mask , i.e., the complement of the intersection of binary masks obtained from the aforementioned two frames, will include, with high probability, all pixels of the background that are not occluded by a moving object. can then be used to generate a modified input frame , where represents an Hadamard product, which is used to update the low-rank component. Additionally, if the procedure with the ball projection described in Section 3.2 is used for the shrinkage step, can be spatially adapted in order to reduce ghosting [40]. Based on the difference between current and previous sparse approximation , a binary mask can be computed and then the sparse component is modified aswhere and and , where is defined in (16), and β is suggested to take values between 0.1 and 0.3.

4. Description of Datasets and Computational Experiments

For the evaluation of the proposed incPCP-PTI algorithm, four datasets were considered. The first dataset consists of synthetic jitter and panning videos. The second one consists of videos of real panning taken from the MoSeg dataset [29]. The third dataset consists of videos of the recently published DAVIS dataset [30, 46]. The last one was obtained from the CDnet2014 dataset [31]. All datasets are detailed in this section. All tests were carried out using GPU-enabled Matlab code running on an Intel i7-2600K CPU (8 cores, 3.40 GHz, 8 MB cache, and 32 GB RAM) with a 12 GB NVIDIA Tesla K40C GPU card. To the best of our knowledge, no other incremental low-rank plus additive matrix video background modeling technique capable of handling panning has been reported in the literature. This situation puts some constraints in our evaluations, and for most of our tests, we do comparisons with batch methods. Specific details for each of the datasets are described in their corresponding sections. Furthermore, to test the stability of incPCP-PTI to jitter, we also use stab + incPCP-PTI, which consists on a preprocessing stage using a recent state-of-the-art video stabilization technique [47] followed by incPCP-PTI.

4.1. Synthetic Datasets

A dataset with synthetic panning and jitter was generated from the 3rd Tower video of the USC Neovision2 dataset [48], which consists of 900 frames of size 1920 × 1088 pixel at 25 fps. For this purpose, a subregion of 720 × 480 pixel was selected from each frame and the centroid of the subregion was translated with each new frame in order to simulate an aerial panning scenario using the piecewise linear trajectory given bywhere is the initial point (in this case, it is chosen as pixel) and is the point of slope change in the curve (chosen as pixel). This process is depicted in Figure 1. The panning velocity was taken as 1, 3, and 5 pixels per frame. A fourth case in which the velocity changed randomly between 1 and 7 pixels per frame was also considered. This dataset will be referred as SP (“synthetic panning”) dataset. Additionally, this same procedure was used to construct a dataset on jittered versions of the original frames. Each frame of the 3rd Tower video was jittered with random uniformly distributed translations on the pixels range and random uniformly distributed rotations on the degrees range. The same trajectory and subregion selection of the SP dataset was used. This second synthetic dataset will be referred as SPJ (“synthetic panning and jitter dataset”) dataset. For both SP and SPJ datasets, the sparse approximation via the batch iALM method [10] using 20 outer iterations was used as a proxy ground truth by selecting the same regions that were selected from the original frames. The iALM was chosen as the proxy ground truth since, as reported in [8], Tables 6 and 7, its segmentation is considered to be reliable. For these synthetic datasets, the performance of the proposed algorithm was measured in terms of the normalized distance:where and are the ground truth and computed sparse components for frame k, respectively, and N is the number of pixels of the frame. Considering images normalized between 0 and 1, the value of varies from 0 (perfect match with the ground truth) to 1.

For the SP dataset, only the incPCP-PTI method was evaluated. For the SPJ dataset, we evaluated incPCP-PTI and, as mentioned in Section 4, stab + incPCP-PTI, which consists of a preprocessing stage using a recent state-of-the-art video stabilization technique [47] followed by incPCP-PTI. This comparison had as objective determining if jitter is handled correctly by incPCP-PTI [23] alone. Additionally, we include a baseline comparison with the sparse components obtained with incPCP on the full Neovision2 Tower video and then segmented using the same procedure described in Section 4.1.

4.2. Moseg Dataset

We used 15 video sequences of the Freiburg-Berkeley Motion Segmentation (Moseg) Dataset [29, 49, 50]. We selected sequences that contained panning or camera movement. For all incPCP-PTI variants, three inner loops and a window size of 30 background frames were used. For the ball projection, α was set to 0.75 and the ghosting suppression was set to 20 frames. α controls the adaptation of τ, with lower α forcing a sparser solution, whereas the difference controls the number of frames used for ghosting suppression. The binary mask obtained with incPCP-PTI was postprocessed using the computation of the convex hull of the connected objects [51]. It is noted that the application of this postprocessing was not applied to the other methods as it tended to reduce their performance.

For comparison, both LTVA and DECOLOR algorithms were used as a reference. The LTVA code was obtained from [52], and the default parameter of a spatial subsampling of 8 for the point tracking was used. The tracking component of the algorithm runs in single-threaded C, whereas the dense clustering component runs in CUDA C 5.5. The Single-threaded Matlab DECOLOR code was obtained from [53], and a tolerance of 1E-4 was used. All other parameters were left at their default values. For reporting the results, we further subdivided the selected videos into two categories: short panning (comprising nine cars sequences and one people sequences) and long panning (comprising 5 marple videos). In the former category, the final and first frames in the panning motion still share some common area, while in the latter, these two frames do not. This subdivision was necessary as DECOLOR performs a preregistration and was not able to work properly on the long panning sequences.

For all videos, the binary masks of the methods were compared to the ground truth provided in the dataset in order to obtain an F measure, defined aswhere and stands for precision and recall, respectively, and TP, FN, and FP are the number of true positives, false negatives, and false positive pixels, respectively. It is noted that approximately only one in ten frames possessed ground truth information.

4.3. DAVIS Dataset

We used 10 video sequences of the DAVIS [30, 46]. We selected some sequences that contained panning or camera movement and that contained at least 50 frames. The algorithms were configured with the same parameters specified in Section 4.2 and the same F measure comparison is performed. However, as mentioned before, DECOLOR is not able to run on all sequences and just the F measures where its prealigment phase runs correctly are reported. Additionally, in this dataset, LTVA presented oversegmentation of some objects and, accordingly, two methods to obtain the final binary mask of moving objects are considered (this problem was not present in the Moseg sequences). The first, reported as LTVA-aPP (automatic postprocessing), considers an automatic selection of the mask by designating the object label with the largest area as the background and considering all others labels as foreground objects. The second, reported as LTVA-mPP (manual postprocessing), entails the manual selection of the labels corresponding to moving objects. Although the latter method is more accurate, for an automatic pipeline, LTVA-aPP would be the more realistic option.

4.4. Change Detection 2014 (CDnet2014) Dataset

Two videos from the PTZ category of the CDnet2014 dataset [31] were chosen:(i)continuousPan(CP): 704 × 480 pixel, 1700 frames-color video containing a continuous panning of a PTZ camera. The video is almost jitter free.(ii)intermittentPan(IP): 560 × 368 pixel, 3500 frame color of a PTZ camera that changes between two fixed positions. The video contains intermittent panning and additional real jitter.

As mentioned in the previous section, DECOLOR is unable to work on this type of long panning sequences and accordingly, only incPCP-PTI and LTVA are compared. The F measure was evaluated only on frames that contained ground truth motion. LTVA presented the same segmentation problems as in the DAVIS dataset. However, in this case, the mask did not provide a segmentation good enough to produce a manual postprocessing, and thus, only the results of LTVA + aPP are presented. All the parameters for incPCP-PTI and LTVA are the same as in previous sections. We also included a comparison with the edge based foreground background segmentation and interior classification (EFIC) [54] and its color version, C-EFIC [55]. These methods were chosen as they obtained the second and third best F measure in the PTZ category of the CDnet2014 dataset results [56]. The top performer in the category was not selected as it corresponded to a supervised convolutional neural network that needs proper training before classification. Unfortunately, no open code is available for EFIC and C-EFIC, and we only had access to the segmented binary masks submitted to the challenge [57, 58]. Due to this limitation, only a referential F measure could be computed. The absence of open code makes it difficult to ascertain if EFIC and C-EFIC can be implemented in a fully incremental way and to compare them in terms of computational performance. Additionally, as EFIC and C-EFIC already include a postprocessing step on the binary mask, we did not apply the convex hull postprocessing of the connected objects [51] that was used for incPCP-PTI.

5. Results

5.1. Synthetic Datasets
5.1.1. SP Dataset

The distance (20) computed for each frame of the different videos of the SP dataset is shown in Figure 2. Table 1 shows the average distance and average time for processing one frame along with a baseline metric, described in Section 4.1. It can be noticed that the distance tends to increase as the panning velocity increases but the distance in all cases maintains relatively small (below 0.01).

5.1.2. SPJ Dataset

Representative frames of the SPJ video with changing velocity and the segmented sparse components with incPCP-PTI and stab + incPCP-PTI are shown in Figure 3. The distance computed for each frame of the different videos of the SPJ dataset are shown in Figures 4 and 5 for incPCP-PTI and stab + incPCP-PTI.

5.2. Moseg Dataset

Representative frames of the video and the segmented sparse components for the cars8, people1, and marple13 videos of the Moseg dataset are shown in Figure 6. Tables 2 and 3 show the F measure obtained in the short and long panning subsets, respectively. As noted in the previous sections, DECOLOR did not properly work on the long panning sequences, and so it is excluded from the comparisons in this subset.

5.3. DAVIS Dataset

Representative frames of the video and the segmented sparse components for the tennis, horsejump-high, swing, and dog-gooses videos of the Davis dataset are shown in Figure 7. Table 4 shows the F measure obtained in the short and long panning subsets, respectively. The sequences in which DECOLOR did not run properly were left unreported. As described in Section 4.3, LTVA-aPP and LTVA-mPP refer to automatic and manual postprocessing of the binary masks generated by the LTVA method, respectively.

5.4. CDnet2014 Dataset

Representative frames of the video and the segmented sparse components for the CP and IP videos are shown in Figures 8 and 9, respectively. Figure 10 shows the F measure (with no postprocessing) for incPCP-PTI (grayscale and color versions) and EFIC and C-EFIC on the frames of the CP video, while Figure 11 shows the same metric for all methods on the frames of the IP video. Tables 5 and 6 show the average F measure and computational time obtained overall frames. For stab + incPCP-PTI, the computational time is shown as (total stabilization time) + (incPCP-PTI time per frame). For LTVA, the total time of the batch execution was divided over the total number of frames in order to obtain an average time per frame.

6. Discussion

It is observed in the results of Section 5.1.1 that, as expected, the distance increased; that is, the sparse approximation was worse, as the panning velocity increased. On the contrary, incPCP-PTI is able to maintain an adequate performance even when the panning velocity changes. Also expected is the fact that adding jitter to the panning scenario (Section 5.1.2) increased the distance for all panning velocities with respect to their jitter-free counterparts. The overall stability of the estimated distance also decreased, as evidenced in the higher variability of the curves in Figure 4. The inclusion of a video stabilization preprocessing technique (stab + incPCP-PTI) seemed to decrease such variability, as evidenced in Figure 5. Nevertheless, even with jitter, standalone incPCP-PTI maintained a low average distance and its performance is comparable with stab + incPCP-PTI, as can be observed in Table 7. Furthermore, although incPCP-PTI obtained higher distances than baseline incPCP, values tend to be close to each other and, for all tested velocities, incPCP-PTI managed to maintain a very small distance from the ground truth (below 0.01 for all cases).

The results of Table 2 (related to the Moseg dataset, Section 4.2) suggest that incPCP-PTI is able to perform comparably to DECOLOR, even though the latter is a batch method and our proposed method is incremental. LTVA has substantially higher average F measure for this particular dataset, although working in a batch fashion. In Table 3, the same trend is observed. As mentioned above, DECOLOR has problems working in these sequences due to its prealignment phase failing to find a suitable unique frame for reference. The low performance of incPCP-PTI in some of the Moseg sequences might stem from the short number of video frames that cause the initial low-rank estimation of PCP to be less precise.

The results of Sections 5.3 and 5.4 (related to the DAVIS and CDNet datasets, respectively, Sections 4.3 and 4.4) suggest that incPCP-PTI can perform adequately in longer real panning videos with more complex scenarios. Regarding the DAVIS dataset, we observe that the highest performance is obtained with LTVA-mPP. Nevertheless, this is a batch method and the final binary segmentation required human interactions. On the contrary, incPCP-PTI shows an average performance superior to LTVA-aPP and comparable to DECOLOR, though the latter did not run on all tested sequences.

The representative frames of Figures 8 and 9 exhibit different positions of the PTZ camera and thus evidence the ability of incPCP-PTI of handling the panning movements in the scene. IncPCP-PTI presents a relatively good F measure for both videos. This metric tended to be higher for the color version of the algorithm. In Figure 11, it can be observed that the F measure suffers decays at specific intervals of the video that coincide with sudden movements of the PTZ camera. However, after these sudden movements, the algorithm is able to restabilize and perform correctly. On the contrary, LTVA fails to track moving objects in a large number of frames. The lower performance on the CDNet dataset might be caused by the higher speed moving objects and panning movements that complicate the optical flow tracking using by LTVA. Additionally, the higher complexity of the objects and panning movements causes the clustering stage to produce a large number of false positives.

For both CP and IP videos (described in Section 4.4), incPCP-PTI showed a higher F measure than stab + incPCP-PTI, although a possible explanation is the misalignment of the ground truth reference frame and the reference frame of the stabilization algorithm. Nevertheless, the visual inspection of the frames and the results from the SPJ dataset suggests that incPCP-PTI is able to handle the presence of jitter in a panning scenario and that it does not need a stabilization preprocessing step. Compared to EFIC, incPCP-PTI showed superior performance in F measure in the CP videos, even without the postprocessing stage. In the IP video, incPCP-PTI is comparable or superior in F measure when compared with EFIC. As mentioned, the absence of open code for EFIC makes it difficult to make a more throughout comparison and to draw further conclusion from these comparisons. Compared to LTVA, incPCP-PTI shows a much higher F measure in both cases. These results suggest that incPCP-PTI might be more adequate than LTVA to track fastest panning and more complex scenarios. It is also noticed that incPCP-PTI attains these results in an incremental manner and with comparable or less computational average time per frame, despite the fact that the LTVA public code implemented in C and CUDA.

7. Conclusion

We have presented a novel algorithm, incPCP-PTI, and have shown with artificial datasets and real videos from the Moseg, DAVIS, and CDnet2014 datasets that it can adequately detect moving objects in scenarios with simultaneous panning and jitter. To the best of our knowledge, this is the first incremental PCP-like method able to handle panning conditions. For the synthetic datasets, the algorithm maintained a low distance with respect to a proxy ground truth, and for the real videos, it maintained an adequate F measure and was able to stabilize after sudden panning of the camera. Additionally, the comparisons with stab + incPCP-PTI (independent video stabilization followed by incPCP-PTI) suggest that a stabilization stage preceding incPCP-PTI is not needed, as it is able to handle the jitter present in the camera motions. The evaluations on real videos show that the incPCP-PTI might be comparable or superior, depending on the case, to the state-of-the-art batch PCP (e.g., DECOLOR) and non-PCP-like (e.g., LTVA, EFIC) foreground separation methods.

Further improvements of the algorithm might focus on (i) making it able to handle other types of distortion-like perspective changes or zooming in/out of the camera and (ii) reduce the time it takes per frame in order to make it more readily accessible for high frame rate real-time applications.

Data Availability

The video data used to support the findings of this study are included within the article. The datasets used in the article are referenced and can be found at publically available sites, namely, (1) synthetic datasets (Section 4.1) were constructed from: USC Neovision2 Project, https://goo.gl/5Si2Nm; (2) Moseg dataset (Section 4.2): “Freiburg-Berkeley motion segmentation dataset,” https://goo.gl/bzEvvi; (3) DAVIS dataset (Section 4.3): “DAVIS: Densely Annotated VIdeo Segmentation,” https://goo.gl/G8Hb7o; (4) CDnet2014 dataset (Section 4.4): http://www.changedetection.net/. Additionally, our implemented method can be found at http://goo.gl/4jEvck.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by the “Programa Nacional de Innovación para la Competitividad y Productividad” (Innóvate Perú) Program, 169-Fondecyt-2015.