Journal of Electrical and Computer Engineering

Volume 2019, Article ID 7675805, 15 pages

https://doi.org/10.1155/2019/7675805

## Panning and Jitter Invariant Incremental Principal Component Pursuit for Video Background Modeling

Department of Electrical Engineering, Pontificia Universidad Católica del Perú, Lima, Peru

Correspondence should be addressed to Gustavo Chau; ep.ude.pcup@uahc.ovatsug

Received 20 July 2018; Accepted 2 September 2018; Published 3 February 2019

Academic Editor: Nicolas Younan

Copyright © 2019 Gustavo Chau and Paul Rodríguez. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Video background modeling is an important preprocessing stage for various applications, and principal component pursuit (PCP) is among the state-of-the-art algorithms for this task. One of the main drawbacks of PCP is its sensitivity to jitter and camera movement. This problem has only been partially solved by a few methods devised for jitter or small transformations. However, such methods cannot handle the case of moving or panning cameras in an incremental fashion. In this paper, we greatly expand the results of our earlier work, in which we presented a novel, fully incremental PCP algorithm, named incPCP-PTI, which was able to cope with panning scenarios and jitter by continuously aligning the low-rank component to the current reference frame of the camera. To the best of our knowledge, incPCP-PTI is the first low-rank plus additive incremental matrix method capable of handling these scenarios in an incremental way. The results on synthetic videos and Moseg, DAVIS, and CDnet2014 datasets show that incPCP-PTI is able to maintain a good performance in the detection of moving objects even when panning and jitter are present in a video. Additionally, in most videos, incPCP-PTI obtains competitive or superior results compared to state-of-the-art batch methods.

#### 1. Introduction

Video background modeling consists of segmenting the “foreground” or moving objects from the static “background.” It is an important first step in various computer vision applications [1] such as abnormal event identification [2] and surveillance [3].

Several video background modeling methods, using different approaches such as Gaussian mixture models [4], kernel density estimations [5], or neural networks [6], exist in the literature. More comprehensive surveys of other methods are presented in [1, 7]. Principal component pursuit (PCP) is currently considered to be one of the leading algorithms for video background modeling [8]. Formally, PCP was introduced in [9] as the nonconvex optimization problem:where the matrix is formed by the *n* observed frames, each of size (rows, columns, and number of channels, respectively); is a low-rank matrix representing the background; is a sparse matrix representing the foreground; is a fixed global regularization parameter; and is the rank of and is the norm of .

Although the convex relaxation is given bywhere is the nuclear norm of matrix *L* (i.e., , the sum of the singular values of *L*) and is the norm of *S*, which is at the core of most PCP algorithms (including the augmented Lagrange multiplier (ALM) and inexact ALM (iALM) algorithms [10, 11]), there exists several others (for a complete list, see [12], Table 4). In particular, we point outwhere is the Frobenius norm, which was originally proposed in [13] since we will use it as the starting point of our proposed method (Sections 2.2.2 and 3.1).

Bouwmans and Zahzah [8] showed that PCP provides state-of-the-art performance in video background modeling problems but also states some of its limitations.

First, PCP is inherently a batch method with high computational and memory requirements. This problem has been addressed in the past by means of solutions based on rank-1 updates for thin SVD [14, 15] (applied to (3)), by low-rank subspace tracking [16] (applied to (2)), stochastic optimization [17] (which applies the maximum-margin matrix factorization (M3F) method [18] to (2)), or random sampling [19] (also applied to (2)).

The second shortcoming of PCP, which is particularly relevant to the present work, is its sensitivity to jitter and its inability to cope with panning video frames. For a general review and classification of methods for motion segmentation able to cope with different degrees of camera motion, we recommend [20] and the many references therein. Among the methods based on low-rank plus additive matrices model, we highlight the robust alignment by sparse and low-rank decomposition (RASL) method [21]. This method used (2) as its starting point and addressed the problem of jitter in PCP using a series of geometric transformations on the observed frame, but as originally casted, it is a batch method. On the contrary, t-GRASTA [22] and incPCP-TI [23], which used (2) and (3) apiece as their starting point, addressed the problem of jitter in a semiincremental or fully incremental way by applying geometric transformation to the observed frames or low-rank component, respectively. Other proposed methods are robust against moving camera and panning [24–26], but all of them are batch or semibatch methods; furthermore, all of them used (2) as their starting point and also used the same general ideas (4) as RASL. Recently, Gao et al. [27] presented a new batch PCP method that produces a panoramic low-rank component that spans the entire field of view, which gives much better results in long panning sequences. We notice, nonetheless, that a fully online PCP algorithm able to cope with both jitter and panning is still an open problem. This phenomenon is of particular importance in some applications such as surveillance systems that use moving traffic or air cameras.

In the present study, we expand our previous work [28], where we proposed to address the panning problem by modifying the optimization problem solved by incPCP-TI [14], which in turn uses (3) as its starting point, and *y* applying a set of transformations to the low-rank component that are updated with each incoming new frame. We substantially expand [28] by(1)expanding the theoretical basis of our algorithm so that the present manuscript is self-contained(2)testing our algorithm in an additional real-life dataset with moving cameras(3)comparing our algorithm with two previously proposed batch methods

Our computational experiments on synthetically created datasets and publicly available videos of the Moseg [29], DAVIS [30], and CDnet2014 [31] datasets show that the proposed algorithm, henceforth referred as panning and transformation invariant incPCP (incPCP-PTI), is able to correctly handle video background modeling in panning and basic jitter conditions.

#### 2. Previous Related Work

##### 2.1. Batch Methods

In this section, two previous motion segmentation batch methods that work under jitter/panning conditions are reviewed. For a more complete review of all available methods, refer [20]. The two algorithms hereby described work on batch fashion but were chosen as a comparison benchmark in this paper due to the public availability of their codes and/or binary executables. It is noted that although [27] recently published a new PCP method for moving cameras, the algorithm was not chosen for comparison due to the unavailability of public code or executables.

###### 2.1.1. Segmentation by Long-Term Video Analysis

In [29], the authors proposed to use a dense point tracker based on variational optical flow in which, instead of the classical two-frame approach of optical flow, long-term analysis is used. It is worth mentioning that following the general classification of methods for motion segmentation proposed in [12, 29] was catalogued as a trajectory classification method (see Table 3 of [12] for a summary of the aforementioned classification, along with their associated properties). After the initial tracking of points, spectral clustering with a spatial regularity constraint is utilized to form groups of point trajectories corresponding to different objects in the image. Finally, an energy minimization model is used to transform the clusters into a dense segmentation of moving objects. Throughout this paper, this method will be referred as LTVA.

###### 2.1.2. DECOLOR

The detecting contiguous outliers in the low-rank representation (DECOLOR) method [25] uses a nonconvex penalty and a Markov random field [32] model to detect outliers that correspond to moving objects. Bouwmans et al. [12] classified this algorithm as a low rank and as sparse representation method [21]. For moving cameras, the method uses a transformation obtained from a prealignment to the middle frame. The prealignment is performed using the robust multiresolution method proposed in [33] and DECOLOR then iteratively refines this transformation.

##### 2.2. Online or Semionline Methods

In this section, two previous online or partially online PCP methods that work under jitter conditions are reviewed. It should be noted that, without modification, these two methods are not directly applicable to panning scenarios.

###### 2.2.1. t-GRASTA

The Grassmannian Robust Adaptive Subspace Tracking Algorithm (GRASTA) [16] is a semionline method for low-rank subspace tracking that has been applied to the foreground-background separation problem. GRASTA is not a fully online algorithm as it requires an initialization stage to obtain an initial low-rank subspace from the first *p* frames. A modification called t-GRASTA was presented in [22], and it is based on the Robust Alignment by Sparse and Low-Rank decomposition (RASL) algorithm [21]. RASL tries to handle the misalignment in the video frames by solvingwhere are a series of per frame transformations that align all the observed frames; it is straightforward to note that (4) is an extension of (2).

The non-linearity in the transformations *τ* of (4) is handled via a linearization using the Jacobian. The main drawback of t-GRASTA is that, aside from the required low-rank subspace initialization, the initial transformation *τ* is estimated by using a similarity transformation obtained from a series of three points manually chosen from each of the *p* initial frames. This initialization stage severely constraints its application in automatic processes and reduces its applicability in panning scenarios, as the feature points in initial frames may not be present on subsequent frames.

###### 2.2.2. incPCP-TI

The incPCP-TI [23] considers the optimization problemwhere is the observed video sequence that suffers from jitter, is the properly aligned low-rank representation, and is a set of transformations that compensate translational and rotational jitter; that is,where represents the unobserved jitter-free video sequence, is a set of filters that independently models translation for each frame, represents the convolution, and is a set of independent rotations applied to each frame with angle . It is interesting to note that ; that is, the transformation used in (5) can be understood as the inverse of the transformation used in RASL or t-GRATSA (4).

In [14, 15], a computationally efficient and fully incremnetal algorithm, based on rank-1 updates for thin SVD [34–37] (also see Section 2.3), was proposed to solve (3); in [23], it was shown that, since (5) is based on (3), such incremental solution can also be used: letting represent each frame of the observed video *D*, and using similar relationships for and w.r.t. and , respectively, then indeed the solution ofcan be efficiently computed in an incremental fashion ([23], Section 3.3 for details).

##### 2.3. Incremental and Rank-1 Modifications for Thin SVD

Given a matrix with thin SVD where and column vectors and (with *m* and *l* elements, respectively), note thatwhere is a zero column vector of the appropriate size. Based on [35, 36], as well as on [37], we will briefly describe an incremental (thin) SVD and rank-1 modifications (update, downdate, and replace) for thin SVD.

The generic operation consisting of the Gram–Schmidt orthonormalization of and w.r.t. and , i.e., , , , and , , , and , is used as a first step for all the cases described below.

###### 2.3.1. Incremental or Update Thin SVD

Given , we want to compute thin , with (i) or (ii) . In this case, we note that and that , where is a unit vector (with elements in this case); then, (8) is equivalent to (9) and (10), where :

Using (11), we get with (i) ; similarly using (12), we get with (ii) (Matlab notation is used to indicate array slicing operations):

###### 2.3.2. Downdate Thin SVD

Given , with , we want to compute thin with *r* singular values. Noting that , then the rank-1 modification (8) is equivalent tofrom which we can compute thin via the following equation:

###### 2.3.3. Replace Thin SVD

Given , with , we want to compute thin with *r* singular values. This case can be understood as a mixture of the previous cases and can be easily derived noticing that .

Finally, we point out that the computational complexity of any of the above procedures ([35], Section 3 and [37], Section 4) is upper bounded by . If holds, then the complexity is dominated by .

#### 3. Methods

##### 3.1. Proposed incPCP-PTI Method

The proposed algorithm (named incPCP-PTI) is a modification of the previously proposed incPCP-TI [23] so that it is able to handle panning and camera motion. It was briefly presented in [28], and is more thoroughly explained and evaluated in this work. The method continuously estimates the alignment transformation so that , i.e., the transformation that aligns the previous low-rank representation with the observed current frame. Thus, incPCP-PTI effectively uses as a local estimation of a composite panoramic background image. After applying such transformation to , the PCP problem can be solved in the reference frame of . After this initial alignment, it is considered that only minor jitter remains in the image and so a procedure similar to incPCP-TI is utilized by estimating a transformation for the *k*-th frame. However, instead of solving the Affinely Constrained Matrix Rank Minimization [38] as in the original incPCP-TI [14], the low-rank approximation problem is solved in the reference frame of by applying to the residual . The whole procedure is presented in Algorithm 1. This algorithm makes use of the incSVD, repSVD, and downSVD operators, which correspond to the thin SVD update, replacement, and downdate operators, respectively (Section 2.3).