Abstract
As an alternative to the tracking-based approaches that heavily
depend on accurate detection of moving objects, which often fail
for crowded scenarios, we present a pixelwise method that employs
dual foregrounds to extract temporally static image regions.
Depending on the application, these regions indicate objects that
do not constitute the original background but were brought into
the scene at a subsequent time, such as abandoned and removed
items, illegally parked vehicles. We construct separate long- and
short-term backgrounds that are implemented as pixelwise
multivariate Gaussian models. Background parameters are adapted
online using a Bayesian update mechanism imposed at different
learning rates. By comparing each frame with these models, we
estimate two foregrounds. We infer an evidence score at each pixel
by applying a set of hypotheses on the foreground responses, and
then aggregate the evidence in time to provide temporal
consistency. Unlike optical flow-based approaches that smear
boundaries, our method can accurately segment out objects even if
they are fully occluded. It does not require on-site training to
compensate for particular imaging conditions. While having a
low-computational load, it readily lends itself to parallelization
if further speed improvement is necessary.
1. Introduction
Conventional approaches on abandoned item
detection can be grouped as motion detectors [1–3], object classifiers [4], and
tracking-based analytics approaches [5–10].
In [2], a dense optical flow map is estimated to infer
the foreground objects moving in opposite directions, moving in a group, and
staying stationary by predetermined rules. In [3], a pixel-based method for characterizing objects
introduced into the static scene by comparing the background image estimated
from the current frame with the previous ones is described. This approach
requires storing as many backgrounds as the minimum detection duration in the
memory and causes ghost detections even after the abandoned item is removed
from the scene.
Recently, an online classifier [4] that incorporates a boosting-based feature selection
to label image blocks as background, valid objects, and unidentified regions is
presented. This method adapts itself to the depicted scene, however, fails
short of discriminating moving objects from stationary ones. Classifier-based
methods face with the challenge of dealing with unknown object type as such
objects can vary from small luggage to ski bags.
A considerable amount of effort has been devoted to
hypothesize abandoned items by analyzing object trajectories [5–7, 9, 10] in multicamera setups. In principle, these methods
require solving a harder problem of object initialization and tracking as an
intermediate step in order to identify the parts of the video frames
corresponding to an abandoned object. It is often assumed that the background
scene is nearly static or periodically varying, while the foreground comprises
groups of pixels that are different from the background. However, object
detection in crowded scenes, especially for uncontrolled real-life situations,
is problematic due to the partial occlusions, heavy shadows, people entering
the scene together, and so forth. Moreover, object appearance is often indiscriminative as people tend to dress in
similar colors, which leads inaccurate tracking results.
For static camera setups, background subtraction
provides strong cues for apparent motion statistics. Various background
generation methods have been employed in a quest for a system that is robust to
changing illumination conditions, appearance variations, shadows, camera
jitter, and severe noise. Parametric mixture models are employed to handle such
variations. Stauffer and Grimson [11] propose an expectation maximization- (EM-) based
adaptation method to learn a mixture of Gaussians with predetermined number of
models at each pixel using fixed learning parameters. The online EM update
causes a weak model, which has a larger variance, to be dissolved into a
dominant model, which has a smaller variance in case the mean value of the weak
model is close to the mean of the dominant one. To address this issue, Porikli
and Tuzel [12] develop an online Bayesian update mechanism for
adaptation multivariate Gaussian distributions. This method estimates the
number of necessary layers for each pixel and the posterior distributions of
mean and covariance of each layer by assuming the data to be normally
distributed with mean and covariance as random variables.
There are other variants of the mixture of models that
use modified feature spaces, image gradients, optical flow, and region
segmentation [13–15]. Instead of iteratively updating models as mixture
methods, nonparametric kernel density estimation [16] stores a large number of previous frames and
estimates weights of multiple kernel functions. Since both memory and
computational complexity proportionally increases with the number of stored
frames, kernel methods are usually impractical for real-time applications.
There exists a class of problems that cannot be solved
by the traditional foreground-background detection methods. For instance,
objects deliberately abandoned in public places, such as suitcases, packages,
do not fall into either of these two categories. They are static; therefore, they
should be labeled as background. On the other hand, they should not be ignored
as they do not belong to the original scene background. Depending on the
learning rate, the pixels corresponding to the temporary static objects can be
mistaken as a part of the scene background (in case of a high-learning rate),
or grouped with the moving regions (low-learning rate). A single background is
not sufficient to separate the temporarily static pixels from the scene
background.
In this paper, we propose a pixel-based method that
employs dual foregrounds. Our motivation is that by changing the background
learning rate, we can adjust how soon a static object should be blended into
the background. Therefore, temporarily static image regions can be
distinguished from the longer term background and moving regions by analyzing
multiple foregrounds of different learning rates. This simple idea is wrapped
into our adaptive background estimation algorithm, where the slowly adapting
background and the fast adapting foreground are aggregated into an evidence
image. We impose different learning rates by processing video at different
temporal resolutions. The background models have identical initial parameters,
thus they require minimal fine tuning in the setup stage. The evidence
statistics are used to extract temporarily static image areas, which may
correspond to abandoned items, illegally parked vehicles, objects removed from
the scene, and so forth, depending on the application.
Our method does not require object initialization,
tracking, or offline training. It accurately segments objects even if they are
fully occluded. It has a very low-computational load and readily lends itself
to parallelization if further speed improvements are necessary. In the
subsequent sections, we give details of the dual foregrounds, show
Bayesian adaptation method, and present results on real-world data.
2. Dual Foregrounds
To detect an abandoned item (or an illegally parked
vehicle, removed article, etc.), we need to know how it alters the temporal and
spatial statistics of the video data. We built our method on the fact that an
abandoned item is not a part of the original scene, it was brought into the
scene not that long ago, and it remained still after it has been left. In other
words, it is a temporarily static object which was not there before. This means
that by learning the prolonged static scene and the moving foreground regions,
we can hypothesize on whether a pixel corresponds to an abandoned item or not.
A scene background can be determined by maintaining a
statistical model that captures the most consistent modes of the color
distribution of each pixel in extended durations of time. From this background,
the changed pixels that do not fit into the statistical models are obtained.
However, depending on the learning rate, the pixels corresponding to the
temporary static objects can be mistaken as a part of the scene background
(higher-learning rates), or grouped with the moving regions (lower-learning
rates). A single background is not sufficient to separate the temporarily
static pixels from the scene background.
As opposed to single background approaches, we use two
backgrounds to obtain both the prolonged (long-term) background
and the
temporarily static (short-term) background
. Note that it is possible to improve the temporal
granularity by employing more than two backgrounds at different learning rates.
Each of these backgrounds is defined as a mixture of Gaussian models. We
represent a pixel as layers of 3D multivariate Gaussians where each dimension
corresponds to a color channel. Each layer models to a different appearance of
the pixel. We perform our operations on the RGB color space. We apply a
Bayesian update mechanism. At each update, at most one layer is updated with
the current observation. This assures the minimum overlap over the layers. We
also determine how many layers are necessary for each pixel and use only those
layers during the foreground segmentation phase. This is performed with an
embedded confidence score. Both of the backgrounds have identical initial
parameters, such as the initial mean and variance of the marginal posterior
distribution, the degrees of freedom, and the scale matrix, except the number
of the prior measurements, which is used as a learning parameter.
At every frame, we estimate the long and short term
foregrounds by comparing the current frame
by the
background models
and
. We obtain two binary foreground masks
and
, where
indicates that
the pixel
is changed. The
long term foreground mask
shows the color
variations in the scene that were not there before including moving objects,
temporarily static objects, as well as moving cast shadows and illumination
changes that the background models fail to adapt. The short-term foreground
mask
contains the
moving objects, noise, and so forth. Depending on the foreground mask values, we
postulate the following hypotheses as shown in Figure 1.
Figure 1: Hypotheses on long- and short-term
foregrounds.
(1)
and
, where
is a pixel that
may correspond to a moving object since
does not fit
any backgrounds.
(2)
and
, where
is a pixel that
may correspond to a temporarily static object.
(3)
and
, where
is a scene
background pixel that was occluded before.
(4)
and
, where
is a scene
background pixel since its value
fits both
backgrounds
and
.
The short term background is updated at a
higher-learning rate than the long-term background. Thus, the short-term
background adapts to the underlying distribution faster and the changes in the
scene are blended more rapidly. In contrast, the long-term background is more
resistant against the changes.
In case a scene background pixel changes temporarily
then sets back to its original value, the long-term foreground mask will be
zero;
. The short term background is pliant and adapts
itself during this time, which causes
. We assume it takes more time to adapt the long-term
background to the newly observed color than the change period. A changed pixel
will be blended into the short-term background, that is,
, if it keeps its new color long enough. If this
duration is not prolonged enough to blend it, the long term-foreground mask
will be one;
. This is the common case for the abandoned items. If
no change is observed in neither of the backgrounds
and
, the pixel is considered as a part of the static
scene background as the pixel has the same value for much longer periods of
time.
The dual
foreground mechanism is illustrated in Figure 2. In this simplified drawing, the horizontal axis
corresponds to time and the vertical axis to the confidence of the background
model. Action indicates that the
pixel color has significantly changed. Label represents the result of the above
hypotheses. For pixels with relatively short duration of change, the
confidences of the long- or short-term models do not increase enough to make
them valid backgrounds. Thus, such pixels are labeled as moving object.
Whenever the short-term model blends the pixel in the background but the
long-term model still marks it as foreground, the pixel is considered to belong
to the abandoned item. Finally, if the pixel change takes even longer, the pixel
is labeled as a scene background. Sample foregrounds that show these cases are
given in Figure 3.
Figure 2: The confidence of the long-term and
short-term background models (vertical axis) changes differently for ordinary
objects (moving or temporarily stationary ones), abandoned items, and scene
background.
Figure 3: First row:

. Second row:

. The long-term foreground

captures moving
objects and temporarily static regions. The short-term foreground

captures only
moving objects. The evidence

gets greater as
the object stays longer.
We aggregate the framewise detection results into an
evidence image
by updating the
pixelwise values at each frame as
(1) where
and
are positive
numbers. The evidence image enables removing noise in the detection process. It
also controls the minimum time required to assign a static pixel as an
abandoned item. For each pixel, the evidence image collects the motion
statistics. Whenever it elevates up to a preset level
, we mark the pixel as an abandoned item pixel and
raise an alarm flag. The evidence threshold
is defined in
term of the number of frames and it can be chosen depending on the desired
responsiveness and noise characteristics of the system. In case the foreground
detection process produces noisy results, higher values of
should be
preferred. High values of
lower the false
alarm rate. On the other hand, the higher the preset level gets, the longer the
minimum duration a pixel takes to be classified as a part of an abandoned item.
A typical range of the evidence threshold
is
frames.
The decay constant
determines how
fast the evidence should decrease. In other words, it decides what should
happen in case a pixel that is marked as an abandoned item is blended into the
scene background or gets its original value before the marking. To set the
alarm flag off immediately after the removal of object, the value of decay
should be large, for example,
. This means that there is only a single parameter to
set for the likelihood image. In our experiments, we observed that the larger
values of decay constant generate satisfying results.
In the following section, we describe the adaptation
of the long- and short-term background models by a Bayesian update mechanism.
3. Bayesian Update
Our background model [12] is similar to adaptive mixture models [11] but instead of mixture of Gaussian distributions, we
define each pixel as layers of 3D multivariate Gaussians. Each layer
corresponds to a different appearance of the pixel. Using Bayesian approach, we
are not estimating the mean and variance of the layer, but the probability
distributions of mean and variance. We can extract statistical information
regarding these parameters from the distribution functions. For now, we are
using expectations of mean and variance for change detection, and variance of
the mean for confidence.
3.1. Layer Model
Data is assumed to be normally distributed with mean
and covariance
. Mean and variance are assumed unknown and modeled as
random variables. Using Bayesian theorem, joint posterior density can be
written as
(2) To perform
recursive Bayesian estimation with the new observations, joint prior density
should have the
same form with the joint posterior density
. Conditioning on the variance, joint prior density is
written as
(3) The above
condition is realized if we assume inverse Wishart distribution for the
covariance and, conditioned on the covariance, multivariate normal distribution
for the mean. Inverse Wishart distribution is a multivariate generalization of
scaled inverse
-distribution.
The parametrization is
(4) where
and
are the degrees
of freedom and scale matrix for inverse Wishart distribution,
is the prior
mean, and
is the number
of prior measurements. With these assumptions, joint prior density becomes
(5) for
three-dimensional feature space. Let this density be labeled as normal inverse
Wishart
. Multiplying prior density with the normal likelihood
and arranging the terms, joint posterior density becomes normal inverse Wishart
with the
parameters updated:
(6) where
is the mean of
new samples and
is the number
of samples used to update the model. If update is performed at each time frame,
becomes one. To
speed up the system, update can be performed at regular time intervals by
storing the observed samples. During our tests, we update one quarter of the
background at each time frame, therefore
becomes four.
The new parameters combine the prior information with the observed samples.
Posterior mean
is a weighted
average of the prior mean and the sample mean. The posterior degrees of freedom
is equal to prior degrees of freedom plus the sample size. System is started
with the following initial parameters:
(7) where
is the
three-dimensional identity matrix.
Integrating joint posterior density with respect to
, we get the marginal posterior density for the
mean
(8) where
is a
multivariate
-distribution
with
degrees of
freedom.
We use the expectations of marginal posterior
distributions for mean and covariance as our model parameters at time
. Expectation for marginal posterior mean (expectation
of multivariate
-distribution)
becomes
(9) whereas
expectation of marginal posterior covariance (expectation of inverse Wishart
distribution) becomes
(10)
Our confidence measure for the layer is equal to one
over determinant of covariance of
:
(11)
If our marginal posterior mean has larger variance,
our model becomes less confident. Note that variance of multivariate
-distribution
with scale matrix
and degrees of
freedom
are equal to
for
.
System can be further speeded up by making
independence assumption on color channels. Update of full covariance matrix
requires computation of nine parameters. Moreover, during distance computation,
we need to invert the full covariance matrix. To speed up the system, we use
three univariate Gaussians corresponding to each color channel. After updating
each color channel independently, we join the variances and create a diagonal
covariance matrix
(12) In this case,
for each univariate Gaussian, we assume scaled inverse
-distribution
for the variance and conditioned on the variance univariate normal distribution
for the mean.
3.2. Background Update
We initialize our system with
-layers for
each pixel. Usually, we select three-five layers. In more dynamic scenes, more
layers are required. As we observe new samples for each pixel, we update the
parameters for our background model. We start our update mechanism from the
most confident layer in our model. If the observed sample is inside the
confidence
interval of the current model, parameters of the model are updated as explained
in (6). Lower confidence models are not updated.
For background modeling, it is useful to have a
forgetting mechanism so that the earlier observations have less effect on the
model. Forgetting is performed by reducing the number of prior observation
parameter of unmatched model. If current sample is not inside the confidence
interval, we update the number of prior measurements parameter,
(13) and proceed
with the update of next confident layer. We do not let
become less
than initial value
. If none of the models is updated, we delete the
least confident layer and initialize a new model having current sample as the
mean and an initial variance (7). The update algorithm for a single pixel can be
summarized as shown in Algorithm 1
With this mechanism, we do not deform our models with
noise or foreground pixels, but easily adapt to smooth intensity changes like
lighting effects. Embedded confidence score determines the number of layers to
be used and prevents unnecessary layers. During our tests, usually secondary
layers correspond to shadowed form of the background pixel or different colors
of the moving regions of the scene. If the scene is unimodal, confidence scores
of layers other than first layer become very low.
3.3. Foreground Segmentation
Learned background statistics are used to detect the
changed regions of the scene. We determine how many layers are necessary for
each pixel and use only those layers during foreground segmentation phase. The
number of layers required to represent a pixel is not known beforehand, so
background is initialized with more layers than needed. Usually, we select three
to five layers. In more dynamic scenes, more layers are required. Using the
confidence scores, we determine how many layers are significant for each pixel.
As we observe new samples for each pixel, we update the parameters for our
background model. At each update, at most one layer is updated with the current
observation. This assures the minimum overlap over layers. We order the layers
according to confidence score and select the layers having confidence value
greater than the layer threshold. We refer to these layers as confident layers.
We start the update mechanism from the most confident layer. If the observed
sample is inside the
of the layer
mean, which corresponds to
confidence
interval of the current model, parameters of the model are updated. Lower
confidence models are not updated.
4. Experimental Results
To evaluate the dual foreground method, we used
several public datasets from PETS 2006, i-LIDS 2007, and Advanced Technology
Center. We tested a total of 32 sequences grouped into 10 sets. The videos have
assorted resolutions; 
, and
. The scenarios ranged from lunch rooms to underground
train stations. Half of these sequences depict scenes that are not crowded.
Other sequences contain complex scenarios with multiple people sitting,
standing, and walking at variable speeds. Some sequences show vehicles parked.
The abandoned items are left in different durations from 10 seconds to 2
minutes. Some sequences contained small abandoned items. A few sequences have
multiple abandoned items.
The sets AB-Easy, AB-Medium, and AB-Hard, which are
included in i-LIDS challenge, are recorded in an underground train station. Set
PETS is a large closed space platform with restaurants. Sets ATC-1 and ATC-2
are recorded from a wide angle camera of a cafeteria. Sets ATC-3 and ATC-4 are
different cameras from a lunch room. Set ATC-5 is a waiting lounge. Since the
proposed method is a pixelwise scheme, it is not difficult to set detection
areas in the initialization time. We manually marked the platform in AB-easy,
AB-medium, and AB-hard sets, the waiting area in PETS 2006 set, and the illegal
parking spots in PV-easy, PV-medium, and PV-hard sets. For the ATC sets, all of
the image area is used as the detection area. For i-LIDS sets, we replaced the
beginning parts of the video sequences with 4 frames of the empty platform.
For all results, we set the learning rate of the
short-term background at 30 times the learning rate of the long-term
background. We assigned the evidence threshold
in the range
depending on
the desired responsiveness time that controls how soon an abandoned item is
detected as an alarm. We used
as the decay
parameter.
Figure 4 shows the detection results for the i-LIDS datasets.
We reported the performance scores of all sets in Table 1, where
is the total
number of frames in a set and
is the duration
of the event in terms of the number of frames. We measure the duration right
after an item has been left behind. It is also possible to measure the duration
after the person moved away or after some preset waiting time in case
additional tracking information is incorporated. Events indicates the number of abandoned
objects (for PV-medium, the number of the illegally parked vehicles). TD means
the correctly detected objects. A detection event is considered to be both
spatially and temporally continuous. In other words, there might be multiple
detections for a frame if the objects are spatially disconnected. FA shows the
falsely detected objects.
and
are the
duration of the correct and false detections.
is the duration
that an abandoned item could not be detected. Since we start an event as soon
as an object is left, this score does not consider any waiting time. This means
that we overestimate our miss rate.
Table 1: Detection results.
Figure 4: Detected events for i-LIDS datasets.
As our results show, we successfully detected almost
all abandoned items while achieving a very low false alarm rate. Our method
performed satisfactory when the initial frame showed the actual static
background. The detection areas have not included any people at the
initialization time in the ATC sets, thus the uncontaminated backgrounds are
easily learned. This is also true for the PV and AB-easy sets. However, the
AB-medium and AB-hard sets contained several stationary people in the initial
frames. This resulted in false detections when those people moved away. Since
the background models eventually learn the statistically dominant color values,
such false alarms should not occur in the long run due to the fact that the
background will be more visible than the people. In other words, the ratio of
the false alarms should decrease in time. We do not learn the color
distribution of the abandoned items (or parked vehicles), thus the proposed
method can detect them even if they are occluded. As long as the occluding
object, for example, a passing by person, has different color than the
long-term background, our method still shows the boundary of the abandoned
item.
Representative detection results are given in Figures 5–12. As visible, none of the moving objects, moving
shadows, people that are stationary in shorter durations was falsely detected.
Besides, there are no ghost false
detections due the inaccurate blending of the abandoned items in the long-term
background. Thanks to the Bayesian update, the changing illumination conditions
as in PV-medium are properly adapted in the backgrounds.
Figure 5: Test sequence AB-easy (
Courtesy of
i-LIDS). The alarm sets off immediately when the item is removed even
though the luggage was stationary 2000 frames (image size is

).
Figure 6: In sequence ATC-2.2 (Courtesy of
Advanced Technology Center, Amagasaki), one person brings a bag, puts it on
the ground, another person comes and picks it up. As visible, the object is
detected accurately, and the alarm immediately sets off when the bag is
removed.
Figure 7: In sequence ATC-2.3 (Courtesy of
Advanced Technology Center, Amagasaki), one person bring a bag, leaves it
on the floor. As visible, after it was detected as an abandoned item, temporary
occlusions due to the moving people do not cause the system to fail.
Figure 8: In sequence ATC-2.6 (Courtesy of
Advanced Technology Center, Amagasaki), one person hides the bag under a
shadowed area of the table and runs away. Another person comes, wanders around,
takes the bag and leaves the scene.
Figure 9: In sequence ATC-3.1 (Courtesy of
Advanced Technology Center, Amagasaki), two people sit on a table. One
person leaves a back bag, another a bottle. They leave both items behind when
they depart.
Figure 10: In sequence ATC-5.3 (Courtesy of
Advanced Technology Center, Amagasaki), one person sits on a couch and puts
a bag next to him. After a while, he leaves but the bag stays on the couch.
Another person comes, sits on the couch, puts his briefcase next to him, and
takes away the bag. The briefcase is also removed later.
Figure 11: A test sequence from PETS 2006 datasets (Courtesy of PETS). There is significant motion all around the scene. To
make things more challenging, the person who leaves his back bag after stays
still for an extended period of time.
Figure 12: Test sequence PV-medium from AVSS 2007 (Courtesy of i-LIDS). A challenge in this video is the rapidly changing
illumination conditions that cause dark shadows.
Another advantage of this method is that the alarm is
immediately set of as soon as the abandoned item is removed from its previous
position. Although we do not know whether the person who left the object is
moved away from the object or not, we consider this property as a superiority
over the tracking-based approaches that require a decision net of heuristic
rules and context-depended priors to detect such event.
One shortcoming is that it cannot discriminate the
different types of objects, for example, a person who is stationary for a long
time can be detected as an abandoned item. This can be, however, an indication
of another suspicious behavior as it is not common. To determine object types
and reduce the false alarm rate, object classifiers, that is, a human or a
vehicle detector, can be used. Since such classifiers are only for verification
purposes, their computation time should be negligible. Since no tracking is
integrated, trajectory-based semantics, for example, who left the item or how
long the item left before the person moves away can not be extracted. Still,
our method can be used as a preprocessing stage to improve the tracking-based
video analytics.
The computational load of the proposed method is low.
Since we only employ pixelwise operations and make pixelwise decisions, we can
take advantage of the parallel processing architectures. By assigning each
image pixel to a processor on the GPU using CUDA programming, since each
processor can execute in parallel, the speed improves more than
in comparison
to the corresponding CPU implementation. For instance, full background update
for
images takes
74.32 milliseceonds on CPU (P4 DualCore
), however on CUDA, it only needs
6.38 milliseceonds. We observed that the detection can be comfortably employed
in quarter spatial resolution by processing the short-term background at
while updating the long term at every 5 seconds (
) with the same
learning rates.
5. Conclusions
We present a robust method that uses dual foregrounds
to find abandoned items, stopped objects, and illegally parked vehicles in
static camera setups. At every frame, we adapt the dual background models using
Bayesian update, and aggregate evidence obtained from dual foregrounds to
achieve temporal consistency.
This method does not depend on object initialization
and tracking of every single object, hence its performance is not upper bounded
to these error prone tasks that usually fail for crowded scenes. It accurately
outlines the boundary of items even if they are fully occluded. Since it
executes pixelwise operations, it can be implemented on parallel processors.
Acknowledgment
The authors thank their
colleagues Jay Thornton and Keisuke Kojima for their constructive comments.
References
- J. D. Courtney, “Automatic video indexing via object motion analysis,” Pattern Recognition, vol. 30, no. 4, pp. 607–625, 1997.
- S. Velastin and A. Davies, “Intelligent CCTV surveillance: advances and limitations,” in Proceedings of the 5th International Conference on Methods and Techniques in Behavioral Research, Wageningen, The Netherlands, August-September 2005.
- A. E. Cetin, M. B. Akhan, B. U. Toreyin, and A. Aksay, “Characterization of motion of moving objects in video,” 2004, US patent no. 20040223652.
- H. Grabner and H. Bischof, “On-line boosting and vision,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR '06), vol. 1, pp. 260–267, New York, NY, USA, June 2006.
- E. Auvinet, E. Grossmann, C. Rougier, M. Dahmane, and J. Meunier, “Left-luggage detection using homographies and simple heuristics,” in Proceedings of the 9th IEEE International Workshop on Performance Evaluation in Tracking and Surveillance (PETS '06), pp. 51–58, New York, NY, USA, June 2006.
- J. Martínez-del-Rincón, J. E. Herrero-Jaraba, J. R. Gómez, and C. Orrite-Uruñuela, “Automatic left luggage detection and tracking using multi-camera UKF,” in Proceedings of the 9th IEEE International Workshop on Performance Evaluation in Tracking and Surveillance (PETS '06), pp. 59–66, New York, NY, USA, June 2006.
- N. Krahnstoever, P. Tu, T. Sebastian, A. Perera, and R. Collins, “Multi-view detection and tracking of travelers and luggage in mass transit environments,” in Proceedings of the 9th IEEE International Workshop on Performance Evaluation in Tracking and Surveillance (PETS '06), pp. 67–74, New York, NY, USA, June 2006.
- F. Lv, X. Song, B. Wu, V. K. Singh, and R. Nevatia, “Left luggage detection using bayesian inference,” in Proceedings of the 9th IEEE International Workshop on Performance Evaluation in Tracking and Surveillance (PETS '06), pp. 83–90, New York, NY, USA, June 2006.
- K. Smith, P. Quelhas, and D. Gatica-Perez, “Detecting abandoned luggage items in a public space,” in Proceedings of the 9th IEEE International Workshop on Performance Evaluation in Tracking and Surveillance (PETS '06), pp. 75–82, New York, NY, USA, June 2006.
- S. Guler and M. K. Farrow, “Abandoned object detection in crowded places,” in Proceedings of the 9th IEEE International Workshop on Performance Evaluation in Tracking and Surveillance (PETS '06), pp. 99–106, New York, NY, USA, June 2006.
- C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '99), vol. 2, pp. 246–252, Fort Collins, Colo, USA, June 1999.
- F. Porikli and O. Tuzel, “Bayesian background modeling for foreground detection,” in Proceedings of the 3rd ACM International Workshop on Video Surveillance & Sensor Networks (VSSN '05), pp. 55–58, Singapore, November 2005.
- K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wallflower: principles and practice of background maintenance,” in Proceedings of the 17th IEEE International Conference on Computer Vision (ICCV '99), vol. 1, pp. 255–261, Kerkyra, Greece, September 1999.
- O. Javed, K. Shafique, and M. Shah, “A hierarchical approach to robust background subtraction using color and gradient information,” in Proceedings of the Workshop on Motion and Video Computing (MOTION '02), pp. 22–27, Orlando, Fla, USA, December 2002.
- A. Mittal and N. Paragios, “Motion-based background subtraction using adaptive kernel density estimation,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '04), vol. 2, pp. 302–309, Washington, DC, USA, June-July 2004.
- A. Elgammal, D. Harwood, and L. Davis, “Non-parametric model for background subtraction,” in Proceedings of the 6th European Conference on Computer Vision-Part II (ECCV '00), vol. 2, pp. 751–767, Dublin, Ireland, June-July 2000.