Mathematical Problems in Engineering

Volume 2016, Article ID 1879489, 11 pages

http://dx.doi.org/10.1155/2016/1879489

## Adaptive Randomized Ensemble Tracking Using Appearance Variation and Occlusion Estimation

Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

Received 30 October 2015; Revised 30 December 2015; Accepted 4 January 2016

Academic Editor: Daniel Zaldivar

Copyright © 2016 Weisheng Li and Yanjun Lin. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Tracking-by-detection methods have been widely studied with promising results. These methods usually train a classifier or a pool of classifiers in an online manner and use previous tracking results to generate a new training set for object appearance and update the current model to predict the object location in subsequent frames. However, the updating process may easily cause drifting in terms of appearance variation and occlusion. The previous methods for updating the classifier(s) decided whether or not to update the classifier(s) by a fixed learning rate parameter in all scenarios. The learning rate parameter has a great influence on the tracker’s performance and should be dynamically adjusted according to the change of scene during tracking. In this paper, we propose a novel method to model the time-varying appearance of an object that takes appearance variation and occlusion of local patches into consideration. In contrast with the existing methods, the learning rate for updating classifier ensembles adaptively is adjusted by estimating the appearance variation with sparse optical flow and the possible occlusion of the object between consecutive frames. Experiments and evaluations on some challenging video sequences have been done and the results demonstrate that the proposed method is more robust against appearance variation and occlusion than those state-of-the-art approaches.

#### 1. Introduction

Visual object tracking, which is one of cardinal problems in computer vision, has a wide range of applications including video surveillance, human computer interaction, video retrieval, and autonomous navigation. Despite the fact that numerous single object tracking methods have been proposed, many of them only achieve favorable performance in simple environment with slow motion and slight occlusion. Hence, it remains a challenging work to research a robust algorithm for complex and dynamic scenes caused by the distractions such as heavy occlusion, appearance variation, and cluttered background.

The purpose of object tracking is to estimate the states of a moving target in a video. A tracking system usually has three components including appearance model, motion model, and model update. An appearance model is used to represent the object with proper features and verify predictions by object representations. A motion model is exploited to predict the most likely state of the target. A model update scheme is applied to make the tracker adapt to appearance variation and occlusion of the target object. Existing trackers can be classified as either generative or discriminative. For generative methods [1–6], tracking is simplified as searching for the most similar position to the target object within many neighborhood positions of current location. The target is often represented by a set of templates. The fragment-based (Frag) tracker [1] addresses the partial occlusion problem by modeling object appearance with histograms of local patches. The incremental visual tracker (IVT) method [2] utilizes an incremental subspace model to adapt appearance variation. The visual tracking decomposition (VTD) approach [3] extends the conventional particle filter framework with multiple motion and observation models to account for appearance variation. The tracker [4] first applies sparse representation to visual tracking with designed trivial templates to handle occlusions and treats the tracking as finding the image region with minimal reconstruction error using minimization. The distribution field (DFT) tracker [5] introduces a method for building an image descriptor using distribution fields (DFs), a representation that allows smoothing the objective function without destroying information about pixel values. Li et al. [6] construct an appearance model using the 3D discrete cosine transform and propose an incremental 3D-DCT algorithm and then embed the discriminative criterion into a particle filtering framework for object state inference. Most of these methods use holistic representations to describe objects and hence do not handle occlusions or distracters well. For discriminative methods, tracking is treated as a binary classification problem which aims at designing an ensemble classifier to distinguish the target object from the background [7–14]. They utilize both the target and the background information. Avidan [7] combines a set of weak classifiers into a strong one and develops an ensemble tracking method. The OAB tracker [8] proposes an online boosting method to update discriminative features to handle the drifting problem in object tracking. Bai et al. [9] propose randomized ensemble tracker that is extended from the online boosting algorithm [8] and ensemble tracker [7] by characterizing the ensemble weight vector as a random variable and evolving its distribution with recursive Bayesian estimation. Babenko et al. [10] introduce multiple instance learning into online object tracking where samples are considered within positive and negative bags or sets. Kalal et al. [11] propose the P-N learning algorithm to exploit the underlying structure of positive and negative samples in learning classifiers for object tracking. Zhang et al. [12] propose a real-time compressive tracking algorithm by adopting random projection to project a datum in high-dimensional space to a low-dimensional vector. The structured output tracking (Struck) method [13] is proposed by adopting kernelized structured output support vector machine to avoid the labeling ambiguity when updating the classifier during tracking. Wang et al. [14] incorporate online distance metric learning into visual tracking based on a particle filter framework, and thus the appearance variations of an object are effectively learned via an online metric learning mechanism. Furthermore, several algorithms have been proposed to exploit the advantages of both generative and discriminative models [15, 16].

In this paper, we mainly focus on model update scheme because the part has a great impact on the performance of the trackers. Existing strategies usually concentrate on updating the weights of classifiers and ignore the update of learning rate of the classifiers. To address the problem, we introduce appearance variation estimation and occlusion estimation to control the learning rate of the classifiers motivated by [15, 17–19]. We estimate the appearance variation using sparse optical flow and possible occlusion by the reconstruction error of local patches. In summary, we propose a robust tracking method with an adaptive appearance model. During tracking, we exploit the local patches to handle occlusion and appearance variation. The model is adaptively updated with the consideration of occlusions to account for variations and alleviate drifts.

The remainder of this paper is organized as follows. The work related to visual tracking and sparse representation is reviewed in Section 2. In Section 3, the proposed method with appearance variation and occlusion estimation is introduced. Experimental results and demonstrations are reported and analyzed in Section 4 and the conclusion is given in Section 5.

#### 2. Related Work

In this section, we discuss the related online tracking algorithms handling appearance variation and occlusion of target object. Although much progress has been made for visual tracking, a lot of challenges remain to design an effective and robust tracker due to some distractions including appearance variation, varying illumination, occlusions, and background clutter.

Tracking-by-detection [7, 9] is increasingly popular due to its top performance recently. The motivation of these methods treats tracking as a detection and classification problem, that is, to avoid modeling object dynamics especially when abrupt motion and occlusions occur. They require training a classifier to distinguish the object from the background for detecting the object in each frame. One common approach is to use an ensemble classifier that linearly combines many weak classifiers with different associated weights, for example, [7, 9]. This is done by constructing a strong classifier which combined a pool of weak classifiers using the initial frame, then updating the weights of classifiers by abandoning some bad weak classifiers, and adding some new good weak classifiers trained on next frame at each time step.

Appearance variation and occlusion are two problems for visual tracking. Some current algorithms have been proposed using holistic and local representation schemes to handle appearance variation and occlusion. The IVT tracker [2] with adaptive appearance model that aims to account for appearance variation or limited deformable motion is presented. It is less effective in handling heavy occlusion as a result of the adopted holistic appearance model. The ensemble tracker [7] formulates tracking as a binary classification problem. Although this method is able to make a distinction between target and background, it is rather limited to handle heavy occlusion. The Frag tracker [1] aims to solve partial occlusion with a representation based on histograms of local patches. The template combining votes of matching local patches is not updated and thereby it is not expected to handle appearance variation in terms of large variation in scale and shape deformation. The MIL tracker [10] is able to reduce drift, but it is not able to handle large shape deformation. The tracker [4] does not exploit the appearance information from the background and thus it is ineffective in handling heavy occlusion.

As mentioned above, how to handle appearance variation and occlusion are two major problems that are also two hot research topics for object tracking. Motivated by [15, 17–19], we introduce methods to estimate appearance variation and possible occlusion in the process of model update. An optical flow algorithm calculates the motion information of the moving target from one frame to the next by using the intensity values of neighboring pixels. So, we can use it to estimate the rate of appearance variation between consecutive frames. Motivated by the widely successful applications of sparse representation in many tasks, a new sparsity-based occlusion estimation method is designed.

#### 3. The Proposed Method

In this section, we introduce the appearance variation estimation based on sparse optical flow and sparsity-based occlusion estimation methods of our tracker. We solve the problem that learning rate of classifier could not be dynamically adapted with the change of scene during tracking. Our tracker is a variant of RET [9].

##### 3.1. The RET Tracker

We give a brief introduction of the RET tracking method. The tracker puts emphasis on estimating the state of the classifier rather than the state of the object. It characterizes the ensemble weight vector that combines weak classifiers as a random variable and evolves its distribution with recursive Bayesian estimation. The object bounding box is divided into local patches of size . For each patch, it extracts a histogram of gradients (HOG) descriptor. At each time step, the method starts with the pool of weak classifiers , a distribution over the weight vector , and input data . It samples the distribution to get instantiations of the weight vector and combines them with the output of weak classifiers to yield ensembles of weak classifiers . These ensembles can be interpreted as instantiations of the randomized classifier and are used to compute the approximation of the expected output of the randomized classifier . The approximation is considered as the output of the strong classifier for input data .

The classification method of the tracker is described as follows. Given a weight vector and input data , we obtain an ensemble binary classifier of the pool by thresholding the linear combination of outputs of all weak classifiers:where is a component of weight vector associated with , is the dimensionality of weight vector , and is a model parameter.

denote the series of sequentially arriving datasets. At time step , given input data , the probabilistic label value is computed:where is base distribution, which is the expectation of vector , and is the concentration parameter and follows the Dirichlet distribution .

The final strong classifier is obtained for input data by voting and thresholding:

##### 3.2. Appearance Variation Estimation Based on Sparse Optical Flow

In visual tracking, appearance variation is an important factor that affects the performance. Most existing methods solve the problem by involving some feature descriptors and ignore estimating the rate and degree. The methods proposed in [17, 20] have shown that appearance variation can be detected by observing changes in the induced optical flow. The magnitude of the induced optical flow vectors will depend on the distance to the object as well as the velocity difference between the object and the video capture devices. The sampling rate of a certain video sequence is constant. If the velocity of the video capture devices is constant, the change in the optical flow vector magnitude will reflect changes in the distance to the object. So, the rate can be computed. If the velocity difference between the object and the video capture devices is small, the change in the optical flow vector magnitude will reflect the degree of appearance variation. Sparse optical flow method is introduced due to the fact that the traditional optical flow algorithms cannot satisfy the request of real-time because they need a large amount of computation cost. Many methods such as Harris corners, Canny edges, SIFT, and SURF can be used to extract points of interest in the process of sparse optical flow calculation.

Motivated by the demonstrated success of optical flow [17, 21], we propose a method based on sparse optical flow to estimate the appearance variation of the object between consecutive frames. The workflow of our method is described in Figure 1. For each frame, we firstly employ the Canny edge descriptor to detect corners with the purpose of extracting suitable and accurate key points in a real-time way. Secondly we calculate the optical flow using the image pyramids and the Lucas-Kanade algorithm and estimate the image velocities with subpixel precision because of possible large displacement. Thirdly we get an optical flow descriptor composed of the position of key points and the magnitude of the corresponding flow vectors. Finally, we get the estimation result by Euclidean distance.