Advances in Multimedia

Volume 2018, Article ID 7357284, 10 pages

https://doi.org/10.1155/2018/7357284

## Robust Visual Tracking with Discrimination Dictionary Learning

^{1}Jiangxi Province Key Laboratory of Water Information Cooperative Sensing and Intelligent Processing, Nanchang Institute of Technology, Nanchang 330099, China^{2}School of Information Engineering, Nanchang Institute of Technology, Nanchang 330099, China

Correspondence should be addressed to Chengzhi Deng; moc.621@ihzgnehcgned

Received 3 June 2018; Revised 3 August 2018; Accepted 10 August 2018; Published 2 September 2018

Academic Editor: Shih-Chia Huang

Copyright © 2018 Yuanyun Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

It is a challenging issue to deal with kinds of appearance variations in visual tracking. Existing tracking algorithms build appearance models upon target templates. Those models are not robust to significant appearance variations due to factors such as illumination variations, partial occlusions, and scale variation. In this paper, we propose a robust tracking algorithm with a learnt dictionary to represent target candidates. With the learnt dictionary, a target candidate is represented with a linear combination of dictionary atoms. The discriminative information in learning samples is exploited. In the meantime, the learning processing of dictionaries can learn appearance variations. Based on the learnt dictionary, we can get a more stable representation for target candidates. Additionally, the observation likelihood is evaluated based on both the reconstruct error and dictionary coefficients with constraint. Comprehensive experiments demonstrate the superiority of the proposed tracking algorithm to some state-of-the-art tracking algorithms.

#### 1. Introduction

Visual tracking is a fundamental task in computer vision, which is applied in a wide range of applications, such as intelligent transportation, video surveillance, human-computer interaction, and video editing. The goal of visual tracking is to estimate target states of a tracked target in each frame. Although many tracking algorithms are proposed in recent decades [1], designing a robust tracking algorithm remains a challenging issue due to factors such as fast motion, out-of-rotation, nonrigid deformation, and background clutters.

Based on the types of target observations, visual tracking algorithms can be classified as either generative [2–6] or discriminative [7–14]. Generative tracking algorithms search for an image patch that has the most similarity to the tracked target model as the tracking result in the current frame. For a generative algorithm, the prime problem is to build an effective appearance model that is robust to complicated appearance variations.

The discriminative tracking algorithms consider visual tracking as a binary classification problem. The tracked target is distinguished from the surround background by learnt classifiers. The classifiers compute the confidence value for target candidates and distinguish each as a foreground target or a background block. In this work, we will propose a generative algorithm. Next, we briefly review some related works to our tracking algorithm and some recent tracking algorithms.

Kwon et al. [3] decompose the observation model and motion models into multiple basic observation models and multiple basic motion models, respectively. Each basic observation model covers a target appearance variation. Each basic motion model covers a special motion model. A basic observation model and a basic motion model are combined into a basic tracker. The tracking algorithm is robust to drastic appearance changes. He et al. [4] propose an appearance model based on locality sensitive histograms at each pixel location. The proposed observation model is robust to drastic illumination variations. In [2], a target candidate is represented by a set of intensity histograms of multiple image patches, which has a vote value on the corresponding position. A target candidate is represented by fixed target templates, which is not robust to drastic appearance variations. Wang et al. [6] represent a target candidate based on target templates with affine constraint. The observation likelihood is computed based on a learnt distance metric.

The representation technique with sparse constraint is applied into visual tracking [15–19]. The target representations with sparsity are robust to outliers and occlusions. In [15], Mei et al. use a set of target templates to represent target candidates and represent partial occlusions with trivial templates. The algorithm in [15] is robust to partial occlusion. While severe occlusions occur, the algorithm is not effective. Zhong et al. [18] propose a collaborative model with sparsity constraint. In order to improve the tracking performance, the tracking algorithm combines the generative model and discriminative model. Zhang et al. [16] exploit the spatial layout structure of a target candidate and represent target appearance based on local information and spatial structure. In [19], a target candidate is represented by underlying low-rank with sparse constraints, in which the temporal consistency is used.

Recently, correlation filter [21–24] and deep network [25–28] techniques are applied into visual tracking. In [23], the tracking algorithm takes different features to learn correlation filter. The proposed appearance model is robust to large-scale variations and maintain multiple modes in a particle filter tracking framework. Liu et al. [22] exploit the part-based structure information for correlation filter learning. The learnt filters can accurately distinguish foreground parts from the background. In [28], Ma et al. exploit object features from deep convolutional neural networks. The output of the convolutional layers includes semantic information and hierarchies, which is robust to appearance variations. Huang et al. [27] propose deep feature cascades based tracking algorithm, which considers the visual tracking as a decision-making process.

Motivated by the above-mentioned work, we propose a learnt dictionary based appearance model. A target candidate is represented by a linear combination of the learnt dictionary atoms. The dictionary learning process can learn appearance variations. The dictionary atoms cover recent appearance variations and a stable target representation is obtained. The observation likelihood is evaluated based on the reconstruction error with sparse constrain on dictionary coefficients. Extensive experimental results on some challenging video sequences show the robustness and effectiveness of the proposed appearance model and the tracking algorithm.

The remainder of this paper is organized as follows. Section 2 proposes the novel tracking algorithm, which includes the appearance model, the dictionary learning, the observation likelihood evaluation, and the dictionary update. Section 3 compares the tracking performance of the proposed tracking algorithm with some state-of-the-art algorithms. Section 4 concludes this work.

#### 2. Proposed Tracking Algorithm

In this section, we detail the proposed tracking algorithm including an appearance model based on a learnt dictionary, discrimination dictionary learning for target representation, and a novel likelihood function. In this work, we propose the tracking algorithm in a particle filter tracking framework [29]. The particle filter framework is widely used in visual tracking due to its effectiveness and simplification.

In our tracking algorithm, the target state in the first frame is given as . denotes the corresponding target observation of . In the first frame, a set of particles (i.e., target candidates) are extracted and denoted as . These particles are collected by cropping out an image regions surrounding the location of . These particles have same sizes as and they have same important weights as , . The particles in frame are denoted as . The states and the corresponding observation of particle are denoted as and with important weights , respectively.

The particles in frame are propagated from frame according the state transition model . is assumed to be a Gaussian distribution:where the covariance is a diagonal matrix, in which the diagonal entries denote the variances of the 2D position and the scale of a target candidate.

The target state and the corresponding observation in the -th frame are denoted as and , respectively. In the particle filter framework, the in the frame is approximately estimated by aswhere is the weight of the particle . In the tracking, the particle weights are dynamically updated according to the likelihood of the particle aswhere is the likelihood function of particle , which is introduced in (15).

##### 2.1. Target Representations

In existing algorithm, a target candidate is represented by a linear combination of a set of target templates. These templates are usually generated from tracking results in previous frames. There are some noises and uncertain information in these templates due to complicated appearance variations. These tracking algorithms are not robust to drastic variations. Thus, in our tracking algorithm, a target candidate is approximately represented by the atoms of a learnt dictionary.

Based on a learnt dictionary , a target candidate is approximately represented aswhere is a learnt dictionary. is an atom of the learnt dictionary . is the number of the atoms. is the coefficient of the dictionary . The dictionary coefficient is evaluated by solvingwhere are the coefficient vector for the target candidate associated with a learnt dictionary .

##### 2.2. Dictionary Learning

In existing tracking algorithms, target candidates are represented by target templates, which are some tracking results from previous frames. To improve the tracking performance, we use a learnt dictionary to approximately represent target candidates.

Denote by the set of training samples. Denote by the coding vector matrix of training samples over , i.e., . The learnt dictionary should have discriminative capability and can adapt to learn appearance variations like partial occlusions, nonrigid deformation, illumination variations, and so on. Based on the learnt dictionary, a stable target representation model is obtained. Motivated by a dictionary learning method [30], we use a discriminative dictionary learning model aswhere is the discriminative fidelity term; is the sparsity constraint on the coefficient matrix ; is a discriminate constraint on the coding coefficient matrix ; and are parameters for balancing this constraint terms. In [30], the dictionary includes a set of subdictionaries for all classes. Different from the dictionary learning in [30], in our tracking algorithm, is a one-class dictionary and it is learnt from only a set of positive training samples.

In the dictionary learning process, based on the reconstruction error between the training samples and the dictionary , the discriminative fidelity term is defined as

To improve the discriminative performance of the learnt dictionary, we add the Fisher discriminative criterion to minimize the within-class scatter of the coefficient matrix . Denote by the with-class scatter, which is defined aswhere is a vector of the coefficient matrix , is the mean vector of the coefficient vector . In the learning process, we use the trace of as constraint term in (6). To prevent some coefficients that are too large in constraint term, a regularized term is added to where is a balancing parameter.

Based on (7), (8), and (9), the dictionary learning model can be rewritten aswhere , , and are positive scalar parameters.

In (10), we update the dictionary and the corresponding coefficient , iteratively. In the updating processing, one is updated when the other is fixed. When the dictionary is fixed, the optimization function is reduced to the following:andwhere is a vector of the coefficient matrix and is the mean vector of the coefficient vector .

When is fixed, the dictionary in (10) is updated as

In the learning process, the training samples are primely important, which should reflect the recent variations of the tracked target and keep diversity to adapt to target appearance variations. In the first frame, a set of training samples are collected. Firstly, the initialized target is selected as training samples. In the meantime, the other training samples are selected by perturbing a few pixels surrounding the center location of the tracked target.

In order to keep the diversity of the learnt dictionary to appearance variations, we set the size of the training samples to 25. In the subsequent frames, we should update the training samples and relearn a dictionary to adapt to target appearance variations. At the current frame, when the tracked target state is computed and located, we crop the corresponding image and extracted the feature vector as a new training sample. Then, the new training sample is added to the set of current training samples and the oldest training sample is swapped.

##### 2.3. Likelihood Evaluation

The similarity metric of a target candidate and the corresponding candidate is an important issue in visual tracking. In this work, the similarity is measured aswhere is the learnt dictionary and is the coefficient vector of the learnt dictionary computed in (6).

Based on the distance between a target candidate and the corresponding template dictionary, the target observation likelihood is computed aswhere is the distance between a target candidate and the corresponding dictionary , is the standard deviation of the Gaussian, and is a positive parameter.

##### 2.4. Visual Tracking with Dictionary Based Representation

By integrating the proposed target representation and the online dictionary learning and updating and the observation evaluation, the proposed visual tracking algorithm is outlined in Algorithm 1. The particle filter framework is used for all video sequences. For a video sequence, the tracked target is manually selected by a bounding box in the first frame. A set of particles (i.e., target candidates) are selected with same weights in the particle framework. The training samples are collected and a dictionary is learnt in the first frame. In the subsequent tracking processing, when the current target states are evaluated, the current tracking result is added to the training samples. The dictionary is relearnt according to the updated training samples.