Abstract

In the visual tracking scenarios, if there are multiple objects, due to the interference of similar objects, tracking may fail in the progress of occlusion to separation. To address this problem, this paper proposed a visual tracking algorithm with discrimination through multimanifold learning. Color-gradient-based feature tensor was used to describe object appearance for accommodation of partial occlusion. A prior multimanifold tensor dataset is established through the template matching tracking algorithm. For the purpose of discrimination, tensor distance was defined to determine the intramanifold and intermanifold neighborhood relationship in multimanifold space. Then multimanifold discriminate analysis was employed to construct multilinear projection matrices of submanifolds. Finally, object states were obtained by combining with sequence inference. Meanwhile, the multimanifold dataset and manifold learning embedded projection should be updated online. Experiments were conducted on two real visual surveillance sequences to evaluate the proposed algorithm with three state-of-the-art tracking methods qualitatively and quantitatively. Experimental results show that the proposed algorithm can achieve effective and robust effect in multi-similar-object mutual occlusion scenarios.

1. Introduction

Visual tracking is an important research area in computer vision and pattern recognition which can be applied to many domains, such as visual surveillance, traffic monitoring, human computer interaction, image compression, three-dimension reconstruction, and weapons automatically tracking combat. To make these applications viable, the results of visual tracking must be robust and precise.

Visual tracking is a challenging problem due to object appearance variations. Many issues can cause object appearance variations, including camera motions, camera viewpoint changes, environmental illumination changes, noise disturbance, background clutter, pose variation, and object shape deformation, and occlusions occur [1].

1.1. Related Works

In recent years, there are a wide range of tracking algorithms to deal with these object appearance variations. These algorithms can be roughly classified into two categories according to the model-construction mechanism, which are generative and discriminative methods.

The generative methods mainly focus on how to robustly describe the appearance model and then find the best matching appearance model of image patch with that of the object. The classical template matching tracking algorithm can be viewed as the generative model. The earliest template-based tracking method dates back to the Lucas-Kanade algorithm. The eigen-tracking [2] algorithm demonstrated that tracking can be considered as finding the minimum distance from the appearance model of tracked object to that of the subspace represented. Matthews et al. [3] show how to update the template which can avoid the “drifting” inherent in the naive method. The IVT [4] tracking algorithm utilizes subspace learning to generate a low-dimensional object appearance and incrementally update it. Hu et al. [5] proposed a visual object tracking algorithm which models appearance changes by incrementally learning a tensor subspace representation. In the tracking procedure, the sample mean and an eigen-basis for each unfolding matrix of the tensor are adaptively updated. The classical mean-shift [6] tracker uses histogram as the appearance model; then the mean-shift procedure is achieved to locate the object. The Fragtrack [7] utilizes several fragments to design the appearance model which can handle pose change and partial occlusion. The -tracker [8] casts tracking problem as sparse approximation where the object is modeled by a sparse linear combination of target and a set of trivial templates. The sparse representation is obtained by solving an -regularized optimization least-squares problem, and the posteriori probability of candidate image patch belonging to the object class is inversely proportional to the residual between the candidate image patch and the reconstructed one. The -APG tracker [9] developed the -tracker that not only runs in real-time but also improves the tracking accuracy. The S-MTT [10] algorithm regularizes the appearance model representation problem employing sparsity-inducing mixed norms which can handle particles independently.

The discriminative methods treat visual tracking as a binary classification problem. It aims to separate the object from its surrounding complex background with a small local region. There are many newly proposed visual tracking algorithms based on boosting classifier because of its powerful discriminative learning capabilities. Online boosting algorithm has wide applications in object detection and visual tracking. Grabner et al. [11] proposed an online boosting tracker which is firstly given a discriminative evaluation of each feature from a candidate feature pool. Then online semisupervised boosting method [12] is proposed for the purpose of alleviating the object drifting problem in visual tracking. Ensemble tracking [13] uses weak classifiers to construct a confidence map by pixel classification to distinguish between the foreground and the background. The MIL tracker [14] represents an object by a set of samples; these samples corresponding to image patch are considered within positive and negative bags. Then, multiple instance boosting is used to overcome the problem that slight inaccuracies in labeled training examples can cause object drift. However, the tracking may fail when the training samples are imprecise. Pointing to this problem, the WMIL tracker [15] which integrates the sample important into the multiple instance learning is proposed. The SVM tracker [16] combined support vector machine into optical flow to achieve visual tracking. A visual tracking algorithm via an online feature selection mechanism for evaluating multiple object features is proposed in [17]. The VTD algorithm [18] designs the observation and motion model based on visual tracking composition scheme. The TLD tracker [19] explicitly decomposes the long-term tracking problem into three components which are tracking, learning, and detection. The CT tracker [20] extracted the sparse image feature combined with a naive classifier to separate the object from the background.

In the multiple moving objects scenarios, with the movement of one object, the reflected lights of other objects which reach to the camera lens may be hindered, making other objects’ projection imaging incomplete or even completely invisible on the imaging plane. When the occlusion occurred, if the tracking object is similar to the occlusion object, the object is vulnerable to the similar objects influence in the progress of occlusion to separation which can cause drift. Thus, it is necessary to distinguish the tracking object with the potential similar objects in the scenarios. Meanwhile, when the object is partially occluded, the information from unoccluded part has a large reference value of determining the object state. Therefore, the object feature must maintain the structural relationship of the original space. This paper proposed a visual object tracking algorithm for multiple similar objects mutual occluded problem which combines these two ideas. First of all, a feature function is designed for the purpose of extracting the tensor feature which can maintain the spatial structure of the object. The multimanifold tensor data set is collected by template matching tracking algorithm in the initial few frames. A tensor distance is defined to determine the intramanifold and intermanifold neighborhood relationship. The object feature tensor is embedded into a low-dimensional space by multimanifold discriminate analysis. Then the object state in the next frame is obtained by Bayesian sequence inference. Considering the changes in the object appearance, an update strategy for the multimanifold set is needed to be set.

1.2. Plan of the Paper

This paper is organized as follows: in the next section, we first introduce the notation of tensor algebra and feature tensor. After that, multimanifold discriminate analysis is reviewed in Section 3. Section 4 details the visual tracking framework. In Section 5 comparative experimental results and analysis are showed, and conclusions are drawn in Section 6.

2. Feature Tensor

A tensor is a high-order array which can be maintained the original spatial structure of an object. Construct a feature tensor from an object appearance can increase tracking accuracy.

2.1. Tensor Algebra

Tensor can be viewed as multiorder array which exists in multiple vector spaces; the algebra corresponding to tensor is the mathematical foundation of multilinear analysis [21]. An -order tensor is denoted as ; each elements in this tensor is represented as for .

The mode- unfolding matrix of a tensor consists of all the mode- column vectors.

The mode- product of a tensor and a matrix is which is a new tensor. The element of this tensor is where , are the elements of tensor and matrix .

The inner product of two tensors , is The Frobenius norm of a tensor is

2.2. Feature Tensor

The object appearance image from RGB color video sequence is a three-dimensional data, which formed a nature tensor structure. The color and edge information of the object have a better discrimination on the object class; the gradient feature can describe the object edge information. For a detailed description of object information, the feature function of an object appearance image is defined as follows: where , , , , , are the -direction and -direction gradients on the , , and color channels.

Each pixel on object appearance image corresponds to a twelve-dimensional feature vector; the size object appearance image corresponds to a feature tensor.

3. Multimanifold Discriminate Analysis

The basic assumption of the manifold learning is that high-dimensional datum can be considered as geometric correlation points which lie in low-dimensional smooth manifold. There is usually a submanifold structure corresponding to a single object class; different objects lie in different submanifolds. The multimanifold discriminate analysis can project the tensor data which is from a submanifold into a low-dimensional space.

3.1. Multimanifold Neighborhood Relationship of Feature Tensor

The appearance of each object under different poses is usually composed of a submanifold; the multiple different object appearance spaces formed the multimanifold. Each moving object appearance image in video sequence can extract a feature tensor . The set of feature tensor calculated by the appearance images from the first frames is denoted as ; then can be seen as a submanifold. Because of the presence of multiple moving objects in the scenarios, the set of each submanifold is a multimanifold dataset [22]. The entries in are corresponding to the th element in , where The distance between two tensors and is (the order and dimension of and are the same) where is the measurement coefficient. Since there are too many entries in the tensor data, the measurement coefficient is defined by the distance of points which have spatial neighborhood relationship. Consider where is the regularization parameter and is the location distance between and . If and , respectively, correspond to the and in tensor , then The intramanifold neighborhood of the tensor is as follows: calculate the tensor distance , between the tensor in submanifold and another tensor in this submanifold; then the nearest intramanifold neighborhood of can be obtained according to the tensor distance .

The intermanifold neighborhood of the tensor is as follows: calculate the tensor distance , between the tensor in submanifold and tensor in another submanifold ; then the nearest intermanifold neighborhood of can be obtained according to the tensor distance .

The multimanifold dataset and its neighborhood relationship are shown in Figure 1.

As can be seen from Figure 1, there are four initial moving objects in the scenarios, thus constructing four submanifolds which are , , , ; these four submanifolds formed a multimanifold. The intramanifold neighborhood relationship of tensor in submanifold is , , ; the intermanifold neighborhood relationship of this tensor is , , , , .

3.2. Multimanifold Discriminate Analysis

The objective of manifold learning is to recover the low-dimensional structure from the high-dimensional datum space and find a low-dimensional embedding map. In the multiple similar objects scenarios, it is hoped that the extracted object feature can distinguish the object and the potential similar objects in the scenarios. The objective of multimanifold learning is that the difference between a tensor and intramanifold neighborhood points decreases and the difference between the tensor and intermanifold neighborhood points increases in the embedded space. Considering these, the objective function of multimanifold discriminate analysis is where , , are the multilinear projection matrices in the first-order, second-order, and third-order which are corresponding to the tensor in the submanifold . Consider where is the number of submanifold points; are the number of intramanifold and intermanifold neighborhood. and are the intramanifold and intermanifold weight matrices; the size is ; the elements are separately as follows: where is the tensor distance; is bandwidth, which is the weighted coefficient of tensor in the submanifold . Consider Then can be viewed as the weighted center of submanifold .

Due to the fact that there is no closed optimal solution of the optimization problem in (9), for the purpose of computing , recursively solve the projection matrix in every order of the tensor feature. Consider where Then To maximize the by solving the eigen-value equation, obtain .

The eigen-values are ; the corresponding eigen-vector of eigen-value is , where is the dimension th order in the original feature tensor from submanifold . The directional projection positive along the eigen-vector which is corresponding to the eigen-value of is positive; that is, intermanifold neighborhood distance of tensors is bigger than the intramanifold neighborhood distance which are projected along this direction. Therefore, the projection matrix consists of all of the eigen-vectors which are corresponding to the positive eigen-values. Thus, the tensor data which are in submanifold can be embedded in a low-dimensional space via multilinear projection matrix , , . In this lower-dimensional space, the difference between tensor data and its intramanifold neighborhood points decreases and the difference between it and its intermanifold neighborhood points increases, so that the distinguishing ability between the object and the similar ones is greater.

4. Visual Tracking Framework

In order to achieve tracking of an object in scenarios, Bayesian sequence inference is used to obtain the object final state. Meanwhile, the multi-manifold datasets and the multi-linear projection matrice which are calculated from multi-manifold discriminate analysis should be updated.

4.1. Sequence Inference

In the visual tracking problem, the movement of the object is unable to predict, the object state in the current frame only related to that in the prior frame; then the visual tracking process satisfies the Markov process [23]. A bounding box is used to describe the object state at the th frame, where , , denote the upper left corner coordinate, the width, and height of the bounding box.

Given a set of observed object appearance images , the objective of visual tracking is to obtain the optimal estimate value of the hidden state variables . There is a similar result as that of the object state which is obtained according to Bayes’ theorem. Consider where refers to the state transition model and refers to the observation model. According the observation model , we can obtain the tracking results.

State Transition Model. This was used to model the movement of object between consecutive frames. Because of the irregular movement of object, the object state is difficult to predict and the moving speed of the object is not very fast. It is considered that the object state in the current frame is near to that in the prior frame. Then, the object state is modeled by independent Gaussian distribution around its counterpart in state , described as where means the diagonal covariance matrix corresponding to the variables , , , , and the elements are , , , . particles can be randomly generated pointing to Gaussian distribution. Each particle corresponds to an object state; then particles can obtain multiple states . During the visual tracking process, the more the particles we generated are, the more accurate the object state estimate was, but at the same time, the computational efficiency was low. For the purpose of efficient and effective of the visual tracking algorithm, there is a balance sought between these factors.

Observation Model. This was used to measure the difference between the appearance observation and the object appearance model. Given a drawn particle state and the corresponding cropped image patch in the frame image , the probability of an image patch being generated from the submanifold space is inversely proportional to the difference between image patch and the appearance model and could be calculated between the negative exponential distance of the projected data and the weighted center of submanifold. Consider where indicates the bandwidth, is the Frobenius norm, and , , are the multilinear projection matrix of the th object in submanifold .

The state corresponding to the maximum is the optimal object state at the th frame. Let represent the error between feature tensor which is calculated by observation and the weighted center of submanifold .

4.2. Multimanifold Data Sets Update

The appearance image of the object changes with the movement of it in the scenarios; the submanifold of the object should have different posture object appearance feature tensors. Therefore, the multimanifold data set should be updated in the tracking process. Because of the factors, such as occlusion and so forth which influence the object appearance, the appearance images of the tracked object have the non-object information; then obtained object feature tensor will not be in the submanifold. Therefore, the update strategy is necessary. From the perspective of the human sensory vision, the appearance information of object changes in the process of occlusion; the changes of object between consecutive frames are bigger or the object feature tensor is far away with the center of submanifold in the embedded space, while the changing information between consecutive frames is small or the object feature tensor is near the center of submanifold in the embedded space; that is, the object state is well determined.

The image first-order entropy is used to describe the gray value distribution of the object image, but not to consider it spatial distribution, while the image second-order entropy uses the 2-tuple feature which is calculated by spatial distribution. The image second-order entropy could describe the changes of the object, where is the gray value and is the neighborhood gray value . denotes the gray value and neighborhood gray distribution, where is the counts of the occurrence of the 2-tuple feature and is the size of image. The second-order entropy is defined as The difference of the object in consecutive frames is described by the second-order entropy. When the second-order entropy difference of the object image in consecutive frames is bigger, the object maybe occluded. Simultaneously, the feature tensor of appearance image would be far away from the weighted center of submanifold; namely, the error is bigger. As shown in Figure 2, the object is largely occluded at the frames 33–46 and 48–63, and small part occluded at the frames 69–77.

For a best state of object which is newly obtained, when the difference of second-order entropy with the prior frame and the error in low-dimensional tensor space embedded , the feature tensor calculated by the newly obtained object state should add into the submanifold , where is mean of the difference of second-order entropy, is the mean of the errors, and is the adjustment factor which takes 1.2 in this experiment.

When the tensor number in a submanifold is the multiples of the initial number, the multimanifold discriminate analysis is computed on the new multimanifold datasets; then the weighted center of submanifold and multilinear projection matrices are updated. There will be a small portion of the determined object data abandoned, but the tensors which added into the data set are essentially the feature tensors of object appearance.

The whole tracking algorithm is working as follows.(1)Locate the object state in the first frame, either manually or by using an automated detector.(2)Tracking objects use template matching tracking algorithm in the initial frames.(3)Extract the feature tensors from each object appearance images which are cropped according to the obtained objects states.(4)Construct the multimanifold dataset using the obtained feature tensors .(5)Determine the neighborhood relationship using tensor distance in the multimanifold dataset.(6)Calculate the weighted centers of each submanifold and the multilinear embedded matrices through multimanifold discriminate analysis.(7)Advance to the next frame . Draw particles according to the object prior state and crop the appearance images corresponding to each of the particles. Extract the feature tensors of each of the appearance images. The best object state in current frame is calculated by Bayesian sequence inference.(8)Calculate the difference of second-order entropy with the prior frame and the error in low-dimensional tensor space embedded; if and , the feature tensor calculated by the newly obtained object state should add into the submanifold .(9)When the tensor number in a submanifold is the multiples of the initial number , go to step (3).

5. Comparative Experiments and Analysis

In order to verify the effectiveness of the proposed algorithm, CAVIAR data sets and PETS outdoor multiperson data sets are used to be verified. The initial state of a moving object is determined by automatically tracking detectors [24] or artificial markers. The initial multimanifold data set is calculated by the object states which come from template matching tracking algorithm. The proposed algorithm is compared with three state-of-the-art trackers which are IVT [4], L1-APG [9], and MIL [14]. The Bayesian sequence inference needs to consider the particle number which impacts on the overall efficiency of the algorithm; the particle number is chosen to be 200 for comprehensive consideration. Each object appearance image is resized to a patch.

5.1. CAVIAR Data Sets

In this experiment, the experiment scenarios come from the Portugal Mall surveillance video data sets. There are object scale change, pose variation and occlusion during the three objects walking away from the camera. Testing video sequences are color images of resolutions. The Gaussian variances of the three objects are , , . The results are shown in Figure 3.

As can be seen from the results, the three main objects did not occlude before the initial 57 frames; the three comparison algorithms can achieve tracking. Since the 57th frame, object 2 gradually occludes object 3 until object 3 is unable to be seen, while the IVT and L1-APG algorithms are all missing object 3 and offset to object 2 which led to the wrong tracking. Since the 87th frame, object 1 gradually occludes object 3 while the IVT tracker could not distinguish them due to the fact that object 1 is similar to object 3 and then object 3 is mistaken as object 1 which carried the wrong tracking. Meanwhile, the color of object 2 is largely different from object 2 and object 3; the IVT and L1-APG trackers can achieve the better results in tracking object 2. The MIL tracker did not achieve the accurate tracking on the three objects due to the interference of the background. The proposed algorithm achieved complete tracking on the three objects which was not subject to the interference of similar object in the tracking process.

5.2. PETS Outdoor Multiperson Data Sets

In this experiment, the experiment scenarios come from the PETS2009 surveillance video data sets. There are multiple human objects that move around in multiple directions in the scenarios which are similar to each other. The objects cross occlusion and the objects scale pose variation during the walking. Testing video sequences are color images of resolutions. The Gaussian variances of the four objects are , , , . The results are in Figure 4.

As can be seen from the results, object 2 gradually completely occludes object 1 since the 26th frame which makes object 1 lost most of its information. Then, the IVT and L1-APG trackers lost object 1 while they achieved tracking object 2 which is not occluded. The MIL tracker roughly achieves tracking of objects 1 and 2. Object 1 occludes object 3 in the 36th frame; then the IVT, L1-APG, and MIL trackers are disturbed by object 1 when tracking object 3; the three algorithms are all offset to object 1 because object 1 and object 3 are very similar. Object 1 is occluded by object 4, since the 56th frame, the IVT, and L1-APG trackers are disturbed by object 1 when tracking object 1. The two trackers lost object 4 and offset to object 1 while the MIL tracker achieved tracking object 4. Object 4 and object 2 mutual occluded since the 64th frame; MIL tracker failed to track object 4 while the IVT and L1-APG are completely wrong tracking. This video sequence often occurs an object occluded another one which made the tracking very difficult, the proposed algorithm tracking successfully without excessive interference with similar objects, and achieved a complete tracking of the four objects.

5.3. Quantitative Evaluation

Aside from the qualitative comparison, we used two metrics to quantitatively compare the experimental results of the tracking algorithms which are tracking success ratio and center location error [20]. We initially manually labeled “ground truth” locations in each experimental scenario.

The tracking success ratio is where is the experiment tracking bounding box, is the ground truth bounding box, and means the area of the region. The tracking result in one frame is considered as a success when the tracking success ratio is above 0.5. The tracking success ratios of four trackers in two scenarios are shown in Figure 5.

As can be seen from Figure 5, the IVT and L1-APG trackers achieve tracking of object 2 in the first scenarios; the three comparison trackers do not achieve completely tracking of other objects in both scenarios due to the disturbance of background information or the similar objects. The tracking success ratios of the proposed algorithm with seven objects in two scenarios are all greater than 0.5 which means that the algorithm achieved accurate tracking and is essentially better than the other three trackers.

The center location error between experiment bounding box and ground truth bounding box is where , , , are the -axis and -axis coordinates of the center of the experiment tracking bounding box and the ground truth bounding box.

The errors of four trackers in two scenarios are shown in Table 1. S2-O2-err represents the center location error of the second object in scenarios 2. The data in bold refer to optimal results.

As can be seen from Table 1, the other three trackers rarely achieve a complete tracking, so the tracking center point errors is large. The errors in the proposed method are significantly better than the other three trackers, and the errors are within the acceptable range.

Our tracker is implemented in MATLAB 2012a and runs at 1.1 frames and 0.8 frames per second on an Inter Xeon 2.4 GHz CPU with 8 GB RAM, which is lacking in real-time.

6. Conclusions

In this paper, we proposed a visual object tracking algorithm via feature tensor multimanifold discriminate analysis which considers the tracking is vulnerable to the interference of similar objects. The object appearance model described by feature tensor can maintain the object spatial structural which helps to deal with the partial occlusion problem and helps better to distinguish the object with similar ones in the embedded low-dimensional subspace through multimanifold discriminate analysis. In addition, the update strategy is designed from the perspective of object appearance change which is used to determine if it is needed to update the multimanifold datasets. As can be seen from the comparison experiments, the proposed algorithm is able to adapt to the object pose variation, scale change, and undisturbed tracking of similar objects in scenarios and also can achieve complete tracking even if the object was completely occluded. The proposed algorithm exist some defects, and when the object is continuously occluded in the dense moving objects scenarios, the object appearance will be incomplete which cannot construct an accurate multimanifold datasets that caused tracking failure.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (10771043) and the National Natural Science Foundation of Inner-Mongolia Autonomous Region China under Grant (2012MS0931).