Abstract

Discriminative tracking methods use binary classification to discriminate between the foreground and background and have achieved some useful results. However, the use of labeled training samples is insufficient for them to achieve accurate tracking. Hence, discriminative classifiers must use their own classification results to update themselves, which may lead to feedback-induced tracking drift. To overcome these problems, we propose a semisupervised tracking algorithm that uses deep representation and transfer learning. Firstly, a 2D multilayer deep belief network is trained with a large amount of unlabeled samples. The nonlinear mapping point at the top of this network is subtracted as the feature dictionary. Then, this feature dictionary is utilized to transfer train and update a deep tracker. The positive samples for training are the tracked vehicles, and the negative samples are the background images. Finally, a particle filter is used to estimate vehicle position. We demonstrate experimentally that our proposed vehicle tracking algorithm can effectively restrain drift while also maintaining the adaption of vehicle appearance. Compared with similar algorithms, our method achieves a better tracking success rate and fewer average central-pixel errors.

1. Introduction

Visual vehicle tracking is one of the key technologies used in active vehicle safety applications. It is used in advanced driver assistance systems (ADAS) and in intelligent vehicles. However, in real traffic situation, the camera and the vehicles need to be tracked are all in motion. The visual tracking of target vehicles is often affected by complex backgrounds, changes in illumination, and occlusion by other vehicles or objects. These factors make vehicle tracking a challenging task in real traffic scenarios.

Existing tracking algorithms generally fall into one of two categories: generative models and discriminative models. Discriminative models convert tracking problems into binary classification problems of target and background. They have attracted much research interest in recent years. Many representative methods have been proposed, such as the online AdaBoost algorithm [1], the multiple instance learning algorithm [2], the TLD (tracking-learning-detection) algorithm [3], and the SVT (support vector tracking) algorithm [4].

The biggest challenge for discriminative model based tracking methods is the tracker drifting problem. This is because only a small amount of labeled samples (e.g., there is only one positive sample in most cases) can be used to train the classifier. Additionally, the tracking of subsequent image sequences depends on the classification results of the former frame. So, if the tracked area of former frame is not locked onto the optimal target, this self-training approach can lead to classifier drifting, which causes accumulated errors and further tracking failures.

To solve this shortcoming of the self-training approach, a new method, named semisupervised learning-based tracking, has been proposed [57]. In this method, a large amount of auxiliary images is used to maintain a feature dictionary which is able to describe images. Then, the dictionary will be used to update the tracker online. However, this kind of method prefers to use pixels from the original image or handcrafted features (such as the Haar feature, HOG feature, or DIFT feature), to generate the feature dictionary. This cannot satisfy the requirements of the image classification that is needed for robust tracking.

The newly developed deep learning framework is a bioinspired architecture which describes data such as images, voices, and text by mimicking the human brain’s mechanisms of learning and analysis. Through deep learning, features are transformed from their original space in a lower layer to a new space in a higher layer [8]. Compared with handcrafted features, the automatic features generated by deep learning are more capable of expressing the details of internal properties of the data. Inspired by this, a semisupervised vehicle tracking algorithm is proposed that uses deep representation. The overall structure of the proposed method is shown in Figure 1. Firstly, a large number of unlabeled images are selected as training samples, and a 2D deep belief network (2D-DBN) is used as a classifier structure. Then, a multilayer deep network will be trained with those unlabeled samples and the nonlinear mapping node in the top layer will be taken as the feature dictionary. In the tracking process, the subimage containing the vehicle to be tracked in the initial frame is set as the positive sample, and the surrounding background subimages are set as negative samples. After that, an online deep tracker will be learned from the generated samples and the feature dictionary. Finally, a partial filter will be used to estimate and provide a small area for the tracker. The deep learning-based tracking algorithm is able to fully exploit the deep-level feature information embedded in the image. This effectively suppresses drift situations while keeping the vehicle tracking system updated.

Generally, compared with other semisupervised learning-based tracking methods which use handcrafted features [57], this work introduces deep learning framework to the traditional semisupervised learning. This work applies 2D-DBN deep model to generate features automatically in unsupervised offline training and then uses small amount number of samples to adjust the tracker online.

In this article, in Section 2, the establishment of the deep model based semisupervised tracking framework is introduced in detail which contains both offline and online training steps. The vehicle position prediction method based on partial filter is given in Section 3. In Section 4, the experiment results and its analysis will be shown. Finally, a brief conclusion of this work is given in Section 5.

2. Establishment of Vehicle Tracking Based on Deep Modeling

The establishment of deep tracking requires two steps (Figure 2). Offline training session: a large number of unlabeled images are selected as training samples and the 2D-DBN is trained unsupervised. Online training session: the labeled samples are refined online using the parameters of the 2D-DBN with a backpropagation algorithm.

2.1. Unsupervised Offline Training

In the unsupervised offline training process, many road images containing vehicles are selected and a large amount of unlabeled samples with pixels is generated from them. Then, all sample images are converted to gray scale and normalized to the interval. After that, the entire normalized matrix is input to the 2D-DBN to perform feature dictionary extraction.

The structure of the proposed 2D-DBN is shown in Figure 3. It is constructed with one visible layer, , and two hidden layers, and . The visible layer contains units which is equal to the dimension of input samples. The hidden layers and contain and units, respectively. The visible layer and hidden layer are connected with a group of weights, . After unsupervised training to adjust the weights, the unit in hidden layer can be considered as the feature dictionary.

In unsupervised training, the greedy-wise reconstruction algorithm is used to adjust the weights between each two layers [9]. Taking visible layer and hidden layer as an example, and can be seen as a restricted Boltzmann machine (RBM). The state energy () of any two units of and can be written aswhere are the weight parameters of the units between visible layer and hidden layer . is the connecting weight of unit in visible layer and unit in hidden layer . and represent the bias between corresponding layers.

So, the RBM can be considered as a joint probability distribution:where is a normalized parameter.

Then, the conditional probability distributions of input state and hidden state are able to be expressed with logistic functions:where .

Based on the analysis above, the connecting weight and bias will be updated with a contrast divergence algorithm [9]:where is the data’s expected distribution, is the reconstruction distribution, and is the update step size.

By repeating these steps from lower layers to higher layers, the weights of () and () can be maintained. The weights of every unit in can then be considered as the nonlinear feature dictionary.

In real practice, many subimages captured from road videos that are vehicles will be input to the visible layer one by one. Then with the unsupervised training process given before, all the weights of the 2D-DBN will be adjusted and if we graphically display the weights in the top layer, the features are able to be viewed. Some of the features generated by the proposed method of unsupervised learning are shown in Figure 4. It can be seen from the feature that the features are more sensitive to the shape of edge, corner, and so on which exist more usually in vehicles.

2.2. Tracker Online Training

When the tracked area is identified in the first frame of a video, positive and negative samples are generated. Firstly, the tracked area will be chosen as the positive sample and will then be rotated from −5 to 5 degrees with an intersection of 1 degree to generate more positive samples. Secondly, negative subimages are generated in a neighborhood ring area around the tracked area. The neighborhood area is defined as , in which is the center of the negative samples, is the center of the tracked area, and and are the inner and outer diameters of the ring area.

Based on the offline trained DBN, a label layer is introduced with a sigmoid function to generate a two-dimensional forward artificial neural network (Figure 5). In the online training process, the positive and negative samples and their labels are input to this neural network. The backpropagation algorithm is used to perform supervised training and adjust all the network weights. Finally, all the subimages in consequent frames are loaded to the newly updated network to make judgments.

3. Position Prediction Based on Particle Filtering

In the tracking process, the online trained classifier needs to verify a large number of subimages in consequent frames. To reduce processing time, a particle filter is used to estimate the target position.

The particle filter uses a Monte Carlo algorithm with Bayesian filtering and demonstrates good object tracking performance. Set and as the state and observation values of the target at time . Then, the vehicle position estimate can be described as follows: in the condition of knowing observation value at time , iteratively estimate the state of the system at time .

Set the state space of system aswhere and are the state transfer function and observation function, respectively, and and represent system noise and observation noise, respectively.

The filtering process contains two main steps: forecasting and updating. In forecasting process, the posterior probability density at time is used to obtain the prior probability density at time :

In the updating process, the newest system state observation value at time and the prior probability density will be used to calculate the posterior probability density at time :

Assume that is a group of samples with weights from the posterior probability density and is the sample set from time 0 to time . Based on the Monte Carlo principle, the posterior probability density at time is able to be approximated with a discrete weighting formula:in which is a Dirac function.

Further, the estimated value of system state at time is

4. Experiment and Analysis

The key parameters of the deep network and the particle filter are set out as follows: the number of units of the visible layer is (); the number of units of the two hidden layers is and (, ); the initial training weights of offline training are , which satisfies the Gaussian distribution; the number of particles in the particle filter is . For the unsupervised offline training, around 5000 subimages that only contain vehicle are selected for tracker training. Some of the typical images for training are shown in Figure 6.

In the experiments, different road scenarios were selected, including daytime, nighttime, and raining conditions. To evaluate performance, two criteria were used: the tracking success rate and center pixel offset. In these two criteria, tracking success is defined as and center pixel offset is defined as the Euclidean distance between and , in which is the outline of the tracking box and is the ground truth of the real target outline box.

The proposed algorithm is comparable with many state-of-the-art algorithms such as MTT [10], CT [11], KCF [12], MIL [2], L1T [13], and DSST [14].

The experiment results are shown in Tables 1 and 2. It can be seen from the table that since the proposed tracking algorithm is able to generate richer deep features, the proposed method performs better than the existing state-of-the-art algorithms under the four different test conditions. The four typical scenarios that are chosen are daytime with good illumination, nighttime with road lamp, daytime with heavy raining, and twilight time. Examples of real tracking results are shown in Figure 7.

It should be mentioned that although the performance of our method is improved, the processing speed is relatively low. The average processing speed for one image is around 75 ms which is among the second worst of all the methods that are used in the experiments (Table 3). The possible reason for the big time cost is that the deep model contains a big amount of neurons and weight connections between each of neurons and the computing of weight in each connection may need much more processing time. The experimental platform was an Intel 2.67 GHz CPU with 4 GB RAM and the Windows 7 operating system.

5. Conclusion

To solve the problem of traditional tracking methods being not robust enough for vehicle tracking in ADAS, a deep representation and semisupervised onboard vehicle tracking algorithm was proposed. Relying on the strong feature extraction ability of deep modeling, this method dramatically inhibits drifting. Generally, the proposed semisupervised and deep model based tracking algorithm performs better than most of the existed methods in the merits of tracking success rate and average center pixel offset. However, the real-time performance of this work still needs to be further improved. There are two ways that are supposed to be used to solve this problem. First, concise deep model may be designed to reduce calculation burden. Second, parallel computing technology may be applied to speed the calculation process.

Competing Interests

The authors declare that they have no conflict of interests.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (61601203, 61403172, and U1564201), China Postdoctoral Science Foundation (2015T80511 and 2014M561592), Key Research and Development Program of Jiangsu Province (BE2016149), Natural Science Foundation of Jiangsu Province (BK20140555), and Six Talent Peaks Project of Jiangsu Province (2015-JXQC-012 and 2014-DZXX-040).