Abstract

Vision based vehicle detection is a critical technology that plays an important role in not only vehicle active safety but also road video surveillance application. In this work, a multistep framework for vision based vehicle detection is proposed. In the first step, for vehicle candidate generation, a novel geometrical and coarse depth information based method is proposed. In the second step, for candidate verification, a deep architecture of deep belief network (DBN) for vehicle classification is trained. In the last step, a temporal analysis method based on the complexity and spatial information is used to further reduce miss and false detection. Experiments demonstrate that this framework is with high true positive (TP) rate as well as low false positive (FP) rate. On road experimental results demonstrate that the algorithm performs better than state-of-the-art vehicle detection algorithm in testing data sets.

1. Introduction

Advanced driver assistant system (ADAS) is developed to improve driver safety and comfort which require comprehensive perception and understanding of on-road environment. For example, stable and dynamic obstacles on road need to be detected accurately in real time so that safe driving space of ego vehicles can be determined. Recently, with the fast development of optical devices and hardware, vision based environment sensing has drawn much attention from researchers. Vehicle detection with in vehicle front-mounted vision system is one of the most important missions for vision based ADAS.

Robust vision based vehicle detection on the road with low miss detection rate and low false detection rate is with many challenges since highways, urban and city road are dynamic environment, in which the background and illuminations are dynamic and time vary. Besides, the shape, color, size and appearance of vehicles are with high variability. Making this task even difficult, the ego vehicle and target vehicles are generally in motion so that the size and location of target vehicles captured in image are diverse.

For robust vehicle detection, a two-step strategy proposed by Sun et al. [1] including candidate generation (CG) and candidate verification (CV) is often applied by many researchers. In CG step, image blocks in which vehicles might exist will be generated with easy and low time cost algorithms. In CV step, the candidate vehicles generated in CG will be verified with another relatively complex machine learning algorithm.

In this work, by mainly following the two-step strategy, a multilevel framework for vision based vehicle detection is proposed. In the first step, for vehicle candidate generation, a novel geometrical and coarse depth information based method is proposed. In the second step, for candidate verification, a deep architecture of DBN for vehicle classification is trained. The last step, to further reduce miss and false detection, a temporal analysis method based on the complexity and spatial information will be used. The overall framework proposed by our work is shown in Figure 1.

Our main contribution is mainly in three aspects: (1) a novel geometrical and coarse depth information based vehicle candidate generation method is proposed, which dramatically reduces the number of candidates needed to be verified thus speeding up the whole vehicle detection process. (2) The DBN based deep vehicle classifier is designed. (3) To deal with occasionally appeared miss detection and false detection, a temporal analysis method based on the complexity and spatial information is further proposed.

2. Vehicle Candidate Generation

In vehicle candidate generation step, traversal search such as sliding window method is often used traditionally. Since traversal search will generate a huge number of candidates, it is not able to meet real-time requirement. Another CG method based on ground plane assumption is proposed by Kim and Cho, which dramatically decreases the candidate numbers [2]. However, this method performs poorly when the camera orientation changes or the road surface is a curve. In [3], Felzenszwalb and Huttenlocher use symmetry and edge based candidate generation method which cannot handle the partial vehicle occlusion situation.

In our work, a novel geometrical and coarse depth information based vehicle candidate generation method by integrating bottom and intermediate layer information of image is proposed. In bottom layer information extraction, original image is transformed to super-pixel form according to the pixel approximation. Then in intermediate layer information extraction, road image scene is mainly divided into three parts which are horizontal plane, vertical plane, and sky by geometrical information extraction. Meanwhile a coarse depth information estimation method is applied to get approximate depth information of single image. Finally, by combining the acquired geometrical and coarse depth information of single image, super pixels are clustered so that vehicle candidate can be generated. The vehicle candidate generation process is shown in Figure 2.

2.1. Bottom Layer Information Extraction

The bottom message is the image information which can be directly maintained without further process such as image pixel value, color, and so forth. Super pixel is an important expression of image underlying information. Super pixel is essentially an over-segmentation method which aggregated neighboring pixels with similar features into a group and name as one super pixel. Because the use of super-pixel segmentation method is able to better reflect the significance of human perception, it has a unique advantage in the field of object recognition tasks. Since firstly proposed in [3], there is now a large number of machine vision approaches that uses super pixels instead of pixels in a wide range of applications such as image segmentation [4], image analysis [5], image targeting [6], and other areas. In our application, the SLIC method described in [3] is applied to segment road image into super-pixel format (Figure 2(b)).

2.2. Intermediate Layer Information Extraction
2.2.1. Geometrical Information Extraction

Through statistical analysis of a large number of road pictures, it is found that flat plane in image usually belongs to road surface area, vertical plane in image tends to be the surface of on-road objects such as vehicles, trees, fence, and so forth, and the sky is often presented in the upper part of the image. If each image pixel can be identified as flat plane, vertical plane, or sky, it will provide rich information for vehicle candidate generation. Luckily, by characterizing each super pixel with color, position, and perspective effect and putting them into a pretrained regression AdaBoost classifier, Hoiem et al. successfully maintain the category of each super pixel [7]. Following Hoiem’s contribution, in our work, images are divided into the aforementioned road pavement, vertical object, and sky. As shown in left bottom picture of Figure 2(c); road pavement is marked with green color, sky is marked with blue color, and vertical object is marked with brown as well as black “”.

2.2.2. Coarse Depth Information Extraction

To maintain depth information in images, there are many mature methods based on stereo vision but they are not suitable for our work with just one front-mounted camera. Recently, single image based depth information acquirement has received widespread attention. Literature [8, 9] have proposed accurate depth information obtaining algorithms based on single image, but these two methods need several seconds to achieve this which is difficult to meet the real time requirements.

Different from the application that needs accurate depth information, vehicle candidate generation tasks only need coarse depth information in the image. For this, a SVM classifier that can get coarse depth of image objects is proposed. A large number of training images is collected with stereo vision and marked with depth information. The depth information is roughly divided into three groups:

For training, all the sample images are divided into small grid and the following three features of image samples are extracted to form feature vector: (1) mean value and histogram of three channels in HSV image space, (2) gabor features based texture gradient feature, and (3) category of grids (sky, ground, etc.). The classifier is trained with libsvm toolbox [10] and the output of is depth category . As shown in right bottom picture of Figure 2(c), three different coarse depths are marked with different colors.

Based on the coarse depth information, the range in pixel level of vehicle candidate can be seen in Table 1.

2.3. Vehicle Candidate Generation Strategy

In Section 2.2, images are segmented as super pixels. Meanwhile geometrical and coarse depth information of images is made. In this section, firstly, the super pixels that satisfy the requirements of the vehicle will be picked out. Then, the vehicle candidate will be generated with the selected super pixels by clustering.

The super pixel selection is mainly based on the priori knowledge as follows.(1)Vertical constraints: the super pixel that might belong to vehicle should be present in or adjacent to the vertical plane.(2)Ground constraints: the super pixel that might belong to vehicle should connect to ground area.(3)Depth constraints: the super pixel that might belong to the same depth can be clustered together.(4)Size constraints: the super pixel that might belong to vehicle should not be out of the range of vehicle size in image.

By applying the four rules, a clustering strategy is proposed so that proper super pixels can be selected and grouped as vehicle candidate. The clustering method is as follows.

Algorithm 1 (Clustering Method for Vehicle Candidate Generation). Consider the following steps:
Step 1. Start from super pixel group that belongs to vertical plane and connecting to ground area.
Step 2. For , is super pixel group that is near and belongs to vertical plane. If is empty or all are processed, STOP. Otherwise, For , find the super pixel pair with minimal Euclidean distance .
Step 3. If satisfies (a) is with the same depth range and (b) is not out of the range of vehicle size in image, then will merge into a new super pixel .
Step 4. Remove from .
Step 5. If satisfies the constraint of image vehicle pixel size range shown in Table 1, jump to Step 2. If do not satisfy the constraint, is considered as a vehicle candidate and jump to Step 2.

Figure 3 shows an example of the proposed clustering Method for vehicle candidate generation. Figure 3(a) is the original image for processing and Figures 3(b)3(e) are the clustering process.

3. Vehicle Candidate Verification

In this section, a deep belief network (DBN) based vehicle candidate verification algorithm is proposed.

Machine learning is proved to be the most useful method for vehicle candidate verification task. Support vector machines (SVM) and AdaBoost are the most two common classifiers that are used for training vehicle detector [1120]. However, classifiers such as SVM and AdaBoost all belong to shallow learning model because both of them can be modeled as structure with one input layer, one hidden layer, and one output layer. On the other hand, deep learning is another class of machine learning technique, where hierarchical architectures are exploited for representation learning and pattern classification. Superior to those shallow models, deep learning is able to learn multiple levels of representation and abstraction of image data.

There are many subclasses of deep architecture in which deep belief networks (DBN) modal is a typical deep learning structure which is first proposed by Hinton et al. [21]. DBN has demonstrated its success in MNIST classification. In [22], a modified DBN is developed in which a Boltzmann machine is used on the top layer, which is used in a 3D object recognition task. In our work, by using DBN, a classifier is trained for vehicle candidate verification tasks.

In Section 3.1, the overall architecture of the DBN classifier for vehicle candidate verification will be introduced. In Sections 3.2 and 3.3, the training method of the whole DBN for vehicle candidate verification will be deduced.

3.1. Deep Belief Network (DBN) for Vehicle Candidate Verification

Let be the set of data samples including vehicle images and nonvehicle images. Assuming that is consisting of samples which is shown as follows:

In , is training samples and is in the image space . Meanwhile, means the labels are corresponding to , which can be written as

In , is the label vector of . If is belonging to vehicles, . On the contrary, .

The ultimate purpose of vehicle candidate verification task is to learn a mapping function from training data to the label data based on the given training set, so that this mapping function is able to classify unknown images between vehicle and nonvehicle.

Based on the task described above, DBN architecture is applied to address this problem. Figure 4 shows the overall architecture of DBN. A fully interconnected directed belief network including one visible input layer , hidden layers and one visible label layer at the top. The visible input layer maintains neural and equal to the dimension of training feature which is the original 2D image pixel values of training samples in this application. An the top, the layer just has two units which is equal to the classes this application would like to classify. Till now, the problem is formulated to search for the optimum parameter space of this DBN.

The main learning process of the proposed DBN is with two steps.(1)The parameters of the two adjacent layers will be refined with the greedy-wise reconstruction method. Repeat Step 1 till all the parameters of hidden lays are fixed. Here, Step 1 is so-called pretraining process.(2)The whole DBN will be fine-tuned with the layer information based on back propagation. Here, Step 2 can be viewed as supervised training step.

3.2. Pretraining with Greedy Layer-Wise Reconstruction Method

Assume that the size of the upper layer is . In this subsection the parameters of the two adjacent layers will be refined with the greedy-wise reconstruction method proposed by Hinton et al. [21]. To illustrate this pretraining process, we take the visible input layer and the first hidden layer for example.

The visible input layer and the first hidden layer contract a restrict Boltzmann machine (RBM). is the neural number in and is that of . The energy of the state (, ) in this RBM is in which are the parameters between the visible input layer and the first hidden layer , is the symmetric weights from input neural in to the hidden neural in , and and are the and bias of and . So this RBM is with the joint distribution as follows: where is the normalization parameter, and the probability that is assigned to of this modal is

After that, the conditional distributions over visible input state in layer and hidden state in are able to be given by the logistic function, respectively: where .

At last, the weights and biases are able to be updated step by step from a random Gaussian distribution value , and with contrastive divergence algorithm [23], and the updating formulations are in which means the expectation with respect to the data distribution, means the reconstruction distribution after one step, and is step size which is set to , typically.

Above, the pretraining process is demonstrated by taking the visible input layer and the first hidden layer , for example. Indeed, the whole pretraining process will be taken from low layer groups () to up layer groups () one by one.

3.3. Global Fine-Tuning

In the above unsurprised pretraining process, the greedy layer-wise algorithm is used to learn the DBN parameters. In this subsection, a traditional back propagation algorithm will be used to fine-tune the parameters with the information of label layer .

Since good parameters initiation has been maintained in the pretraining process, back propagation is just utilized to finely adjust the parameters so that local optimum parameters can be achieved. In this stage, the learning objection is to minimize the classification error , where and are the real label and output label of data in layer .

4. Temporal Analysis Using Complexity and Spatial Information

After the vehicle candidate verification step proposed in last section, most vehicles in frames will be detected. However, due to search window scale sparsity and inadequate detector performance, there may be some miss detections and false detections in some frames. By observation, it is found that most miss detections and false detections appear just occasionally, so that a temporal analysis using complexity and spatial information is proposed to refine vehicle detect results.

The temporal analysis is with two compositions. First, for detected target in frame , if it is judged that the target also appeared in and frames by using a similarity function, this target is considered as true positive, otherwise it is considered as false detection and needs to be eliminated. Then, for verified detected target in frame , also take advantage of the similarity function to determine whether there is a corresponding target in frame . If not, miss detection is considered and the area having the highest similarity and is above a threshold is regarded as the detection target.

The similarity between any two targets is defined as follows: in which means the complexity similarity between target in frame and target in frame . means the spatialsimilarity. Complexity similarity and spatialsimilarity are defined as follows.

(a) Complexity Similarity. There are multiple ways to calculate image complexity, in which edges proportion is a good measurement. Let be pixel number of an image and be the number of pixels belonging to edge. Then image complexity is defined as

The complexity ratio of two images is considered as similarity function. The complexity similarity between target in frame and target in frame is

(b) Spatial Similarity. Vehicle movement in video is a continuous process and the time interval between two consecutive frames is very short (around 40 ms) so the vehicle motion does not usually change dramatically in such a very short period of time and vehicles in two consecutive frames are with very small displacement and deformation. So the detection window size and centroid coordinates are used to build spatialsimilarity measurement function in which is centroid coordinates of detection window, means detection window size, and and are constant factors.

5. Experiments and Analysis

The proposed multistep vehicle detection framework is tested on PETS2001 dataset and many video clips captured on the highway by our project groups. In the experiment, the hidden number of the DBN is set to 2, the neural number of two hidden layers are and , respectively, and is set as 0.8. Besides, image samples for training are all resized to . The frames are all resolution and the experiment is made in our Advantech industrial computer.

Some of the vehicle candidate generation effects are shown in Figure 5 where (a) is the original image, (b) is the sky in the vertical plane, the ground plane geometry information extracted from the original image, (c) is the depth information by using pretrained classifier, and (d) is generated candidates.

Based on the vehicle candidate generation results, the DBN based vehicle candidate verification method is applied further. The vehicle candidate verification results are shown in Figure 6. The left column is generated vehicle candidates marked with yellow while the right column is the verified vehicles marked with blue rectangle.

Finally, by applying temporal analysis, some miss detection can also be detected. As shown in Figure 7, the blue window means normal detected vehicles by the DBN vehicle detector while the dash green window means the miss detected vehicles but is corrected by the temporal analysis.

Our method is tested in more than 10 road captured videos including different times and different weather conditions. The overall vehicle detection effects are shown in Table 2 as well as some state-of-the-art vehicle detection effects.

From the results shown in Table 2, it can be seen that the proposed vehicle detection framework maintains the lowest false positive (FP) rate while reaching the second highest true positive (TP) rate which is 0.24% lower than that of Southall’s stereo vision based method. Meanwhile, by using monocular vision, the processing speed of our method is much faster than Southall’s.

6. Conclusion

In this work, a multistep framework for vision based vehicle detection is proposed. In the first step, for vehicle candidate generation, a novel geometrical and coarse depth information based method is proposed. In the second step, for candidate verification, a deep architecture of DBN for vehicle classification is trained. In the last step, a temporal analysis method based on the complexity and spatial information is used to further reduce missed and false detection. Experiments demonstrate that this framework is with high TP rate as well as low FP rate.

Conflict of Interests

The authors declare that they have no conflict of interests.

Acknowledgments

This research was supported in part by the National Natural Science Foundation of China under Grants 51305167, 61203244 and 61403172 Information Technology Research Program of Transport, Ministry of China, under Grant 2013364836900, Natural Science Foundation of Jiangsu Province under Grant BK20140555, Jiangsu Province Colleges and Universities, Natural Science Foundation, under Grant 13KJD520003 and Jiangsu University Scientific Research Foundation for Senior Professionals under Grant 12JDG010, 1291120026.