Detection of moving vehicles in aerial video sequences is of great importance with many promising applications in surveillance, intelligence transportation, or public service applications such as emergency evacuation and policy security. However, vehicle detection is a challenging task due to global camera motion, low resolution of vehicles, and low contrast between vehicles and background. In this paper, we present a hybrid method to efficiently detect moving vehicle in aerial videos. Firstly, local feature extraction and matching were performed to estimate the global motion. It was demonstrated that the Speeded Up Robust Feature (SURF) key points were more suitable for the stabilization task. Then, a list of dynamic pixels was obtained and grouped for different moving vehicles by comparing the different optical flow normal. To enhance the precision of detection, some preprocessing methods were applied to the surveillance system, such as road extraction and other features. A quantitative evaluation on real video sequences indicated that the proposed method improved the detection performance significantly.

1. Introduction

In recent years, analysis of aerial videos has become an important topic [1] with various applications such as intelligence, surveillance, and reconnaissance (ISR), intelligence transportation, and military fields [2, 3]. As an excellent supplement of ground-plane surveillance system, airborne surveillance is more suitable for monitoring fast-moving targets and covers larger area [4]. Mobile vehicles in aerial videos need to be detected for event observation, summarization, indexing, and high level aerial video understanding [5]. This paper is focused on vehicle detection from a low altitude aerial platform (about 120 m above ground).

Detection of objects has traditionally been a very important research topic in classical computer vision [6, 7]. However, there are still some challenges related to detection with low resolution aerial videos. Firstly, vehicles in aerial video have small size and low resolution. Lack of color, low contrast between vehicles and backgrounds, and small and variable vehicle sizes (400~550 pixels) make the appearance and size of vehicle not very distinct to arouse correspondence. On the other hand, frame and background modeling usually assume static background and consistent global illumination. However, in practice, changes of background and global illumination are common in aerial videos due to motion of the global camera. Moreover, UAV video analysis requires real-time processing. Therefore, fast and robust detection algorithm is strongly desired. So far, detection of moving vehicle is still a big challenge.

In this work, a vehicle detection method was proposed based on the method of VSAM by Cohen and Medioni [8]. The similarity and difference of these two methods were discussed in detail. We used Speeded Up Robust Feature (SURF) for video stabilization and demonstrated its validity. The scene context such as road in mobile vehicle detection was introduced, and good results were obtained. Also, complementary features such as shape were used to achieve well-performed detection.

This paper is organized as follows. Section 2 enumerates related work on vehicle detection from aerial videos. Section 3 describes the details about the proposed approach. Section 4 presents our experimental results. Conclusions of this work are summarized in Section 5.

In the literature, some approaches have been proposed to deal with vehicle detection in airborne videos. However, they mostly tackle stationary camera scenarios [911]. Recently, there has been an increasing interest in studying the mobile vehicle detection from moving cameras [12]. Background subtraction technique is one of the most successful approaches to extract moving objects [13, 14]. However, they have limitation that they are only applicable with the stationary cameras in fixed fields of view. Detection of moving objects with moving cameras has been researched to overcome this limitation.

As for moving object detection in video captured by moving camera, the most typical method for detecting moving objects with mobile cameras is the extension of background subtraction method [15, 16]. In these methods, panoramic background models are constructed by applying various image registration techniques [17] to input frames and the position of current frame in panoramas if found by image matching algorithms. Then, moving objects are segmented in a similar way to the fixed camera case. Cucchiara et al. [15] built background mosaic considering internal parameters of cameras. However, camera internal parameters are not always available. Shastry and Schowengerdt [18] proposed a frame-by-frame video registration technique using a feature tracker to automatically determine control-point correspondences. This converts the spatiotemporal video into temporal information, thereby correcting for airborne platform motion and attitude errors. However, digital elevation map (DEM) is not always available. In this work different types of motion model are used, none consider registration error by parallax effect.

The second method to detect moving objects with moving camera is optical flow [2, 19, 20]. The main concept proposed in [2] is to create an artificial optical flow field by estimating the camera motion between two subsequent video frames. Then, this artificial flow is compared with the real optical flow directly calculated from the video feed. Finally, a list of dynamic pixels is obtained and then grouped into dynamic objects. Yalcin et al. [19] propose a Bayesian framework for detecting and segmenting moving objects from the background, based on statistical analysis of optic flow. In [20] the authors obtain the motion model of the background by computing the optical flow between two adjacent frames in order to get motion information for each pixel. The methods of optic flow need calculation of the field of optic flow first which is sensitive to noise and cannot get a precise result; meanwhile, it is not proper to detect real-time moving vehicles.

Recently, appearance feature based classification is used widely in vehicle detection [3, 4]. Shi et al. [3] proposed a moving vehicle detection method based on a cascade of support vector machine (SVM) classifiers. Shape and histogram of orientated gradient (HOG) features are fused to training SVM for classifying vehicles and nonvehicles. Cheng et al. [4] proposed a pixelwise feature classification method for vehicle detection using dynamic Bayesian network (DBN). These approaches are promising. However, the effectiveness of methods depends on the selected feature. For example, color feature of each pixel in [4] is extracted by new color transformation in [21]. However, the new color transformation only considers the difference between vehicle color and road color and does not take similar color among vehicle color, building color, and road color (Figures 9(a2) and 9(b1)). Moreover, the fact that a number of positive and negative training samples need to be collected to train the SVM for vehicle classification is another concern.

In this paper, we designed a new vehicle detection framework that preserves the advantages of the existing works and avoids their drawbacks. The modules of the proposed system framework are illustrated in Figure 1. It is two-stage object detection: initial vehicle detection and refined vehicle detection with scene context and complementary features. The whole framework can be roughly divided into three parts, which are video stabilization, initial vehicle detection, and refined vehicle detection. Video stabilization is used to eliminate camera vibration and noise with SURF feature extraction. Initial vehicle detection is used to find the candidate motion region with optical flow normal. Performing background color removal can not only reduce false alarms and speed up the detection process but also facilitate the road extraction. The initial vehicle detections are refined by using the road context and complementary features such as size of the candidate region. The whole process is proceeding online and iteratively.

3. Hybrid Method for Moving Vehicle Detection

Here, we elaborate each module of the proposed system framework in detail. We compensated the ego motion of airborne vehicle by SURF [22] feature point based image alignment on consecutive frames and then applied an optical flow normal method to detect the pixels with motion. Pixels with high optical flow normal value were grouped as candidates of mobile vehicles. Meanwhile, the features such as size were used to improve the detection accuracy.

3.1. SURF Based Video Stabilization

Registration is the process of establishing correspondences between images, so that the images are in a common reference frame. Aerial images are achieved with a moving airborne platform, and large camera motion exists between consecutive frames; thus sequence stabilization is essential for motion detection. Global camera motion is eliminated or reduced by the process of image registration. For registration, descriptors such as SURF or SIFT (scale invariant feature transform) [23] can be used. In particular, SURF features were exploited due to its efficiency.

3.1.1. SURF Feature Extracting and Matching

The selection of features for motion estimation is very important, since unstable features may produce unreliable estimations with variations in rotation, scaling, or illumination. SURF is a robust image interest point detector, first presented by Bay et al. [22]. SURF descriptor is similar to the gradient information extracted by SIFT. SURF algorithm includes two main parts: the feature point detection and feature point description. But in the whole process, using fast Hessian matrix to detect feature points and introducing the integral image and the box filter to compute approximations of the Laplacian of Gaussians improve the efficiency of the algorithm. SURF has similar performance to SIFT; however, it is faster. An example of SURF is shown in Figures 4(a) and 4(b).

3.1.2. Feature Point Detection

The integral images allow for fast computation of box filters. The entry of an integral image at a location represents the sum of all pixels in the input image with a rectangular region formed by the origin and : Once the integral image has been computed, it takes three additions to calculate the sum of the intensities. , , , and are assumed to be four points, respectively, of the rectangular area shown in Figure 2. Hence, the sum of all pixels in the black rectangular area can be expressed by . The calculation time is independent of its size. This is important in SURF algorithm.

Then SURF uses the Hessian matrix to detect feature points, for a point in the image marked in the scale on Hessian matrix is defined as In formula (2), means the convolution result of the point in the image and the Gaussian filter second order partial derivative , and the calculation methods in and are similar.

In order to reduce the workload of calculation, SURF uses the box filters to replace, respectively, , , and with the convolution of the original input images , , and . The calculations are shown in Figure 3 and formula (3).

In Figure 3, the weight of black pixel is −2 and white pixel is 1. The formula of , , and calculations using integral image is shown as follows: In formula (3), are the row and column of the pixel in image, respectively, is 1/3 of the size of box filter, and ( is operation of rounding).

The formula , which is the approximation for the Hessian matrix Gaussian calculation determinant matrix, can be illustrated as follows: By using a nonmaxima suppression method in the neighborhood, the image feature points can be found in different scales.

3.1.3. Feature Point Description

In order to be invariant to image rotation, a dominant orientation for each key point is identified first in feature point description. For a key point, Haar wavelet responses in and directions are calculated within a circular neighborhood of radius around it, where is the corresponding scale of the detected key point. The Haar wavelet responses can be computed using Haar wavelet filters and integral images. The wavelet responses are then weighted with a Gaussian () centered at the key point. The dominant orientation can be estimated by rotating a sliding fan-shaped window of size . At each position, the horizontal and vertical responses within the sliding window are summed and used to form a new vector. The longest such vector over all windows is assigned as the orientation of the key point.

Then, SURF descriptor is generated in a square region centered at the key point and oriented along its dominant orientation. The region is divided into square subregions. For each subregion, Haar wavelet responses in horizontal direction and in vertical direction are computed from sample points. Then the wavelet responses and are weighted with a Gaussian () centered at the key point. The responses and their absolute values are summed up over each subregion and form a 4D feature vector (). Thus for each key point, this results in a descriptor vector of length . Finally, the SURF descriptor is normalized to make it invariant to illumination changes.

After feature extraction process, it is necessary to match feature point between two successive frames. For this process, we are investigating the matching process as proposed by Lowe [23]. This process is based on finding a match between two consecutive image features using Euclidean distance. The Euclidian distance between SURF descriptors is employed to determine the initial corresponding feature point pairs in different images. We used RANSAC to filter outliers that come from the imprecision of the SURF model. The example is shown in Figures 4(c) and 4(d).

3.1.4. Motion Detection and Compensation

The temporally and spatially changing video can be modeled as a function , where is the spatial location of a pixel and is the temporal locator index, within the sequence. The function can be thought of as representing the pixel intensity at location and time . Thus, this function satisfied the following property: This means that an image taken at time is considered to be shifted from the earlier image by , called the displacement in time . If the pixel is obscured by noise, or if there is an abnormal intensity change due to light reflection by objects, (5) can be redefined as Using feature matching, we can get the geometric transformation between and . Indeed, let denote the warping of the image to the reference frame . And the stabilized image sequence is defined by . The parameter estimation of the geometric transform is done by the minimum mean square error criteria: Generally, the geometric transformation between two images can be described by a 2D or 3D homograph model. We adopted four parameters 2D affine motion model to describe geometric transformation between two consecutive frames. If is the point in frame , and is the same point in the successive frame, then the transformation from to can be represented as shown in the following: or in the form of . The affine matrix can describe accurately pure rotation, panning, and small translations of the camera in a scene with small relative depth variations and zooming effects. is the scaling factor, is the rotation, and and are the translations in the horizontal and vertical direction, respectively. Corresponding pairs of feature points were used to determine the transform matrix in (1) from two consecutive image frames. Since four unknowns exist in (8), at least three pairs are needed to determine a unique solution. Nevertheless, more matches can be added under least-square criteria to ensure results are more robust: Then we can compensate the current frame to obtain stable images. Compensation of the video is calculated directly using warping operation. The example is shown in Figure 4(e).

3.2. Vehicle Detection

After removing the undesired motion of camera, the first step of mobile vehicle detection was the initial vehicle detection, which produces the vehicle candidates, including many false alarms.

3.2.1. Normal Flow

The reference frame and the warped one do not, in general, have the same metric since in most cases, the mapping function is not a translation but a 2D affine transform. This change in metric can be incorporated into the optical flow equation associated with the image sequence , in order to detect more accurately candidate mobile vehicle region. From the image brightness constancy assumption [24, 25], the gradient constraint equation selected by Horn and Schunck [24] is where and are the optical flow velocity components and , , and are the spatial gradients and temporal gradient of image intensity. Equation (10) is written in matrix form: The optical flow associated with the image sequence is Expanding the previous equation we obtain According to composite function derivation rules

Expanding (13) we obtain And therefore, the normal flow is characterized by Although does not always characterize image motion, due to the aperture problem, it allows accurate detecting of moving points. The amplitude of is larger near moving regions and becomes null near stationary regions. The relation of normal flow and optical flow is shown in Figure 5(a) and the candidate mobile region detection is shown in Figure 5(b).

3.2.2. Context Extraction

Context is especially useful in aerial video analysis, because most of the vehicles move in special area. And road is an effective context information for robust mobile vehicle detection. Many estimate the road network using the scene classification, which needs complicated training and many issues are prepared in advance. Based on human knowledge in general, we can make the following brief description of the road.(i)Road has constant width along all its length.(ii)Road always is vertical or horizontal in the airborne videos.(iii)There are two distinct parallel edges of the road.(iv)Road is always a connected region area.

Based on above assumption, we use Canny Edge detection and Hough Transform to extract the road area. The results are shown in Figure 6.

3.2.3. Complementary Features

Initial vehicle detection produces candidate mobile vehicle regions, including many false alarms, shown in Figure 7. We use shape (size) [3] of the candidate motion regions to improve the detection performance. Size feature is a four-dimensional vector, which is represented as (17), where and denote the length and width of the object, respectively:

4. Experiment Results and Analysis

We tested our method with three surveillance videos. The first two were got from our own hardware platform, shown in Figure 9(a), named 2.avi and gs.avi, respectively. The other is from the Shastry and Schowengerdt’s paper [18], shown in Figure 9(b), named TucsonBlvd_origin.avi. The first two were taken in 25 frames per second with resolution of pixels on the airship of 120 m height from the ground, where the speed of airship is 30 Km/h, shown in Figure 8.

From the vehicle numbers and background complexity in Figure 9, vehicles contained in (a1) are the least. And the background is simple, which includes no buildings; therefore it cannot cause visual error. The vehicles increase in the (a2), and the background includes buildings, which cause visual error. The most complex video is (b), which not only includes more vehicles and buildings but also has lowest resolution. And experiments’ results show that different videos have different detection performance.

The hardware platform of the simulation is CPU 2.1 GHz and RAMS 2 G. The software used in the experiments is opencv1.0 and VC++ 6.0.

4.1. Image Stabilization Comparison between SURF and SIFT

Our first experiment consists of comparing our video stabilization system to [5]. This system is based on SIFT feature extraction. We demonstrated the Speeded Up Robust Feature (SURF) key points are more suitable for the stabilization task. Figure 10 shows five frames of the unstable input sequence corresponding to 1, 2, 5, 10, and 15, taken from 2.avi.

Next, we compute the global motion vector , shown in Table 1. Table 1 shows that the airplane moves in vertical direction mostly and the accuracy of vector is almost the same in two video stabilization methods.

Figures 11(a) and 11(b) show the stabilization result of SIFT and SURF, respectively. Subjectively, our video stabilization system has the same results compared to SIFT.

Then, we used Peak Signal-to-Noise Ratio (PSNR), an error measure, to evaluate the quality of the video stabilization. PSNR between frame 1 and stabilized frame is defined as where MSE(), mean square error, between frames and is frame dimensions: We found that our stabilization system using SURF feature is working well compared to the stabilization system using SIFT feature in Figure 12. For the parallax effect of wrapping operation and multiple moving vehicles, the PSNR is low. So in the mobile vehicle detection, we use the normal optical flow.

Objectively, our video stabilization system has the better results compared to SIFT from Table 1 and Figure 12.

Last, we compare the performance of the two video stabilization methods, shown in Figure 13.

Through the experiments, the image stabilization accuracy is the same in subjective and objective evaluation. And the efficiency of image stabilization on SURF is better than on SIFT. We find that our stabilization system is working well.

4.2. Mobile Vehicle Detection Comparison between Proposed Method and Existing Methods

To evaluate the performance of mobile vehicle detection, our tests were run on a number of real aerial video sequences with various contents. Aerial video includes cars and buildings. Figure 14 shows the results under different conditions in video. The mobile vehicle is identified with a red rectangle. From the results, we can see that moving object can be successfully detected with different backgrounds. But we find a failure in the detection process.

To evaluate the performance of this method, we used detection ratio (DR) and false alarm ratio (FAR). In (20), TP is true positives of mobile vehicles, FP is false positives of mobile vehicles, and FN is false negatives (not detected). Results are shown in Table 2. And Figure 15 shows vehicle detection results comparison of 2.avi by using GMM, LK, and proposed method: For the quantitative analysis of our results we used two metrics: DR and FAR. Table 2 and Figures 14 and 15 illustrate the performance of our system. Because the resolution and complexity of videos are different, the detection performance is different. Our system has the highest rates of DR and the lowest rate in FAR.

5. Conclusion

In this paper, we present a hybrid method to detect mobile vehicle efficiently in aerial videos. We also demonstrate that SURF as features are robust for video stabilization and mobile vehicle detection purpose compared with SIFT. A quantitative evaluation on real video sequences demonstrates that the proposed method improves the detection performance. Our future work will focus on the following aspects to improve our method.(i)To increase the accuracy of the mobile vehicle, more local and global features, such as color information and gradient distribution, can be applied in the methods.(ii)We have to balance between the processing speed and algorithm complexity and robustness.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.


The authors would like to express their sincere thanks to the anonymous referees and editors for their time and patience devoted to the review of this paper. This work is partially supported by NSFC Grant (no. 41101355).