In order to safely navigate populated environments, an autonomous vehicle must be able to detect human shapes using its sensory systems, so that it can properly avoid a collision. In this paper, we introduce a Bayesian approach to the Viola-Jones algorithm, as a method to automatically detect pedestrians in image sequences. We present a probabilistic interpretation of the basic execution of the original tool and develop a technique to produce approximate convolutions of probability matrices with multiple local maxima.

1. Introduction

Being able to detect and avoid pedestrians is an essential feature of autonomous vehicles, if they are to guarantee a safe behavior in populated environments. However, automatically detecting human shapes in images is a very complex procedure for a computer vision system, and it has been widely studied before.

One of the most usual frameworks in literature is Viola-Jones [1], based on feature training and classifier cascades, which is explained in detail in Section 2.1. This technique has been improved by its authors by considering object motion [2, 3] and also by applying several classifiers simultaneously [4] or RealBoost to improve weak classifiers [5].

The main contributions of this paper are the introduction of a Bayesian approach to pedestrian detection methods—exemplified by, but not limited to, the Viola-Jones framework—, by creating a statistical interpretation of the basic execution of the original algorithm and developing a technique to produce approximate convolutions of probabilistic matrices with multiple local maxima. This aims to increase the precision of the framework for its usage on autonomous vehicles, in order to more efficiently detect and avoid obstacles and pedestrians in image sequences.

Furthermore, the method we present can be used with both preprocessed binary results and unaltered probabilistic elements. As the latter are commonly returned by the sensors of a robot, this allows for greater flexibility and a more accurate management of the uncertainty of the available data.

1.1. Related Work

Another important algorithm for detecting pedestrians consists of using Histograms of Oriented Gradients (HOG) to define the features on an image [6]. This algorithm has been implemented for FPGA-based accelerators [7] and GPUs [8] and combined with Support Vector Machine (SVM) classifiers [9, 10]. Variations of histogram-based detection methods, such as Co-occurrence HOG [11] and combinations with wavelet methods [12] also exist. Bayesian methods have also been applied to the problem of pedestrian detection [13].

Both HOG and Viola-Jones algorithms are included in the official release of OpenCV [14]. Although the former usually provides very precise detection results, as studied in [15], it has been proved to perform slightly slower than the latter and is therefore less suitable for a real-time operation like pedestrian detection for a moving vehicle.

2. Materials and Methods

2.1. Viola-Jones Framework

The Viola-Jones object detection framework uses object features which, similarly to Haar-like features [16], are defined by additions and subtractions of the sums of pixel values within rectangular, nonrotated areas of an image. The different types of features used by Viola-Jones are shown in Figure 1.

Thanks to the usage of integral images, such thatwhere is the integral of image , these operations can be done in constant time. For example, the sum of all the pixels of the rectangle in Figure 2 would be calculated assince each value is the sum of all the pixels in the rectangle defined by the opposite corners and .

A set of classifiers are then trained using AdaBoost [17], and a cascade architecture allows the result to be used in real-time, by immediately discarding a sample as soon as one classifier rejects it, as shown in Figure 3.

2.2. Bayesian Model

Let and be two random variables.(i) expresses the existence or absence of objects of interest (in our case, pedestrians) within an image, for each pixel location.(ii) shows an equivalent value, as returned by the Viola-Jones detection when applied to an image.It is possible to use as evidence to evaluate the degree of belief of proposition (i.e., ), by applying Bayes’ theorem:

The common use of a Bayesian model is to weed out wrong positive detections by comparing them to previous observations. However, when detecting pedestrians this decision could be damaging to the procedure, since false positives are preferable to false negatives, a missed detection involves immediate danger, whereas a false detection would only cause a less efficient route.

Therefore, we propose a reverse application of Bayes’ theorem, which filters absences of objects rather than detections, by considering the reverse values of the presented variables:where and are calculated as explained in the following subsections.

2.2.1. Likelihood

The default behavior of the Viola-Jones detection method, for a given image, is to return a set of rectangles within which objects of interest have been found.

A binary matrix can be produced from these areas, such that each cell is set to 1 if it belongs to one of them, and 0 otherwise. In our work, the binary matrix corresponding to the th rectangle is named .

Some of these marked areas may be superfluous (false positives), and others may overlap. The more rectangles that overlap over a group of pixels, the more likely it will be to contain an actual object of interest.

The original Viola-Jones algorithm allows for a minimum overlap restriction: a rectangle would only be valid if it can be computed as the intersection of a given number of overlapping detections.

Instead, we suggest to produce a detection matrix, such that the value of each one of its cells is equal to the number of rectangles that overlap over its corresponding pixel (Figure 4). This matrix is equal to the sum of the binary matrices of all the observed detections.

The likelihood matrix for the probability of absence of objects of interest within an image is proportional to the opposite value of the detection matrix; for detections, this would be

The concept of associating a weight value to each detection was also presented in the Soft Cascade method [18]. Its results are returned as rectangular areas, but unlike Viola-Jones, these are isolated and as such cannot be processed into probabilistic matrices. Preliminary tests showed that, because of this restriction, the accuracy of this technique is noticeably inferior to that of the probabilistic interpretation of Viola-Jones that we present in this work. Therefore, we chose not to use Soft Cascade in our experiments.

2.2.2. Prior

The usage of Bayes’ theorem involves an evolution of the resulting posterior probability function, in order to produce the prior probability function for the following iteration of the algorithm (typically a convolution is applied).

Ideally, at each time step , the location of an object is determined by a certain probability distribution. The distribution of the appearance of objects of interest in our experiments is extracted from the normalized addition of overlapping binary rectangular distributions, which is asymmetrical and has a flat top. A new probability distribution was developed to approximate this behavior.

Let be a set of detections as returned by the Viola-Jones method for a particular object of interest. An object can be represented as a tuple, such that(i) is the number of elements in set ,(ii) is the minimal rectangle area that holds the intersection of all the elements in , and(iii) is the minimal rectangle area that holds the union of all the elements in .

Using these data, a two-dimensional function which simulates the summation of all the elements in was modeled:

If considering a single dimension, rectangles and can be seen as two segments and , respectively, where (Figure 5).

Consider the following function:

The shape of suits our needs, but its height is scaled down so that, for two dimensions, the summation of the detections of a single object can be calculated asfor , and where , , , and are, respectively, the leftmost, rightmost, upper, and lower limits of area , and , , , and are the corresponding limits of area .

A probability matrix can therefore be generated, using the tuples which define the detected objects of interest. For objects

In order to isolate each object of interest among the added distributions of all the detections in an image, we locate the maximum value in the probability matrix and analyze its adjacent cells to define a tuple, such that(i)area contains all the cells that share a maximum probability value , caused by the overlapping of all the involved detection rectangles, and(ii)area contains all the cells that are delimited by local minima and zero values, so that we can assume that all nonzero cells that are not contained in belong to unrelated detections.After an object is located, its data are stored and it is removed from the probability matrix. This procedure is repeated until the matrix is empty.

Once all objects are extracted, they are matched to those of previous time steps to study their relative movement. When the objects involved are clearly individual, their movements can be analyzed and predicted separately. In our case, their number and their correspondences between frames are unknown.

Using a minimum mean square error estimation, each object is then added to a previously stored trajectory, which is used to predict new values for the following time step, using a linear regression over the tuple values.

The prediction values are finally used to generate the prior probability matrix using (9) (Figure 6).

3. Results and Discussion

Our method was tested over twelve image sequences, described in Table 1 and exemplified by Figure 7. Dataset ETSII was recorded in the parking lot of the Computer Engineering School of Universidad de La Laguna. Datasets ITER1 and ITER2 were filmed in the outer limits and in the parking lot of the Institute of Technology and Renewable Energy (ITER) facilities in Tenerife (Spain), respectively.

These three image sequences were captured by the visual sensors of the VERDINO prototype (Figure 8), a modified EZ-GO TXT-2 golf cart equipped with computerized steering, braking, and traction control systems. Its sensor system consists of a differential GPS, an Inertial Measurement Unit (IMU), an odometer, three Sick LMS221-30206 laser range finders, two thermal stereo cameras, and two Santachi DSP220x optical cameras.

Datasets BAHNHOF, JELMOLI, and SUNNY DAY were downloaded from Andreas Ess’ Robust Multi-Person Tracking from Mobile Platforms website at the Swiss Federal Institute of Tecnology. These image sequences were recorded using a pair of AVT Marlins F033C and have been used in publications [1922].

Datasets CAVIAR1 to CAVIAR4 belong to the Context Aware Vision using Image-based Active Recognition (CAVIAR) project [23] and were recorded in a shopping center in Portugal using a static camera. The selected image sequences correspond to the corridor views of clips WalkByShop1 (CAVIAR1), OneShopOneWait1 (CAVIAR2), OneShopOneWait2 (CAVIAR3), and ThreePastShop1 (CAVIAR4).

Dataset DAIMLER corresponds to the Daimler pedestrian detection benchmark dataset, introduced in [24], and dataset CALTECH corresponds to sequence V002 from testing set seq06 of the Caltech pedestrian detection benchmark [15, 25]. Both datasets were recorded from a vehicle driving through regular traffic in an urban environment.

Ten tests were conducted over each image dataset; the average results are shown in Figures 10 and 9. As explained in Section 2.2, the main goal of our detection enhancement method is to reduce the amount of false negatives returned by the Viola-Jones framework. As such, classic analysis techniques such as receiver operating characteristic (ROC) and detection error tradeoff (DET) curves, which depend on the amount of false positives of the results, do not properly display the improvement introduced by our approach. We instead present the average ratio between the amount of false positives returned by both the original and the enhanced detection methods, and the amount of true positives found in the input frames.

We observed that our Bayesian approach always provides less conservative detection rates than Viola-Jones, successfully lowering the rate of false positives for all datasets. Results were especially good for the ETSII, ITER, CAVIAR, and DAIMLER datasets. The sequences for these sets have good visibility, which results in more accurate detections by the original method and, consequently, a higher improvement introduced by our approach.

The rest of the datasets have higher occlusion rates and feature pedestrians in poses and locations that complicate their detection, thus lowering the enhancement of a Bayesian processing. This effect was especially noticeable for the CALTECH dataset, which features very few clearly visible pedestrians.

4. Conclusions

We have developed a Bayesian approach to the Viola-Jones detection method and applied it to a real case where pedestrians must be located and avoided by a self-guided device. Our method describes a statistical modification of the original tool, which is combined with a form of approximate convolution of two-dimensional probability matrices with multiple local maxima.

Our algorithm has been proved to improve the precision of the results, by restricting a probabilistic matrix returned by the original method to the area where objects are expected to appear, according to their previously observed movements.

It was found that our method behaves best when pedestrians are clearly visible, so that the detections by the original method can be properly enhanced by a Bayesian processing. More accurate detection algorithms are expected to improve the results of our approach in situations of high visual occlusion. This proposal serves as grounds for further research.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.


The authors gratefully acknowledge the contribution of the Spanish Ministry of Economy and Competitiveness (http://www.mineco.gob.es/) under Project STIRPE DPI2013-46897-C2-1-R. Javier Hernández-Aceituno’s research is supported by a FPU Grant (Formación de Profesorado Universitario) FPU2012-3568, from the Spanish Ministry of Science and Innovation (http://www.micinn.es/). The authors gratefully acknowledge the funding granted to the Universidad de La Laguna by the Agencia Canaria de Investigación, Innovación y Sociedad de la Información; 85% was cofunded by the European Social Fund.