#### Abstract

Even though several promising approaches have been proposed in the literature, generic category-level object detection is still challenging due to high intraclass variability and ambiguity in the appearance among different object instances. From the view of constructing object models, the balance between flexibility and discrimination must be taken into consideration. Motivated by these demands, we propose a novel contextual hierarchical part-driven conditional random field (CRF) model, which is based on not only individual object part appearance but also model contextual interactions of the parts simultaneously. By using a latent two-layer hierarchical formulation of labels and a weighted neighborhood structure, the model can effectively encode the dependencies among object parts. Meanwhile, beta-stable local features are introduced as observed data to ensure the discriminative and robustness of part description. The object category detection problem can be solved in a probabilistic framework using a supervised learning method based on maximum a posteriori (MAP) estimation. The benefits of the proposed model are demonstrated on the standard dataset and satellite images.

#### 1. Introduction

Object category detection is one of the most important problems in computer vision and is still full of challenges because of various factors such as object deformation, occlusion, and viewpoint change. To address these challenges, successful object detection methods need to strike the balance between being flexible enough to model intraclass variability and being discriminative enough to find objects with ambiguity appearance in complicate scenes [1–3].

Part-based object model, firstly proposed by Fischler and Elschlager [4] in 1973, has been proved as a powerful paradigm for object category detection and recognition in numerous researches [5–10], due to its advantages of intuitive interpretation and semantic expression. In such models, each part is generally represented by small templates or local image feature information, and the whole object is modeled as a collection of parts with or without geometric and cooccurrence constraints. The final discriminate of object is achieved by solving the probability density function or using a Hough vote mechanism. In the early researches on part-based approaches, parts are learned purely on the basis of their appearance by clustering visually similar image patches in the training images and do not exploit any spatial layout of the parts. Obviously, since the part appearance only reflects local image characteristics, these models cannot get enough spatial information support. The neglected contextual interactions that are used to capture geometric relationships between parts of an object should play a more crucial role in the part-based model to enhance the representational power of model.

On the other hand, most current part-based approaches can be roughly divided into two separate groups: generative and discriminative. Generative part-based models [5–9] have shown high flexibility because of their advantage of handling missing data (i.e., the correspondence between local features and parts) in a principled manner. So, each part can be interpreted in a semantically meaningful way. The most popular generative approach for part-based object detection was proposed by Fergus et al. [7] in 2003, in which objects are modeled as flexible constellations of parts and the appearance, spatial relations, and cooccurrence of local parts are learned in an unsupervised manner. Felzenszwalb and Huttenlocher [8] proposed a pictorial structure model, in which deformable configuration is represented by spring-like connections between pairs of parts. By integrating spatial relationships with “bag of features,” Sudderth et al. [9] developed a hierarchical probabilistic model to capture the complex structure in multiple object scenes. However, generative approaches often cannot compete with discriminative manner in the field of object category detection. The generative framework has natural drawback that it has to assume the independence of the observed data to make the model computationally. In contrast to the discriminative model, the generative model may be quite complex even though the class posterior is simple. Moreover, learning the class density models may become even harder when the training data is limited [10].

In this paper, we focus on the discriminative random field model, called conditional random field (CRF), which is originally proposed by Lafferty et al. [11] in 2001. Kumar and Herbert [12, 13] first introduced the extension of 1D CRFs to 2D graphs over image and applied it to object detection. By treating object detection problem as a labeling problem, CRF model cannot only flexibly utilize various heuristic image features, but also get the contextual interactions among image parts through its classic graphical structure. In order to deal with multiple labels for object parts, Kumar and Hebert presented a multiclass extension of CRF [14], and utilized fully labeled data where each object part is assigned a part label to train the model. By contrast, Quattoni et al. presented an expansion graph structure of CRF framework [15] that uses hidden variables, which are not observed during training, to represent the assignment of parts. Moreover, located CRF model [16], proposed by Kapoor and Winn, introduces global positions to the hidden variables and can model the long-range spatial configuration and local interactions simultaneously.

The goal of this paper is to introduce a novel contextual hierarchical part-driven CRF model for object category detection. The main novelty of our approaches lies in the use of a latent two-layer hierarchical formulation of labels and a weighted minimum spanning tree neighborhood structure. The model can effectively encode latent label-level context, as well as observation-level context. Meanwhile, beta-stable local features are also introduced as observed data to ensure the discriminative and robustness of part description. Such features provide a sparse and repeatable manner to express object parts and actually reduce the computation complexity of the model.

The remainder of this paper is organized as follows. Section 2 gives detailed introduction on the proposed contextual hierarchical part-driven CRF Model. The parameter learning and inference algorithms are introduced in Section 3. Experimental results are presented in Section 4. Finally, in Section 5 we draw the conclusions.

#### 2. Contextual Hierarchical Part-Driven CRF Model

The conditional random field is simply a Markov random field (MRF) [17] globally conditioned on the observation. It is a discriminative model that relaxes conditional independence assumption by directly estimating the conditional probability of labels [18, 19].

In other words, let be the observed data from an input image, where , is the data from the th site, and is the set of sites. The corresponding labels at image sites are given by . For labeling problems, the general form of a CRF can be written as
where partition function is a constant normalization with respect to all possible values of with parameters denotes the set of edges, and and are the unary and pairwise potentials, respectively. Here, encodes compatibility of the label with the observed image and encodes the pairwise label compatibility for all that conditioned on *. *

##### 2.1. Problem Formulation

For our object category detection problem, assume that we are given a training set of images , which contains objects from a particular class and background images. The corresponding labels can be denoted as , each is a member of a set of possible image labels. Since in object detection we only focus on presence or absence of objects, the possible labels should be limited to binary data, that is, or . Now, our task is to learn a mapping from images to labels . For simplicity of notation, we drop the superscript indicating training instance.

According to the theory of part-based model, assume that each image can be seen as a collection of parts , each part corresponds to a local observation or local feature. In order to describe the relationship between these parts, similar to hidden random field approach [15], we introduce latent labels , where corresponds to the “part-label” of part , and corresponds to the actually object parts, for example, for airplane objects.

Now, we can model the posterior directly by marginalizing out the latent labels , and the model can be defined as where is the set of parameters.

Here, we assume that is conditional independence of given . This means that the final object label only relies on the latent middle-level labels rather than on the original observations. The hypothesis makes sense because it is theoretically possible that we estimate the object or background occurrence by the spatial distribution of meaningful object parts in real world. Also, by doing this, we can build distinct two-layer structure of our proposed model. The hierarchical graphical structure of our contextual hierarchical part-driven CRF model is shown as Figure 1.

Obviously, both and can be modeled as CRFs. Thus, the whole model can be seen as the combination of two single-layer CRFs. By modeling the contextual interactions of these two layers, respectively, our model may have a high level of ability to describe different levels of context. Note that the latent label cannot be observed during training (i.e., unlabelled), so we must learn the models in a unified framework to avoid the direct use of . Detailed modeling approach and potential definitions for the two layers will be described below.

##### 2.2. Model of Layer 1

Without considering the object label , the distribution over the latent part labels given the observations may be modeled as a multiclass CRF. In such a model, observations are linked to local features located at certain spatial positions of the image. Therefore, the distribution of may be arbitrary and disorganized due to the uncertainty of local feature extraction. Meanwhile, different observations may be associated with the same part label, which corresponded to the meaningful object component. Due to the fact that adjacent or relevant observations are more likely to have the same label, we should consider label-level context in our model in addition to observation-level context.

Furthermore, although we cannot use the part labels explicitly, we can theoretically use them to define the posterior to capture the context structure of layer 1. Considering only unary and pairwise potentials, the posterior distribution can be modeled as where the set of parameters is given by , as shown in Figures 2(a) and 2(b), denote the unary potentials and are responsible for modeling part occurrences based on a single image feature, denote the pairwise potentials and are responsible for modeling the cooccurrences of and based on the corresponding pairwise image feature. The connectivity of nodes , that is, neighborhood structure of the observations, is defined in Section 2.4.

**(a)**

**(b)**

**(c)**

**(d)**

Note that, different from multiclass CRF [14], the definitions of potentials here must consider the missing label data . By using the parameter vector, the potentials can be denoted as where and refer to the unary feature vector and the pairwise feature vector, respectively. Parameter vectors of and have the same dimensions with the corresponding feature vectors.

In comparison with hidden CRF [15], our approach introduces the pairwise potentials to the model and can effectively capture the part label-level context by measuring the compatibility of different part labels.

##### 2.3. Model of Layer 2

According to the conditional independence assumption mentioned in Section 2.1, the final image label only depends on part labels . In other words, the occurrence of an object can be estimated by the spatial distribution of object parts.

Particularly, part labels should be regarded as observations in this layer, and the posterior distribution can be easily defined as where is the set of parameters, unary potentials describe the compatibility between image label and part label , pairwise potentials describe the compatibility between image label , part label , and part label , as in Figures 2(c) and 2(d).

Note that there is only one image label for an instance, so we do not need to model label-level context like in layer 1, and the potentials can be defined as where parameter vectors and .

Now, we can give the complete expression of our part-driven CRF model with the specific potentials, which can be denoted as where the set of parameters is given by .

##### 2.4. Neighborhood Structure

For probabilistic graphical models, the neighborhood structure is an important factor affecting the model capability. Moreover, as mentioned above, the observations in our model are distributed over the image plane in an arbitrary layout. So, how to define the neighborhood structure becomes a question we have to consider during the model design phase.

In [15], Quattoni et al. evaluated a range of different neighborhood structure and come to the conclusion that the minimum spanning tree (MST) shows better performances than many other complex connected graph structures. Following this, we adopt MST as the basic structure and extend it to a novel weighted neighborhood structure (WNS). The basic idea is to exploit the edge cost, which is discarded in previous work, as heuristic information to reflect the degree of correlation between two nodes.

In other words, different edges should have different weights during calculating pairwise potentials. We denote the edge cost between node and node as and modify (2.5) to

Similarly, (2.8) is changed by

By doing this, the weighted neighborhood structure can not only describe the connectivity of nodes, but also encode the assumption that parts that are spatially close are more likely to be dependent.

#### 3. Parameter Learning and Inference

Given -labeled training images, the parameters can be learnt by using maximum A posteriori (MAP) estimation. Gaussian prior is introduced to prevent overfitting. So, parameter learning can be achieved by maximizing the following objective function:

We use gradient ascent to search for the optimal parameter values . In our model, the derivatives of the log-likelihood with respect to the model parameters can be written in terms of local feature vectors, marginal distributions over individual part label , and marginal distributions over pairwise labels and :

Note that, all the terms in the derivatives can be calculated using Belief Propagation (BP) algorithm [17], provided the graphical structure does not contain cycles. Otherwise, approximate methods, such as loopy BP could be considered. Here, BP is suitable for our case due to the tree-like neighborhood structure.

For the final class inference, we need to find the image label that maximizes the conditional distribution , given parameters . For this work, we can also use the max-product version of BP to find the MAP estimate .

#### 4. Experiments

In this section, we demonstrate the capability of the proposed model on two different datasets: Caltech-4 standard dataset and airplane images collected from Google Earth. The aim of these experiments is to illustrate the performance of this detection framework using contextual hierarchical part-driven CRF model and compare with state-of-the-art models.

##### 4.1. Image Features

For the object category detection task, the robustness and distinctiveness are basic requirements for local features in order to provide powerful expression ability for objects. On the other hand, the quantity of features corresponds to the quantity of observations and has an enormous influence on computational complexity. Taking these aspects into consideration, we should try to use the local features which have the characteristics of sparse, robust, and discriminative simultaneously.

In our experiments, we use the beta-stable feature extracting method [20] to locate local features and SIFT descriptor [21] to construct feature vectors. Rather than selecting features that persist over a wide interval of scales, beta-stable features are chosen at a scale so that the number of convex and concave regions of the image brightness function remains constant within a scale interval of length beta. As a result, the beta-stable features have stronger robustness than SIFT-like features and are better anchored to visually significant parts. The comparative feature-points detecting results are shown in Figures 3(a) and 3(b).

**(a)**

**(b)**

**(c)**

The unary feature vector, , used in this work is represented by the combination of SIFT descriptor and relative location features. The pairwise feature vector, , is just the joint of unary feature vector and .

Given the locations of local observations, we construct graphical models using MST, as shown in Figure 3(c). The edge cost used in MST construction between two observations was computed by where and are balance factors depending on the actual object, and . If the object has richer shape information than appearance information, we think that the 2D distance might be more useful for discrimination, so we will take a bigger than .

##### 4.2. Object Detection on Standard Database

The first dataset that we used to test our model is a subset of the Caltech-4 standard dataset, which contains images for two object categories, car (rear view) and airplane (side view), and one background category. Each image contains at most a single instance of the objects in diverse natural background and, therefore, is suitable for our 2-class detection task. We randomly split the images into two equal separate subsets for training and testing.

Figure 4 shows the examples of the assignment of parts to local features for two object categories. It is apparent that the proposed model can effectively associate the mass of scattered and unordered observations with their corresponding object parts. Note that multiple observations may be assigned to the same part label with the premise that they physically belong to the same part. The number of parts can be empirically set according to the complexity of objects.

**(a)**

**(b)**

##### 4.3. Airplane Detection in Satellite Images

In this section, we verify our model by 170 airplane (top view) satellite images taken from Google Earth. To gather a sufficiently large learning dataset, we acquire images from different heights and different directions. Furthermore, a few synthetic images with simulation airplane models are also used for testing. All images are resized to pixels. The balance factor used in the weighted neighborhood structure is set to 0.7 to encourage the use of shape information.

Due to space constraints, we provide a few examples of the detection results (as shown in Figure 5). We use a simple bounding box located at the center of the efficient observations to roughly label the detected objects in the test images.

##### 4.4. Performance Comparison

We compare the detection performance of our model with those of three existing models: hidden CRF [15] model, located HRF [16] model, and multiclass CRF [14] model. For fairness of comparison, the local features in these three models are also computed by SIFT descriptors. In order to measure the influence of neighborhood structures, we also investigate the performance of an equivalent model without weighted neighborhood structure. The object categories are car (rear), airplane (side), and airplane (top), which have been mentioned in previous sections. The equal error rate (EER) defined in [7] is adopted as evaluation criterion, in which higher EER values means better classification performance. The comparative results are summarized in Table 1.

As can be seen, our model consistently gives the best results for these three object categories for the car rear dataset and airplane side dataset. Note that the airplanes (side) are easier to be discriminated than cars due to their distinct shape structure. On the airplane top dataset, our model is exceeded slightly in accuracy only by the located HRF model. This may be caused by overfitting since our model has to use more parameters to encode more contextual dependencies.

From the results in the last row of Table 1, we can see that incorporating the weights of neighborhood structures is important since the performance of such a model dropped. Rather than hypothesizing that all the edges in MST are equally important, the weighted neighborhood structure uses weights to measure the degree of correlation between connected nodes and inherently have higher representational power.

Note that, since the local features are extracted automatically during training and testing, the quantity and structure (constructed by MST) of observations should be unpredictable. As a result, the computing time of our model is influenced by the object complexity and image quality. In this experiment, we use about 3 hours for training, and 1.2 second per image on the average for testing on a 2.8 GHz computer.

#### 5. Conclusion

In this paper we presented a contextual hierarchical part-driven CRF model for object category detection. By incorporating two single-level models, the proposed model can effectively represent latent label-level context and observation-level context simultaneously. A weighted neighborhood structure is also introduced to capture the degree of correlation between connected nodes. Experimental results on challenging datasets with high intraclass variability have demonstrated that the proposed model can effectively represent multiple context information and give competitive detection performance. Our future researches will focus on the following directions: introducing more sparse and robust local features to reduce the computational complexity and utilizing high-order clique potentials to investigate more contextual dependencies in images.

#### Acknowledgments

This work is supported by a special financial grant from the China Postdoctoral Science Foundation (Grant no. 20100481512, 201104765). The authors would like to thank Kevin Murphy for publishing CRF2D toolbox, which is very helpful for this work. They also thank X. J. Xiang for data collection.