Abstract

With the rapid development and application of CRFs (Conditional Random Fields) in computer vision, many researchers have made some outstanding progress in this domain because CRFs solve the classical version of the label bias problem with respect to MEMMs (maximum entropy Markov models) and HMMs (hidden Markov models). This paper reviews the research development and status of object recognition with CRFs and especially introduces two main discrete optimization methods for image labeling with CRFs: graph cut and mean field approximation. This paper describes graph cut briefly while it introduces mean field approximation more detailedly which has a substantial speed of inference and is researched popularly in recent years.

1. Introduction

Recognizing and labeling objects and properties in a given image is an important task in computer vision. The goal of image labeling is to label every pixel or groups of pixels in the image with one of several predetermined semantic object or property categories, for example, “dog,” “building,” and “car.” It is a natural ability for human beings to perform object recognition effortlessly, but it is not straightforward for a computer to do so. Researchers [14] are still trying to improve the image labeling technique to reach a better result in terms of speed and accuracy. Figure 1 is an example of label image labeling.

Image labeling usually includes several issues: first we should set up a model and train it; then we will make inference of labeling for a new image. The state-of-the-art of algorithmic solution to image labeling is yet to reach a satisfactory state, especially for the process of inference. Graph cut method [58]was popular previously. But the speed of graph cut method is very slow, especially when there are many labels. In [1], Vineet et al. are able to achieve remarkable speed-ups and improvements in accuracy with graph cut base inference techniques comparing with the baseline method in both joint stereo-object labeling and object class segmentation. However, their method [9] has two limitations: the first is the fact that mean field approximation assumes complete factorization over the individual variables; the second limitation relates to the form of the pairwise weights in the formula which are a linear combination of Gaussian kernels. See Section 3.2 for more details of these two limitations.

Naturally, human beings understand a scene mainly by using the spatial and visual information assimilated through their eyes. Inversely, given an image or several images, this information, such as boundary or object, is extremely necessary for scene interpretation. What we hope is to capture the full interaction between pixels. Due to the sensor noise and complexity of the real world, researchers realize that the solution of vision problems can be transformed to some equivalent optimization process as exact interpretation is unapproachable for computers.

In the early history of computer vision, Markov random field (MRF) was popularly used in both low-level and high-level vision perception after it was first introduced into vision by S. Geman and D. Geman in 1984 [10]. The MRF provides a mathematical framework to find optimal solutions by using the contextual visual information in the images. Recently, the MRF model regained attention in the field of computer vision thanks to the progress in powerful energy minimization algorithms [3] such as graph cut [6], belief propagation [11], dual decomposition [12], fusion move [13], and iterated conditional modes. The MRF has been applied to image problems such as restoration, matting [14], segmentation, optical flow, object classification [15, 16], face recognition [17], and text recognition [18].

Object classification can be formulated as a pixel labeling problem; that is, the correct label is to be assigned to each pixel or clique where the label of a pixel represents some property in the real scene, such as the same object or disparity. In [3], Chen et al. introduced the background, basic concepts, and fundamental formulation of image labeling with MRF. They discussed two distinct types of discrete optimization method, that is, belief propagation and graph cut. And they further applied them to the solutions of two classical vison problems: stereo and binary image segmentation using MRF model. Figure 2 shows some examples of labeling problems in computer vision.

It was later recognized that the image labeling problem can be naturally described with a Conditional Random Fields (CRFs) model [1]. The CRF model was first proposed by John Lafferty et al. [19] in 2001. In their work they present iterative parameter estimation algorithms for Conditional Random Fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data. The CRF model is brought to image labeling by Shotton et al., Peng and McCallum, and Kristjansson et al. [2022].

The use of CRFs was originally restricted in the area of Information Extraction [2225], in which, given a dataset, the problem is to extract relevant information that belongs to some predefined types. Since the datasets are mostly linguistic, imposing a chain structure on the texts is both effective in capturing temporal relations and efficient in inference and learning for texts is inherently sequential. Therefore, CRFs have been quickly adopted in a wide range of text processing applications, such as part-of-speech tagging (POS), chunking [26, 27], and semantic role labeling [28]. Later on, the application of CRFs has been expanded to word alignment [29], question answering [30], and document summarization [31].

Recently, the research of the CRF model in computer vision has been very popular, as it can be solved by efficient energy minimization algorithms. The efficiency of inference is a critical issue for CRFs in training and predicting the labels on new inputs. After training a CRF model, the marginal distribution over subsets of labels is computed so as to estimate the parameters of the model. As a result, it can be used to predict the labels of a new input such as a new image using the most likely labels. A lot of inference algorithms have been deployed to solve the CRF optimization problems, such as iterated conditional modes [32], Monte Carlo methods [33], graph cut methods [58], and message passing methods, in which mean field inference [1, 34] and belief propagation [35] are the two most popular ways, and people also developed many extensions around the methods.

Local information is well captured by the standard form of a CRF [6, 36]. Since it is not effective for modeling global information as it often fails to capture global consistency in image recognition, researches on how to capture global information of images in CRF with different forms [57, 37] become a hot area. To capture both local and global information of images makes the learning and inference very tough; we should not only focus on the accuracy of the method, but also consider the efficiency which turns out to be very poor with the increasing number of the input, such as the dimensions of the feature captured, or the number of input images. Therefore, many methods [3841] have been proposed to solve such a problem. Recently, a number of cross bilateral Gaussian filter-based methods have been proposed for problems such as object class segmentation [34], denoising [42], and stereo and optical flow [2]; all of these permit substantially faster inference, which maintains or improves accuracy as well. On the basis of [6], Vineet et al. [1] show how higher-order terms can be formulated such that filter-based inference remains possible and demonstrate their techniques on joint stereo and object labeling problems, as well as object class segmentation. In fact, they show that they are able to speed up inference in these model around 10–30 times with respect to competing graph cut methods.

In this paper, we review the progress in the inference of image labeling with CRF models. As mentioned above, a good inference method algorithm is critical in both predicting a new label with a new input and learning the parameters of the model to satisfy the goals of accuracy and efficiency which are two main aspects that we pursue.

Section 2 gives the model of CRFs and their extensions. In Section 3, we mainly introduce two inference methods: graph cut and mean field approximation which are widely used in recent years. And we conclude this paper in Section 4.

2. The Model of CRFs

A CRF is a discriminative undirected probabilistic graphical model that can represent relationships between different variables [20, 43]. The structure of a CRF model helps to estimate the unobserved ones given the observed ones. The classical CRF model is described as follows [34].

Denote by the input variable and the joint output variable. The input variable represents our knowledge about the domain such as color and texture. The output can be continuous or discrete, but, in most cases, all the labels we set are discrete.

We would like to model the mapping from to via the conditional distribution . As a result, we are only interested in the output structure conditioned on the input. CRFs approach the modeling of by representing as a Markov random field. More precisely, let be an undirected graph, where is the set of nodes in the graph and each node corresponds to a variable , and is the set of edges. Let denote the number of nodes in the graph. Define as the set of input random variables and as the set of output random variables where and each takes a value from a range of possible discrete labels. In a conditional random field, we assume that each random variable obeys the Markov property when conditioned on , such that the conditional probability distribution of given its adjacent nodes is independent of the rest of the nodes in the graph. That is, if is such a graphical model that where is the set of adjacent nodes of , the is conditional random field (CRF). Let represent the neighbor system to indicate the interrelationship between nodes or the order of CRF. The edges are added between one node and its neighbors . Usually, the neighbor system should satisfy the following:(1)A site does not neighbor with itself: .(2)The neighboring relationship is mutual: .The definition of the neighbor system is important because it reflects how far the contextual constraint is. For regular data, as in Figure 3, the neighbors of are defined as the set of sites within a radius of from where is the order of the neighbor system. One has , where measures the Euclidean distance between and .

In object recognition problems, the observations are often the image data themselves, or extracted visual features, and correspond to the outputs of vision system, for example, possible class labels of the image to be classified, which is shown in Figure 4.

To make the concept clear, we only consider the case when each variable in takes a value from a range of possible discrete labels, although they can be either continuous or discrete in a more general case. The paper will describe it in two aspects: probabilistic and energy function.

Under probabilistic understanding, it gets the set of all maximal cliques of , by using and to denote the values assigned to variables and , respectively. The conditional probability distribution of a CRF can be written as where the so-called potential function or compatibility function is a nonnegative potential function defined over which is a maximal clique in . is a normalization factor which is also called partition function depending on the observed values of input variable and is defined as

We also assume that the conditional distribution over graph is an exponential family [44]; thus we require each potential function to have the formwhere is a real-valued parameter vector and is a set of feature functions defined on the potential .

To simplify the solution to the energy function (see (2)), one can take the negative logarithm of the left hand side and right side of (2), and the problem of maximizing the conditional probability becomes an energy minimization problem. In practice, we usually model structures using pairwise constraints, since inference is easier in this case and the model parameters are easy to learn. For example, in computer vision problems, we often see CRFs with maximal cliques of size 2. In this case we can write down the energy as where we call the unary potential and the pairwise potential. Occasionally we also use high-order cliques and there are special types of high-order clique potentials that are useful in a few applications.

Probabilistic models need to be normalized properly and in many cases require evaluating intractable integrals over the space of all possible variable configurations. While energy functions have no such normalization requirement, thus they provide more flexibility in designing the architecture of the underlying graphical model.

The standard form [1, 25] of a CRF is good for modeling local information. We can write down the form of the standard CRF as follows:where is an input image, represents labeling, and is a category label at size . is a set of sites in the image, is a set of neighbors of , and is a coefficient that modulates the effects of the potentials.

In fact, the unary potential represents relations between labels and local image features. It predicts label based on the local features at site . And the pairwise potential represents relationships between labels of neighboring sites. It means if neighboring sites have similar image features, favors the same category label for them; if not, they might be assigned different category labels. So the pairwise potential works for data-dependent smoothing. What is important is that both potentials represent only local information, as a result, the global information was lost, and some intuitive mistakes can happen; for example, a “dog” might appear in the water [43]. Using the global information, some classification mistakes in image labeling will be avoided which is shown in Figure 5.

Later on, the multiscale CRF [43] (mCRF) was invented to use regional and global label features that encode particular label patterns at local and global scales. The form of mCRF can be presented below by multiplicatively combining component conditional distributions that capture statistical structure at different spatial scale :

Although the mCRF uses regional and global label features, it has massive variables and parameters to be estimated. And it also involves inefficient stochastic sampling for learning and label inference. So the overwhelmingly large dataset size and number of classes are its limitations in practical application.

The boosted random fields [37] model long-range interactions learned by using a boosting algorithm [45]. The hierarchical CRF [23] (hCRF) uses a hierarchical structure of CRFs to model long-range interaction (e.g., relative configurations of objects or regions) and short-range interactions (e.g., pixel-wise label smoothing) in a tractable manner. Its two-layer formulation to exploit different levels of contextual information in images for robust classification is general enough to be applied to different domains ranging from pixel-wise image labeling to contextual object detection. Both of these two methods do not incorporate global information of the image and thus make the labeling highly dependent on local information.

The random field model proposed by Toyoda and Hasegawa [46] explicitly models local information and global information in conditional random field. The method extracts global image features as well as local ones and uses them to predict the scene of the input image. The form iswhere and are global unary potential and global pairwise potential, respectively, , , and are coefficients that modulate the effects of the potentials, and is the partition function for normalization. The global unary potential represents relationships between labels and global image features. It predicts the spatial configuration of labels according to the scene of the input image. The global pairwise potential represents the compatibility of all pairs of labels. This method not only incorporates the local information and global information, but also enables rapid processing by using the global image features. However, it will not do the classification well if there are too many classes (there are only 7 classes in their experiments) because the relationship between classes becomes substantially complex.

Some researchers [4749] move their research point to the higher-order cliques. In fact, most energy minimization methods for solving computer vision problems assume that the energy can be represented in terms of unary pairwise clique potentials. As a result, this assumption severely restricts the representational power of these models making them unable to capture the rich statistics of natural scenes [50], while higher-order clique potentials have the capability to model complex interactions of random variables and thus could overcome this problem. The initial work with high-order potentials [36, 5052] has been quite promising but their use has been limited due to the unavailability of efficient algorithms for minimizing the resulting energy functions. Kohli et al. [49] extend the class of energy functions for which the optimal -expansion and -swap moves can be computed in polynomial time. In the paper, they propose the Potts model for which the optimal move can be found by solving a st-mincut problem. They define the Potts model potential for cliques of size as where , . For a pairwise clique this reduces to the Potts model potential defined as if and otherwise. The Gibbs energy of the CRF with high-order cliques is as follows in this paper:where is a clique which represents the path of the frame and is the set of all cliques. The example in the paper demonstrates the importance of enforcing label consistency over homogeneous regions for object class segmentation. However, the inference speed is inefficient comparing to mean field inference method.

The Potts model potential is a particular case of the pattern-based potentials [48] which is defined aswhere is a set of recognized patterns (i.e., label configurations for clique) each associated with an individual cost , while a common cost is applied to all other patterns. If we set to be the configurations with constant labels, then we will get the Potts model as described.

Cooccurrence relations capture global information about which classes tend to appear together in an image and which do not. And to model object class cooccurrence statistics a new term is added to the energy:

Torralba et al. [53] proposed the use of additional unary potentials to capture scene based occurrence priors. Their costs took the form:However, the complexity of inference over such potentials scales linearly with the size of the graph; they are prone to overcounting costs and it also requires an initial hard decision of scene type before inference.

Rabinovich et al. [54, 55] proposed cooccurrence as a soft constraint that took the form:where is some potential which penalizes labels that should not occur together in an image. It can capture the global information, however, because it is on the basis of a fully connected graph; the memory requirements of inference scale badly with the size of a fully connected graph. It grows with complexity rather than with the size of the graph.

To improve these methods, Ladicky et al. [40] proposed a new form of :where which guarantees invariance to the size of an object and can be seen as a particular higher-order potential defined over a clique which includes the whole of , that is, . And the restriction is placed on that it should be nondecreasing with respect to the inclusion relation; that is, , , and imply that . By incorporating these potentials, they got a quantitatively better and visually more coherent labelings. But it carries a comparable higher computer cost comparing to mean field inference.

Similar to Ladicky et al.’s form of , Vineet et al. [47] proposed the form of :where , where is 1 for a true condition and 0 otherwise. They used filter-based mean field inference to solve the energy with higher-order terms and showed that they are able to spend up inference in relative models about 10–30 times with respect to competing graph cut methods [43].

Joint optimization for object class segmentation is another important area of research in image labeling, such as combining objects and attributes for image segmentation [56], or joint optimization for object class segmentation and dense stereo reconstruction [4]. In [57], Farhadi et al. proposed a method to shift the goal of recognition from naming to description; for example, we not only recognize a basketball as a basketball, but also describe its attributes such as round. Therefore, the method allows them not only to name a familiar object, but also to report unusual aspects of a familiar object and to learn how to recognize new objects with few or no visual examples. The attributes in the paper consist of two aspects: semantic and discriminative. Since the concepts of objects and attributes are both important for describing images precisely, in [57], they formulated the problem of joint visual attribute and object class image segmentation as a dense multilabeling problem, where each pixel in an image should be associated with both an object class and a set of visual attributes labels. In the paper, they proposed a factorial multilabel CRF model which combines the multiclass CRF model and the multilabel model.

The multiclass CRF for objects can be defined in terms of an energy function:where and are unary potential and pairwise potential functions, respectively, and .

The multilabel CRF for attributes is defined aswhere are a set of random variables and are a set of random attribute labels. Rather than taking values directly in though, the ’s take values in the power-set operator.

They also defined a joint CRF in terms of a pairwise energy over the ():where

Using a two-level hierarchical model, where labeling object classes and attributes is done not only at the pixel level but also at a regional level, they gave the following energy:

It was recognized that the problems of dense stereo reconstruction and object class segmentation can both be transformed as one CRF model based labeling problem, in which every pixel in the image is assigned a label corresponding to either its disparity, or an object class. This inspires [4, 46] to provide an energy minimization framework that unifies the two problems. In their paper, the energy function of object class segmentation using a CRF took the formAnd the problem of dense stereo reconstruction using a CRF can be written asThus the energy of the CRF for joint estimation can be written asUsing the fact that certain objects occupy a certain range of real world heights, they jointed unary potentials successfully bywhere is the corresponding height above the ground plane and is a histogram based measure of the naïve probability that a pixel taking label has height in the training set. So the combined unary potential can be written aswhere , , and are the corresponding weights.

For pairwise interactions, we know that an object classes boundary is more likely to occur if the disparity of two neighboring pixels differs significantly. Taking it into account, they chose tractable pairwise potentials of the formwhere , , and are the weights of the pairwise potential.

Although the two models described as above need more parameters to learn which makes the processes of learning and inference more complicated, they achieved a better scene understanding comparing to other models before.

3. Inference Methods

Over the years, a large number of inference algorithms have been developed; although exact inference in such CRFs is intractable, much attention has been paid to developing fast approximation algorithms, including graph cut approaches [6], variants of belief propagation [11, 35, 50], and a number of Gaussian filter-based methods [1, 39]. In this section, we briefly introduce two inference methods for approximating energy minimums; one is the classical method, graph cut, and the other is mean field approximation which has been popular in recent years.

3.1. Graph Cut

Greig and Porteous [59] first applied the graph cut in computer vision which describes a large family of MRF inference algorithms based on solving min-cui/max-flow problem. If a type of computer vision problems can be formulated in terms of an energy function, then we can use graph cut to get the minimum energy configuration that corresponds to the MAP theory. Figure 6 is an example of min-cut graph cut.

In this method, we set a directed weighted graph which consists of a set of nodes and a set of directed edges and the edge weight is nonnegative. The nodes correspond to pixels in image labeling problem. There are two additional nodes which are called terminals, that is, the source and the sink . In computer vision, terminals correspond to the set of labels that can be assigned to pixels. All edges in the graph are assigned some weight or cost. In fact, it is very important to assign edge weights for many graph-based applications in vision. And there are two types of edges in the graph: -links and -links. The former connect pairs of neighboring pixels so they can represent a neighborhood system in an image. The latter connect pixels with terminals; thus a -link connecting a pixel and a terminal corresponds to a penalty for assigning the corresponding label to the pixel. A cut is a set of edges such that the terminals are separated in the induced graph . In addition, no other proper subset of separates the terminals in . And the weight of a cut is the sum of its edge weights, for example, . The minimum cut problem is to find the cut with the smallest cost.

Boykov et al. [6] proposed the two most graph cut algorithms: -expansion and -swap. -swap is described as follows: for a pair of labels , , it exchanges the labels between an arbitrary set of pixels labeled and another arbitrary set labeled . The algorithm generates a labeling such that there is no swap move that decreases the energy. As for -expansion: for a label , this move assigns an arbitrary set of pixels to the label . This algorithm is ended when there is no expansion move that decreases the energy. In their paper they define two concepts: semimetric and metric. Suppose is the interaction potentials of the energy, for example, with features . is called a semimetric on the space of labels if, for any pair of labels , it satisfies two properties: and . If also satisfies the triangle inequality in , then is called a metric. Although -expansion is more accurate and efficient and can produce a result with lower energy, the interaction potential must be a metric when using -expansion, while for -swap, it must be semimetric.

The main idea of the -expansion algorithm is to successively segment all and non- pixels with graph cuts and the algorithm will change the value of at each iteration. The algorithm will iterate through each possible label for until it converges. At each iteration, the region can only expand. This changes somehow the way to set the graph weights. Also when two neighboring nodes do not currently have the same label, an intermediate node is inserted and links are weighted so they are relative to the distance of the label.

The main idea of the -swap algorithm is to successively segment all pixels from pixels with graph cuts and the algorithm will change the combination at each iteration. The algorithm will iterate through each possible combination until it converges. Within an iteration the graph is constructed in a normal way so it can segment efficiently between the region and the region. Special care must be taken with nodes that are neither in the nor in the region. That means, for a pixel, the terminal link weight is the data term plus the sum of all links to neighbors which are neither in the region nor in the region.

In [6], the energy formula was described as . The first term is known as the data term. It ensures that the current labeling is coherent with the observed data . It penalizes a label to pixel if it is too different from the observed data . The second term is the smooth term. To make it clear for algorithms used in [6], a quick implementation of the -expansion algorithm for image restoration is shown in Figure 7. Here an image with embedded squares is of intensity values 255, 191, 128, and 64. Noise was added to the original image so intensity is . Possible labels are all integers between 0 and 255. The algorithm will perform segmentations until it converges. Note that means non- labels. The data term used here is a simple squared difference . The smoothing term used here is Potts model , where if is true, or zero otherwise.

For more details about -swap and -expansion, one can go to [6]. In addition, Kolmogorov and Rother [60] wrote a survey about graph cut and pointed out that graph cut can be applied to both submodular and nonsubmodular functions. Other more recent developments in graph cut include order-preserving graph cut [61] and combination graph cut [3, 62].

3.2. The Mean Field Approximation

Recently, a number of mean field approximations in computer vision have been proposed, such as object class segmentation [8, 9, 11, 34]. The mean field algorithm finds the distribution , which is closest to which is the exact distribution by minimizing the KL-divergence within the class of distributions representable as a product of independent marginal, [63]. Although the approximation of as a fully factored distribution is likely to lose a lot of information in the distribution, this approximation is computationally attractive. The mean field approximation can be formulated as follows:where is the energy functional. See [63] for more details.

The approach of [34] of provides a filter-based method for performing fast approximate maximum posterior marginal (MPM) inference; for example, the solution satisfies , in multilabel CRF models with fully connected pairwise terms, where the pairwise terms have the form of a weighted mixture of Gaussian kernels. We can express the fully connected pairwise CRF aswhere is the energy associated with a configuration conditioned on and and are unary and pairwise potential functions, respectively. And, in [34], the pairwise potentials take the form of a weighted mixture of Gaussian kernels:where is a label compatibility function, , are Gaussian kernels, and , are the corresponding weight of the kernels. We briefly deduce the whole process of the iterative update equation:

First, we can write the KL-divergence :where refers to the expected value under the distribution . Since and linearity of expectation , one haswhere

The marginal which we need is found by minimizing a Lagrangian that consists of all terms in plus Lagrange multipliers assuring that the marginal are probability distributions. The detailed derivations will be presented below:

So we can get

Setting the derivative to 0, and rearranging terms, we get thatwhere is the corresponding partition function.

Substituting the definition of the pairwise potential above into the mean field update in (38) yields the following formulation of the update equation:

In fact, the general form of the mean field update equations (see [52]) iswhere is a value in the domain of random variable , denote an assignment of all variables in clique , and is an assignment of all variables in apart from , and denotes the marginal distribution of all variables in apart from derived from the joint distribution . Thus evaluates the expected value of over given the condition that takes the value . When we set and by evaluating (40) across the unary and pairwise potentials defined in [34], we will directly get (39).

In [34], it is shown that parallel updates for (39) can be evaluated by convolution with a high dimensional Gaussian kernel using any efficient bilateral filter, for example, the permutohedral lattice method of [39]. It is achieved by the following transformation:where is a Gaussian kernel corresponding to the th component of (30), and is the convolution operator. The following algorithms are the algorithms used in [34].

Algorithm  1 (mean field in fully connected CRFs).while not converged doend  while

In [34], the permutohedral lattice [39] was used for the filter-based inference; the recently proposed domain transform filtering approach [58] has certain advantages over the permutohedral lattice. Since domain transform filtering approach does not subsample the original signal, its complexity is independent of the filter size, while the complexity and filter size are inversely related using the permutohedral lattice. In [47], it was demonstrated that the domain transform approach achieves even faster inference times than using the permutohedral lattice for accurate object/stereo labeling. On the basis of [34, 47] the mean field approximation to the inference of models with higher-order terms was further applied.

In [47] the pattern-based potentials were added, which is described in Section 2, to the energy function; the required expectation for the mean field updates (39) can be calculated:where is the subset of patterns in for which .

A particular case of the pattern-based potential is the -Potts model, and the required expectations can be expressed as

The paper [1] also added coconcurrence potentials (see [47] for more details) which is over the entire image clique with a defined form and tested their approach on object class segmentation. As a result, they showed substantial improvements in inference speed with respect to graph cut based methods, particularly by using recent domain transform filtering techniques, while also observing similar or better accuracies. Figures 8 and 9 are the results of [1] in both stereo and image labeling. All the experiments in [1] are based on an Inter® Xeon® 3.33 GHz processor, and they fixed the number of full mean field update iterations to 5 for all models.

In Figure 8, [1] applied their model to the Leuven dataset, consisting of stereo images of street scenes, with ground truth labeling for 7 object classes, and manually annotated ground truth stereo labeling quantized into 100 disparity labels. In their model they used JointBoost classifier responses to form the object unary potentials. A truncated -norm of the intensity differences is used to form the disparity potentials. For the densely connected pairwise terms, identical kernels and weightings and Ising model for the label compatibility function were used. For the -Potts potentials, for all was set and was set by cross-validation. Figure 9 is the results of [1] on PascalVOC-10 dataset.

From Figure 8 and Table 1, we note that the densely connected CRF with higher-order terms (Dense + HO) achieves comparable accuracies to [4], and that the use of domain transform filtering methods [58] permits an extra speed-up, with inference being almost 12 times faster than the least accurate setting of [4] and over 35 times faster than the most accurate. The Dense + HO + CostVol approach achieves the best overall stereo accuracies. Although the improved stereo performance appears to generate a small decrease in the object labeling accuracy in [1]’s full model, the former remains at an almost saturated level.

Figure 9 and Table 2 compare timing and performance of [1]’s approach (final 2 lines) against two baseline. The importance of higher-order information is confirmed by the better performance of all algorithms compared to the basic dense CRF of [34]. Further, the filter-based inference is able to improve substantially on the inference time and class-average performance of the AHCRF [40], with -Potts and cooccurrence potentials each giving notable gains.

Although the mean field algorithm is an easy approximation method, it still has several limitations. As mentioned in [9], the first limitation is related to the fact that the mean field approximation assumes complete factorization over the individual variable. As a result, the mean field inference methods are usually sensitive to initialization although the simplified model leads to efficient and tractable models for learning and inference. Another limitation relates to the form of the pairwise weights in (30) which are a linear combination of Gaussian kernels. In fact, they allow each Gaussian component to take only zero mean and use the same combination of Gaussian kernels for each label pair. Although these are improved in [9], they are still lead to unsatisfactory results. Therefore, in the future, we hope to find some other methods which have not only substantial speed of inference but also considerable accuracies.

4. Conclusion

Recently, CRF is accepted as one of the popular approaches for solving the image labeling problem in computer vision and image analysis. An important issue in CRF models is to develop an efficient inference algorithm to find the most appropriate labels especially when considering the global information of an image.

In this paper we review the research development and status of object recognition with CRFs, especially the two main discrete optimization methods for image labeling with CRFs: graph cut and mean field approximation. We describe graph cut briefly while we introduce mean field approximation more detailedly which has a substantial speed of inference and is popular in recent years. Compared to the graph cut method, the mean field inference improves speed substantially for its simplified model.

In the application of image labeling problem in computer vision, one typical problem is that there are too many nodes. For example, for an image with the size of , supposing each node takes possible labels, the computation space is . Thus the computation space expands exponentially with the growth of image’s size. It is very clear that the inference algorithm plays a very important role in these problems. Another key issue is to construct reasonable CRF models as Section 2 introduces. Learning the parameters of a CRF model efficiently from images instead of being manually or empirically chosen is also an important issue, though it is not the focus of this paper.

Nowadays, many tasks in computer vision and image analysis can be formulated as a labeling problem where the correct label has to be assigned to each pixel or clique. However, computational expense of training is still a computational burden for the need to perform inference repeatedly during training process. In the future, we hope to improve the accuracy of mean field inference for image labeling while maintaining its efficiency. Solving these problems will greatly influence some technology such as driverless car. On the other hand, with the development of the skills for capturing image depth information such as Kinect, depth information of an image is easily obtained like color features. So it is considerable to combine these properties with CRF models and efficient inference approaches for image labeling and stereo reconstruction in 3-dimensional space. Moreover, using these theories for facial action labeling research may be another strategy.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This work was supported by the Special Fund Project of “Industry-Education-Academy” Cooperation in Guangdong Province in 2013 (2013A090100002), the National High Technology Research and Development Program (“863” Program) of China (2015AA043302), and the Key Scientific Projects of Guangzhou Huadu in 2014 (HD14ZD004).