Abstract

Markov random field (MRF) is a widely used probabilistic model for expressing interaction of different events. One of the most successful applications is to solve image labeling problems in computer vision. This paper provides a survey of recent advances in this field. We give the background, basic concepts, and fundamental formulation of MRF. Two distinct kinds of discrete optimization methods, that is, belief propagation and graph cut, are discussed. We further focus on the solutions of two classical vision problems, that is, stereo and binary image segmentation using MRF model.

1. Introduction

Many tasks in computer vision and image analysis can be formulated as a labeling problem where the correct label has to be assigned to each pixel or clique. The label of a pixel represents some property in the real scene, such as the same object or the disparity. Such problems can be naturally represented in Markov random field (MRF) model. MRF is firstly introduced into vision by S. Geman and D. Geman [1] in 1984 and has been widely used in both low-level and high-level vision perception in recent years.

Basically, humans understand a scene mainly by using the spatial and visual information which is assimilated through their eyes. Inversely, given an image or images, this information such as boundary or object, mainly based on the contextual constraints, is extremely necessary for scene interpretation. We hope to model the vision problem to capture the full interaction between pixels. On the other hand, due to the sensor noise and complexity of the real world, exact interpretation is rather difficult for computers. As a result, researchers have realized that the solution of vision problems should be solved by using optimization methods. As the most popular models for gridded image-like data, the MRF provides a series of mathematical theories to find such optimal solutions under the contextual visual information in the images. The context-dependent object in digital images can be modeled in a convenient and consistent way through MRF theory. It is achieved through characterizing mutual influences among such entities using conditional MRF distributions [2]. Besides, the images we captured are always piecewise smooth, which can be encoded as a prior distribution. Thus, we can use the MRF model whose negative log-likelihood is proportional to a robustified measure of image smoothness [3]. Moreover, we may know that some premise or external knowledge of the scene such as the specify object might exist in the environment. With these priors, we can get more reliable understanding of the images.

In the latest two decades, the renaissance of the MRF model in computer vision has begun due to powerful energy minimization algorithms. A lot of inference algorithms have been developed to solve the MRF optimization problems, such as graph cut [4], belief propagation [5], tree-reweighted message passing [6], dual decomposition [7], fusion move [8], iterated conditional modes, and their extensions. In the literature [9], Szeliski et al. gave a set of energy minimization benchmarks and used them to evaluate the solution quality and runtime of several energy minimization algorithms. Felzenszwalb and Zabih [10] reviewed the dynamic programming and graph algorithms and then discussed their applications on computer vision. A review for the linear programming to solve max-sum problem was given in [11]. On the other hand, a framework of learning image priors for MRF was introduced by Roth and Black [12]. Schmidt et al. [13] revisited the generative aspects of MRF and analyzed the quality of common image priors in a fully application-neutral setting. New models based on MRF such as MPF were proposed. It was proved that the convex energy MPF can be used to encourage arbitrary marginal statistics [14]. Some excellent books about MRF models in image analysis such as [2] are also available.

The MRF has been successfully applied to image analysis such as restoration, matting [15], and segmentation, as well as two-dimensional (2D) fields such as stereo matching, super resolution [16], optical flow, image inpainting [17], motion estimation, and 2D-3D registration [18]. The MRF was also used to solve the high-level vision problems such as object classification [19, 20], face analysis [21], face recognition [22], and text recognition [23]. Many optimization problems can be formulated in the MRF, for example, color to gray transformation [24], feature detection scale-selection [25], and so forth. Additionally, Boykov and Funka-Lea [26] presented a survey of various energy-based techniques for binary object segmentation. S. Geman and D. Geman [1] firstly applied the MRF to image restoration. Sun et al. [27] used belief propagation algorithm and combinet it with occlusion to solve the stereo problem. Detry et al. [28] proposed an object representation framework that encodes probabilistic spatial relations between 3D features. Then the authors of [28] organize the proposed framework in the MRF.

In the remainder of the paper, Section 2 gives a sketchy of MRF and related concepts. Section 3 provides two most frequently used inference algorithms for MRF. Section 4 briefly introduces two labeling applications of MRF in low-level vision. Section 5 summarizes the contribution and offers the future works in the topic.

2. Problem Formulation with MRF

As a branch of probability theory [2], MRF is an undirected graphical model in which a set of random variables have a Markov property. To solve a special computer vision problem involving pixel interaction and partially observed information into an optimization problem using MRF model, we will go over the graphical models that visualize the structure of the probabilistic models using diagrammatic representations. A graph consists of nodes and edges. Each node means an event, and each edge represents the relationship between the events. MRF is used to find the most optimal label configuration.

For a labeling problem, we need to specify a set of nodes, labels, and edges. Without loss of generality, let be a set of indexes , and let be a set of observed nodes. In vision problem, a node often represents the pixel intensity or some other image features. Let be a set of labels. can be continuous or discrete, but in most cases, all the labels we set are discrete: .

As stated above, a label means some quantity of the real scene. The simplest case is binary form where . Such black and white model is often used to classify the foreground and background regions in the image. In general cases, the label value is more meaningful. For example, in stereo and image restoration problem, larger label value means depth information or lighter pixel intensity. Additionally, also can be unordered labels of which the value has no semantic meaning, such as for object classification.

represents the neighbor system to indicate the interrelationship between nodes or the order of MRF. The edges are added between one node and its neighbors . Usually, the neighbor system should satisfy [2] the following:(1)a site does not neighbor with itself: ,(2)the neighboring relationship is mutual: .

The definition of the neighbor system is important because it reflects how far the contextual constraint is. For a regular array data, as in Figure 1, the neighbors of are defined as the set of sites within a radius of sqrt from where is the order of the neighbor system. where measures the Euclidean distance between and .

Another concept here is “clique.” A clique is a subset of which plays the near role of the neighbor system. However, the nodes in a clique are ordered which means that is a different form . Figure 2 shows some examples of clique types.

Though we could get more static information of the problem domain with larger neighboring system, the computational complexity of the problem will also increase exponentially with the size of neighborhood. In most cases, a 4-neighborhood system is used for simplification and efficiency. MRF is an undirected graph where a set of random variables have a Markov property. In the random field, each random variable in the set can take a label from . Usually, a mapping function in which can represent for this processing. is also called configuration. Denote as the probability of a pixel taking the label . Then the configuration is a joint probability: . Note that . The Markov property is a basic property if the conditional probability where means the entire element in other than and is the neighbor system of . The Markov property means that the state probability of one node only depends on its neighbors rather than other remaining nodes. Gibbs random field is a random field in which the probability obeys the Gibbs distribution in the form of where is a normalizing constant called the partition function, and is a constant which shall be assumed to be 1 unless otherwise stated. is the energy function. C is the clique defined on the graph, and is the clique potential function.

Hammersley-Cliffod theorem states that if a probability distribution is a positive and satisfies the Markov properties in an undirected graph , the distribution is a Gibbs random field. That is, its probability can be factorized over the cliques of the graph. This theorem provides a simple way to calculate the joint probability using the clique potential. According to the Bayes’ rule, the posterior distribution for a given set and their evidence , combined with a prior over the unknowns , is given by

If we do not know the prior information, the maximum likelihood (ML) criterion may be used where argmax . However, sometimes we can still obtain the knowledge about the prior distribution of . Thus, the maximum of a posteriori (MAP) estimation is the best way to get the optimization where argmax . Figure 3 illustrates the difference between ML criterion and MAP criterion. MAP probability is one of the most popular statistical criteria for optimization, and in fact, it is the first choice in MRF vision modeling.

Logarithmize both sides, and then we can obtain the negative posterior log-likelihood where is a constant used to make integration of equal to 1. To find the MAP solution, we simply minimize (2.4). Rewrite the clique potential , then (2.4) becomes an energy function where the can be treated as the clique potential whose clique size is 1, and is the remaining clique potential or the observed image prior distribution. In most vision problems, the single-site clique potential is also called unary potential or data energy. Similarly, is called smooth potential or smooth energy. With (2.1), (2.5) can be rewritten as where . Therefore,

Most vision problems map the minimization of an energy function over an MRF. In some degree, the energy function can be seen as a mathematical representation of the scene and should precisely measure the global quantity of the solution as well as can be easy to find the global minimization. When the energy function (2.7) is minimized, the corresponding posteriori gets the maximum.

To solve a specific problem, we need to determine the energy form and the parameters involved. Though there are many types of clique potential functions, there exists a unique normalized potential, called the canonical potential. In literature, the energy function can be expressed as either a parametric form or a nonparametric form [2]. Here, we take the second-order clique potential, for example, which is also called the pairwise model. The pairwise MRF is the most commonly used model in which each node interacts with its adjacent nodes. It is the lowest-order constraint to convey contextual information and is widely used due to its simple form and low computational cost [2]. The pairwise MRF models the statistics of the first derivative in the image structure (Figure 4). The corresponding energy function is

Usually, is the local evidence of taking the label such as the intensity or the color value. Equation (2.8) can be rewritten as

In the binary MRF case, , where . In the multilabel case, Potts model is the most widely used one which can prevent the edges of objects from oversmoothing. Usually, Potts model takes the form where α may be a constant or .

As is illustrated in Figure 5, in the pairwise MRF model, a node is attached to a pixel in the image, while edges are constructed between the node and its four neighborhoods. With such model, the corresponding energy function can be efficiently minimized using many inference algorithms. Other graph structures are also used. For example, in image segmentation, an image is partitioned into several regions. Each region can be regarded as a node, and edges may be constructed between adjacent segmented regions. To make the optimization more efficient, a hierarchical MRF model is used. It mainly uses the pyramid structure and performs in a coarse-to-fine scheme which uses a coarser solution to initialize a finer solution. It is well known that hierarchical methods can significantly improve the convergence rate and reduce the execution time. In [2931], a regular pyramid downsampling method was applied, while Zitnick and Kang [32] used an irregular pyramid. Figure 6 illustrates an example.

Although most MRFs use the pairwise model due to its simplicity, a scheme of more complex interaction, for example, 8-neighborhood or more numbers of pairwise terms, is also used sometimes. People usually use 26-neighborhood in 3D volumetric images or video analysis. Higher-order clique potentials can capture more complex interactions of random variables. For example, calculating the curvature of an object requires interaction of at least three nodes. Computational time for the clique potential increases exponentially with the size of the clique and poses a difficult energy minimization scenario, which poses a tough question. Recently, there have been many attempts to go beyond pairwise MRF. One approach is to transform the higher-order problem into pairwise problem by adding auxiliary variables. For instance, Kohli et al. [33] proposed an efficient graph cut method based on special class of higher-order potential, that is, robust Potts model. Rother et al. [34] transformed the minimizing sparse higher-order energy function into an equivalent quadratic minimization problem. Potetz and Lee [35] introduced an efficient belief propagation algorithm where the computational complexity increases linearly with the clique size. Kwon et al. [36] decomposed high-order cliques as hierarchical auxiliary nodes and used hierarchical gradient nodes to reduce the computational complexity. Another way is to perform direct computing using factor graph representation [37]. Kwon et al. [38] proposed a nonrigid registration method using the MRF with a higher-order spatial prior. Experiments show that using high-order potential the performances of image denoising are significantly improved, as is shown in Figure 7.

3. Inference Methods

Over the years, a large number of inference algorithms have been developed, which can be mainly classified into two categories, that is, message passing algorithms such as loopy belief propagation and move making algorithms such as graph cuts. In this section, we briefly introduce two classic inference methods for approximating energy minimums, that is, belief propagation and graph cut.

3.1. Graph Cut

Graph cut (GC) was first applied in computer vision by Greig et al. [40], which describes a large family of MRF inference algorithms based on solving min-cut/max-flow problems. Given a type of computer vision problems which can be formulated in terms of an energy function, GC can get the minimum energy configuration corresponds to the MAP theory.

Suppose that is a directed graph in which the edge weight is nonnegative, represents vertices, and denotes edges. The graph has two special terminals (vertices), that is, the source s and the sink t. A cut is a partition of . An s-t cut is a cut that splits the source and the sink to be in different subsets where and . Besides, according to graph theory, the potential of a cut can be measured by the sum of the weights of the edge crossing the cut. To find a cut which can minimize s-t cut problem is equivalent to compute the maximum flow from the source s to the sink t. Maximum flow is the maximum “amount of water” that can be sent from the source to the sink by interpreting graph edges as directed “pipes” with capacities equal to edge weights. As illustrated in Figure 8, the GC algorithm is ideally designed to solve the max-flow problem.

It was reported that GC can obtain the exact solution in the binary label case. In multilabel case, GC requires solving a series of related binary inferences and then obtains the approximated global optimal solutions. Two of the most popular GC algorithms are α-β swap and α-expansion. In the α-β algorithm, a swap move takes some subset of nodes that currently label with α and assign their label with β, and vice versa. The α-expansion algorithm increases the set of nodes taking α by moving it to other nodes. When there is no more swap or expansion move, a local minimum is found. Comparing the two algorithms, α-expansion is more accurate and efficient. Also α-expansion can produce a result with lower energy. However, the condition of α-expansion is more strict. When using the α-expansion, the interaction potential must be metric, that is, For α-βswap, it must be semimetric, that is,

More details about -βswap and α-expansion can be found in [4]. In addition, Kolmogorov and Rother [41] wrote a survey about graph cut and pointed out that GC can be applied to both submodular and nonsubmodular functions. Other more recent developments in GC include order-preserving GC [42] and combination GC [43].

3.2. Belief Propagation

Belief propagation is a power inference tool originally developed for tree-Bayesian networks [45]. It is recently extended to those “cycle” graphs such as MRF. Although BP can only guarantee convergence with the Bethe free energy in MRF [46], it can obtain reasonable results in practice. In standard BP with pairwise MRF, a variable can be treated as a “message” from a node to its neighbor which contains the information about what the state of node should be in. The message is a vector of the same dimension as the number of possible labels. The value of each dimension manifests how this label might be corresponding to the node.

Let be the pairwise interaction potential of with , and is the “local evidence” of . Usually, the message must be nonnegative. A large value of the message means that the node “believes” the posterior probability of is high. The message updating rule is where t represents the number of interaction as showed in Figure 9.

The belief is the product of “local evidence” of the node and all messages send to it

The standard BP described above is also called sum-product BP. There is another variant BP which is more simple to use, that is, max-product (or max-sum in log domain). In max-product BP, (3.3) and (3.4) are represented as

The sketch map of this process is illustrated in Figure 10. Several speed-up techniques are attempted, for example, distance transformation, checkerboard updating, and multiscale BP [5], so that the belief propagation can converge efficiently. In another way, Yu et al. [47] used the predictive coding, linear transform coding, and envelope point transform to improve the BP efficiency.

Although BP is an implicitly efficient inference algorithm for MRF with loops, it can only converge to the stationary point of the Bethe approximation of the free energy. Recently, a generalized belief propagation (GBP) algorithm proposed by [48] has received more attention due to its better convergence property against BP. It can converge to a more accurate stationary point of Kikuchi free energy [46]. More details about the GBP algorithm can be found in [48].

BP and graph cut are both good optimal techniques which can find “global” minima over cliques and produce plausible results in practice. A comparison between the two different approaches for stereo vision was described in [49]. GC can get lower energy, but the performance of BP is comparative to GC relative to the ground truth.

In addition to the two typical methods, many other inference algorithms have been proposed in latest few years. Fusion move [8] is proposed for multilabel MRF. By employing QPBO graph cut, the fusion move can efficiently combine two proposal labels in a theoretically sound way, which is in practice often globally optimal. Alahari et al. [50] improved the computational and memory efficiency of algorithms for solving multilabel energy functions arising from discrete MRF by recycling, reducing, and reusing. Kumar et al. [51] provided an analysis of linear programming relaxation, the quadratic programming relaxation, and the second-order cone programming relaxation to obtain the maximum a posteriori estimate of a general discrete MRF. Komodakis and Tziritas [52] proposed an exemplar-based framework and used priority BP to find MRF solutions. Ishikawa [53] introduced a method to exactly solve a first-order MRF optimization problem in more generality than previous ones. Cho et al. [54] used patch transform representation to manipulate images in the patch domain. The patch transform is posed as a patch assignment problem on an MRF, where each patch should be used only once, and neighboring patches should fit to form a plausible image.

4. Applications

Here, we provide MRF solutions for two typical problems in computer vision, that is, stereo matching and image segmentation. These problems require labeling each pixel with a value to represent the disparity and foreground or background. They can be easily modeled using MRF and solved by energy minimization.

4.1. Stereo Matching

Stereo matching has always been one of the most challenging and fundamental problems in computer vision. Comprehensive research has been done in the last decade [32, 5558]. A latest evaluation of these various methods can be found in [59]. In the last few years, as is shown in [44], the global methods based on MRF have reached the top performance.

For MAP estimation, let be the set of the image pixels in image pair, and let be the set of disparity. The initial data cost, which is calculated by the truncated linear transform which is robust to noise or outlier, is defined as where λ is the cost weight which determines the portion of energy that data cost possesses in the whole energy, and T represents the truncating value. The parameters can be set with empirical values from experiments. (p) represents intensity in the left image of channel c. is similarly defined. Birchfield and Tomasi’s pixel dissimilarity is used to improve the robustness against the image sampling noise. The smooth cost which expresses the compatibility between neighboring variables embedded in the truncated linear model, is defined as: where is the truncating value. The smooth cost based on the truncated linear model is also referred to as discontinuity preserving cost since it can prevent the edges of objects from oversmoothing. The corresponding energy function used here is the most conspicuous one and is defined as where contains the edges in the four-connected neighborhood set.

The objective is to find a solution which minimizes (4.3). The solution means the correct depth information in the scene. Figure 11 shows the results of “Tsukuba” data set using different energy minimization methods available in [44]. In the past decades, segment-based stereos [32] have been boomed as they perform well in reducing the ambiguity associated with textureless regions and enhancing noise tolerance by aggregating over pixels with homogenous properties. Usually, those algorithms firstly segment the source image. Then the matching cost is computed over the entire segment. A plane fit method is applied to refine the result.

4.2. Binary Image Segmentation

Binary image segmentation is widely used in medical image analysis and object recognition. Here, each pixel is assigned with a label with . In the simplest case, we have , where 0 represents the pixel belonging to the background and 1 to the foreground. The segmentation result should be accurate and fine enough for successful applications such as object category, photo editing, and image retrieval. Although segmentation is regarded as one of the most difficult problems due to the complexity of real scene and noise corruption, MRF model can often successfully deal with this challenging problem.

The corresponding energy function is represented the same as (2.9). The data cost represents whether the pixel property is consistent with the statistic distribute of possible region. It may be simple to take such an absolute difference of pixel intensity and the mean of region gray level. Alternatively, the complex data term often leads to better results. For example, in [60], the data cost uses the color data model which is the log-likelihood of a pixel and is modeled as two separate Gaussian mixture models. The smoothness term is a simple Potts model

In (4.4), is the Euclidean distance of pixel m and pixel , and denotes the indicator function taking 0 and 1. K is a constant. If , the smoothness term recovers the Ising model which encourages smoothness everywhere. K determines how coherent the similar grey level in a region is. Recently, user interaction was proposed to refine the results in [6062]. Usually, the user first marks some pixels to indicate the background and foreground. With those labeled pixels, we can get the corresponding region statistics.

GC is the most common optimal tool for binary MRF combined with both color (texture) information and edge information. Further, the marked pixels can be used as the seeds in the cut-based algorithm. A graph cut extension, that is, grabcut [60], was proposed for iterative minimization of the energy. Figure 12 shows the results of binary segmentation with different methods using identity parameters [44].

Considering multilabel segmentation, Micusik and Pajdla [63] formulated single-image multi-label segmentation into coherent regions in texture and color as a max-sum problem. As a region merging method, Mignotte [64] used MRF fusion model combining several segmentation results to achieve a more reliable and accurate result.

More recently, Panda and Nanda [65] proposed an unsupervised color image segmentation scheme using the homotopy continuation method and compound MRF model. Chen et al. [66] proposed image segmentation method based on MAP or ML estimation. Li [67] introduced a multiresolution MRF approach to texture segmentation problems. Rivera et al. [68] presented a new MRF model for parametric image segmentation. Some other works [6971] carried out for learning of the prior distribution. MRF is also widely used in medical image segmentation. Zhang et al. [72] proposed segmentation of brain MR images through a hidden MRF. Scherrer et al. [73] used expectation maximization to segment the images in an MRF model. Anguelov et al. [74] segment 3D scanned data into objects using GC. Hower et al. [75] investigated in the context of neuroimaging segmentation. As a low level vision problem, the segment is often applied for object classification. Honghui et al. [76] proposed a robust supervised label transfer method for semantic segmentation of street scenes. Feng et al. [77] recently proposed a method to optimize the MRF, which can automatically determine the number of labels in balance of accuracy and efficiency.

5. Conclusion

It is now acknowledged that MRF is one of the most successful approaches for solving labeling problems in computer vision and image analysis. The most challenge of MRF models is to develop its efficient inference algorithm in order to find the low-energy configuration. As in computer vision, there are too many nodes. For example, consider two frame images with the size of . If each node takes N possible labels, the computation space is . Clearly, the inference algorithm should be efficient enough to overcome this dilemma. Secondly, constructing reasonable MRF also plays key roles, especially for some new vision applications. For instance, there are many different grid topologies and nonlocal topologies. Thirdly, the parameters of MRF model should be efficiently learned form image instead of manually or empirically chosen. Furthermore, further studies can focus on the energy functions which can not be efficiently solved by using state-of-the-art methods.

Acknowledgments

This work was supported by the National Natural Science Foundation of China and Microsoft Research Asia (NSFC-60870002, 60802087), NCET, Zhejiang Provincial S&T Department (2010R10006, 2010C33095), and Zhejiang Provincial Natural Science Foundation (R1110679).