Abstract

Generalized belief propagation (GBP) is a region-based belief propagation algorithm which can get good convergence in Markov random fields. However, the computation time is too heavy to use in practical engineering applications. This paper proposes a method to accelerate the efficiency of GBP. A caching technique and chessboard passing strategy are used to speed up algorithm. Then, the direction set method which is used to reduce the complexity of computing clique messages from quadric to cubic. With such a strategy the processing speed can be greatly increased. Besides, it is the first attempt to apply GBP for solving the stereomatching problem. Experiments show that the proposed algorithm can speed up by 15+ times for typical stereo matching problem and infer a more plausible result.

1. Introduction

Many engineering problems related to computer vision, statistical physics, signal processing, and artificial intelligence can be formulated as an inference problem in probabilistic graphical models such as Bayesian networks or Markov Random Fields (MRF).The goal is to find the maximum a posteriori (MAP) configuration [1]. However, it is an NP hard problem to get the exhaustive solution, and thus we may get the approximate inference by graph cuts or message passing algorithm and so on. The most popular variant of message passing algorithm is Belief Propagation (BP). Recently, because of its flexibility and efficiency, BP and its variants are boomed especially in image restoration, optical flow, and stereo.

BP is an optimization tool which is firstly proposed by Pearl for singl Bayesian network [2] and extended to loopy graphs such as MRF in last decade. The virtue of it is that we can use it to compute marginal probabilities for graphical models, at least approximately, in a time that grows only linearly with the number of nodes in the system [3]. In BP algorithm, each variance starts with the same initial message and iteratively updates all the messages passing from its neighbor variances, and calculates messages for its every neighbor, then passes new messages back until converged. In factor graph or Bayesian network, BP can be used to perform exact inference for every variance. However, when it refers to highly connected graphs with massive conflicting interactions such as the MRF of stereo matching, the convergence problem becomes a tricky issue anyway. The precision of configuration will vary with the cyclicity of graph. To impulse and accelerate variances to converge, many works have done and also achieved some plausible progress. However, being enslaved to the absence of the convergence property of BP in graph models with loops, the development of BP seems slow. On the other hand, generalized belief propagation (GBP) proposed by Yedidia et al. [4] with its better convergence property against BP has received more attentions recently.

GBP can be considered as a variant of standard BP. It is also an instance of cluster variation methods. In the literature, BP can only converge to a stationary point of Bethe free energy, while GBP can converge to a more accurate stationary point of Kikuchi free energy [5]. Therefore, it leads GBP to take the advantage of better convergence than BP. Despite of the characteristic of good convergence, as a toll, it is really computationally expensive. When considering the temporal complexity with the optimal version in [6], BP as an approximate method reaches linear complexity, while the canonical GBP takes quartic complexity. This has limited its applicability in some small-scale problems, for example, image denoising and image restoration [1], and obviously prevents GBP away from some more complicated problems, for example, stereo matching even in a small size image pair.

To accelerate GBP algorithm, some optimization methods have proposed recently. Petersen et al. proposed two strategies of fast GBP for map estimation on 2D and 3D grid-like MRF [7]. One is to use a caching method that significantly reduces the number of multiplications during GBP inference. The other is to introduce a speed-up for computing the map estimate of GBP cluster messages by presorting its factors and limiting the number of possible combinations. Pawan and Torr also provides a method of fast memory efficient GBP [8].

However, for solving the stereo matching problem, it is still a fraction of the need. This paper proposes a new method named direction set method which is introduced into the pairwise message computation stage to make GBP more efficiently. With the proposed method, the temporal complexity can be decreased from quartic to cubic. Furthermore, this is the first attempt to apply GBP for solving the stereo matching problem. For completeness, we will briefly introduce the MRF and BP in the next section.

The remainder of this paper is organized as follows. Section 2 gives a sketch of basic theory. Section 3 provides the definition of the GBP with min-sum messaging and its caching structure. Section 4 represents a detailed description of the proposed strategies for GBP optimization. Section 5 gives the experiments and results of stereo matching and Section 6 summarizes the findings.

2. The Basic Principle

Human understand a scene mainly using the spatial and visual information which is assimilated through our eyes. These information such as region or object, mainly based on the contextual constraints, are extremely necessary for interpretation. The context-dependent object such as image can be modeled in a convenient and consistent way through MRF theory. It is achieved through characterizing mutual influences among such entities using conditional MRF distributions [9].

MRF is firstly introduced into computer vision in [10] and have dominated the fields of image processing and computer vision since the early 1980s. As the most popular type of prior models for gridded image-like data, which include not only regular natural images but also two-dimensional fields such as motion or depth maps, as well as binary fields such as and image restoration and segmentations, MRF provides a mathematical foundation for the characterization of contextual constraints and the derivation of the probability distribution of interacting features [9].

Without loss of generality, let 𝑀 be a set of indexes 𝑀={1,…,π‘š}, 𝑃={𝑝1,…,π‘π‘š} be a set of observed nodes, 𝐿={𝑙1,…,π‘™π‘›βˆ£π‘›β‰€π‘š} be a set of labels. Here we set all the labels are discrete.

𝑁={π‘π‘–βˆ£π‘–βˆˆπ‘€} represents the neighbor system to indicate the interrelationship between nodes or the order of MRF. Recently, a learning high-order MRF model named Fields of Excepts has been proposed which could get more sufficient priors. However, we use one-order MRF (also called pairwise MRF) for simplification. Figure 1 shows a sample of MRF used in this paper.

Many computer vision problems can be formulated as a labeling problem in which the solution is assigning a label from the set 𝐿 to each of the nodes in 𝑃. In the literature, a mapping function πΉβˆΆπ‘ƒβ†’πΏ which 𝐹={𝑓1,…,π‘“π‘š} can be represented in this processing. It has been proved that the joint probability Pr(π‘“π‘π‘š) of an MRF is a Gibbs distribution. Besides, according to the Hammersley-Cliffod theorem, the posterior probability Pr(π‘“π‘π‘š) only depends on its neighborhood π‘π‘π‘š, which means that𝑓Prπ‘π‘šβˆ£π‘ƒβˆ’π‘π‘šξ€Έξ€·π‘“=Prπ‘π‘šβˆ£π‘π‘π‘šξ€Έ,βˆ€π‘π‘šβˆˆπ‘ƒ.(2.1)

According to Bayes’ Rule, the posterior distribution for a given set 𝑦 and their evidence 𝑝(π‘¦βˆ£π‘₯), combined with a prior 𝑝(π‘₯) over the unknowns π‘₯, is given by 𝑝(π‘₯βˆ£π‘¦)=𝑝(π‘¦βˆ£π‘₯)𝑝(π‘₯)𝑝(𝑦).(2.2)

If we take the negative logarithm of both sides, we get βˆ’log𝑝(π‘₯βˆ£π‘¦)=βˆ’log𝑝(π‘¦βˆ£π‘₯)βˆ’log𝑝(π‘₯)+log𝑝(𝑦).(2.3) Here log𝑝(𝑦) is a constant which is used to make the 𝑝(π‘₯βˆ£π‘¦) integrate to 1. To find the MAP solution, we simply minimize (2.3), which can also be treated as an energy function:𝐸(π‘₯,𝑦)=𝐸𝑑(π‘₯,𝑦)+𝐸𝑠(π‘₯),(2.4) where the 𝐸𝑑(π‘₯,𝑦) is the data penalty and 𝐸𝑠(π‘₯) is smoothness penalty.

Recall (2.1) and rewrite(2.4)𝐸𝑓𝑝,𝑁𝑝=𝐸𝑑𝑓𝑝,𝑁𝑝+𝐸𝑠𝑓𝑝,(2.5) where 𝐸𝑠𝑓𝑝=πΈπ‘ π‘žβˆˆπ‘π‘β§΅π‘žξ€·π‘“π‘,π‘“π‘žξ€Έ.(2.6)

Therefore, the energy function is 𝐸(𝐹)=π‘βˆˆπ‘ƒπΈπ‘ π‘žβˆˆπ‘π‘β§΅π‘žξ€·π‘“π‘,π‘“π‘žξ€Έ+ξ“π‘βˆˆπ‘ƒπΈπ‘‘ξ€·π‘“π‘,𝑁𝑝.(2.7)

In the large label space, because of massive variances and various uncertainties, it becomes a nontrivial task [11] to make a global inference using local information. For this reason, many approximated inference algorithms are proposed to find the MAP estimation against the exact answer. In this case, the inference problem usually can be mapped into an energy minimization problem which has a profound mathematic foundation in the literature. In the last few years, two approximate algorithms have been developed in MRF approximated inference problem with their efficiency and comparatively high accuracy, for example, graph cut (GC) [12] and BP [6, 13].

In standard belief propagation with pairwise MRF, a variable π‘šπ‘–π‘—(π‘₯𝑗) can be vividly treated as a β€œmessage” from a node 𝑖 to its neighbor node 𝑗 which contain the information about what state node 𝑗 should be in. The message is a vector of same dimensionality as the number of possible label. The value of each dimension manifested that how this label might be corresponding to the node.

Recall the function of (2.1) and write Pr(π‘“π‘π‘š) as Pr(𝑝𝑖), then𝑝Pr𝑖=Ξ (𝑖,𝑗)βˆˆπ‘πœ™π‘–π‘—ξ€·π‘π‘–,π‘π‘—ξ€Έξ‘π‘–πœ™π‘–ξ€·π‘₯𝑖,𝑦𝑖,(2.8) where πœ™π‘–π‘—(𝑝𝑖,𝑝𝑗) is the pairwise interaction potential and πœ™π‘–(π‘₯𝑖,𝑦𝑖) is the β€œlocal evidence”.

Usually, the message must be nonnegative. A high value of message show that the node β€œbelieves” the posterior probability of 𝑋𝑗 is very high. The message update rule isπ‘šπ‘–π‘—ξ€·π‘₯𝑖𝑑=π‘₯π‘–πœ™π‘–ξ€·π‘₯π‘–ξ€Έπœ‘π‘–π‘—ξ€·π‘₯𝑖,π‘₯π‘—ξ€Έξ‘π‘˜βˆˆπ‘π‘–β§΅π‘—π‘šπ‘˜π‘–ξ€·π‘₯π‘–ξ€Έπ‘‘βˆ’1,(2.9) where 𝑑 represents the number of iterations 𝑇 as shown in Figure 2.

The belief is the product of β€œlocal evidence” of the node and all messages send to this node𝑏𝑖π‘₯𝑖=π‘˜πœ™π‘–ξ€·π‘₯π‘–ξ€Έξ“π‘—βˆˆπ‘π‘–π‘šπ‘–π‘—ξ€·π‘₯𝑖.(2.10)

The standard BP we have described above is also named sum-product BP. There is another variant BP which is more simple and easy to use: max-product (or max-sum in log domain). In max-product BP, (2.9) is rewritten π‘šπ‘–π‘—ξ€·π‘₯𝑖𝑑=maxπ‘₯π‘–ξƒ©πœ™π‘–ξ€·π‘₯𝑖+πœ‘π‘–π‘—ξ€·π‘₯𝑖,π‘₯𝑗+βˆ‘π‘˜βˆˆπ‘π‘–β§΅π‘—π‘šπ‘˜π‘–ξ€·π‘₯π‘–ξ€Έπ‘‘βˆ’1ξƒͺ,𝑏𝑖π‘₯π‘–ξ€ΈβŽ›βŽœβŽœβŽξ“=π‘˜π‘—βˆˆπ‘π‘–π‘šπ‘–π‘—ξ€·π‘₯𝑖+πœ™π‘–ξ€·π‘₯π‘–ξ€ΈβŽžβŽŸβŽŸβŽ .(2.11) It indicates that which states should the node most likely be in. Though BP is an efficient implicit inference algorithm for MRF with loops. It can only converge to the stationary points of the Bethe approximation of the free energy where the node number of regions is at most two. As has discussed above, GBP can get a more accurate inference than BP. In next section, we extend BP to GBP.

3. Message Passing

The GBP which was firstly proposed by Yedidia et al. can be considered as a region based BP method [4]. Specifically, the basic intuitive idea behind GBP is to compute more useful message between regions other than nodes. As a Kikuchi free energy approximation method, GBP in general allows an arbitrary number of nodes to gather as a clique and involves the clique information to the whole passing process, which yields better approximation to the posterior probability, while BP only do node-to-node message passing around.

As another source of information, that is, the clique information, involved in the passing process, the search capability for the minimum of an energy function is extensively upgraded. The update rules of the canonical GBP are defined as below: π‘šπ‘Ÿπ‘ βˆ‘βŸ΅π‘˜π‘₯π‘Ÿβ§΅π‘ πœ‘π‘Ÿβ§΅π‘ ξ€·π‘₯π‘Ÿβ§΅π‘ ξ€Έβˆπ‘šπ‘Ÿπ‘†β€²β€²β€²β€²βˆˆπ‘€(π‘Ÿ)⧡𝑀(𝑠)π‘šπ‘Ÿβ€²β€²π‘ β€²β€²βˆπ‘šπ‘Ÿβ€²π‘ β€²βˆˆπ‘€(π‘Ÿ,𝑠)π‘šπ‘Ÿβ€²π‘ β€²π‘π‘ŸβŸ΅π‘˜πœ‘π‘Ÿξ€·π‘₯π‘Ÿξ€Έξ‘π‘šπ‘Ÿβ€²π‘ β€²βˆˆπ‘€(π‘Ÿ)π‘šπ‘Ÿβ€²π‘ β€²,(3.1) where π‘Ÿ is the regions and 𝑠 is their correspondent subregion, π‘šπ‘Ÿπ‘  is the message sending from region π‘Ÿ to its subregion 𝑠, πœ‘(π‘₯) is the local β€œevidence” of node π‘₯, 𝑀(π‘Ÿ) is the set of messages sending from out side of region π‘Ÿ to some nodes inside region π‘Ÿ, 𝑀(𝑠) is defined similarly, 𝑀(π‘Ÿ,𝑠) is the set of messages sending from some nodes in region π‘Ÿ but not in region 𝑠 to some nodes in region 𝑠, and π‘π‘Ÿ is the belief of region π‘Ÿ.

The definitions of the regions, for example, π‘Ÿ and 𝑠, in (3.1) directly determine the performance of GBP. It is very hard to choose the reasonable size of region. Though the basic clusters should encompass as many cycles as possible, the complexity will grows exponentially with the number of size. To some degree, they are somewhat contradictory, the more lager size is, the less efficiency the algorithm is. In practice, it is infeasible to set the cluster sizes larger than four.

In this paper, we concern the implementation instance introduced in [14]. This instance of GBP is comprised of two types of regions definition, that is, single node region and double node region, and the correspondent messages are named edge message and cluster message, respectively. The message update rules are defined in (3.2) and (3.3). The sketch map of the message passing process can be seen in Figure 3,π‘šπ‘ β†’π‘’ξ€·π‘₯𝑒=maxπ‘₯π‘ ξ€·πœ™π‘ πœ‘π‘ π‘’π‘šπ‘Žβ†’π‘ π‘šπ‘β†’π‘ π‘šπ‘β†’π‘ π‘šπ‘π‘‘β†’π‘ π‘’π‘šπ‘π‘’β†’π‘ π‘’ξ€Έ,π‘š(3.2)𝑠𝑑→𝑒𝑣π‘₯𝑒,π‘₯𝑣=maxπ‘₯𝑠,π‘₯π‘‘ξ€·πœ™π‘ πœ™π‘‘πœ‘π‘ π‘‘πœ‘π‘ π‘’πœ‘π‘‘π‘£π‘šπ‘Žβ†’π‘ π‘šπ‘β†’π‘ π‘šπ‘β†’π‘‘π‘šπ‘‘β†’π‘‘π‘šπ‘Žπ‘β†’π‘ π‘‘π‘šπ‘π‘’β†’π‘ π‘’π‘šπ‘‘π‘“β†’π‘‘π‘£ξ€Έπ‘šπ‘ β†’π‘’π‘šπ‘‘β†’π‘£.(3.3) Equation (3.2) describes the edge message sending from a specific node 𝑠 to node 𝑒. Equation (3.3) describes the cluster message sending from pair-node 𝑠 and 𝑑 to pair-node 𝑒 and 𝑣.

4. Efficiency Improvement

In order to improve the efficiency of GBP, the direction set method is proposed which can reduce the computation complexity of cluster message. Considering (11), when π‘₯𝑒 and π‘₯𝑣 are given, the temporal complexity to compute a specific item in the cluster message is 𝑂(𝑛2). It is almost contributed from the first term in the equation, which can be regarded as finding the minimum value in a grid-like dataset which size is 𝑠 and 𝑑, respectively. Usually all the elements in the lattice are traversed to find a minimization. And the temporal complexity is 𝑂(𝑛2). Petersen et al. proposed a method to reduce the search space [7, 15], but it relies very much on the traverse order. The method which we suggested in this paper is very straightforward. The temporal complexity becomes 𝑂(𝑛) when the direction set method is applied. Thus the total complexity for computing the cluster message becomes 𝑂(𝑛3).

The direction set method adopted here is also called Powell’s method [16]. It is a classical numerical algorithm in function minimization or maximization. It decomposes an 𝑁-dimensional (𝑁-D) search problem into several one-dimensional (1D) search processes. Take an example in 2D lattice where a node 𝑃 with a random initial position, and two orthogonal directions are given. First, 𝑃 moves to the extreme value position which is found by searching along the first direction among the two initial directions. Second, 𝑃 moves to another extreme value position by searching along the second direction given by initialization. Third, the first direction is substituted by the second, and the second direction is set to be a new direction which is determined by the initial position and the final position after two rounds of searching. Meanwhile, the final position is set to be the new initial position. The three steps are performed in an iterative way until 𝑃 no longer moves. In another word, by searching along the two directions, there is no other position where its value is less than 𝑃. Thus, the final position is where 𝑃 stops.

The general idea of the direction set or Powell’s method has a challenging problem that the two directions will β€œfold up on each other” in some cases. Once this happens, the search capability in this iteration will be weakened, and the process has a high risk of getting a subspace minimization instead of full N-D case. On the other hand, in practice it is hard for computer to search along an arbitrary direction where it needs more computation to determine which nodes are occupied. This paper adopts the method suggested in [16]. We set the two directions to be static and parallel along each axis. This setting not only keeps the orthogonal condition from the beginning to the end, but also makes the implementation easier because every search process is along one of the axes.

Although there is no special requirement for the start position, it is more useful to place the initial position close to the extreme value position. When π‘₯𝑒 and π‘₯𝑣 are given, the initial position at the s-t lattice is a tricky issue. To place it near the minimum value position, we assume that the combination of the independent minimum value positions of 𝑠 and 𝑑 is close to the actual minimum value position.

Through this optimization, the number of accessed positions is decreased from 𝑛2 to 2π‘˜π‘› where π‘˜ is the number of iterations which in our practice is about 2 to 3 in average, 𝑛 is the search range, for example, the disparity range for stereo matching. Since the comparison operation takes main computation time, the general complexity becomes 2π‘˜π‘›3 while the complexity of brute force search is 𝑛4. The efficiency rate is 𝑛/(2π‘˜). When 𝑛 is larger, the rate of computation time is higher.

5. Experiments

Stereo matching has been one of the most challenging and fundamental problems in computer vision. A comprehensive research has been done in the last decade [17–22]. A latest evaluation of these various methods can be found in [23]. In the last few years, as is showed in [24], the global methods based on MRF have reached the top-performing.

In this section, stereo matching is formulated as a MRF inference problem. To achieve the MAP estimation, which can be yielded as an energy minimization problem, let 𝑃 be the set of the image pixels in image pair and 𝐿 be the disparity. The initial data cost calculated by the truncated linear transform which is robust to noise or outlier is defined as π·ξ€·π‘“π‘ξ€ΈβŽ›βŽœβŽœβŽξƒŽ=πœ†β‹…minξ“π‘βˆˆ{𝐿,π‘Ž,𝑏}ξ€·πˆπΏπ‘(𝑝)βˆ’πˆπ‘…π‘ξ€·π‘βˆ’π‘“π‘ξ€Έξ€Έ2⎞⎟⎟⎠,𝑇,(5.1) where 𝛾 is the cost weight which determines the portion of energy that data cost possesses in the whole energy, 𝑇 represents the truncating value. Both of them are set experimentally. 𝐼𝐿𝑐(𝑝) represents 𝑝’s intensity in the left image of channel 𝑐. 𝐼𝑅𝑐(𝑝) is defined similarly. The Birchfield and Tomasi’s pixel dissimilarity is used to improve the robustness against the image sampling noise. It is noticeable that we calculate the data cost in the CIELAB (the πΏβˆ—π‘Žβˆ—π‘βˆ— standard of Commission Internationale de L’Eclairage) color space, and the Euclidean distance is used as the measure. Practical experiments show that it can improve the final results at some degree.

The smooth cost which expresses the compatibility between neighboring variables embedded in the truncated linear model is defined as𝑉𝑓𝑝,π‘“π‘žξ€Έξ€·||𝑓=minπ‘βˆ’π‘“π‘ž||ξ€Έ,𝐾,(5.2) where 𝐾 is the truncating value. The smooth cost based on the truncated linear model is also referred to as discontinuity preserving cost, since it can prevent the edges of objects from over smoothing.

The corresponding energy function used here is the most conspicuous one which is defined as 𝐸(𝑓)=π‘βˆˆππ·ξ€·π‘“π‘ξ€Έ+(𝑝,π‘ž)βˆˆππ‘‰ξ€·π‘“π‘,π‘“π‘žξ€Έ,(5.3) where 𝐍 are the edges in the four-connected neighborhood set.

The energy function defined in (5.3) can be considered as a description of the scene. The objective is to find a solution which can minimize (5.3), which means the correct depth information in the scene. Generally, a rather complex energy function can get the solution more correct. However, to simplify the presentation and to be consistent and comparable with other methods, the dualistic energy function as (5.3) is used in this paper.

The proposed method is evaluated on MiddleBury test. We compared our results with efficient BP [6] and canonical GBP to show the improved efficiency as well as the accuracy of the proposed method. The same set of certain typical parameters were used, where specifically, 𝑇=30.0 and πœ†=0.87 in the data cost term, 𝐾 = 10.0 in the smooth cost term, 𝑑 = 16. All experiments were tested on a personal computer with 1.6 GHz CPU and 2 G DRAM.

Apparently, efficient BP is much faster than others. However, it is less accurate. The ultimate purpose of the proposed method is to improve the efficiency of canonical GBP while keeping it in a good accuracy. As shown in Figure 4, the execution time which combines the two strategies can be extensively reduced, while the convergence energy rises a little because direction set may cause loss of accuracy. The canonical GBP with caching and direction set can achieve about 15+ times of the speed rate. The experiments were tested with the image β€œTsukuba” (384 × 288 size).

The error of result is calculated with the ground truth, respectively. From the evaluated accuracy listed in Table 1, the accuracy of proposed method is obviously better than that of canonical GBP. Comparing it with the accuracy of efficient BP, the proposed method yields a similar level. On the other hand, through the comparison between the proposed method and efficient BP, it is noticeable that efficient BP tends to get a frontoparallel result which makes the surface oversmooth and results in a layered effect. In the contrary, the proposed method does not have the drawback of layered effects like that caused by efficient BP, but the 3D map becomes blurred at the boundaries and some noises cannot be eliminated. In fact, although a layered result can reach a lower energy, it cannot always be a better description of the real scenes (Figure 5).

6. Conclusion

This paper studied the challenging issues in both physics and computer vision, that is, the efficient optimization for GBP and stereocorrespondence for 3D vision. A min-sum scheme is invented for the message computing process in GBP, and this new method is applied to solve the stereo matching problem. Direction set is proposed for improving the efficiency. For a typical image pair, it can speed up the matching process to about 15+ times. Besides this improved speed in each single thread, with a parallel computing architecture, it can further catch up or take over most contemporary global algorithms due to its message-based passing process. Furthermore, with the proposed method we can get more plausible results in visual favorite because its better convergence can outperform most of other global algorithms. The practical experiments also prove these conclusions beyond both efficient BP and canonical GBP.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (NSFC-60870002, 60802087, 60873264, 61070214), the 973 Plan [2011CB302800], NCET, and the Science and Technology Department of Zhejiang Province (2009C21008, 2010R10006, 2010C33095, Y1090592).