EURASIP Journal on Advances in Signal Processing
Volume 2009 (2009), Article ID 647262, 21 pages
doi:10.1155/2009/647262
Research Article

Joint Motion Estimation and Layer Segmentation in Transparent Image Sequences—Application to Noise Reduction in X-Ray Image Sequences

1INRIA Centre Rennes-Bretagne-Atlantique, Campus universitaire de Beaulieu, 35042 Rennes Cedex, France
2General Electric Healthcare, 283 rue de la Miniere, 78530 Buc, France

Received 27 November 2008; Accepted 6 April 2009

Academic Editor: Lisimachos P. Kondi

Copyright © 2009 Vincent Auvray et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This paper is concerned with the estimation of the motions and the segmentation of the spatial supports of the different layers involved in transparent X-ray image sequences. Classical motion estimation methods fail on sequences involving transparent effects since they do not explicitly model this phenomenon. We propose a method that comprises three main steps: initial block-matching for two-layer transparent motion estimation, motion clustering with 3D Hough transform, and joint transparent layer segmentation and parametric motion estimation. It is validated on synthetic and real clinical X-ray image sequences. Secondly, we derive an original transparent motion compensation method compatible with any spatiotemporal filtering technique. A direct transparent motion compensation method is proposed. To overcome its limitations, a novel hybrid filter is introduced which locally selects which type of motion compensation is to be carried out for optimal denoising. Convincing experiments on synthetic and real clinical images are also reported.

1. Introduction

Most image sequence processing and analysis tasks require an accurate computation of image motion. However, classical motion estimation methods fail in the case of image sequences involving transparent layers. Situations of transparency arise in videos for instance when an object is reflected in a surface, or when an object lies behind a translucent one. Transparency may also be involved in special effects in movies such as the representation of phantoms as transparent beings. Finally, let us mention progressive transition effects such as dissolve , often used in video editing. Some of these situations are illustrated on Figure 1.

fig1
Figure 1: Examples of transparency configuration in videos. (a) Different reflections are shown, (b) three examples of phantom effects, (c) and one example of a dissolve effect for a gradual shot change.

In this paper, we are particularly concerned with the transparency phenomenon occuring in X-ray image sequences (even if the developed techniques can also be successfully applied to video sequences [1]). Since the radiation is successively attenuated by different organs, the resulting image is ruled by a multiplicative transparency law (i.e., turned into an additive one by a l o g operator). (The physics of the X-Ray resulting in additively transparent images are detailed in Appendix A.). For instance, the heart can be seen over the spine, the ribs and the lungs on Figure 2.

fig2
Figure 2: (a) One image of X-ray exam yielding a situation of bidistributed transparency. (b) The segmentation of the image in its different regions: three two-layer regions (the orange region corresponds to the “heart and lungs”, the light blue one to “heart and diaphragm” and the pink one to “heart and spine”), two single-layer regions (the lungs in green and spine in yellow), and a small three-layer region (“heart and diaphragm and spine” in grey). (c) Its spatial segmentation into four transparent layers (i.e., their spatial supports, spine in orange, diaphragm in blue, heart in red and lungs in white). By definition, the spatial supports of the transparent layers overlap. The colors have been independently chosen in these two maps.

When additive transparency is involved, the gray values of the different objects superimpose and the brightness constancy of points along their image trajectories, exploited for motion estimation [2], is no longer valid. Moreover, two different motion vectors may exist at the same spatial position. Therefore, motion estimation methods that explicitly tackle the transparency issue have to be developed.

In this paper, we deal both with transparent motion estimation and spatial segmentation of the transparent layers in the images. We mean that we aim at recovering both the motion and the spatial support of each transparent layer. Transparent layer segmentation is an original topic to be distinguished from the transparent layer separation task: a spatial segmentation aims at directly delimiting the spatial support of the different transparent objects based on their motions, whereas a separation framework [35] leads to recover the grayvalue images of the different transparent objects. The latter can be handled so far in restricted situations only (e.g., specific motion must be assumed for at least one layer, the image globally includes only two layers), while we consider any type of motions and any number of layers. We aim at defining a general and robust method since we will apply it to noisy and low-contrasted X-ray image sequences.

We do not assume that the number of transparent layers in the image is known or limited. In contrast, we will determine it. We only assume a local two-layer configuration, that is, the image can be divided into regions where at most two transparent layers are simultaneously present. We will call such a situation bidistributed transparency. This is not a strong assumption since this is the most commonly encountered configuration in real image sequences.

Finally, we derive from the proposed transparent motion estimation method a general transparent motion compensation method compatible with any spatio-temporal filtering technique. In particular, we propose a novel method for the temporal filtering of X-ray image sequences that avoids the appearance of severe artifacts (such as blurring), while taking advantage of the large temporal redundancy involved by the high acquisition frame rate.

The remainder of the paper is organized as follows. Section 2 includes a state-of-the art on transparent motion estimation and introduces the fundamental transparent motion constraint. In Section 3, we present and discuss the different assumptions involved in the motion estimation problem statement. Section 4 details the MRF-based framework that we propose, while Section 5 deals with the practical development of our joint transparent motion estimation and spatial layer segmentation method. In Section 6, we present the proposed filtering method, involving a novel transparent motion compensation procedure. We report in Section 7 experimental results for transparent motion estimation on realistic test images as well as on numerous real clinical image sequences. Section 8 presents denoising results on realistic test images and real clinical image sequences. Finally, Section 9 contains concluding remarks and possible extensions.

2. Related Work on Transparent Motion Estimation

A first category of transparent motion estimation method attempts to directly extend usual motion estimation strategies to the transparency case [6, 7]. Approaches that are particularly robust to deviations from the brightness assumption are adopted, but the weak point is that transparency is not explicitly taken into account. The method [8] focuses on the problem of transparent motion estimation in angiograms to improve stenosis quantification accuracy. The motion fields are iteratively estimated by maximizing a phase correlation metric after removing the (estimated) contribution of the previously processed layer. However, it leads to interesting results only when one layer dominates the other one (which is not necessarily the case in interventional X-ray images).

Among the methods which explicitly tackle the transparency issue in the motion estimation process, we can distinguish two main classes of approaches. The first one works in the frequency domain [911], but it must be assumed that the motions are constant over a large time interval (dozen of frames). These methods are therefore unapplicable to image sequences involving time-varying movements, such as cardiac motions in X-ray image sequences.

The second class of methods formulates the problem in the spatial image domain using the fundamental Transparent Motion Constraint (TMC) introduced by Shizawa and Mase [12], or its discrete version developed in [13]. The latter states that, if one considers the image sequence 𝐼 as the addition of two layers 𝐼 1 and 𝐼 2 ( 𝐼 = 𝐼 1 + 𝐼 2 ), respectively, moving with velocity fields 𝐰 1 = ( 𝑢 1 , 𝑣 1 ) and 𝐰 2 = ( 𝑢 2 , 𝑣 2 ) , the following holds: 𝑟 𝑥 , 𝑦 , 𝐰 1 , 𝐰 2 = 𝐼 𝑥 + 𝑢 1 + 𝑢 2 , 𝑦 + 𝑣 1 + 𝑣 2 , 𝑡 1 + 𝐼 ( 𝑥 , 𝑦 , 𝑡 + 1 ) 𝐼 𝑥 + 𝑢 1 , 𝑦 + 𝑣 1 , 𝑡 𝐼 𝑥 + 𝑢 2 , 𝑦 + 𝑣 2 , 𝑡 = 0 , ( 1 ) where ( 𝑥 , 𝑦 ) are the coordinates of point 𝐩 in the image. For sake of clarity, we do not make explicit that 𝐰 1 and 𝐰 2 may depend on the image position. Expression (1) implicitly assumes that 𝐰 1 and 𝐰 2 are constant over time interval [ 𝑡 1 , 𝑡 + 1 ] . Even if the hypothesis of constant velocity can be problematic at a few specific time instants of the heart cycle, (1) offers us with a reasonable and effective Transparent Motion Constraint (TMC) since the temporal velocity variations are usually smooth. This constraint can be extended to 𝑛 layers by considering 𝑛 + 1 images while extending the motion invariance assumption accordingly [13].

To compute the velocity fields using the TMC given by (1), a global function 𝐽 is usually minimized: 𝐽 𝐰 1 , 𝐰 2 = ( 𝑥 , 𝑦 ) 𝑟 𝑥 , 𝑦 , 𝐰 1 ( 𝑥 , 𝑦 ) , 𝐰 2 ( 𝑥 , 𝑦 ) 2 , ( 2 ) where 𝑟 ( 𝑥 , 𝑦 , 𝐰 1 ( 𝑥 , 𝑦 ) , 𝐰 2 ( 𝑥 , 𝑦 ) ) is given by (1) and denotes the image grid.

Several methods have been proposed to minimize expression (2), making different assumptions on the motions. The more flexible the hypothesis, the more accurate the estimation, but also the more complex the algorithm. A compromise must be reached between measurement accuracy on one hand and robustness to noise, computational load and sensitivity to parameter tuning on the other hand.

Dense velocity fields are computed in [14] by adding a regularization term to (2), and in [15] by resorting to a Markovian formalism. It enables to estimate nontranslational motions at the cost of higher sensitivity to noise and of high algorithm complexity. In contrast, stronger assumptions on the velocity fields are introduced in [16, 17] by considering that 𝐰 1 and 𝐰 2 are constant on blocks of the image, which allows fast but less accurate motion estimation. In [13], the velocity fields are decomposed on a B-spline basis, so that this method can account for complex motions, while remaining relatively tractable. However, the structure of the basis has to be carefully adapted to particular situations and the computational load becomes high if fine measurement accuracy is needed.

3. Transparent Motion Estimation Problem Statement

We consider the general problem of motion estimation in bidistributed transparency . It refers to transparent configurations including any number of layers globally, but at most two locally. This new concept, which suffices to handle any transparent image sequence in practice, is discussed in Section 3.1.

To handle this problem, we resort to a joint segmentation and motion estimation framework. Because of transparency, we need to introduce a specific segmentation mechanism that allows distinct regions to superimpose, and to derive an original transparent joint segmentation and motion estimation framework.

Finally, to allow for a reasonably fast and robust method (able to handle X-Ray images), we consider transparencies involving parametric motion models as explained in Section 3.2.

3.1. Bi-Distributed Transparency

We do not impose any limitation on the number of transparent layers globally involved in the image. Nevertheless, we assume that the images contain at most two layers at every spatial position 𝐩 , which is acceptable since three layers rarely superimpose in real transparent image sequences. We will refer to this configuration as the bidistributed transparency.

Even in the full transparency case encountered in X-ray exams, where acquired images result from cumulative absorption by X-ray tissues, the image can be nearly always divided into regions including at most two moving transparent layers, as illustrated on Figure 2. The only region involving three layers in this example is insignificant since the three corresponding organs are homogeneous in this area.

Unlike existing methods, we aim at explicitly extracting the segmentation of the image in its transparent layers, which is an interesting and exploitable output per se and is also required for the motion-estimation stage based on the two-layer TMC.

3.2. Transparent Motion Constraint with Parametric Models

We decide to represent the velocity field of each layer by a 2D polynomial model. Such a parametric motion model accounts for a large range of motions, while involving few parameters for each layer. We believe that affine motion models offer a proper compromise since they can describe a large category of motions (translation, rotation, divergence, shear), while keeping the model simple enough to handle the transparency issue in a fast and tractable way. Our method could consider higher-order polynomial models as well, such as quadratic ones, if needed. Let us point out that in case of a more complex motion, the method is able to over-segment the corresponding layer in regions having a motion compatible with the considered type of parametric model. Complex transparent motions can still be accurately estimated at the cost of oversegmentation.

The velocity vector at point ( 𝑥 , 𝑦 ) for layer 𝑘 is now defined by 𝐰 𝜽 𝐤 ( 𝑥 , 𝑦 ) = ( 𝑢 𝜃 𝑘 ( 𝑥 , 𝑦 ) , 𝑣 𝜃 𝑘 ( 𝑥 , 𝑦 ) ) : 𝑢 𝜃 𝑘 ( 𝑥 , 𝑦 ) = 𝑎 1 , 𝑘 + 𝑎 2 , 𝑘 𝑥 + 𝑎 3 , 𝑘 𝑣 𝑦 , 𝜃 𝑘 ( 𝑥 , 𝑦 ) = 𝑎 4 , 𝑘 + 𝑎 5 , 𝑘 𝑥 + 𝑎 6 , 𝑘 𝑦 , ( 3 ) with 𝜃 𝑘 = [ 𝑎 1 , , 𝑎 6 ] 𝑇 the parameter vector. We then introduce a new version of the TMC (1) that we call the Parametric Transparent Motion Constraint (PTMC): 𝑟 𝑥 , 𝑦 , 𝐰 𝜽 1 , 𝐰 𝜽 2 = 𝐼 𝑥 + 𝑢 𝜃 1 + 𝑢 𝜃 2 , 𝑦 + 𝑣 𝜃 1 + 𝑣 𝜃 2 , 𝑡 1 + 𝐼 ( 𝑥 , 𝑦 , 𝑡 + 1 ) 𝐼 𝑥 + 𝑢 𝜃 1 , 𝑦 + 𝑣 𝜃 1 , 𝑡 𝐼 𝑥 + 𝑢 𝜃 2 , 𝑦 + 𝑣 𝜃 2 , 𝑡 = 0 ( 4 ) with 𝐰 𝜽 1 and 𝐰 𝜽 2 given in (3).

The next section introduces the MRF-based framework that concludes the problem statement.

4. MRF-Based Framework

4.1. Observations and Remarks

We have to segment the image into regions including at most two layers to estimate the motion models associated to the layers from the PTMC (4). Conversely, the image segmentation will rely on the estimation of the different transparent motions. Therefore, we have designed a joint segmentation and estimation framework based on a Markov Random Field (MRF) modeling. A joint approach is more reliable than a sequential one (as in [18]) since estimated motions can be improved using a proper segmentation and vice versa.

Joint motion estimation and segmentation frameworks have been developed for “classical” image sequences [1924], but have never been studied in the case of transparent images. In particular, we have to introduce a novel segmentation allowing regions to superimpose. Moreover, the bidistributed assumption implies to control the number of layers simultaneously present at a given spatial location.

The proposed method will result in an alternate minimization scheme between segmentation and estimation stages. To maintain a reasonable computational load, the segmentation is carried out at the level of blocks. Typically, the 1 0 2 4 × 1 0 2 4 images are divided in 3 2 × 3 2 blocks (for a total number of blocks 𝑆 = 1 0 2 4 ). We will see in Section 5.2 that this block structure will also be exploited in the initialization step. According to the targeted application, the block size could be fixed smaller in a second phase of the algorithm. The pixel-level could even be progressively reached, if needed.

The blocks are taken as the sites 𝑠 of the MRF model (Figure 3). We aim at labeling the blocks 𝑠 according to the pair of transparent layers they are belonging to. Let 𝑒 = { 𝑒 ( 𝑠 ) , 𝑠 = 1 , , 𝑆 } denote the label field with 𝑒 ( 𝑠 ) = ( 𝑒 1 ( 𝑠 ) , 𝑒 2 ( 𝑠 ) ) , where 𝑒 1 ( 𝑠 ) and 𝑒 2 ( 𝑠 ) designate the two layers present at site 𝑠 . 𝑒 1 ( 𝑠 ) and 𝑒 2 ( 𝑠 ) are given the same value when the site 𝑠 involves one layer only. The spatial supports of the transparent layers can be straightforwardly inferred from the labeling of the two-layer regions (i.e., from the elements of each pair that forms the label).

fig3
Figure 3: MRF framework. (a) A processed image divided in blocks (36 blocks for the sake of clarity of the figure). (b) The graph associated with the introduced Markov model. The sites are plotted in blue and their neighbouring relations are drawn in orange.

Let us assume that the image comprises a total of 𝐾 transparent layers, where 𝐾 is to be determined. To each layer is attached an affine motion model of parameters 𝜃 𝑘 (six parameters). Let Θ = { 𝜃 𝑘 , 𝑘 = 1 , , 𝐾 } .

4.2. Global Energy Functional

We need to estimate the segmentation defined by the labels 𝑒 ( 𝑠 ) , and the corresponding transparent motions defined by the parameters Θ . The estimates will minimize the following global energy functional: 𝐹 ( 𝑒 , Θ ) = 𝑠 𝑆 0 𝑥 0 2 0 0 𝑑 ( 𝑥 , 𝑦 ) 𝑠 𝜌 0 𝑥 0 2 0 0 𝑑 𝐶 𝑟 𝑥 , 𝑦 , 𝜃 𝑒 1 ( 𝑠 ) , 𝜃 𝑒 2 ( 𝑠 ) [ ] 𝜇 𝜂 𝑠 , 𝑒 ( 𝑠 ) + 𝜇 𝑠 , 𝑡 𝑒 0 𝑥 0 2 0 0 𝑑 1 𝛿 1 ( 𝑠 ) , 𝑒 1 𝑒 ( 𝑡 ) 1 𝛿 1 ( 𝑠 ) , 𝑒 2 + 𝑒 ( 𝑡 ) 1 𝛿 2 ( 𝑠 ) , 𝑒 1 𝑒 ( 𝑡 ) 1 𝛿 2 ( 𝑠 ) , 𝑒 2 . ( 𝑡 ) ( 5 ) The first term of (5) is the data-driven term based on the PTMC defined in (4). Instead of a quadratic constraint, we resort to a robust function 𝜌 𝐶 ( ) in order to discard outliers, that is, points where the PTMC is not valid [25]. We consider the Tukey function as robust estimator. It is defined by: 𝜌 𝐶 ( 𝑟 𝑟 ) = 6 6 𝐶 2 𝑟 4 2 + 𝐶 4 𝑟 2 2 𝐶 i f | 𝑟 | < 𝐶 , 6 6 o t h e r w i s e . ( 6 ) It depends on a scale parameter 𝐶 which defines the threshold above which the corresponding point will be considered as an outlier. To be able to handle any kind of images, we will determine 𝐶 on-line as explained in Section 5.3.

The additional functional 𝜂 ( , ) is introduced in (5) to detect single layer configurations. It is a binary function which is 1 when 𝑠 is likely to be a single layer site. It will be discussed in Section 4.3.

The last term of the global energy functional 𝐹 ( 𝑒 , Θ ) enforces the segmentation map to be reasonably smooth. We have to consider the four possible label transitions between two sites (involving two labels each). 𝛿 ( , ) is equal to 1 if the two considered elements are the same and equals 0 otherwise.

The 𝜇 parameter weights the relative influence of the two terms. In other words, a penalty 𝜇 is added when introducing between two sites a region border involving a change in one layer only, and a penalty 2 𝜇 when both layers are different. A transition between a mono-layer site 𝑠 1 and a bilayer site 𝑠 2 will also imply a penalty 𝜇 (as long as the layer present in 𝑠 1 is also present in 𝑠 2 ). 𝜇 is determined in a content-adaptive way, as explained in Section 5.3.

4.3. Detection of a Single Layer Configuration

Over single layer regions, (1) is satisfied provided one of the two estimated velocities (for instance 𝐰 𝜽 𝐞 1 ( 𝐬 ) ) is close to the real motion of this single layer whatever the value of the other velocity ( 𝐰 𝜽 𝐞 2 ( 𝐬 ) ). The minimization of (5) without the 𝜂 ( , ) term would therefore not allow to detect single layer regions because a “imaginary” second layer would be introduced over these sites. Thus, we propose an original criterion to detect these areas.

We define the residual value: 𝜈 ̂ 𝜃 𝑒 1 ( 𝑠 ) , 𝜃 𝑒 2 ( 𝑠 ) = , 𝑠 ( 𝑥 , 𝑦 ) 𝑠 ̂ 𝜃 0 𝑥 0 2 0 0 𝑑 𝑟 𝑥 , 𝑦 , 𝑒 1 ( 𝑠 ) , 𝜃 𝑒 2 ( 𝑠 ) 2 . ( 7 ) If it varies only slightly for different values of 𝜃 𝑒 2 ( 𝑠 ) (while keeping 𝜃 𝑒 1 ( 𝑠 ) constant and equal to its estimate ̂ 𝜃 𝑒 1 ( 𝑠 ) ), it is likely that the block 𝑠 contains one single layer only, corresponding to 𝑒 1 ( 𝑠 ) . 𝜂 ( , ) would be set to 1 in this case to favour the label ( 𝑒 1 ( 𝑠 ) , 𝑒 1 ( 𝑠 ) ) over this site (and to 0 in the other cases).

Formally, to detect a single layer corresponding to 𝜃 𝑒 1 ( 𝑠 ) , we compute the mean value ̂ 𝜃 𝜈 ( 𝑒 1 ( 𝑠 ) , 𝑠 ) of the residual 𝜈 ( 𝜃 𝑒 1 ( 𝑠 ) , , 𝑠 ) by applying 𝑛 motions (defined by 𝜃 𝑗 , 𝑗 = 1 , , 𝑛 , ) to the second layer. We want to decide if ̂ 𝜃 𝜈 ( 𝑒 1 ( 𝑠 ) , 𝑠 ) is significantly different from the minimal residual on this block, ̂ 𝜃 𝜈 ( 𝑒 1 ( 𝑠 ) , ̂ 𝜃 𝑒 2 ( 𝑠 ) , 𝑠 ) , where ( 𝑒 1 ( 𝑠 ) , 𝑒 2 ( 𝑠 ) ) are the current labels at site 𝑠 . This minimal residual is in practice coming from the iterative minimization of (5) presented in Section 5.1.

To meet this decision independently of the image texture, we first compute a representative value for the residual of the image, given by 𝜈 m e d = m e d 𝑠 𝑆 𝜈 ̂ 𝜃 𝑒 1 ( 𝑠 ) , ̂ 𝜃 𝑒 2 ( 𝑠 ) , 𝑠 , ( 8 ) and its median deviation Δ 𝜈 m e d = m e d 𝑠 𝑆 | | | 𝜈 ̂ 𝜃 𝑒 1 ( 𝑠 ) , ̂ 𝜃 𝑒 2 ( 𝑠 ) , 𝑠 𝜈 m e d | | | . ( 9 ) (This assumes that the motion models have been well estimated and the current labeling is correct on at least half the sites). Then, we set 𝜂 𝑠 , 𝑒 1 ( 𝑠 ) , 𝑒 2 | | | ( 𝑠 ) = 1 i f 𝜈 𝜃 𝑒 1 ( 𝑠 ) 𝜃 , 𝑠 𝜈 𝑒 1 ( 𝑠 ) , 𝜃 𝑒 2 ( 𝑠 ) | | | , 𝑠 < 𝛼 Δ 𝜈 m e d , 𝑒 1 ( 𝑠 ) = 𝑒 2 𝜂 ( 𝑠 ) , ( 1 0 ) 𝑠 , 𝑒 1 ( 𝑠 ) , 𝑒 2 ( 𝑠 ) = 0 o t h e r w i s e , ( 1 1 ) where 𝜂 ( , ) is the functional introduced in (5). This way, we favour the single layer label ( 𝑒 1 ( 𝑠 ) , 𝑒 1 ( 𝑠 ) ) at site 𝑠 when the condition (10) is satisfied. The same process is repeated to test for 𝜃 𝑒 2 ( 𝑠 ) as the motion parameters of a (possible) single layer. In practice, we fix 𝛼 = 2 .

5. Joint Parametric Motion Estimation and Segmentation of Transparent Layers

This section describes the minimization of the energy functional (5) along with its initialization. We also explain how the parameters are set on-line, and how the number of layers globally present is estimated. The overall joint transparent motion estimation and layer segmentation algorithm is summarized in Algorithm 1.

alg1
Algorithm 1: Joint transparent motion estimation and layer segmentation algorithm.

5.1. Minimization of the Energy Functional 𝐅

The energy functional 𝐹 defined in (5) is minimized iteratively. When the motion parameters are fixed, we use the Iterative Conditional Mode (ICM) technique to update the labels of the blocks: the sites are visited randomly, and for each site the label that minimizes the energy functional (5) is selected.

Once the labels are fixed, we have to minimize the first term of (5), which involves a robust estimation. It can be solved using an Iteratively Reweighted Least Square (IRLS) technique which leads to minimize the equivalent functional [26]: 𝐹 1 ( Θ ) = 𝑠 𝑆 ( 𝑥 , 𝑦 ) 𝑠 𝛼 ( 𝑥 , 𝑦 ) 𝑟 𝑥 , 𝑦 , 𝜃 𝑒 1 ( 𝑠 ) , 𝜃 𝑒 2 ( 𝑠 ) 2 0 𝑥 0 2 0 0 𝑑 0 𝑥 0 2 0 0 𝑑 , ( 1 2 ) where 𝛼 ( 𝑥 , 𝑦 ) denotes the weights. Their expression at the iteration 𝑗 of the minimization is given by: 𝛼 𝑗 𝜌 ( 𝑥 , 𝑦 ) = 𝐶 𝑟 ̂ 𝜃 𝑥 , 𝑦 , 𝑒 𝑗 1 1 ( 𝑠 ) , ̂ 𝜃 𝑒 𝑗 1 2 ( 𝑠 ) ̂ 𝜃 2 𝑟 𝑥 , 𝑦 , 𝑒 𝑗 1 1 ( 𝑠 ) , ̂ 𝜃 𝑒 𝑗 1 2 ( 𝑠 ) ( 1 3 ) with ̂ 𝜃 𝑗 1 the estimate of 𝜃 computed at iteration 𝑗 1 , and 𝜌 𝐶 the derivative of 𝜌 𝐶 .

Even if each PTMC involves two models only, their addition over the entire image allows us to simultaneously estimate the 𝐾 motion models globally present in the image by minimizing the functional 𝐹 1 ( Θ ) of (12) (which is defined in a space of dimension 6 𝐾 ). If the velocity magnitudes were small, we could consider a linearized version of expression (12) (i.e., by relying on a linearized version of the expression 𝑟 ). Since large motions can occur in practice, we introduce a multiresolution incremental scheme exploiting Gaussian pyramids of the three consecutive images. At its coarsest level 𝐿 , motions are small enough to resort to a linearized version of functional 𝐹 1 ( Θ ) (12). The minimization is then achieved using the conjugate gradient algorithm. Hence, first estimates of the motions parameters are provided, they are denoted ̂ 𝜃 𝐿 𝑘 , 𝑘 = 1 , , 𝐾 .

At the level 𝐿 1 , we initialize 𝜃 𝑖 𝐿 1 with ̃ 𝜃 𝑖 𝐿 1 , where ̃ 𝑎 𝐿 1 𝑖 , 𝑘 = 2 ̂ 𝑎 𝐿 𝑖 , 𝑘 ( 𝑖 = 1 , 4 ) and ̃ 𝑎 𝐿 1 𝑖 , 𝑙 = ̂ 𝑎 𝐿 𝑖 , 𝑙 ( 𝑙 = 2 , 3 , 5 , 6 ). We then write 𝜃 𝑘 𝐿 1 = ̃ 𝜃 𝑘 𝐿 1 + Δ 𝜃 𝑘 𝐿 1 , and we minimize 𝐹 1 ( Θ ) with respect to the Δ 𝜃 𝑘 𝐿 1 , 𝑘 = 1 , 𝐾 , once 𝑟 is linearized around the ̃ 𝜃 𝑘 𝐿 1 , using the IRLS technique. This Gauss-Newton method, iterated through the successive resolution levels until the finest one, allows us to simultaneously estimate the affine motion models of the 𝐾 transparent layers.

5.2. Initialization of the Overall Scheme

Such an alternate iterative minimization scheme converges if properly initialized. To initialize the motion estimation stage, we resort to a transparent block-matching technique that tests every possible pair of displacements in a given range [17]. More specifically, for each block 𝑠 , we compute 𝜁 𝐰 1 , 𝐰 2 = , 𝑠 ( 𝑥 , 𝑦 ) 𝑠 𝑟 𝑥 , 𝑦 , 𝐰 1 , 𝐰 2 2 0 𝑥 0 2 0 0 𝑑 ( 1 4 ) for a set of possible displacements 𝐰 1 × 𝐰 2 , where 𝑟 is given by (1). The pair of displacements ( 𝐰 1 , 𝐰 2 ) is the one which minimizes (14). This scheme is applied on a multiresolution representation of the images to reduce the computation time (it would be higher than in the case of nontransparent motions since the explored space is of dimension 4 ).

To extract the underlying layer motion models from the set of computed pairs of displacements, we apply the Hough transform on a three-dimension parameter space (i.e., a simplified affine motion model): 𝑢 = 𝑎 1 + 𝑎 2 𝑥 , 𝑣 = 𝑎 4 + 𝑎 2 𝑦 . ( 1 5 ) Indeed, restricting the Hough space to a 3D space obviously limits the computational complexity and improves the transform efficiency, while being sufficient to determine the number of layers and to initialize their motion models. Each displacement 𝐰 = ( 𝑢 , 𝑣 ) votes for the parameters: 𝑎 1 = 𝑎 2 𝑎 𝑥 𝑢 , 4 = 𝑎 2 𝑦 𝑣 , ( 1 6 ) defining a straight line. The Hough space has to be discretized in its three directions. Practically, we have chosen a one pixel step for the translational dimensions 𝑎 1 and 𝑎 4 , and for the divergence term 𝑎 2 a step corresponding to a one pixel displacement in the center of the image. An example of computed Hough accumulation matrix is given on Figure 4.

647262.fig.004
Figure 4: Accumulation matrix in the space ( 𝑎 1 , 𝑎 4 , 𝑎 2 ) , built from the displacements computed by a transparent block-matching technique. These displacements are presented on the left of Figure 5. The ground truth of the two motion models present in the image sequences are plotted in green and blue.

If the layers include large homogeneous areas (which is the case in X-ray images), the initial block-matching is likely to produce a relatively large number of erroneous displacement estimates. To improve further the initialization stage, we adopt a continuous increment mechanism of the accumulation matrix based on a confidence value depending on the block texture.

To compute the confidence value associated to a block 𝑠 and a displacement 𝐰 1 (the other displacement being fixed to 𝐰 2 ), we analyse the behavior of 𝐰 𝜁 ( , 2 , 𝑠 ) . If it remains close to its minimal value 𝐰 𝜁 ( 1 , 𝐰 2 , 𝑠 ) , then the layer associated to 𝐰 1 is homogeneous and 𝐰 1 should be assigned a low confidence value. Conversely, if 𝐰 𝜁 ( , 2 , 𝑠 ) has a clear minimum in 𝐰 1 , the corresponding layer is likely to be textured, and 𝐰 1 can be considered as reliable.

More precisely, we compute in each block 𝑠 : 𝑐 1 | | | | | 1 ( 𝑠 ) = 𝑛 𝚫 𝐰 𝐰 0 𝑥 0 2 0 0 𝑑 𝜁 1 𝐰 + 𝚫 𝐰 , 2 𝐰 , 𝑠 𝜁 1 , 𝐰 2 | | | | | , 𝑐 , 𝑠 2 | | | | | 1 ( 𝑠 ) = 𝑛 𝚫 𝐰 𝐰 0 𝑥 0 2 0 0 𝑑 𝜁 1 , 𝐰 2 𝐰 + 𝚫 𝐰 , 𝑠 𝜁 1 , 𝐰 2 | | | | | , , 𝑠 ( 1 7 ) where 𝑛 is the number of tested displacements Δ 𝐰 . To normalize these coefficients, we compute their first quartile ̃ 𝑐 over the image, and then assign to each block 𝑠 and computed displacement 𝐰 𝑖 ( 𝑖 = 1 , 2 ) the value 𝑐 𝑖 ( 𝑠 ) / ̃ 𝑐 (or 1 if 𝑐 𝑖 ( 𝑠 ) / ̃ 𝑐 > 1 ). Then, the 2 5 % more reliable computed displacements are assigned the value 1 , whereas those that are less informative, or which are not correctly computed, are given a small confidence value.

The Hough transform allows us to cluster the reliable displacement vectors. We successively look for the dominant peaks in the accumulation matrix, and we decide that the corresponding motion models are relevant if they “originate” from at least five computed displacements that have not been considered so far. Conversely, a displacement computed by the transparent block-match technique is considered as “explained” by a given motion model if it is close enough to the mean velocity induced by this motion model over the considered block (in practice, distant by less than two pixels).

This method yields a first evaluation of the number of layers 𝐾 and an initialization of the affine motion models. Then, the label field is initialized by minimizing the first term of (5) only (i.e., we consider a maximum likelihood criterion). Figure 5 illustrates the initialization stage.

fig5
Figure 5: Example if the initialization stage for a symbolic example. (a) The displacements computed by the transparent block-matching. (b) The velocity fields corresponding to the affine models extracted by the Hough transform. Three layers are detected; they are plotted in red, green and blue. The erroneous displacements are plotted in black. (c) The true displacements.
5.3. Content Adaptive Parameter Setting

Two parameters have to be set for the functional 𝐹 (5) to be defined: the scale parameter 𝐶 of the robust functional, and the parameter 𝜇 weighting the relative influence of the data-driven and the smoothing term. 𝐶 is determined as follows: 𝑟 = m e d 𝐩 𝑟 ̂ 𝜃 𝐩 , 𝑒 1 ( 𝑠 ) , ̂ 𝜃 𝑒 2 ( 𝑠 ) , Δ 𝑟 = 1 . 4 8 × m e d 𝐩 | | 𝑟 ̂ 𝜃 𝐩 , 𝑒 1 ( 𝑠 ) , ̂ 𝜃 𝑒 2 ( 𝑠 ) 𝑟 | | , 𝐶 = 2 . 7 9 5 × Δ 𝑟 ( 1 8 ) when 𝐩 is a pixel position, refers to the image grid and where ̂ 𝜃 is the estimate of 𝜃 from the previous iteration of the minimization.

The use of the medians allows to evaluate representative values 𝑟 and Δ 𝑟 of the “mean” and “deviation” residual values without being disturbed by the outliers. The factor 1 . 4 8 enables to unbiase the estimator of Δ 𝑟 , and the factor 2 . 7 9 5 has been proposed by Tukey to correctly estimate 𝐶 [27].

The 𝜇 parameter is determined in a content-adaptive way: 𝜇 = 𝜆 m e d 𝑠 𝑆 ( 𝑥 , 𝑦 ) 𝑠 𝜌 𝐶 𝑟 ̂ 𝜃 𝑥 , 𝑦 , 𝑒 1 ( 𝑠 ) , ̂ 𝜃 𝑒 2 ( 𝑠 ) . ( 1 9 ) According to the targeted application, 𝜆 can be set to favour the data-driven velocity estimates (small 𝜆 ), or to favour smooth segmentation (higher 𝜆 ). In practice, the value 𝜆 = 0 . 5 has proven to be a good tradeoff between regularization and oversegmentation.

5.4. Update of the Number of Transparent Layers

To update the number 𝐾 of transparent layers, we have designed two criteria. On one hand, two layers, the motion models of which are too close (typically, difference of one pixel on average over the corresponding velocity fields), are merged. Furthermore, a layer attributed to less than five blocks is discarded, and the corresponding blocks relabeled. On the other hand, we propose means to add a new layer if required, based on the maps of weights generated by the robust affine motion estimation stage.

The blocks where the current labels and/or the associated estimated motion models are not coherent with every pixel they contain should include low weight values delivered by the robust estimation stage for the outlier pixels. It then becomes necessary to add a new layer if a sufficient number of blocks containing a large number of pixels with low weights are detected. More formally, we use as indicator the number of weights smaller than a given threshold. The corresponding points will be referred to as outliers . To learn which number of outliers per block is significant, we compute the median value of outliers 𝑁 0 over the blocks, along with its median deviation Δ 𝑁 𝑜 . A block 𝑠 is considered as mislabeled if its number 𝑁 𝑜 ( 𝑠 ) of outliers verifies: 𝑁 𝑜 ( 𝑠 ) > 𝑁 𝑜 + 𝛾 Δ 𝑁 𝑜 ( 2 0 ) w i t h 𝑁 𝑜 = m e d 𝑠 𝑆 𝑁 𝑜 ( 𝑠 ) , ( 2 1 ) Δ 𝑁 𝑜 = m e d 𝑠 𝑆 | | 𝑁 𝑜 ( 𝑠 ) 𝑁 𝑜 | | . ( 2 2 ) In practice, we set 𝛾 = 2 . 5 . If more than five blocks are considered as mis-labeled, we add a new layer. We estimate its motion model by estimating an affine model from the displacement vectors supplied by the initial block-matching step in these blocks (using a least-square estimation), and we run the joint segmentation and estimation scheme on the whole image again.

6. Motion-Compensated Denoising Filter for Transparent Image Sequences

In this section, we exploit the estimated transparent motions for a denoising application. To do so, we propose a way to compensate for the transparent motions, without having to separate the transparent layers.

6.1. Transparent Motion Compensation
6.1.1. Principle

A first way of tackling the problem of transparent motion compensation is to separate the transparent layers and compensate the individual motion of each layer, layer per layer. However, the transparent layer separation problem has been solved in very restricted conditions only [5, 8]. As a result, this cannot be applied in general situations as those encountered in medical image sequences.

Instead, we propose to globally compensate the transparent motions in the image sequence without prior layer separation. To do so, we propose to rearrange the PTMC (4) to form a prediction of the image 𝐼 at time 𝑡 + 1 , based on the images at time instants 𝑡 1 and 𝑡 and exploiting the two estimated affine motion models ̂ 𝜃 1 and ̂ 𝜃 2 : 𝜽 𝐼 ( 𝐩 , 𝑡 + 1 ) = 𝐼 𝐩 + 𝐰 1 𝜽 ( 𝐩 ) , 𝑡 + 𝐼 𝐩 + 𝐰 2 𝜽 ( 𝐩 ) , 𝑡 𝐼 𝐩 + 𝐰 1 𝜽 ( 𝐩 ) + 𝐰 2 . ( 𝐩 ) , 𝑡 1 ( 2 3 ) Equation (23) allows us to globally compensate for the transparent image motions. It enables to handle X-ray images that satisfy the bidistributed transparency hypothesis, that is, involving locally two layers, without limiting the total number of layers globally present in the image.

Any denoising temporal filter can be made transparent-motion-compensated by considering, instead of past images, transparent-motion-compensated images 𝐼 given by (23). As a consequence, details can be preserved in the images, and no blurring introduced if the transparent motions are correctly estimated.

However, relation (23) implies an increase of the noise level of the predicted image since three previous images are added. The variance of the noise corrupting 𝐼 is the sum of the noise variances of the three considered images. This has adverse effects as demonstrated in the next subsection, if a simple temporal filter is considered.

6.1.2. Limitation

Transparent motion compensation can be added to any spatiotemporal filter. We will illustrate its limitation in the case of a pure temporal filter. More precisely, we consider the following temporal recursive filter [28]: 𝐼 ( 𝐩 , 𝑡 + 1 ) = ( 1 𝑐 ( 𝐩 , 𝑡 + 1 ) ) 𝐼 ( 𝐩 , 𝑡 + 1 ) + 𝑐 ( 𝐩 , 𝑡 + 1 ) 𝐼 ( 𝐩 , 𝑡 + 1 ) , ( 2 4 ) where 𝐼 ( 𝐩 , 𝑡 + 1 ) is the output of the filter, that is, the denoised image, 𝐼 ( 𝐩 , 𝑡 + 1 ) is the predicted image and 𝑐 ( 𝐩 , 𝑡 + 1 ) the filter weight. This simple temporal filter is frequently used since its implementation is straightforward and its behavior well-known. Spatial filtering tends to introduce correlated effects that are quite disturbing for the observer (especially when medical image sequences are played at high frame rates). This filter is usually applied in an adaptative way to account for incorrect prediction, which can be evaluated by the expression | 𝐼 ( 𝐩 , 𝑡 + 1 ) 𝐼 ( 𝐩 , 𝑡 + 1 ) | . More specifically, the gain is defined as a decreasing function of the prediction error.

To illustrate the intrinsic limitation of such a transparent-motion compensated filter, we study its behavior under ideal conditions: the transparent motions are known as well as the level of noise in the different images. Furthermore, we ignore the low-pass effect of interpolations. The noise variances 𝜎 2 𝐼 ( 𝑡 + 1 ) , 𝜎 2 𝐼 ( 𝑡 + 1 ) , and 𝜎 2 (constant in time) of the images 𝐼 ( 𝑡 + 1 ) , 𝐼 ( 𝑡 + 1 ) , and 𝐼 ( 𝑡 ) , respectively, are related as follows (from (24): 𝜎 2 𝐼 ( 𝑡 + 1 ) = ( 1 𝑐 ( 𝑡 + 1 ) ) 2 𝜎 2 + 𝑐 ( 𝑡 + 1 ) 2 𝜎 2 𝐼 ( 𝑡 + 1 ) ( 2 5 ) under the assumption that the different noises are independent. On the other hand, (23) implies (for a recursive implementation of this filter): 𝜎 2 𝐼 ( 𝑡 + 1 ) = 2 𝜎 2 𝐼 ( 𝑡 ) + 𝜎 2 𝐼 ( 𝑡 1 ) . ( 2 6 ) For an optimal noise filtering, one should choose 𝑐 ( 𝑡 + 1 ) so that 𝜎 2 ( 𝑡 + 1 ) is minimized: 𝑐 ( 𝑡 + 1 ) = 2 𝜎 2 𝐼 ( 𝑡 ) + 𝜎 2 𝐼 ( 𝑡 1 ) 2 𝜎 2 𝐼 ( 𝑡 ) + 𝜎 2 𝐼 ( 𝑡 1 ) + 𝜎 2 . ( 2 7 ) Equations (25) and (27) define a sequence ( 𝜎 2 𝐼 ( 𝑡 ) ) 𝑡 𝑁 . We show in Appendix C that it asymptotically reaches a limit: l i m 𝑡 𝜎 𝐼 ( 𝑡 ) = 2 3 𝜎 0 . 8 1 6 𝜎 . ( 2 8 ) Even if we assume that the motions were known, transparent motion-compensated recursive temporal filter cannot allow for a significant denoising rate . Similarly, even if transparent motion-compensated spatiotemporal filters do not exhibit the exact same behavior, they denoise less efficiently that their noncompensated counterparts.

6.2. Hybrid Filter
6.2.1. Problem Statement

Transparent motion compensation allows for a better contrast preservation since it avoids blurring. However, it affects the noise reduction efficiency by increasing the noise of the predicted image. We therefore propose to exploit the transparent motion compensation when appropriate only, to offer a better tradeoff between denoising power and information preservation. We distinguish four local configurations:

( 𝐂 0 ) Both layers are textured around pixel 𝐩 . The global transparent motion compensation is needed to preserve details. The filter output will rely on 𝐼 ( 𝐩 , 𝑡 + 1 ) and 𝐼 ( 𝐩 , 𝑡 + 1 ) only (instead of 𝐼 ( 𝐩 , 𝑡 ) and 𝐼 ( 𝐩 , 𝑡 + 1 ) for the case without motion compensation). ( 𝐂 1 ) The first layer only is textured around pixel 𝐩 . We will just perform the motion compensation of this layer but still applied to the compound intensity. The filter will then exploit ̂ 𝜃 𝐼 ( 𝐩 , 𝑡 + 1 ) , 𝐼 ( 𝐩 + 𝐰 1 ( 𝐩 ) , 𝑡 ) , and 𝐼 ( 𝐩 , 𝑡 + 1 ) (which still carries pertinent information here, but will be assigned a small weight because of its noise level): 𝐼 ̂ 𝜃 𝐼 ( 𝐩 , 𝑡 + 1 ) = 𝛼 ( 𝐩 , 𝑡 + 1 ) 𝐼 ( 𝐩 , 𝑡 + 1 ) + 𝛽 ( 𝐩 , 𝑡 + 1 ) 𝐩 + 𝐰 1 ( 𝐩 ) , 𝑡 + ( 1 𝛼 ( 𝐩 , 𝑡 + 1 ) 𝛽 ( 𝐩 , 𝑡 + 1 ) ) 𝐼 ( 𝐩 , 𝑡 + 1 ) . ( 2 9 ) Like in Section 6.1.2, explicit expressions can be computed for the optimal weights (see Table 1 for their expression in the case of a temporal hybrid filter). ( 𝐂 2 ) The second layer only is textured around pixel 𝐩 . We use a combination of ̂ 𝜃 𝐼 ( 𝐩 , 𝑡 + 1 ) , 𝐼 ( 𝐩 + 𝐰 2 ( 𝐩 ) , 𝑡 ) , and 𝐼 ( 𝐩 , 𝑡 + 1 ) . ( 𝐂 3 ) Both layers are homogeneous around pixel 𝐩 . The four intensities can be used: ̂ 𝜃 𝐼 ( 𝐩 , 𝑡 + 1 ) , 𝐼 ( 𝐩 + 𝐰 1 ̂ 𝜃 ( 𝐩 ) , 𝑡 ) , 𝐼 ( 𝐩 + 𝐰 2 ( 𝐩 ) , 𝑡 ) , and 𝐼 ( 𝐩 , 𝑡 + 1 ) .
tab1
Table 1: Optimal filter weights for the five possible configurations. The noise standard deviation noise of the acquired image is denoted 𝜎 , the one of the previous denoised image 𝜎 𝐼 and the one of the predicted image 𝜎 𝐼 .

A fifth configuration is added w.r.t. the motion estimation output.

( 𝐂 4 ) The motion estimates are erroneous . In this case, we duplicate 𝐼 ( 𝐩 , 𝑡 + 1 ) only. This fifth configuration makes the hybrid filter adaptive, in the sense that it will keep displaying coherent images even if erroneous motion estimates are supplied.
6.2.2. Configuration Selection and Designed Hybrid Filtering

This subsection deals with the detection of the local configuration among the five listed above. Let us assume that 𝐼 1 only is textured around pixel 𝐩 . Then, we can write (for convenience, we will write 𝐰 1 and 𝐰 2 instead of 𝐰 ̂ 𝜃 1 ( 𝐩 ) and 𝐰 ̂ 𝜃 2 ( 𝐩 ) ): 𝐼 ( 𝐩 , 𝑡 + 1 ) = 𝐼 1 ( 𝐩 , 𝑡 + 1 ) + 𝐼 2 ( 𝐩 , 𝑡 + 1 ) = 𝐼 1 𝐩 + 𝐰 1 , 𝑡 + 𝐼 2 𝐩 + 𝐰 2 , 𝑡 𝐼 1 𝐩 + 𝐰 1 , 𝑡 + 𝐼 2 𝐩 + 𝐰 1 , 𝑡 𝐼 𝐩 + 𝐰 1 . , 𝑡 ( 3 0 ) We have exploited in (30) the local lack of contrast of the layer 𝐼 2 . As a result, we can compare 𝐼 ( 𝐩 , 𝑡 + 1 ) and 𝐼 ( 𝐩 + 𝐰 1 , 𝑡 ) to decide whether 𝐼 2 is uniform around 𝐩 . To do so, we have to establish if these two values differ only because of the presence of noise, or if they actually correspond to different physical points. This is precisely the problem handled by adaptive filters.

We resort to the same mechanism. Rather than adopting a binary decision to select one given configuration 𝐂 𝐢 , that would be visually disastrous since neighboring pixels would be processed differently,

(i) we first compute for each pixel 𝐩 two factors: 𝑓 1 ( 𝐩 ) associated to “the layer 1 is uniform ” and 𝑓 2 ( 𝐩 ) associated to “the layer 2 is uniform ”. They are defined as decreasing functions of | 𝐼 ( 𝐩 , 𝑡 + 1 ) 𝐼 ( 𝐩 + 𝐰 2 , 𝑡 ) | (resp., | 𝐼 ( 𝐩 , 𝑡 + 1 ) 𝐼 ( 𝐩 + 𝐰 1 , 𝑡 ) | ). A third factor 𝑓 1 2 ( 𝐩 ) is associated to “ 𝐼 ( 𝐩 , 𝑡 + 1 ) is a good prediction of 𝐼 ( 𝐩 , 𝑡 + 1 ) ”. It is a decreasing function of | 𝐼 ( 𝐩 , 𝑡 + 1 ) 𝐼 ( 𝐩 , 𝑡 ) | . This enables to associate each configuration ( 𝐂 𝐢 ) , 𝑖 = 0 4 , an appropriate weighting factor, as shown in (31). (ii) we filter the image using relation (32) by considering in turn each configuration 𝐂 𝐢 , 𝑖 = 0 4 , and we get the output images 𝐼 ( 𝐂 𝐢 ) ( 𝐩 , 𝑡 ) . (iii) we combine linearly these five output images as follows to yield the final denoised image: 𝐼 ( 𝐩 , 𝑡 ) = 𝑓 1 2 ( 𝐩 ) 1 𝑓 1 ( 𝐩 ) 1 𝑓 2 𝐼 ( 𝐩 ) 𝐂 0 ( 𝐩 , 𝑡 ) + 𝑓 1 2 ( 𝐩 ) 1 𝑓 1 𝑓 ( 𝐩 ) 2 𝐼 ( 𝐩 ) ( 𝐂 1 ) ( 𝐩 , 𝑡 ) + 𝑓 1 2 ( 𝐩 ) 𝑓 1 ( 𝐩 ) 1 𝑓 2 ( 𝐼 𝐩 ) ( 𝐂 2 ) ( 𝐩 , 𝑡 ) + 𝑓 1 2 ( 𝐩 ) 𝑓 1 ( 𝐩 ) 𝑓 2 𝐼 ( 𝐩 ) ( 𝐂 3 ) + ( 𝐩 , 𝑡 ) 1 𝑓 1 2 𝐼 ( 𝐩 ) 𝐂 4 ( 𝐩 , 𝑡 ) . ( 3 1 )

To summarize, the overall scheme comprises two modules:

(i)the first one filters the images based on different (transparent or nontransparent) motion compensation schemes (Section 6.2.1). (ii)the second module locally weights the five intermediate images according to the probability of the considered configuration (Section 6.2.2).
6.2.3. Temporal Hybrid Filter

In the case of a purely temporal hybrid filter, the expression for a given configuration is defined by: 𝐼 𝐼 ( 𝐩 , 𝑡 + 1 ) = 𝛼 ( 𝐩 , 𝑡 ) 𝐼 ( 𝐩 , 𝑡 + 1 ) + 𝛽 ( 𝐩 , 𝑡 ) 𝐩 + 𝐰 1 𝐼 , 𝑡 + 𝛿 ( 𝐩 , 𝑡 ) 𝐩 + 𝐰 2 𝐼 , 𝑡 + 𝛾 ( 𝐩 , 𝑡 ) ( 𝐩 , 𝑡 + 1 ) , ( 3 2 ) where 𝛼 , 𝛽 , 𝛿 , and 𝛾 are filter weights locally specified. 𝛽 = 0 and 𝛿 = 0 for 𝐂 0 ; 𝛿 = 0 for 𝐂 1 ; 𝛽 = 0 for 𝐂 2 ; 𝛽 = 0 , 𝛿 = 0 and 𝛾 = 0 for 𝐂 4 . When the noise level of the input images involved in (32) is known or estimated, one can analytically set the other weights for an optimal filtering (Table 1).

7. Transparent Motion Estimation Results

7.1. Results on Realistic Generated Image Sequences

We have tested our transparent motion estimation scheme on realistic image sequences generated as described in Appendix B.2. It supplies a meaningful quantitative assessment of the performance of our method under realistic conditions. It also allows us to compare the performance of different settings of our algorithm in order to choose the optimal one (Each parameter is either computed online (in particular the crucial ones like the 𝐶 parameter of the Tukey function (6), or the 𝜇 parameter of the MRF function (5), or set once for all based on tests on the realistic generated image sequence (prepocessing, interpolation type, Block-Matching search range, accumulation matrix structure ). The method is therefore fully automatic.).

In this subsection, we focus on images containing two layers only, each one spread over the full image. It is indeed difficult to simultaneously assess the quality of the motion segmentation and of the layer segmentation (An erroneous segmentation that mislabels one block will dramatically impact the global estimation error (33), even if the considered block is low textured and little informative. The residual error would be a better error metric, yet it is much less intuitive.). The overall performance of the global method is discussed over real experiments in Section 7.2.

More specifically, we have applied our method on 2 5 0 three-frame sequences, the first layer (abdomen image) undergoing a translation and the second layer (heart image) an affine motion. To generate the affine motion of the second layer, we proceed in two steps. First, we randomly choose the two translational and the scaling (denoted ) parameters so that the resulting displacement magnitude lies in the range of −8 to 8 pixels. Then, we convert the obtained transformation into a set of affine motion models by allowing the two pairs of affine parameters 𝑎 2 , 𝑎 6 on one hand, and 𝑎 3 , 𝑎 5 on the other hand, to vary from their reference value (resp., and 0 ), in a range of respectively ± 0 . 2 and ± 0 . 2 . Consequently, the generated motions are similar to anatomic motions, while not perfectly following the model assumed by the Hough transform in the initialization step. The two generated motions are also required to sufficiently differ from each other, that is, from 2 pixels in average over the image grid (An observer would not perceive two distinct transparent layers otherwise!)

We have considered image sequences representative of diagnostic (high dose) and fluoroscopic (low dose) exams (with a noise of standard deviation 𝜎 = 1 0 (SNR: 34 dB) and 𝜎 = 2 0 (SNR: 28 dB) resp.), at different scatter rates (a real typical value being 2 0 % ). The images are coded on 1 2 bits, and their mean value is typically 5 0 0 . Running the overall framework takes about 3 0 seconds for 2 8 8 × 2 8 8 images on a Pentium IV (2.4 GHz and 1 Go). The global estimation error is formally estimated below (33).

Table 2 contains the the mean value (in pixels) of the global estimation error 𝜖 computed from 2 5 0 tests, as well as its standard deviation and its median value: 1 𝜖 = | | | | 𝐩 0 𝑥 0 2 0 0 𝑑 𝐰 𝜃 t r u e 1 ̂ 𝜃 ( 𝐩 ) 𝐰 1 ( 𝐩 ) 2 + 𝐰 𝜃 t r u e 2 ̂ 𝜃 ( 𝐩 ) 𝐰 2 ( 𝐩 ) 2 ( 3 3 ) where 𝐰 𝜃 t r u e 𝑖 (resp., 𝐰 ̂ 𝜃 𝑖 ) refers to the velocity vectors (given by the true (resp., estimated) models. We can observe that very satisfactory results are obtained. The average error raises to 0 . 3 6 pixels only for the most difficult diagnostic case. For comparison, the best method from the state of the art [8] reached a precision of about 2 pixels on similar data (involving quadratic motion models though). The estimation accuracy remains very good on the difficult fluoroscopic image sequences ( 𝜎 = 2 0 ), where subpixel precision is maintained if the scatter rate is not too high. In this last case ( 5 0 % scatter rate), the motion estimation remains interesting but is less accurate. The other indicators demonstrate the repeatability of the method over the different experiments.

tab2
Table 2: Performance evaluation of the proposed method for different noise levels and scatter rates: average, standard deviation and median value (in pixels) of the global error computed over 250 generated image sequences.

As for every method based on (1), our framework assumes temporal motion constancy over two successive time intervals. This hypothesis may be critical for clinical image sequences at some time instants of the heart cycle. To test the violation of this assumption, we have carried out the following experiment.

We have randomly chosen two affine models ( 𝜃 1 1 , 𝜃 1 2 ) as explained above, and applied them between time instants 𝑡 1 and 𝑡 . We have then computed a second transparent motion field ( 𝜃 2 1 , 𝜃 2 2 ) , allowing each coefficient to vary from 10, 20, or 3 0 % around ( 𝜃 1 1 , 𝜃 1 2 ) , and applied it between time instants 𝑡 and 𝑡 + 1 . This way, a sequence of three images with temporal motion variation is generated. We have evaluated the global errors between the estimated motion field and ( 𝜃 1 1 , 𝜃 1 2 ) on one hand, and ( 𝜃 2 1 , 𝜃 2 2 ) on the other hand. We report its mean value computed over 2 5 0 generated sequences in Table 3.

tab3
Table 3: Average of the global estimation error for different noise levels and different temporal motion variations (with 20% scatter rate).

We can note a progressive degradation of the estimation accuracy with the amount of temporal motion change. Then, it is not that critical that the temporal motion constancy over two successive time intervals is not strictly verified. The transparent motion estimation for fluoroscopic images remains accurate, even if the two successive motions vary in a range of 3 0 % .

7.2. Results on Real Clinical Image Sequences

The previous experiments are useful to study the behaviour of the proposed method, to fix the options and the parameters, and to quantitavely compare it to other motion estimation methods. However, it does not validate the relevance of the two-layer model per se, since the generated images themselves rely on this model. In this section, we present results obtained on real image sequences that demonstrate the bitransparency model validity.

We present motion estimation results out of three real clinical image sequences and one video. We display several frames along with the estimated motion fields in Figures 68, at some interesting time instants of the sequences. The velocity fields are plotted with colored vectors, the length of which is twice the real displacement magnitude for sake of visibility.

fig6
Figure 6: Four-estimated two-layer motion fields, along with the corresponding fluoroscopic image at four different time instants. One layer corresponds to the static background (ribs) and its estimated affine model is plotted in red. The other layer involves the heart and the lung tissues and its estimated affine model is plotted in green. (a), (c) instants are within in the diastole phase, (b), (d) ones in the systole phase.
fig7
Figure 7: Second example of a X-ray interventional cardiac image sequence, (a)–(c): Images acquired at three different time instants, (d), (e), (f): the corresponding velocity fields supplied by the estimated affine motion models, plotted in different colours according to the transparent layer they are belonging to. (a) Illustration of the method ability to detect single layer situations; (b) correct segmentation and estimation of the motions of small objects, even included in noisy images; (c) handling of a transparency situation with three simultaneous transparent layers in some areas (see main text).
fig8
Figure 8: Example of a X-ray interventional cardiac image sequence, (a)–(c): image acquired at three different time instants, (d), (e), (f): the corresponding velocity fields supplied by the estimated affine motion models, plotted in different colours according to the segmented layer they are belonging to. (a) presents an example of global bitransparency; (b) illustrates the ability of the method to handle dominant motions in case of three simultaneous transparent layers in some areas; (c) refers to a complex configuration in which a trade-off has to be met between accurate motion estimates and clean segmentation maps.

The motion estimation quality is evaluated by visual inspection since no ground truth is available, and since the resulting displaced image differences are difficult to interpret due to the lack of contrast. Anyway, the reliability of the estimated motions is objectively demonstrated by the convincing results of transparent-motion-compensated denoising given in Section 8.

The image of Figure 6 corresponds to an area of about 5 c m × 5 c m size, located between the heart (dark mass on the right) and the lungs (bright tissues). It also contains a static background where ribs are perceptible. In the considered region, the heart carries the lungs tissues along, so that they have the same apparent motion. The motions of the two layers are correctly estimated: the red arrow field corresponds to the static background (it is not plotted when it is exactly equal to 0 ), and the green one to the estimated affine model for the layer formed by the pair “heart and lungs”. Its motion is coherent with the observation, both during the diastole (first and third presented images) and the systole (second and last images).

The sequence shown on Figures 7(a)7(c) is a cardiac interventional image sequence. It globally involves three distinct transparent layers:

(i)the static background, which includes the contrasted surgical clips. (ii)the set “diaphragm and lungs”. The diaphragm is the dark mass in the bottom left corner of the image, and the lungs form the bright tissues in the other half of the image. Their motions are close, so that they can be considered as forming a single moving layer. (iii)the heart is also visible, even if its layer is less textured: it is the convex light-grey area on the right of the image. It can be easily seen on the first displayed image. A catheter (interventional tube) is inserted in one of its coronary, which has an visible motion different from the projected global motion of the heart (i.e., mainly inferred from the cardiac boundary perceptible in the image).

We first report results obtained at a time instant where the three layers are static (Figures 7(a)7(d)). Only one region is detected, which is correct: our method is still effective when no transparent motion is involved.

At a second time instant, the group “diaphragm and lungs” is still static. The velocity field supplied by the corresponding estimated motion model is plotted in red and the estimated motion of the heart in green (Figure 7(e)). Both motion models are correctly estimated. Interestingly, the movement of the catheter in the upper part of the image is properly segmented as well, even if it forms a thin structure in a noisy image sequence.

The image content of the third considered time instant (Figure 7(f)) is also of interest, since the three layers now appears to be undergoing independent visible motions. In this borderline situation (since we have three moving layers in some blocks), the method again proves to perform well: it manages to focus on the two dominant layers in the different regions. As a result, the red velocity field corresponds to the static background layer, the green one to the lungs layer, and the blue one to the heart motion layer.

The sequence presented on Figures 8(a)8(c) is a cardiac interventional image sequence. It depicts an about 5 c m × 5 c m area of the anatomy, where the heart (dark mass filling three quarters of the image, nearly static under the considered acquisition angulation) superimposes on the lungs (bright tissues in the upper right of the image). We give results for three distant instants of this sequence. The velocity fields plotted on Figures 8(d)8(f) are supplied by the affine motion models estimated at the three considered time instants.

A global two-layer transparency correctly explains the observed motions at the first time instant (Figure 8(d)). The green velocity vectors correspond to the group “lungs and diaphragm”, animated by the breathing, and the red field refers to the transparent layer of the heart (it is present all over the image but is not plotted when it is perfectly null). Let us also point out that the static background is merged with the heart layer.

It is necessary to introduce a bidistributed transparency configuration to explain the motions observed at the second considered time instant (Figure 8(e)). The red velocity field still refers to the (almost) static background, which now includes the mass of the heart and the diaphragm (motionless at this time). The blue velocity field corresponds to the upward motion of breathing carrying along the lungs. The green velocity field accounts for a supplementary layer corresponding to the set of coronary arteries taken as a whole, the motion of which becomes perceptible. It is properly handled and correctly estimated. This demonstrates the ability of the method to focus on the two dominating motions even in situations of three-layer transparency (here, static layer, lungs layer and coronary arteries layer).

The last reported result (Figure 8(f)) highlights the performance of the method when situations of high complexity are encountered. All the different motions are indeed correctly estimated (by observing the sequence) even if oversegmentation is noticeable. Let us mention that a less fragmented spatial segmentation could be obtained by increasing the value of the regularization factor 𝜆 (19), but at the cost of a less accurate match between estimated motion models and observed motions. The trade-off has to be met according to the targeted application.

Finally, Figure 9 reports experiments conducted on a sequence extracted from a movie, picturing a couple reflected on an apartment window. To our knowledge, it is the first time a real transparent video sequence is processed (we mean a sequence which has not been constructed for that purpose). The reflection superimposes to a panning view of the city behind. The camera is undergoing a smooth rotation, making the reflected faces and the city undergo two apparent translations with different velocities in the image. At some time instant, the real face of a character appears in the foreground but does not affect our method because of its robustness. The obtained segmentation and motion estimation are satisfying.

647262.fig.009
Figure 9: Example of a movie depicting two people reflected on an apartment window. From left to right and top to bottom: the first frame of the sequence; one of the three images corresponding to the reported results later in the sequence; the obtained segmentation into the transparent layer supports (the green polygonal line in the middle roughly encloses the reflected people); the velocity fields supplied by the estimated affine motion models; displaced frame differences computed by compensating the motion of one of the two layers.

More results on video sequences can be found in [1].

8. Denoising Results

We have tested the proposed denoising method in the case of purely temporal filters because of their practical interest (explained in Section 6.1.2). Three denoising filters are compared: the adaptive recursive filter [28] without motion compensation (ANMCR) acting as a reference, the transparent-motion-compensated recursive filter (MCR) described in Section 6.1, and the proposed hybrid recursive filter (HR) developed in Section 6.2. The MCR and HR filters exploit transparent motions estimated by the method of Section 5.

The adaptive function of the ANMCR and MCR filters, taking into account the relation between filter gain and prediction error, is pictured on Figure 10. It has been designed heuristically to provide efficient noise reduction without introducing artifacts. It has three parts, defined by two thresholds ( 𝑠 1 = 𝜎 and 𝑠 2 = 2 𝜎 in practice): a constant part for the low prediction errors (where the coefficient is set to the optimal value for noise reduction 𝑐 m a x ), a linear decreasing one in a transition area, and a vanishing one for large prediction errors. We have specified the three factors 𝑓 1 , 𝑓 2 , and 𝑓 1 2 of the hybrid filter in a similar way. 𝑐 m a x is set to 1, 𝑠 1 to 1 . 5 𝜎 and 𝑠 2 to 2 𝜎 for that filter.

647262.fig.0010
Figure 10: Decreasing function used as adaptive function in the different filters. It has three parts: a constant one for small prediction errors, a linear one in a transition area, and a vanishing one for large prediction errors.
8.1. Results on Realistic Generated Image Sequences

We have tested the proposed denoising method on realistic synthetic image sequences simulating the X-ray imaging process and the transparency phenomenon (Appendix B.2). The obtained image sequence is corrupted by a strong noise typical of fluoroscopic exams ( 𝜎 = 2 0 ).

Table 4 contains the evolution of the residual noise level of the filtered images. The transparent motion compensated filter soon reaches a denoising limit, as predicted by the theory. The hybrid filter performs slightly better than the ANMCR filter, as far as residual noise level is concerned.

tab4
Table 4: Normalized residual noise evolution given by the rate 𝜎 ( 𝑡 ) / 𝜎 for a realistic synthetic image sequence typical of X-ray exams, processed by the adaptive temporal filter without motion compensation (ANMC), with transparent motion compensation (MC) and by the proposed hybrid filter (HR).

The residual noise maps are given in Figure 11. They show that the hybrid filter preserves better the image details, and that the MCR filter also outperforms the ANMCR filter. However, the residual noise is much more perceptible in the case of MCR filter than for the other two filters.

647262.fig.0011
Figure 11: Residual noise of the eighth image of the generated sequence respectively obtained with the ANMCR filter, the MCR filter and the proposed HR filter (see main text).

Combining the different performance criteria, we can claim that the HR filter appears as the best choice among the three filters for that set of experiments.

8.2. Results on Real Clinical Images

It is difficult to illustrate denoising results by means of static printed images, when the considered images are meant to be observed dynamically on a specific screen in a dark cath-lab. However, the major interest of our framework being its ability to improve the quality of real interventional images, we present three typical denoising results in this subsection.

Since the MCR performs noticeably worse than the two other filters, we will compare the performance of the ANMCR and HR filters only. The images processed with the former will be presented on the right of the figures, and those with the latter on the left, at different time instants. Both are heuristically parameterized to provide a visually equivalent global denoising effect, so that the difference of performance will be mainly assessed based on the quality of contrast preservation and on the presence of artifacts. We have drawn arrows on the figures to highlight the regions of interest (that appear immediately on a dynamic display).

Results on a cardiac fluoroscopic exam are reported on Figure 12 at different time instants (It can be observed at the address http://www.irisa.fr/vista/Equipe/People/Auvray/Travaux_Vincent.Auvray.English.html.). The dark mass of the heart (on the right) can be distinguished from the bright tissues of the lungs (on the left). These two organs are superimposed to the background, where spine disks can be seen. The comparison of the output images obtained with the HR filter (on the left) and the ANMCR filter (on the right) reveals a much better contrast preservation of the heart with the HR filter (even if the printed figures do not give the immediate improvement impression that an observer has in ideal observation conditions). This is confirmed by the observation of the lungs.

fig12
Figure 12: (a) Two time instants of a fluoroscopic sequence processed with the HR, (b) the ANMCR filter, (c), (d) one detail of each image is shown. (c) Highlights the better cardiac border contrast, and (d) the better lungs detail preservation.

The second image sequence (Figure 13) corresponds to a cardiac exam where the catheter motion has been correctly handled by the transparent motion estimation module. We indeed observe that the catheter is more contrasted on the images processed by the HR filter than the ANMCR filter.

fig13
Figure 13: (a) Four-time instants of a fluoroscopic sequence processed with the HR, and (b) the ANMCR filter. We observe a better contrast preservation of the catheter with the hybrid filter.

The last experiment exhibits the “noise tail” artifact induced by the ANMCR filter. When a moving textured object is detected by this filter, the corresponding area is kept without filtering in the output image. As a result, a region of the output image is more noisy than its neighborhood, which can be disturbing. In this situation, the hybrid filter is able to denoise the whole image, and thus does not introduce such artifacts. This phenomenon is pictured on Figure 14. We have added on the right of the figure a zoom on the region of interest. We observe that the curve corresponding to the moving border of the heart remains corrupted on the image denoised with the ANMCR filter. This artifact disappears on the image processed by the HR filter.

fig14
Figure 14: (a) Fluoroscopic sequence