Abstract

Motion estimation techniques are widely used in today's video processing systems. The frequently used techniques are frequency-domain motion estimation methods, most notably phase correlation (PC). If the image frames are corrupted by Gaussian noises, then cross-correlation and related techniques do not work well. In this paper, however, we have studied this topic from a viewpoint different from the above. Our scheme is based on the bispectrum method for sub-pixel motion estimation of noisy image sequences. Experimental results show that our proposed method performs significantly better than PC technique.

1. Introduction

Image frames are generated by scanning a scene several times a second where each frame, generally, consists of two regions. One region, referred to as stationary background, is virtually the same as the previous frame. The other region, referred to as moving data, has moved with respect to the previous frame. Estimating the motion between image frames has long been a problem of interest in areas such as video compression, robot vision, and biomedical engineering [1].

Many motion estimation schemes have been developed. They can be classified into spatial-domain and frequency-domain approaches. The spatial-domain algorithms consist of matching algorithms and gradient-based algorithms. The frequency domain algorithms consist of phase correlation algorithms, wavelet transform-based algorithms, and DCT-based algorithms [2]. The vast majority of these algorithms consider noise-free data, although in [3] the displacement vector is estimated from noisy data using the generalized maximum likelihood criterion. If the image frames are corrupted by Gaussian noises, then cross-correlation and related techniques do not work well. In this circumstance, higher-order spectra (HOS) in general and the bispectrum in particular have recently been widely used as an important tool for signal processing. The classical methods based on the power spectrum are now being effectively superseded by the bispectral ones due to some definite disadvantages of the former. These include the inability to identify systems fed by non-Gaussian noise (NGN) inputs and nonminimum phase (NMP) systems, and by identification of system nonlinearity. In these cases, the autocorrelation-based methods offer no answer. Out of all these, the identifiability of NMP systems has received the maximum attention from researchers.

HOS-based methods have already been proposed to estimate motion between image frames [4–9]. In [6], the displacement vector is obtained by maximizing a third-order statistics criterion. In [8], the global motion parameters obtained by a new region recursive algorithm. In [4, 5], several algorithms are developed based on a parametric cumulant method, a cumulant-matching method, and a mean-kurtosis error criterion. In this correspondence, a novel algorithm for the detection of motion vectors in video sequences is proposed. The algorithm uses bispectrum method for subpixel motion estimation of noisy image sequences to obtain a measure of content similarity for temporally adjacent frames and responds very well to scene motion vectors.

2. Bispectrum-Based Image Motion Estimation

The problem of motion estimation can be stated as follows: β€œgiven an image sequence, compute a representation of the motion field that best aligns pixels in one frame of the sequence with those in the next” [9]. This is formulated as π‘”π‘˜βˆ’1(π‘₯,𝑦)=π‘“π‘˜βˆ’1(π‘₯,𝑦)+π‘›π‘˜βˆ’1𝑔(π‘₯,𝑦),π‘˜(π‘₯,𝑦)=π‘“π‘˜(π‘₯,𝑦)+π‘›π‘˜(π‘₯,𝑦)=π‘“π‘˜βˆ’1ξ‚€π‘₯βˆ’π‘‘π‘˜π‘₯,π‘¦βˆ’π‘‘π‘˜π‘¦ξ‚+π‘›π‘˜(π‘₯,𝑦)=π‘”π‘˜βˆ’1ξ‚€π‘₯βˆ’π‘‘π‘˜π‘₯,π‘¦βˆ’π‘‘π‘˜π‘¦ξ‚+π‘€π‘˜(π‘₯,𝑦),(1)where π‘€π‘˜(π‘₯,𝑦)=π‘›π‘˜(π‘₯,𝑦)βˆ’π‘›π‘˜βˆ’1(π‘₯βˆ’π‘‘π‘˜π‘₯,π‘¦βˆ’π‘‘π‘˜π‘¦); (π‘₯,𝑦) denotes spatial image position of a point; π‘”π‘˜(π‘₯,𝑦) and π‘”π‘˜βˆ’1(π‘₯,𝑦) are observed image intensities at instant π‘˜ and π‘˜βˆ’1, respectively; π‘“π‘˜(π‘₯,𝑦) and π‘“π‘˜βˆ’1(π‘₯,𝑦) are the noise-free frames; π‘›π‘˜(π‘₯,𝑦) and π‘›π‘˜βˆ’1(π‘₯,𝑦) are assumed to be spatially and temporally stationary, zero-mean image Gaussian noise sequences with unknown covariance; (π‘‘π‘˜π‘₯,π‘‘π‘˜π‘¦) is the displacement vector of the object during the time interval [π‘˜βˆ’1,π‘˜].

The third-order autocumulants and cross-cumulants of a zero-mean from 2D random field π‘”π‘˜(π‘₯,𝑦) are defined, respectively, as follows:πΆπ‘”π‘˜π‘”π‘˜π‘”π‘˜3ξ‚€π‘Ÿ1,π‘Ÿ2;𝑠1,𝑠2𝑔=πΈπ‘˜(π‘₯,𝑦)π‘”π‘˜ξ‚€π‘₯+π‘Ÿ1,𝑦+π‘Ÿ2ξ‚β‹…π‘”π‘˜ξ‚€π‘₯+𝑠1,𝑦+𝑠2,πΆξ‚ξ‚„π‘”π‘˜π‘”π‘˜βˆ’1π‘”π‘˜3ξ‚€π‘Ÿ1,π‘Ÿ2;𝑠1,𝑠2𝑔=πΈπ‘˜(π‘₯,𝑦)π‘”π‘˜βˆ’1ξ‚€π‘₯+π‘Ÿ1,𝑦+π‘Ÿ2ξ‚β‹…π‘”π‘˜ξ‚€π‘₯+𝑠1,𝑦+𝑠2,(2)where 𝐸{β‹…} represents the expectation operator.

The discrete bispectrum of the frame π‘”π‘˜(π‘₯,𝑦) is defined as follows: π΅π‘”π‘˜π‘”π‘˜π‘”π‘˜3𝑒1,𝑒2;𝑣1,𝑣2𝐢=β„±π‘”π‘˜π‘”π‘˜π‘”π‘˜3ξ‚€π‘Ÿ1,π‘Ÿ2;𝑠1,𝑠2=πΊπ‘”π‘˜ξ‚€π‘’1,𝑒2ξ‚πΊπ‘”π‘˜ξ‚€π‘£1,𝑣2ξ‚πΊβˆ—π‘”π‘˜ξ‚€π‘’1+𝑣1,𝑒2+𝑣2,(3)where β„± denotes the Fourier transform operation; πΊπ‘”π‘˜(𝑒) corresponds to the DFT of the frame π‘”π‘˜(π‘₯,𝑦); βˆ— indicates the complex conjugate; (𝑒1,𝑒2) and (𝑣1,𝑣2) are the frequency co-ordinates for the 2D Fourier transform.

Due to the shift-invariance of the bispectrumπ΅π‘”π‘˜π‘”π‘˜π‘”π‘˜3𝑒1,𝑒2;𝑣1,𝑣2=π΅π‘”π‘˜βˆ’1π‘”π‘˜βˆ’1π‘”π‘˜βˆ’13𝑒1,𝑒2;𝑣1,𝑣2+π΅π‘€π‘˜π‘€π‘˜π‘€π‘˜3𝑒1,𝑒2;𝑣1,𝑣2,(4)where π΅π‘”π‘˜βˆ’1π‘”π‘˜βˆ’1π‘”π‘˜βˆ’13(𝑒1,𝑒2;𝑣1,𝑣2) and π΅π‘€π‘˜π‘€π‘˜π‘€π‘˜3(𝑒1,𝑒2;𝑣1,𝑣2) denote bispectrum of the frame π‘”π‘˜βˆ’1(π‘₯,𝑦) and noise, respectively. If the probability density function of the noise is symmetrical, that is, 𝑝(𝑛)=𝑝(βˆ’π‘›), or at least not skewed, that is, βˆ«π‘›3𝑝(𝑛)𝑑𝑛=0, then the term π΅π‘€π‘˜π‘€π‘˜π‘€π‘˜3(𝑒1,𝑒2;𝑣1,𝑣2) is negligible which renders the triple-correlation very effective in detecting a signal embedded in noise [10]. Thenπ΅π‘”π‘˜π‘”π‘˜π‘”π‘˜3𝑒1,𝑒2;𝑣1,𝑣2=π΅π‘”π‘˜βˆ’1π‘”π‘˜βˆ’1π‘”π‘˜βˆ’13𝑒1,𝑒2;𝑣1,𝑣2.(5)The cross-bispectrum is defined as follows:π΅π‘”π‘˜π‘”π‘˜βˆ’1π‘”π‘˜3𝑒1,𝑒2;𝑣1,𝑣2𝐢=β„±π‘”π‘˜π‘”π‘˜βˆ’1π‘”π‘˜3ξ‚€π‘Ÿ1,π‘Ÿ2;𝑠1,𝑠2=πΊπ‘”π‘˜ξ‚€π‘’1,𝑒2ξ‚πΊπ‘”π‘˜βˆ’1𝑣1,𝑣2ξ‚πΊβˆ—π‘”π‘˜ξ‚€π‘’1+𝑣1,𝑒2+𝑣2.(6)Then,π΅π‘”π‘˜π‘”π‘˜βˆ’1π‘”π‘˜3𝑒1,𝑒2;𝑣1,𝑣2=π΅π‘”π‘˜βˆ’1π‘”π‘˜βˆ’1π‘”π‘˜βˆ’13𝑒1,𝑒2;𝑣1,𝑣2ξ‚β‹…π‘’βˆ’π‘—2πœ‹(𝑣1π‘‘π‘˜π‘₯+𝑣2π‘‘π‘˜π‘¦).(7)Using the relation in (5), we can transform (7) asπ΅π‘”π‘˜π‘”π‘˜βˆ’1π‘”π‘˜3𝑒1,𝑒2;𝑣1,𝑣2=π΅π‘”π‘˜π‘”π‘˜π‘”π‘˜3𝑒1,𝑒2;𝑣1,𝑣2ξ‚β‹…π‘’βˆ’π‘—2πœ‹(𝑣1π‘‘π‘˜π‘₯+𝑣2π‘‘π‘˜π‘¦).(8)

As we can see from (3), the bispectrum has two vector arguments containing totally four scalar frequency variables. Assuming that πΊπ‘”π‘˜(𝑒1,𝑒2) is an N-by-N discrete Fourier transform of π‘”π‘˜(π‘₯,𝑦), the bispectrum becomes a four-dimensional N-by-N-by-N-by-N matrix. It is therefore not practical to evaluate the whole bispectrum. A better solution is to take 2D slices of the 4D spectrum. There are basically various ways of defining these slices, but we will only consider the case whereπ΅π‘”π‘˜π‘”π‘˜π‘”π‘˜3,𝑙𝑒1,𝑒2=π΅π‘”π‘˜π‘”π‘˜π‘”π‘˜3𝑒1,𝑒2;𝑙𝑒1,𝑙𝑒2ξ‚π΅βˆ€π‘™βˆˆπ‘…,π‘”π‘˜π‘”π‘˜βˆ’1π‘”π‘˜3,𝑙𝑒1,𝑒2=π΅π‘”π‘˜π‘”π‘˜βˆ’1π‘”π‘˜3𝑒1,𝑒2;𝑙𝑒1,𝑙𝑒2ξ‚βˆ€π‘™βˆˆπ‘….(9)

Although we have now taken only a small portion of the whole spectrum [11], it can be shown that the motion vector is still possible and no essential information has been lost.

Thus, the third-order hologram, β„Žπ‘™(π‘Ÿ1,π‘Ÿ2), is defined byβ„Žπ‘™ξ‚€π‘Ÿ1,π‘Ÿ2=β„±βˆ’1[π΅π‘”π‘˜π‘”π‘˜βˆ’1π‘”π‘˜3,𝑙𝑒1,𝑒2|||π΅π‘”π‘˜π‘”π‘˜π‘”π‘˜3,𝑙𝑒1,𝑒2|||ξ‚€π‘Ÿ]=𝛿1βˆ’π‘‘π‘˜π‘₯,π‘Ÿ2βˆ’π‘‘π‘˜π‘¦ξ‚.(10)

As a result, by finding the location of the pulse in (10) we are able to tell the displacement, which is the motion vector. Since third-order statistics are used, the method is insensitive (in theory) to both spatially and temporally corrupted by noise which is symmetrically distributed (e.g., Gaussian). In practice, the motion vector is not an impulse; hence, we estimate (π‘‘π‘˜π‘₯,π‘‘π‘˜π‘¦) as the index (π‘Ÿ1,π‘Ÿ2), which maximizes |β„Žπ‘™(π‘Ÿ1,π‘Ÿ2)|.

The co-ordinates (π‘Ÿ1π‘š,π‘Ÿ2π‘š) of the maximum of the real-valued array β„Žπ‘™(π‘Ÿ1,π‘Ÿ2) can be used as an estimate of the horizontal and vertical components of motion between π‘”π‘˜(π‘₯,𝑦) and π‘”π‘˜βˆ’1(π‘₯,𝑦) as follows:ξ‚€π‘Ÿ1π‘š,π‘Ÿ2π‘šξ‚ξ‚€β„Ž=argmaxπ‘™ξ‚€π‘Ÿ1,π‘Ÿ2.(11)

3. Subpixel Accuracy

Subpixel performance is a critical element of the proposed algorithm. With reference to our previously published work [12, 13], we are introducing a number of important new features, which improve the accuracy of the motion estimates.

Subpixel accuracy of motion measurements is obtained dy variable-separable fitting performed in the neighborhood of the maximum using one-dimensional quadratic function. Using the notation in (11) above, prototype functions are fitted to the tripletsξ‚†β„Žπ‘™ξ‚€π‘Ÿ1π‘šβˆ’1,π‘Ÿ2π‘šξ‚,β„Žπ‘™ξ‚€π‘Ÿ1π‘š,π‘Ÿ2π‘šξ‚,β„Žπ‘™ξ‚€π‘Ÿ1π‘š+1,π‘Ÿ2π‘šξ‚†β„Žξ‚ξ‚‡,(12)π‘™ξ‚€π‘Ÿ1π‘š,π‘Ÿ2π‘šξ‚βˆ’1,β„Žπ‘™ξ‚€π‘Ÿ1π‘š,π‘Ÿ2π‘šξ‚,β„Žπ‘™ξ‚€π‘Ÿ1π‘š,π‘Ÿ2π‘š+1,(13)(ξπ‘‘π‘˜ξπ‘‘π‘₯,π‘˜π‘¦)that is, the maximum peak of the phase correlation surface and its two neighboring values on either side, vertically and horizontally.

The location of the maximum of the fitted function provides the required subpixel motion estimate ξπ‘‘π‘˜π‘₯. Fitting a parabolic function horizontally to the data triplet (12) yields a closed-from solution for the horizontal component of the motion estimate ξπ‘‘π‘˜β„Žπ‘₯=π‘™ξ‚€π‘Ÿ1π‘š+1,π‘Ÿ2π‘šξ‚βˆ’β„Žπ‘™ξ‚€π‘Ÿ1π‘šβˆ’1,π‘Ÿ2π‘šξ‚2𝐻,(14) as follows:ξ‚ƒβ„Žπ»=π‘™ξ‚€π‘Ÿ1π‘š+1,π‘Ÿ2π‘šξ‚βˆ’2β„Žπ‘™ξ‚€π‘Ÿ1π‘š,π‘Ÿ2π‘šξ‚+β„Žπ‘™ξ‚€π‘Ÿ1π‘šβˆ’1,π‘Ÿ2π‘š.where ξπ‘‘π‘˜π‘¦

The fractional part of the vertical component can be obtained in a similar way using (13) instead of (12).

Finally, the horizontal and vertical components of the subpixel accurate motion estimate are obtained by computing the location of the maxima of each of the above fitted quadratics.

In [14], it is shown that half-pixel accuracy motion vectors leads to a very significant improvement when compared to one-pixel accuracy, where as a higher precision results in negligible changes. Therefore, a half-pixel accuracy was chosen in our simulations.

4. Experimental Results

To prove the feasibility of the proposed method, we compared it to a PC technique implemented in a similar manner as our approach. In this section, we examine a few examples and compare the performance, efficiency, and complexity of the two methods. In our experiments, we used the well-known test sequences: foreman (176 pixels by 144 lines), mother-daughter and Stefan (352 pixels by 288 lines), table tennis (352 pixels by 240 lines). Although the original sequences are in color, only the luminance (brightness) component is used to estimate the motion vectors.

To assess the performances of the different motion estimation techniques, the following comparisons were made. First, the subjective quality of the estimated motion field was evaluated, showing the capability of the algorithm to estimate the true motion in the scene. Second, the PSNR of motion compensated was measured, giving insight about the quality of the prediction. Results obtained using the foreman, mother-daughter and Stefan sequences are shown in Figure 1. All image sequences are degraded with additive zero-mean Gaussian noise to a signal-to-noise ratio (SNR) of 10 dB. Our results demonstrate that the proposed method outperforms PC, achieving higher precision and a significantly smaller corresponding measurement error. This confirms the motion that the proposed technique of an image is a superior feature selector utilizing the portions of the image spectrum most likely to contribute to reliable motion estimation.

The ability of the bispectrum method to accurately estimate the displacement vector field from a degraded sequence is demonstrated in Figure 2. In this Figure, the estimated motion vector fields for the mother-daughter sequence using the two aforementioned motion estimation methods. The motion vectors estimated between the frames 126 and 127 are shown for the mother-daughter sequence. We can see that the estimates from the PC seem very random, but the bispectrum technique gives better results, producing the same motion vectors. Thus, the motion fields estimated by the β€œour approach” tend to be very smooth due to the smoothness constraint. Because of the noise-resistant property of the bispectrum, it produces more reliable estimates. Therefore, the proposed method motion estimation results globally in motion fields more representative of the true motion in the scene.

In terms of motion compensated images, from mother-daughter sequence, we observe better compensated images by the proposed method. We also observe that the motion compensated images for the β€œour method” are much closer to the original images. Thus, the β€œour scheme” is able to measure the motion vector more accurately and is more robust in general. Overall, the bispectrum typically offers better visual quality images than the PC method. Figure 3 gives examples of motion compensated images.

Comparisons of the PC and bispectrum methods indicate that the bispectrum is a robust technique for motion vector. Results of these comparisons are shown for different noise levels and video sequences. Additive Gaussian noise (AGN) was added to image sequences with an input SNR varying from 15 dB to 28 dB. Figure 4 shows the PSNR of the motion compensated prediction error for noise power of 15 dB for table tennis sequence, confirming that our scheme is consistently more immune to noise. Experiments with other levels of noise power demonstrated that similar performance gains are achievable. Similar experiments were performed for the other sequences. The results further confirm that the β€œour method” consistently outperforms PC. In terms of complexity, this is measured by the computation time. All the computations are performed on Intel(R) Pentium(R)D CPU 3.4 GHz with Windows XP. The two algorithms have been implemented using a prototype written in MATLAB 6.5 R13. The comparison between the β€œour method” and the PC confirmed that the two methods have the complexity on the same order. This is shown in Table 1.

5. Conclusion

In this paper, the bispectrum method for subpixel motion estimation of noisy image sequences in frequency-domain was presented. The β€œour” proposed method provides an advantage over the PC algorithm in the presence of AGN. With β€œour method,” the displacement vector field is smoother, providing a more accurate measure of object motion. At relatively low noise levels, the bispectrum performance is comparable to its performance in the noise-free environment. At high noise levels SNR around 10 dB, the PC fails, yet even under these extreme conditions, the bispectrum provides improvement in performance over the PC algorithm. In addition to its PSNR performance, the bispectrum also yields smooth motion fields.