Abstract

Multiview video consists of multiple views of the same scene. They require enormous amount of data to achieve high image quality, which makes it indispensable to compress multiview video. Therefore, data compression is a major issue for multiviews. In this paper, we explore an efficient fractal video codec to compress multiviews. The proposed scheme first compresses a view-dependent geometry of the base view using fractal video encoder with homogeneous region condition. With the extended fractional pel motion estimation algorithm and fast disparity estimation algorithm, it then generates prediction images of other views. The prediction image uses the image-based rendering techniques based on the decoded video. And the residual signals are obtained by the prediction image and the original image. Finally, it encodes residual signals by the fractal video encoder. The idea is also to exploit the statistical dependencies from both temporal and interview reference pictures for motion compensated prediction. Experimental results show that the proposed algorithm is consistently better than JMVC8.5, with 62.25% bit rate decrease and 0.37 dB PSNR increase based on the Bjontegaard metric, and the total encoding time (TET) of the proposed algorithm is reduced by 92%.

1. Introduction

There are several image/video compression methods, for example, JPEG, MPEG, and H.26X, which are all based on motion estimation/compensation (ME/MC) [1]. But fractal compression, which is based on the IFS (iterated function system) proposed by Mandelbrot [2], is a new approach to image coding. After Jacquin [3] proposed the first automatic algorithm for fractal coding of still images, much effort has been made in the area of the fractal still image coding techniques. For example, Fisher made use of quadtree to improve the method [4].

With the development of the fractal image compression, fractal coding method has been applied to video sequence compression [5]. Wang and Lai proposed a hybrid compression algorithm [6], which merged the advantages of a cube-based fractal compression method and a frame-based fractal compression method. And an adaptive partition instead of fixed-size partition was discussed. The adaptive partition and the hybrid compression algorithm exhibit relatively high compression ratio for the sequence of motion images from a videoconference. But the computational complexity is high.

And with the advantages of high compression ratio and resolution-independent in the fractal coding, more researches were conducted. Distasi et al. tried to overcome the difficulty of such long consuming time [7]. Wu et al. proposed a schema genetic algorithm for fractal image compression to find the best self-similarity in fractal image compression [8]. Zhu et al. proposed an improved fractal compression method using more effective macroblock partition scheme instead of classical quadtree partition scheme to provide promising performances at low bit rate coding [9]. Giusto et al. used IFS and wavelet decomposition to achieve spatial zoom and slow motion replay of digital color video sequences [10]. Active scene detection, wavelet subband analysis, and color fractal coding based on earth mover’s distance (EMD) measure were used to reduce computational load and to improve visual quality. The new classification scheme is useful for high speed and low power consumption [11]. The classification is devised with the intention of being hardware realizable. A scheme is also laid out to store the classified domain and range data in order to accommodate easy hardware. Recently, Shiping Zhu improved the traditional fractal video coding based on a new hexagon block-matching motion estimation technology, which improves the compression efficiency [12].

Multiview video consists of multiple views of the same scene in which there are high degree of correlations between interviews and intraviews. Multiview video coding (MVC) is the key technology that serves the features and applications of three-dimensional television (3DTV). Alternatively, 3DTV displays can supply the 3D representation through the human brain to fuse the left and right views of the same scene, which are captured from slightly different viewing angles. Undoubtedly, it will be a very attractive and effective direction and technique to realize the 3D visual communication in the near future [13]. A novel intermediate view synthesis method based on disparity estimation was presented in [14]. The left and right parts of the image were divided into three kinds using strong and weak consistency constraints separately. Then different strategies are applied according to the characteristic of each region. Ryu et al. proposed an adaptive competition method in order to increase the accuracy of a motion vector (MV) prediction and save bit rate in multiview video coding (MVC) [15]. Motion vector predictors for INTER mode and SKIP (DIRECT) mode were optimally selected from a given adaptive set of predictors by a slightly modified rate distortion criterion. An interview motion model in terms of the global geometric transformation to represent the motion correlation between two adjacent views was proposed in [16]. Compared to the traditional single-channel video system, the original data of multiview video is much huge, which makes it indispensable to compress multiview video.

2. The Fractal Compression Theory and Mathematical Background

Let be image luminance of a pixel at position and let be the set of nonoverlapping range blocks (i.e., collections of pixel coordinates) partitioning the image. Similarly, let be the set of , possibly overlapping, domain blocks covering the image. Finally, let and .

For each range block the goal is to find a domain block and a contractive mapping that jointly minimize a dissimilarity (distortion) criterion . The contractive affine mapping consists of three submappings.(1)Contraction. is usually preceded by lowpass antialias filtering; for example, -fold contraction with four neighbors averaging: , where is the first-order neighborhood (north-east-west-south).(2)Photometric transformation (accounts for different dynamic ranges of pixels in the range and domain blocks). , where is a scaling factor (gain) and is an offset.(3)Geometric transformation (inverse mapping: range → domain). , where , is a matrix, and is a translation vector (this mapping must be 1-to-1 between pixels of the range and domain blocks).

The overall transformation that maps a domain-block pixel into the range-block pixel at is defined by where is the composition operator. The above general expression can be simplified by constraining the transformation to eight cases: 4 rotations (0°, 90°, −90°, 180°) and four mirror reflections (mid-horizontal, mid-vertical, first diagonal, second diagonal). We denote the set of possible transformations by . Furthermore, by expressing implicitly as the index of the domain block, that is, in , we can write as follows:

In order to encode range block , a search for index (domain block ) and for an isometric transformation must be executed, jointly with the computation of photometric parameters and . This can be performed by minimizing the following mean-squared error: denotes the number of range blocks. While the isometric transformation and index (equivalent to translation ) are usually found by exhaustive search, the scaling and offset are computed as follows: where and represent the mean of the luminance values in the range and domain blocks, respectively. Instead of transmitting the photometric offset (in addition to , , and ), mean value of the range block can be transmitted. This permits a precise representation of the mean local luminance but to assure convergence at the decoder, without a constraint on the luminance scaling coefficients, it requires a modification of the photometric transformation. This can be considered as orthogonalization with respect to the constant blocks and has been treated in detail in [17].

3. The Proposed Scheme

We encode the multiviews with the fractal codec method. Compared to the previous encoder, there is the homogeneous region condition and the improved motion estimation in monocular video codec, as shown in Figure 1. Fast disparity estimation algorithm and temporal-spatial prediction structure are also proposed in the paper.

3.1. Extended Fractional Pel Motion Estimation Algorithm

The traditional diamond pattern search uses large diamond search pattern (LDSP) and small diamond search pattern (SDSP). LDSP consists of nine checking points from which eight points surround the center one to compose a diamond shape. SDSP comprises five checking points which form a smaller diamond shape. Several specific features have been incorporated into motion estimation (ME) to improve its coding efficiency. However, they result in very high computational load. The difference between the integer pixel matching error surface and the fractional matching error surface is that in the searching window the former one is far from a unimodal surface, which will easily result in trapping in a local minimum. Because of the smaller search range of subpixel ME as well as the high correlation between fractional pels, we can predict the original search point precisely and fractional pel MV search can be terminated earlier. An extended fractional pel motion estimation algorithm employs SDSP and square pattern, which is based on motion vector. In the proposed scheme, the fractional candidate that is considered as starting MV of the current macroblock is predicted by taking fast motion vector prediction (FMVP) in neighbor block. We do not perform iteration of diamond pattern, instead we stop fractional pel search in of FMVP, which contains 50%–90% of the accurate MV and reduces the time of searching for best-matching block greatly. We employ the best and second best sum of absolute difference (SAD) from the four diamond points in fractional pel positions. If the first and second best matching points are at the opposite position, four surrounding points in the square pattern are compared. If the best matching point is next to the second best matching point, an additional point in the square pattern will be checked. For illustration, Figure 2 shows a simple search path example of extended diamond pattern search algorithm.

With acceptant PSNR and compression ratio, the proposed extended fractional pel motion estimation algorithm can speed up coding process greatly, which is very important for fractal video coding.

3.2. Homogeneous Region Condition

In general, decision of an optimum matching block for motion estimation depends on spatial and temporal homogeneity in video sequences. In fractal coding, root mean square (RMS) is used as matching error between D block and R block. Through two vectors, (from the D block ) and (from the R block ), and appropriate value and , the minimum matching error can be obtained as

Setting the appropriate contrast factor and brightness factor , the value after the affine transformation has the minimum square distance value from . When partial derivatives of and are both 0, the minimum RMS can be obtained as where denotes the average value of the vectors and denotes the average value of the vectors . Due to (6) and (7), (5) can be inferred as follows:

Let , and, obviously, , ; (8) can be inferred as follows: Let

For each R block, is already known, and the value is considered in the rough optimum matching block decision when searching for the candidate blocks with least value of RMS.

For each R block in the matching process, if is bigger than , the macroblock is classified into a spatially homogeneous region. And the further comparison is not necessary to determine the best error RMS. is experimentally determined in advance and set to 0.1 in the experiment.

3.3. Fast Disparity Estimation Algorithm

Geometric constraints between neighboring frames in multiview video sequences can be used to eliminate space redundant. Using the joint motion estimation and disparity estimation predictable way, a minimum prediction error is chosen for the result. Parallax distribution constraint, which contains epipolar constraint, directional constraint, and spatial-time domain correlation, is used in this paper.

Epipolar constraint: the epipolar geometry is the intrinsic projective geometry between two views. Epipolar line can be found in the right image for a point of left image, on which the corresponding point can be searched. For parallel systems, only the search of direction is needed just along the scan line.

Directional constraint: for the same scene, the left perspective projection image moves right relative to the right image. Therefore, searching only in one direction is necessary.

The spatial-time domain correlation: the disparity vector of the continuous variation disparity field has a strong correlation that exists in the same frame. For two adjacent frames, only a few pixels move and the positions of most pixels have not changed.

Therefore, corresponding disparity vector of the previous frame can be considered as the starting point for a small area search to find the actual disparity vector quickly. From the three constraints above, the best ME matching block of one block is mainly distributed in an area along the line. This characteristic is helpful to reduce the number of search candidates and further to decrease the ME time. Fast disparity estimation algorithm (FDEA) can greatly improve the coding efficiency, which makes full use of the correlation between left and right views, and can find the minimum matching error more quickly.

3.4. Temporal-Spatial Prediction Structure

Multiviewpoint video sequences are captured by several cameras at the same time and there is a high degree of correlation between interviews and intraviews. The prediction structure on spatial domain is different with temporal case. In temporal domain, it is known that data correlation is maximized when time separation is small. We use the next or previous frame as reference picture in temporal prediction of single video sequence. But for the multiviews, spatial correlation in interview domain may not be proportional with the distance between the cameras. For instance, if the two camera positions are far, it does not mean that this camera pair is less correlated than other pairs because other cameras may be rotated and the captured images are very different. Thus we should not only consider the position in interview prediction structure. The higher the interview similarity is, the greater its data redundancy is. So we proposed an algorithm for making the geometric prediction structure based on view center and view distance. When processing the multiview video signal, disparity prediction and motion prediction based on multi-references mode are combined to reduce the number of intraframe coded frames and improve the view compensation efficiency.

The proposed new prediction structure is shown in Figure 3 which contains 5 views and each view applies IPP prediction structure. The centre view channel is coded with homo-I-frame based on the discrete cosine transform (DCT) and other channels are coded based on disparity/motion compensation, in which the frames of , , , and are coded by our proposed disparity prediction named FDEA and is coded by the proposed fractal video coding.

Figure 4 is the geometric schematic diagram of temporal-spatial prediction. Disparity and motion estimation are used for prediction to reduce data redundancy adequately.

4. Experimental Results

To verify the performance of the proposed method, Figures 5 and 6 illustrate the comparison of the average performances among the proposed method, JMVC8.5 and different methods in [18] with standard videos “Ballroom” ( pixels,) and “Race” ( pixels). First, segmentation is proposed in [18]. Each frame is divided into flat background region, complex background region, and foreground region by an adaptive threshold. Second, predictive modes are filtered further according to the texture direction of macroblocks, which can eliminate some unnecessary modes and improve the speed of mode searching [18]. Third, the prediction structure adjusts the selection of I-view and removes some interview predictions [18]. The maximum and minimum partition block sizes are pixels and pixels, respectively. The experiment is preceded in a PC (CPU: Inter Core2 E6300, 1.98 GHz, RAM: 2 G, DDR2). The test conditions are shown in Table 1.

From Figures 5 and 6 and Table 2, one can see that the proposed fractal video sequences codec with multiviews reduces the encoding time and improves coding efficiency greatly, which leads to more real-time applications. With the QP decreasing, the performance of the proposed algorithm is superior to that of JMVC8.5 and the algorithms in [18] on “Ballroom” and “Race.” The more data need to be compressed, the better performance will be got for the precise predictive estimation with fractional pel. And disparity and motion estimation reduce data redundancy fully. With the better decoding image quality, the proposed method uses lower bit rate and less coding time.

Figure 7 shows the original and decoded images of 3rd frame resulted from JMVC8.5 and the proposed method. The proposed fractal video sequences codec with multiviews reduces the encoding time and improves coding efficiency greatly, which leads to more real time applications.

To verify the performance of the small QP, it is compared with the state-of-the-art JMVC8.5 [19]. The configuration settings of the JMVC8.5 simulation environment are shown in Table 3. Public multiview sequences “Flamenco2” () and “Race” () are used for performance comparison.

Figure 8 illustrates that the coding efficiencies of the proposed algorithm improve a lot compared with JMVC8.5. The gain achieved is up to 33% bit rate saving for “flamenco2” and 35% bit rate saving for “race,” and the coding performance is about 0.49 dB higher for “flamenco2” and 0.38 dB higher for “race” than that of the JMVC8.5 coding, respectively. For comparison of the computational complexity, we use the total encoding time (TET). As shown in Figure 9, the TET of the proposed algorithm is reduced by 92%, on average. So, the proposed method is very effective.

In order to facilitate the comparison, the average performances computed with the Bjontegaard metric [20] are shown in Table 4. We note that the proposed method is consistently better than JMVC8.5, with 62.25% bit rate decrease and 0.37 dB PSNR increase.

The contents are organized as follows. The theory of fractal coding is summarized in Section 2. The proposed efficient fractal video sequences codec with multiviews is presented in Section 3. The experimental results are presented in Section 4. And finally the conclusions are outlined in Section 5.

5. Conclusion

In this paper, an efficient fractal video sequences codec with multiviews is presented to improve the encoding performance. We make full use of the characteristic of fractal video coding and the nature of video particularly. Compared to the JMVC8.5, better results are obtained with our proposed method.

We firstly improved the motion estimation and proposed the homogeneous region condition to get better performance of the based view. In addition, fast disparity estimation algorithm and temporal-spatial prediction structures are applied to further raise the compression efficiency. The proposed algorithm spends less encoding time and achieves higher compression ratio with better decoding image quality. Experimental results show that the proposed algorithm is consistently better than JMVC8.5, with 62.25% bit rate decrease and 0.37 dB PSNR increase with the Bjontegaard metric. This method makes the best of various features of fractal coding and video motion and has done a great improvement and got considerably good results. Also, it has built a good foundation for the further research of multiview fractal video coding and other related coding methods.

Acknowledgments

The project is sponsored by the National Natural Science Foundation of China (NSFC) under Grants no. 61375025, no. 61075011, and no. 60675018 and also the Scientific Research Foundation for the Returned Overseas Chinese Scholars from the State Education Ministry of China. The authors thank for their financial supports. The authors would also like to express their appreciation to the anonymous reviewers for their insightful comments, which help improving this paper.