Abstract

We propose an innovative scheme for multiple description coding (MDC) with regions of interest (ROI) support to be adopted in high-quality television. The scheme proposes to split the stream into two separate descriptors and to preserve the quality of the region of interest, even in case one descriptor is completely lost. The residual part of the frame (the background) is instead modeled through a checkerboard pattern, alternating the strength of the quantization. The decoder is provided with the necessary side-information to reconstruct the frame properly, namely, the ROI parameters and location, via a suitable data hiding procedure. Using data hiding, reconstruction parameters are embedded in the transform coefficients, thus allowing an improvement in PSNR of the single descriptions at the cost of a negligible overhead. To demonstrate its effectiveness, the algorithm has been implemented in two different scenarios, using the reference H.264/AVC codec and an MJPEG framework to evaluate the performance in absence of motion-compensated frames on 720p video sequences.

1. Introduction

The widespread of digital TV and high-quality broadcasting requires the definition of innovative schemes to guarantee improved robustness and error resilience to video streams. The robustness requirement is particularly necessary in the areas where the digital signal is not strong enough and frequent errors that damage slices or even entire frames are common. In fact, one of the most significant drawbacks of current technologies for terrestrial video broadcasting is that if the signal-to-noise ratio drops under a certain threshold, the program is obscured until the quality of the signal returns back to the standard operational range. This is actually one of the main concerns that arose with Digital Video Broadcasting (DVB), and it is one of the pending issues that need to be addressed. With this respect, scalable coding can be regarded as a potential solution to the problem, allowing the split of the stream into a number of substreams. The substreams can provide reduced quality at lower bit rate, and the full quality can be achieved by suitably combining all layers. In such a framework, either layered coding or MDC could be adopted as a viable approach [1].

MDC [2] has emerged as a promising approach and a valid alternative to more common layered coding standards, to improve error resilience in a video. Starting from a video source, the core idea consists of splitting the original stream in a predefined number of substreams, called descriptors. Descriptors are typically balanced, meaning that each of them has comparable bit-rates and provides similar quality when individually decoded. This results in a big advantage if compared to layered coding (from MPEG-2 to H.264/SVC), where the hierarchical organization of the stream strongly relies on the correct decoding of the base layer. In MDC, instead, each descriptor can be independently decoded at low but acceptable quality. The higher the number of received descriptors, the better the quality. If all substreams are received and properly merged, the original quality can be guaranteed.

In our work, we propose to spatially differentiate the quality of each description, defining a suitable region of interest (ROI) where the quantization strength is reduced in order to improve the perceived quality within the desired area. Since this information (location, dimension, quality factor, etc.) must be shared with the decoder to correctly display the visual content, we exploit a watermarking algorithm, which allows improving the PSNR of each single description without significantly affecting the resulting overhead. The algorithm is tested on both MJPEG and H.264/AVC. The choice of performing the validation on two different coding schemes is motivated by the recent developments in HDTV broadcasting research. In this area, bandwidth is not a critical issue anymore as it used to be, and, therefore, the transmission of static images instead of motion compensated videos could provide better perceptual quality, and higher robustness to error propagation. The validation phase has been carried out taking into account standard metrics like PSNR, together with VQM, in order to determine the perceptual impact of the ROI and the watermarking scheme.

The paper is structured as follows. Section 2 reports a review of the most relevant works in the field of MDC. In the same section, also the adoption of data hiding tools for nonconventional applications is discussed. The proposed MDC scheme is described in detail in Section 3, while ROI embedding is presented in Section 4. Experimental results are reported in Section 5, and Section 6 presents some concluding remarks.

2.1. Multiple Description Coding

The generation of multiple streams imposes the introduction of a variable amount of overhead (also referred to as redundancy ), which is related to the number of descriptors and to the requirements of robustness of the specific application. Given descriptors, the overhead is calculated in general as in (1), where refers to the video encoding in single descriptor mode, traditionally: The overhead is caused by the need of storing in each descriptor some auxiliary information like

(i)headers, trailers, synchronization markers, and presentation data that must be replicated in order to make each single stream visible;(ii)side information to facilitate the reconstruction in presence of losses.

One of the first effective applications of MDC to image coding that is still a reference method was proposed by Vaishampayan in [3]. His idea of a Multiple Description Scalar Quantizer was demonstrated, for the sake of generality, with application to a memoryless Gaussian source; however, it turns out to be effective also in image and video coding applications.

In literature we find a large amount of algorithms based on MDC, and one of the expected results has always been the ability to deploy effective implementations capable of maximizing quality in the single descriptor reconstruction while minimizing overhead. This can be achieved through different techniques. The most common approaches are on the one hand related to the improvement of the single descriptor quality, by sharing relevant data among descriptors, and on the other hand focused on the maximization of the number of substreams to provide diversity. In fact, the intrinsic nature of MDC does not impose any limitation in the number of descriptors to be considered and it has been demonstrated that this goal can be achieved by applying, for example, wavelet-based approaches. There are several implementations dealing with these problems as it can be seen in [46] where the different strategies have been tested considering up to sixteen descriptors. The methods proposed in these articles mostly deal with static images, since wavelet-based tools for video coding have not received great attention yet. However, the implementation of a high number of descriptors introduces often a non-negligible redundancy.

If not so, the achieved quality of the single descriptor reconstruction is very poor and the streaming provider usually prefers other alternatives such as FEC as discussed in [7]. In addition, a large number of descriptors would introduce a significant increase in the complexity both at the encoder and at the decoder, making it almost impossible (with the current technologies) to perform live video transmission. Therefore, the most likely number of descriptors to deal with can be limited to a few units, according to the channel capacity and the network setup, even though most of the works in literature adopt two descriptors only.

Even though several MDC techniques have been proposed in the video framework in both the form of MJPEG contents and motion-compensated coding standards, still little research has been conducted on the combination of MDC and H.264. In [8] the authors focus on the implementation of a coding scheme to be implemented in a MIMO architecture. In [9] an innovative standard compliant approach is presented: descriptors are composed by alternating primary and redundant slices in a complementary manner, so that losses can be mitigated by replacing each missing portion of data with its redundant slice from the complementary descriptor. More recently, another MDC system was proposed in [10]. Here, the smoothness and edge features of DCT blocks are modeled in such a way that their perceptual tolerance against visual distortion is measured; the key components identified via this method are duplicated, while the remaining ones are effectively split among descriptors.

2.2. Data Hiding

As far as data hiding is concerned, we have applied the concept in a slightly different manner with respect to the standard applications. Data hiding schemes are usually applied to digital rights management [11]. Such methods invisibly embed a watermark into a cover data (a portion of the image in which the mark is inserted), allowing the identification of copyright violation, presence of illegal copies or illegal distributors, by recovering the embedded information. In particular, classical watermarking applications consist of copy control, broadcast monitoring, fingerprinting, authentication, copyright protection, and access control [12]. Since early 1990s, digital watermarking plays a key role in security of multimedia communication and digital right management. In the last few years, the watermarking schemes have evolved significantly and the classical Spread Spectrum techniques have been outperformed by the so-called informed watermarking, the Quantization Index Modulation (QIM) algorithms [13, 14].

As far as video watermarking techniques are concerned, H.264 requires the definition of new appropriate tools for both copyright protection and authentication. Few works have addressed this problem so far. A hybrid approach is presented in [15] that faces both features by embedding a robust mark in the transform coefficients and a fragile mark in the motion vectors. Authentication is instead achieved in [16] via a fragile watermark insertion using skipped macroblocks of H.264 compressed sequences. Recently, the work in [17] proposed an H.264 video watermarking method which exploits a perceptual model to select the coefficients for the mark insertion. The goal consists, as usual, in increasing the size of the payload, guaranteeing a reduced impact on the visual distortion, and therefore focusing on the standard problem of robustness to common signal processing attacks. The same issue is faced in [18].

In the last few years, data hiding has received considerable attention also in different research areas and has been adopted also to nonsecurity oriented applications. As far as error concealment is concerned, there are some works exploiting watermarking, where embedded information is extracted to detect and conceal errors. In [19] important data for each macroblock (MB) are extracted and inserted into the next frame with suitable MB-interleaving slice-based data hiding techniques for I and P frames, while in the decoding phase, they can be exploited to conceal the corrupted MBs. Hidden information about the original (high-quality) data can be also used to estimate the quality degradation of the received data. Indeed, the extraction of the embedded information can support the reduced-reference methods for quality assessment [20]. Recently, a simple data hiding technique has been adopted also in video surveillance applications to enhance the visual quality of faces [21].

A first proof of concept that demonstrates the suitability of data hiding tools also in the MDC has been proposed in our previous work [22]. Here we underline the advantages of data hiding in improving MDC schemes while reducing the overhead. The approach embeds the DC coefficient of each block of the JPEG frame into the AC coefficients, therefore, avoiding its transmission.

3. The Proposed MDC Coding Scheme

According to the typical constraints of MDC, each descriptor should be

(i)independent,(ii)nonhierarchical,(iii)complementary,(iv)balanced.

Following these principles, we propose to modify the quality of each MB on the basis of a predefined checkerboard pattern of MBs, in which high- and low-quality MBs are alternated. In case only one descriptor out of two is correctly received and decoded, the achievable perceptual quality is reduced, due to the checkerboard pattern. In Figure 1 (left) an example of a checkerboard pattern is shown. White MBs are coarsely quantized being the least expensive in terms of bit allocation, while a finer quantization is applied to black MBs.

Being the amount of overhead and, therefore, the final bitrate strongly related to the checkerboard structure and the quantization factors of the MBs, we show in Figure 2 the relationship between the redundancy and the size of the grid when adopting H.264/AVC as the coding algorithm. In H.264/AVC, the quality factor is given by the quantization parameter (QP), which values span in the range , being the best achievable quality. In the presented configuration, we have chosen for high-quality macroblocks, while the coarsely quantized MBs have . The same evaluation could be carried out using the MJPEG encoder. However, since only H.264 has predictive features also in the intra frames, the MJPEG curve would be almost flat, since the number of MBs with high and low quality is constant regardless of the grid size and in particular, each block is independently coded and has no relationship with the adjacent ones. For the MJPEG configuration, we have instead set the quality factor (QF) to and

The choice of a simple checkerboard pattern and QP (QF) alternation would be reasonable in presence of limited packet losses, and it may generate annoying artefacts in presence of significant losses. However, according to the human visual system, we can say that users typically focus the attention on a specific area of the image when watching a video [23]; therefore, we have introduced a mechanism to provide better perceptual quality of the single descriptions reconstruction in a ROI, by (i) reducing the gap between the two quantization levels in the ROI and (ii) decreasing the size of the grid in that area to the minimum size of MBs. Figure 1 (right) illustrates an example where the grid of the background is set to MBs and the ROI grid is set to MBs. A new parameter () for the H.264/AVC configuration and () for the MJPEG configuration is introduced to replace and within the ROI and it is represented by the gray MBs in the figure. This parameter allows reducing the gap between the two category of blocks, smoothing the quality differences. In order to make the scheme more flexible, we keep all parameters (Table 1) configurable every intra-frame, so to adapt the ROI to the visual content. Furthermore, the size of the checkerboard can also be chosen a priori and changed every GOP (in H.264) or every frame (in MJPEG), making it possible to adjust the quality and the corresponding overhead to the video source.

In order to allow the correct decoding of the video sequence, the ROI parameters necessarily require to be transmitted to the receiver/decoder. This is usually achieved by either delivering the data out-of-band or by introducing new syntax elements in the stream. In our approach we have preferred to exploit data hiding techniques due to (i) the negligible impact on the resulting bitrate and (ii) the intrinsic nature of data hiding in terms of data protection, allowing only authorized users to recover the embedded information. The procedure of data embedding and recovery is explicitly described in the next section.

4. ROI Parameters Embedding

The choice of a suitable data hiding scheme requires the estimation of the payload to be transmitted together with the video. For the sake of demonstration, we have roughly calculated a reasonable number of bits that need to be embedded as watermark. According to our design we have considered that 48 bits are sufficient to encode the ROI information, including the size of the grid, the quantization parameters for the background and the ROI, and its coordinates in the image plane (see Table 1). In order to allow a one-to-one comparison between the two coding schemes, and values are represented with 6 bits, meaning that in the JPEG case, only every second value can be chosen, since , and this would require instead 7 bits. This approximation is reasonable and does not affect the achieved results significantly, since the visual appearance variation is negligible.

In order to limit the additional computational burden of the algorithm, the complexity of the adopted data hiding technique must be kept as low as possible. Therefore, we have chosen a simple yet flexible watermarking scheme in the class of quantization index modulation (QIM) techniques [13]. The adopted embedding technique is the Binary Scalar Costa Scheme (BSCS) [24], and the formula is as follows: where is the watermark symbol, is the coefficient to be marked, is the cardinality of the alphabet of the watermark (2 in our case), is the watermarking key, and represents scalar uniform quantization with step . We selected , to minimize the impact of the inserted information thus leading to perfect mark invisibility and very low complexity. Since the embedding of the ROI parameters has a negligible impact on the redundancy, it provides a powerful tool to improve the perceptual quality of single description, as experimentally verified in Section 5.

In our embedding scheme, we have decided to replicate the mark in three different locations to improve the robustness in case a portion of the image is damaged and we have chosen the first seven MBs of the first, middle, and last row as a cover for the watermark insertion. In each MB, we embed 8 payload bits by modifying the first two AC coefficients of each block composing the MB after quantization. At the decoder side, the knowledge of the watermark extraction key (i.e., its location) allows the extraction of the embedded mark (i.e., ROI parameters). Notice that the first parameters of Table 1 (, N, M, ROI, and —or ) are fundamental to correctly start the decoding process. Therefore, to ensure their reconstruction, we set QP (QF) of and to () in both descriptions and . The proposed data hiding exploitation adds a significant level of protection to the MDC scheme. As a matter of fact, it is impossible to reconstruct an acceptable version of the video without knowing the watermark extraction key. If the embedded ROI parameters cannot be extracted (in case for example of nonauthorized users) neither descriptor can be correctly decoded. Details about this concept are provided in Section 5. The overall encoding scheme for H.264/AVC is shown in Figure 3. As it can be noticed, the proposed architecture does not alter the coding process, returning a completely standard-compliant stream, resulting in another advantage of this architecture with respect to other solutions. Indeed, the information related to the ROI is included in the transform coefficients and no additional syntax elements need to be defined. The corresponding scheme for the MJPEG implementation is similar to the one in Figure 3, but clearly without the motion compensation loop.

5. Experimental Results

This section refers to the validation of the system using three different test sequences retrieved from the European Broadcasting Union (EBU) website [25] in the 720 p format, namely, . We will first review the results obtained in the MJPEG framework and successively using H.264. In both cases, the encoders have been configured in order to achieve a PSNR higher than 40 dB on the luminance component in case both descriptors have been correctly received.

As a common achievement in both implementations, we demonstrate that the data hiding procedure serves also as a protection tool, which does not allow enjoying the visual content properly; if the user does not know how to interpret the embedded key, the exact quantization values are known only for the black MBs in Figure 1, while are hidden in the cover. The effects on the resulting image depend on the adopted codec, but in general the visual content is not preserved. An example is presented in Figure 4 without the ROI, for MJPEG and H.264, respectively. The MJPEG implementation shows an evident grid. In fact, during the process of inverse quantization, values are not reported at the original magnitude, and the details of the picture are lost. In case of H.264, instead, the content appears to be more scrambled. This is due to the intra prediction algorithm of the codec that tends to compensate possible errors that may occur in intrablocks as well. For the tests, we have set a grid of MBs, in order to visually reduce the blocking artifacts due to the high- and low-quality block alternation. In the ROI, the grid is set to MB. The location of the ROI for each test sequence is set in terms of percentage as shown in Table 2.

It is clear that the ROI size has an impact on the total overhead. In our system, the increase of the redundancy is proportional to the ROI size, as demonstrated in Figure 5. Here we have chosen the central portion of the Ducks take off video sequence and we have progressively increased the area occupied by the ROI at regular intervals.

5.1. MJPEG Experiments

This first set of results aims at demonstrating the viability of the proposed approach in absence of motion compensation videos. It is worth noting that although we have chosen a simple MJPEG codec, there is no practical constraint in terms of compressing standard, since the data hiding procedure can be applied to any transform-based coding scheme. Focusing on the target quality of 40 dB in case of full reconstruction, we have set and in order to achieve a reasonable compromise between perceived quality and overhead. This configuration returns an average PSNR of the video sequence around 30 dB. In Figure 6, a comparison between the full reconstruction and the single descriptor decoding is shown for the Crowd Run sequence. In Figure 7 a detail of the picture is shown, in which the improvements given by the ROI are highlighted at different values for The corresponding numerical values are reported in Table 3.

5.2. H.264 Experiments

Also in this case, the encoder has been configured in order to ensure a PSNR in the range of 40dB in case both descriptors are received. In our experiments we have set and . All the tests presented in this section have been performed using the JM reference software (v12.4) [26]. In Figure 8, the relationship between redundancy and distortion for different values of is reported, which demonstrates the versatility of the proposed method in modulating the overhead. We can notice that, on one hand, very low redundancy values can be achieved at the cost of low single description quality, while, on the other hand, it is possible to obtain higher quality reconstruction by increasing the bitrate. In any case, even increasing significantly, the overhead is still constrained below 40% with an average distortion of the whole frame around 30 dB.

Figure 9(a) shows a single description reconstruction (=50) with disabled ROI, with Figure 9(b) being its detail, while in Figure 9(c), the ROI is inserted. The same experiment is presented in Figure 10 for the Crowd Run sequence. In Figure 11, we illustrate the details of the ROI by applying different values of . As it can be noticed from the picture, the quality of the ROI can be significantly improved by filling the gap in terms of quantization parameter.

Figures 12 and 13 highlight the performance improvement due to the ROI introduction: although this results in an unavoidable bitrate increase, the PSNR in the ROI can be significantly improved by varying the , letting the user perceive a better quality in the significant area of the video. Both Figures 12 and 13 refer to a configuration in which is set to 20, is set to 50, and is between 45 and 25.

To assess the perceived quality improvement introduced by the ROI we employed the NTIA General Video Quality Metric (VQM) [27], which returns values in the 0–1 range, with 1 representing worst quality. This metric has been proved to be highly correlated to subjective ratings of processed HD video sequences [28]. The results are reported in Table 4.

As a final comment on the achieved results, we underline that the proposed method embeds the ROI data within a reduced number of MBs, altering a maximum of only two transform coefficients per block. Therefore, the quality degradation due to data hiding is completely negligible. Simulation results refer that the average loss in a single watermarked macroblock can be assessed around 0.45 dB in both MJPEG and H.264/AVC, therefore, minimally affecting the visual appearance of the image. We have replicated the watermark three times in different areas of the frame, thus reducing the probability of ROI details being lost due to channel errors. Even in this case, the impact on stream quality and bitrate is minimal: our experiments refer negligible additional distortion (0.005 dB) and increased redundancy () when applying three fully redundant watermarks on the Ducks take off sequence.

6. Conclusions

In this paper, we presented a novel MDC technique for HDTV video sequences, using MJPEG and H.264/AVC. The proposed method exploits data hiding to improve the perceptual quality of each single description within specific regions of interest. ROIs are defined a priori and coded with better quantization factors, and the corresponding information required at the decoder side is transmitted via a suitable watermarking algorithm. We reported extensive experimental results demonstrating the effectiveness of the proposed method, which allow increasing the PSNR of single descriptions at the cost of a slightly higher redundancy.

Acknowledgment

This work is partially funded by the Province of Trento (Italy) in the framework of the TRITON Project.