The support for region of interest (ROI) browsing, which allows dropping background part of video bitstreams, is a desirable feature for video applications. With the help of the slice group technique provided by H.264/SVC, rectangular ROI areas can be encoded into separate ROI slices. Additionally, by imposing certain constraints on motion estimation, ROI part of the bitstream can be decoded without background slices of the same layer. However, due to the additional spatial and temporal constraints applied to the encoder, overall coding efficiency would be significantly decreased. In this paper, a rate-distortion optimized (RDO) encoding scheme is proposed to improve the coding efficiency of ROI slices. When background slices are discarded, the proposed method uses base layer information to generate the prediction signal of the enhancement layer. Thus, the temporal constraints can be loosened during the encoding process. To do it in this way, the possible mismatch between generated reference frames and original ones is also considered during rate-distortion optimization so that a reasonable trade-off between coding efficiency and decoding drift can be made. Besides, a new Lagrange multiplier derivation method is developed for further coding performance improvement. Experimental results demonstrate that the proposed method achieves significant bitrate saving compared to existing methods.

1. Introduction

With the rapid development and continuous expansion of mobile communications, mobile internet services are becoming more and more popular. As a result, mobile video applications, such as mobile video broadcasting [1], mobile video conference [2], and mobile video surveillance [3], have become an active research area in recent years. However, due to the fact that mobile devices typically have limited communication bandwidth, constrained power capacity, and various display capabilities, there are several fundamental difficulties in deploying high-quality video service for mobile devices over wireless networks. Among them, one crucial problem for mobile video application is how to browse high-resolution (HR) videos on mobile devices with small screens. Traditional approaches typically downsize the video to achieve the required resolution, which will inevitably cause a loss of perceptual information and a waste of network bandwidth. Visual perception experiments show that small display screen is always a critical factor affecting the browsing experiences. In fact, different parts of a picture do not equally attract people’s attention. People are likely to pay more attention to a certain area, which is called region of interest (ROI), than to other areas of a picture. Thus, it is beneficial to optimize multimedia systems according to ROIs of video content, for example, to make ROI areas have better video quality [4] and to make background areas droppable when needed [5].

Scalable video coding extension of the H.264/MPEG-4 AVC video compression standard (H.264/SVC or SVC for short) [6] enables various functionalities to make encoded bitstreams more adaptive to dramatic variation of resource constraints, such as bandwidth, display capability, and power consumption. The base layer of SVC is compatible with H.264/AVC, so that it is easy to meet the requirement for the compatibility when upgrading video broadcasting infrastructures [7]. SVC is capable of encoding original video into different layers. Typically, it generates a high-quality video bitstream that contains one or more substreams, each of which corresponds to a degraded version of the original video signal with lower spatial resolution or lower temporal resolution or lower picture fidelity. Substreams can be extracted at media gateways according to network bandwidth and end device capability.

In addition to spatial, temporal, and quality scalabilities, SVC also supports ROI scalability [5]. With the help of the slice group technique (also known as Flexible Macroblock Ordering (FMO) [8]), macroblocks (MBs) in ROI and background areas can be coded into ROI slices and background slices, respectively. A low bitrate substream, which contains ROI slices, can be extracted from the high-quality bitstream without any transcoding operation [6]; therefore SVC can provide ROI browsing functionality for multimedia communication systems. As illustrated in Figure 1, the base layer of a bitstream can be coded with relatively low spatial resolution or modest fidelity to provide basic video quality for devices with small screen or low bandwidth. The enhancement layer can be coded with high resolution or high fidelity. To provide higher video quality, the enhancement layers may be composed of ROI slices and background slices to provide ROI scalability with FMO, which is allowed in the base baseline line profile of H.264/SVC. With the help of spatial scalability, one may choose either the QCIF (Quarter Common Intermediate Format) base layer or the CIF (Common Intermediate Format) enhancement layer according to resource constraints such as network bandwidth, screen size, and power consumption, as illustrated in scenarios (a) and (c). Additionally, ROI scalability introduces a new scenario (scenario (b)), in which both base layer and ROI of enhancement layer are delivered and one may choose between whole scene with low resolution and zoomed ROI area with high resolution for a better experience.

To develop a video application system with ROI browsing functionality based on SVC (like the application in Figure 1), ROI areas should be detected firstly in each frame, and then they can be coded into ROI slices using FMO. The detection of ROI of a picture has been widely studied with various attention models [911]. Given the fact that the ROI of a certain picture may be quite different for different people, it is practical and reasonable for a video application system to allow users to freely select some initial interested objects and then track these objects in subsequent video frames to locate ROI areas. Existing tracking methods can be used to perform the tracking operation. These methods can be categorized into pixel domain (Pel-domain) [1214] and compressed domain (Com-domain) [1517] approaches. Generally speaking, the Pel-domain approaches can achieve better tracking accuracy than the Com-domain ones yet with higher complexity. For the simulations in this paper, a color based Monte Carlo tracking technique introduced by Prez et al. [14] is applied for ROI tracking.

To support ROI scalability, ROI slices should be self-contained, in other words, decodable in the absence of other slices of the same picture [18]. Thus, during the encoding process, dependencies between different slices, such as dependencies introduced by intraprediction and motion vector prediction, should be prohibited [8]. Besides, in order to provide acceptable visual quality in case background slices are all discarded, several additional temporal constraints are suggested to avoid using background slices in reference pictures to predict current ROI slice. The constrained motion vector method [19] presents an H.264/SVC compatible way where the MBs belonging to current ROI slice should use corresponding ROI slices in reference pictures only as reference. Besides, since the reference frames should be upsampled with a 6-tap interpolation filter for quarter-pel motion estimation (ME), the fractional pixel located within two pixels of the ROI slice boundary must also be ignored during motion estimation. However, this may significantly decrease the coding efficiency. Thus, Bae et al. [20] suggest using half-sample interpolation method for fractional pixel interpolation, where the slice boundary is treated as picture boundary. However, this method has little improvement compared with the constrained motion vector method; therefore, it has not been adopted into the H.264/SVC standard. Generally speaking, the above approaches all make strict truncation of temporal prediction, which leads to significant degradation of coding performance. Fortunately, for SVC application in Figure 1, since the base layer information is available when ROI of enhancement layer is being encoded, it is better to adopt a more flexible method to improve the coding performance of the enhancement layer ROI slice.

In this paper, an efficient ROI coding algorithm under the SVC scalable baseline profile is proposed. When the enhancement layer contains ROI slices, information of the base layer is adopted to improve the coding efficiency of them. The framework of the proposed algorithm is illustrated in Figure 2. The ROI area of an input picture is coded as a ROI slice by the proposed enhancement layer encoder. It improves the coding efficiency by using rate-distortion optimized (RDO) mode decision, which takes into consideration the error propagation due to the loss of the background slices, instead of directly restricting the motion vectors. A new Lagrange multiplier derivation method, which is associated with the proportion of ROI area, is also derived and used in the RDO model for further improving the ROI slice encoding performance.

The remainder of this paper is organized as follows. The proposed ROI coding method is introduced in Section 2. Then Section 3 shows some experimental results to verify the benefits of the proposed RDO method and is followed by the conclusion in Section 4.

2. ROI Coding for H.264/SVC

In H.264/SVC, a rectangular ROI area of a picture can be coded into a separate slice using FMO technique. However, to support ROI browsing functionality, additional effort should be made to ensure that ROI slices are independent of background data. Several approaches can be used to encode ROI slices. Most of them apply constraints to temporal prediction to enable fully independent decoding of ROI slices, for example, the constrained motion vector method [10, 19] and the half-sample interpolation method [20]. However, in H.264/SVC, such strict constraints may severely degrade the coding efficiency for enhancement layer users. Fortunately, since the corresponding base layer data is always contained in an enhancement layer bitstream, the base layer information is available to be adopted to further improve the coding efficiency of the ROI area of the enhancement layer. In this section, the existing ROI coding approaches are introduced first, and then the proposed RDO based framework for coding ROI slices is presented.

2.1. Existing ROI Coding Methods
2.1.1. The Constrained Motion Vector Method

In H.264/SVC, motion estimation (ME) and motion compensation (MC) are performed using motion vectors with the accuracy of quarter-pixel luma samples. If the motion vector represents a fractional pixel position, interpolation is performed to generate predicted signal value.

As shown in Figure 3, the half-pixel samples are interpolated first from neighboring integer-pixel samples using a 6-tap finite impulse response (FIR) filter [21]. This means that each half-pixel sample is a weighted sum of 6 neighboring integer samples. Once all the half-pixel samples are available, the quarter-pixel samples are interpolated with neighboring half- and/or integer-pixel samples using bilinear interpolation.

The half-pixel sample “” is interpolated by And the quarter-pixel sample is interpolated by

Since the derivation process of some fractional pixels within two pixels inside the ROI boundary (labeled as “unavailable fractional pixel” in Figure 3) depends on pixels out of the ROI region, Hannuksela et al. [19] proposed not to use them as reference during ME/MC; therefore, the dependency between ROI and background can be removed.

2.1.2. The Half-Sample Interpolation Method

To loosen the restriction imposed on the ME process, the half-sample interpolation method [20] modifies the fractional pixel interpolation process to extend the pixels on slice boundaries using the 6-tap interpolation; for example, the half-sample “” in Figure 3 is generated by In the actual implementation, half-sample interpolation method is only adopted when generating reference signal for ROI areas. Original interpolation method in H.264/SVC is still used to generate reference frames of background slices so that better coding efficiency can be achieved.

The above two methods aim to prohibit using samples in the background area of the reference frame during ME/MC, which also prevent a lot of MBs from being coded with proper prediction block located in or overlapped with background area. Therefore the coding efficiency is severely reduced compared with the original coding method.

2.2. Proposed RDO Based ROI Coding Framework

In application scenario (b) illustrated in Figure 1, when background slices of the enhancement layer are dropped, pixels in them cannot be used as reference to decode the ROI area of enhancement layer, but the base layer information is still available. So the background pixels of the base layer can be reconstructed and used to generate reference frames for ROI areas of enhancement layer using error concealment techniques. However, the mismatch between original and error-concealed reference frames may probably cause severe error propagation; therefore, the coding modes of MBs (which may use the error-concealed blocks as reference) should be selected carefully.

Let and denote the original reference frame (ORF) and the error-concealed reference frame (named as the virtual reference frame, VRF), respectively, where may be generated using base layer information through error concealment techniques. In the proposed RDO framework, the mismatch between and , together with source error introduced by quantization in the encoding loop, is considered as total distortion. The RDO evaluation for mode decision is based on this total distortion. Furthermore, the Lagrange multiplier is modified to take account of the proportion of ROI area for better performance.

2.2.1. Generation of Virtual Reference Frame (VRF)

Figure 4 shows the generation of intercoded VRF at encoder side. Since the background slice is assumed to be discarded, the pixels belonging to the background slice are estimated using the base layer information with the same error concealment method that the decoder uses (in this paper, the well-known BL-skip method [22] is adopted); the pixels belonging to the ROI slice are generated by motion compensation using their own motion vectors and residuals while taking the former VRF as reference. Then, the VRF also serves as the reference frame of the following VRFs. The generation of intracoded VRF is similar to that of intercoded VRF except that upsampled textures are directly used to imitate the background slice. Notice that, in actual implementation, the upsampled motion vectors (MVs), residuals, and textures can be easily obtained when calculating the cost of “base layer mode.” Thus, only one additional MC operation for each MB is needed to generate VRF.

2.2.2. Proposed RDO Mode Decision

In the mode decision process of a macroblock, the coding mode with the minimum RD cost is selected: where and are distortion introduced and bits consumed by the coding mode under consideration, respectively. is the Lagrange multiplier.

For an MB in a ROI slice, the proposed mode decision scheme considers both the distortion introduced by the difference between the reconstructed MB and the original MB (termed source distortion) and the mismatch between reference MB in and (termed mismatch distortion). So the RD cost function for mode decision becomes where stands for the source distortion and is the mismatch distortion.

For an MB in a background slice, a basic assumption is that users who receive enhancement layer background slices should also receive ROI slices. Such assumption is reasonable considering ROI slices are more important and thereby are more protected than background slices. Thus, becomes zero, and the cost function for mode decision is now degraded to its original form:

Figure 5 depicts in detail the implementation of the proposed RDO mode decision of an MB.(1)Firstly, given a mode , the best motion vector for each partition is selected using the original reference frame. And let the corresponding predictor be .(2)Then, calculate the source distortion and the cost bits through the encoding process of mode .(3)If current MB belongs to ROI area, then find the new predictor from previous VRF with motion vector and calculate the mismatch error . Note that the following selection should be made: for distortion calculation in mode decision, the previous VRF is used as reference, while for distortion calculation in ME, the previous ORF is still used.(4)Calculate the RD cost using (5) or (6), and turn to step (1) for the next mode.(5)Finally, the mode with the minimum among all candidate modes is selected as the best mode.

The benefit of the proposed RDO method is illustrated through the rate-distortion (RD) performance comparison in Figures 6 and 7 for both spatial and quality SVC. In Figure 6, the spatial SVC bitstream contains a QCIF base layer and a CIF enhancement layer. In Figure 7, the quality SVC bitstream contains a CIF base layer and a CIF enhancement layer. The intraperiod is set to 30. Four pairs of quantization parameter (QP) are chosen for the test: for spatial SVC, QP pairs for QCIF base layer and CIF enhancement layer are (22, 26), (26, 30), (30, 34), and (34, 38), respectively, and, for quality SVC, the QP pairs are (30, 26), (34, 30), (38, 34), and (42, 38), respectively. The original method in H.264/SVC, which uses the ORFs without any constraints on temporal prediction, is simulated as the anchor. Three collections of data are presented in each figure, where “mdrdo” stands for the proposed RDO based mode decision method, and “mv_constrain” and “half-interpolation” stand for the ROI coding methods mentioned in Section 2.1.

Coding efficiency in the following two scenarios is considered. The first is that enhancement layer slices are received completely, which means that the quality of full resolution enhancement layer (labeled “Enc full”) should be considered. The second scenario is that background slices are all discarded, which means only the quality of ROI areas (labeled “Dec ROI”) affects the user experience. Average bitrate savings, which are calculated via the excel add-in proposed in VCEG-AE07 [23] (lager -bitrate value means worse performance), of the above three coding methods compared with “orig” method are presented.

From Figures 6(a) and 7(a), we can see that the coding efficiency of those three methods (for the whole SVC bit-stream) is all lower compared with the “orig.” method, which uses the perfectly decoded reference frames as prediction. However, the proposed method achieves significant improvement compared with the other two methods because of the use of a better reference for coding ROI slices. The average performance gain compared with the one termed “half-interpolation” over the tested sequences is about 5% (Figure 6(a)) and 7% (Figure 7(a)) for spatial and quality SVC, respectively.

Though the performances of those methods are inferior compared with original method when enhancement layer slices are all received, however, considering the most important ROI browsing scenario in which all the background slices may be discarded, the RD performance of those methods is much better than that of “orig.” method. As illustrated in Figures 6(b) and 7(b), the bitrate saving of the proposed method is up to 50% (about 30% and 4% for spatial and quality SVC on average, resp.) compared with “orig.” method. The lower gain for quality SVC compared with spatial SVC is consistent with the common knowledge that a better concealment quality will be obtained when the base layer and enhancement layer have the same resolution; thus, acceptable quality may be obtained even with “orig.” method for the quality coding configuration (CIF-CIF). Still, the proposed method outperforms “mv_constrain” and “half-interpolation” methods, and the average gain compared with “half-interpolation” is 3% (Figure 6(b)) and 2.5% (Figure 7(b)) for spatial and quality SVC, respectively.

2.2.3. The Selection of Lagrange Multiplier

In RDO optimization, the Lagrange multiplier should be carefully selected to ensure that the most suitable modes are chosen. In this paper, a refined Lagrange multiplier selection method for background and ROI slices is proposed to further improve the RD performance of ROI slices. In H.264/AVC, the Lagrange multiplier can be calculated as follows.

Supposing and in (4) to be differentiable everywhere, the minimum cost is given by setting its derivative to zero, thus, leading to Then, the Lagrange multiplier for single layer video coding can be solved through the rate model  (8) and distortion model  (9) [24]: where and are two constants and is the quantization step. According to (8) and (9), the derivative of and can be calculated by Putting (10) into (7) and letting , the for single layer is finally derived as where is a constant, which is experimentally suggested to be 0.85 [24], though others proposed 0.68 [25].

During the development of H.264/SVC, such is directly used in H.264/SVC reference software Joint Scalable Video Model (JSVM) [26]. However, applying such , which is derived for single layer Lagrange multiplier, into multilayer scenario is inappropriate, since the correlation between layers is not considered in the Lagrange multiplier selection. To improve the overall coding performance, an encoder-only optimization contribution for RDO in SVC is presented by Li et al. [27], which is adopted into later JSVM in an optional way. According to this method, the Lagrange multiplier is derived as where denotes the resolution ratio of the two layers and and are the quantization steps for the base layer and enhancement layer, respectively.

Similarly, for a specific user who requires base layer slices together with ROI slices, the best Lagrange multiplier for ROI slices can be derived as follows.

Let the joint cost be where and are the RD cost functions for base layer and enhancement layer with ROI area, respectively, and are the contribution weights of base layer cost and ROI area cost, respectively, and . Similar to [27], the term denotes the resolution ratio between enhancement layer ROI area and base layer and is introduced as an approximation for bitrate ratio between the ROI slice and base layer slice.

The term is the mismatch distortion (as described in (5)) introduced by difference between ORF and VRF. is much larger than and thus can be regarded as independent with . Therefore, put (8) and (9) into (13) and then set the derivative of to zero; the Lagrange multiplier can be solved as Considering the base layer Lagrange multiplier is determined by single layer selection method (11); put into (14) to derive as

Note that the derived is similar to (13). However, the rate ratio has a different form since the area of ROI has been taken into consideration.

Let and denote the quantization parameters for base layer and enhancement layer, respectively. Then, according to the relationship between quantization steps and quantization parameters, which has been defined in H.264/AVC standard [28], , , and conform to

Equation (14) can be simplified according to the quantization parameter difference between base and enhancement layers; namely, . And finally, the modified Lagrange multiplier can be obtained through the following equation for SVC enhancement layer when ROI is enabled:

Similar to the RDO mode decision part, the performance of the proposed RDO method with modified Lagrange multiplier is shown in Figures 8 and 9 for spatial and quality SVC, respectively. The same coding parameters are used. “Enc full” and “Dec ROI” denote the scenarios when background slices are received and discarded, respectively. The proposed “mdrdo” method with Lagrange multiplier [27] and with the proposed   (17) is simulated separately. And the “mdrdo” method with original single layer Lagrange multiplier (11) is performed as the anchor.

All the values in Figures 8 and 9 are negative, which means that both Lagrange multiplier modification methods have brought benefits to ROI and enhancement layer coding compared with the original Lagrange multiplier method. Compared with , the proposed achieves a better performance for ROI, and the average gain is about 6% for spatial SVC (Figure 8(b)) and 7% for quality SVC (Figure 9(b)). Since is always smaller than , the proposed shifts bits from a background slice to an ROI slice. Thus, for the whole enhancement layer, the proposed leads to trifling loss of coding efficiency compared with . The average loss of “mdrdo + ” compared with “mdrdo + ” is about 1% for spatial SVC (Figure 8(a)) and 1.5% for quality SVC (Figure 9(a)), respectively. Although the performance of the background area is sacrificed, it is worthwhile given that the performance of semantic-important ROI area is improved.

3. Simulation Results

To illustrate the overall improvement of the proposed framework, the benefits of the proposed RDO method together with modified Lagrange multiplier are shown in Figures 10 and 11 for spatial and quality SVC, respectively. The same coding parameters, as described in Section 2.2, are still used. Now, thanks to the modification of Lagrange multiplier, the proposed “mdrdo + ” method achieves significant bitrate saving compared with the coding scheme using the original Lagrange multiplier method (denoted as “orig.”), regardless of the background slices being lost or not. Accordingly, the gain of “mdrdo + ” method, which uses the modified Lagrange multiplier, is also increased compared with “mv_constrain” and “half-interpolation” methods which use the original Lagrange multiplier method. For example, the average gain compared with “half-interpolation” method under “Dec ROI” scenario is now 15% (Figure 10(b)) and 12% (Figure 11(b)) for spatial and quality SVC, respectively.

When all background slices of enhancement layer are discarded, decoding drift occurs in both “orig.” method and the proposed method. To show the impact of such drift in both cases, visual quality comparison of “mdrdo,” “mdrdo + ,” and “orig.” methods is given for decoded ROI. In Figures 12 and 13, the decoded QCIF sized ROI areas under “Dec ROI” scenarios for spatial SVC with a QICF base layer (QP 22) and a CIF enhancement layer (QP 26) are illustrated. The “input” shows the original QCIF ROI area of the input sequence. It can be seen that the “orig.” method, which applies no constraints to temporal prediction, suffers a severe quality degradation, especially at the boundaries of ROI areas (Figures 12(b) and 13(b)). By taking into consideration the error propagation during the mode decision, the proposed “mdrdo” and “mdrdo + ” methods can effectively restrict the errors. The error propagation is unnoticeable in Figures 12(c) and 12(d) and 13(c) and 13(d).

4. Conlusions

In this paper, a ROI coding framework based on H.264/SVC is proposed. Firstly, a rate-distortion optimized mode decision method is proposed, which loosens the temporal constraints when coding ROI slices in SVC enhancement layer. The mismatch between reference frames is considered during the RDO process, whenever the background slices are discarded or kept. Then, a new Lagrange multiplier estimation algorithm is derived to improve the coding efficiency of ROI slices. Compared with the existing constraint-based methods, such as the constrained motion vector method and the half-sample interpolation method, experimental results show that the proposed method achieves significant bitrate saving while maintaining both higher objective and subjective video quality.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.


This work is supported by the National Natural Science Foundation of China (NSFC) General Program (Contract no. 61272316) and 863 Project (Contract no. 2011AA01A102).