Abstract

The recent scalable video coding (SVC) extension to the H.264/AVC video coding standard has unprecedented compression efficiency while supporting a wide range of scalability modes, including temporal, spatial, and quality (SNR) scalability, as well as combined spatiotemporal SNR scalability. The traffic characteristics, especially the bit rate variabilities, of the individual layer streams critically affect their network transport. We study the SVC traffic statistics, including the bit rate distortion and bit rate variability distortion, with long CIF resolution video sequences and compare them with the corresponding MPEG-4 Part 2 traffic statistics. We consider (i) temporal scalability with three temporal layers, (ii) spatial scalability with a QCIF base layer and a CIF enhancement layer, as well as (iii) quality scalability modes FGS and MGS. We find that the significant improvement in RD efficiency of SVC is accompanied by substantially higher traffic variabilities as compared to the equivalent MPEG-4 Part 2 streams. We find that separately analyzing the traffic of temporal-scalability only encodings gives reasonable estimates of the traffic statistics of the temporal layers embedded in combined spatiotemporal encodings and in the base layer of combined FGS-temporal encodings. Overall, we find that SVC achieves significantly higher compression ratios than MPEG-4 Part 2, but produces unprecedented levels of traffic variability, thus presenting new challenges for the network transport of scalable video.

1. Introduction

We study the video traffic generated by the scalable video coding (SVC) extension [1, 2] of the H.264/MPEG-4 advanced video coding standard [3] (H.264 SVC for brevity). This extension is expected to have a broad application domain for heterogeneous wired and wireless video transmission to various terminals. Indications of the growing acceptance of H.264/AVC are its adoption in application standards and industry consortia specifications, such as DVB, ATSC, 3GPP, 3GPP2, MediaFLO, DMB, DVD Forum (HD-DVD), and Blu-Ray Disc Association (BD-ROM). At the same time, mobile TV technologies are made widely available. IPTV, mobile TV, satellite TV, and video surveillance are considered key applications that can make H.264/AVC and its SVC extension the dominant video encoder in the professional and consumer markets.

In order to examine the fundamental traffic characteristics of H.264 SVC's scalability modes, we focus on encodings with fixed quantization scales, that is, with variable bit rate (VBR). An additional motivation for the focus on VBR video is that the VBR streams allow for statistical multiplexing gains that have the potential to improve the efficiency of video transport over communication networks [49]. The development of video network transport mechanisms that meet the strict playout deadlines of the video frames and efficiently accommodate the variability of the video traffic is a challenging problem. Based primarily on the characteristics of MPEG-4 Part 2 single-layer and scalable video, transport mechanisms have been developed for a wide range of network transport scenarios, including video transport over the Internet (see, e.g., [1016]) over wireless networks (see, e.g., [1724]) over peer-to-peer networks (see, e.g., [2532]) and over sensor networks [3335]. The widespread adoption of the new H.264/AVC video standard necessitates the careful study of the traffic characteristics of video coded with the new H.264/AVC codec and its extensions. Recent traffic studies [36] indicate that despite the lower average bit rate of H.264/AVC and H.264 SVC single-layer video, elementary bufferless multiplexing of a small number of video streams can be more efficient with MPEG-4 Part 2 encoding than with H.264/AVC or H.264 SVC encoding due to the significantly higher traffic variability of H.264/AVC and H.264 SVC. Therefore, it is necessary to thoroughly examine the new SVC extension's statistical traffic characteristics from a communication network perspective.

The traffic characterizations and network transport mechanisms for scalable video encoded with MPEG-4 Part 2 and older codecs have received significant attention in the literature (see, e.g., [3754]). The traffic characterization of H.264/AVC and H.264 SVC nonscalable (single-layer) traffic is studied in [36, 55, 56]. The study of network transport mechanisms in the context of H.264/AVC (see, e.g., [5759]) and H.264 SVC (see, e.g., [6064]) has begun to attract interest. To the best of our knowledge, the traffic of H.264 SVC-encoded scalable video is for the first time examined in the present study. Existing studies of the H.264/AVC codec and its SVC extension, such as [3, 65, 66], focus primarily on the bit rate-distortion (RD) performance, that is, the video quality (PSNR) as a function of the average bit rate, and typically consider only short video sequences up to a few hundred frames. In contrast, for the transport over communication networks, the traffic variability is also a key concern [5, 9, 14]. Therefore, we examine in the present study the joint characterization of bit rate-distortion and higher order bit rate statistics, such as the variability of the bit rate, as a function of the distortion. We perform a detailed analysis of elementary statistics of the scalable video traffic. We study statistics of frame sizes, group of picture (GoP) sizes, as well as frame and GoP qualities. We use bit rate-distortion (RD) and bit rate variability-distortion (VD) curves to compare the H.264 SVC-layered traffic to the equivalent traffic of MPEG-4 Part 2 [67], which is the predecessor of H.264/AVC and which supports temporal, spatial, and FGS scalability. In order to obtain reliable and meaningful statistical estimates of the traffic variability and other properties, it is necessary to examine long video sequences with several thousand frames, as we do in this study.

All encodings of this study are publicly available as video traces at http://trace.eas.asu.edu/. Video traces [47] are files mainly containing video frame time stamps, frame types (e.g., I, P, or B), encoded frame sizes (in bits), and frame qualities (PSNR). Video traces are employed in simulation studies of transport of scalable video over communication networks (see, e.g., [3741, 44, 46, 52, 53, 54]). Key advantages of simulating with video traces over experiments with actual video are that only very basic knowledge of video encoding is required for simulations with video traces and that video traces are freely available without copyright protection. Also, network simulations with video traces can be conducted with standard network simulation programs and integrated in network simulation modules (see, e.g., [68]), whereas experiments with actual video require in-depth video coding expertise and large computational resources for the encoding of many long video sequences.

The paper is organized as follows. We provide a brief overview of the scalability modes of the H.264 SVC extension in Section 2. In Section 3, we describe the video test sequences, encoding tools, and video traffic metrics employed in our study. In Section 4, we analyze the traffic characteristics of the individual temporal scalability layers of long CIF videos. In Section 5, we study spatial scalability mode traffic with the same long CIF sequences and their QCIF subsampled versions. In Sections 6 and 7, we examine SVC's fine granularity scalability (FGS) traffic and medium granularity scalability (MGS) traffic, respectively. In Section 8, we consider the combined spatiotemporal and FGS-temporal scalabilities, which permit us to examine the separability of the combined scalability modes into the basic modes from a video traffic analysis perspective. We summarize our conclusions in Section 9.

2. Overview of H.264 Scalable Video Coding (SVC)

In this section, we briefly introduce the scalable video coding (SVC) extension of H.264/AVC. For a detailed discussion of the video technologies in the MPEG-4 family, such as MPEG-4 Part 2 [67] and H.264/AVC [3], we refer to [69]. At the end of 2007, the SVC scalability extension was added to the H.264/AVC standard. The SVC extension provides temporal scalability, spatial scalability, coarse (CGS) and medium (MGS) granularity scalability, as well as combined spatiotemporal SNR scalability (restricted set of spatiotemporal-SNR points can be extracted from a global scalable bit stream). The fine granularity scalability (FGS) mode was initially intended to be part of the SVC extension, however, FGS was not included in the initial version of SVC. Presently, investigations are ongoing to include FGS in a followup of the SVC extension.

While earlier scalable video encoders and receivers, such as MPEG-4 Part 2, did not gain wide market deployment, the H.264 SVC scalability extension is expected to play a major role in providing video services over heterogeneous networks due to the significantly improved rate-distortion efficiency of the H.264 SVC scalability encoding tools (with respect to MPEG-4 Part 2) and the growing industrial acceptance of H.264/AVC as the successor of the pervasive MPEG-2 standard.

In the following subsections, we briefly discuss the main scalability modes of this new H.264 SVC scalability amendment and refer to [2] for detailed information.

2.1. Temporal Scalability with Hierarchical B Frames

The introduction of hierarchical B frames has allowed the H.264 SVC encoder to achieve temporal scalability while at the same time improving RD efficiency as compared to the classical B frame prediction method, employed by the older MPEG standards (MPEG-1/2/4 Part 2) and used by default in H.264/AVC. Figure 1(a) depicts the classical B frame prediction structure, where each B frame is predicted only from the preceding I or P frame and from the subsequent I or P frame. Figure 1(b) depicts the hierarchical B frame structure [70] which uses B frames to predict B frames. The illustrated case is the dyadic hierarchy of B frames, meaning that the number of B frames in between the key pictures (I or P frames) must equal . (We do not consider low-delay or constrained delay B frame prediction structures, for which we refer to [2].)

We depict the hierarchy with 3 B frames (I frame period is 16) in Figure 1(b). Temporal layer 0 consists of I and P key pictures, which are used to predict the B frames of temporal layer 1 (the temporal layer is indicated by the subscript of the I, P, and B symbols). The B frames of temporal layer 1 together with the key pictures predict the B frames of the second temporal layer. This halving of the prediction distance between frames in each prediction step is called dyadic hierarchy, with each splitting step resulting in one temporal layer, that is, the hierarchy with 15 B frames supports 5 temporal layers.

Underneath Figures 1(a) and 1(b), we provide for each frame the preferred encoding order with the smallest decoding delay. We observe that the encoding orders are identical for temporal layer 0, since the prediction dependencies of the key pictures are identical in both cases. With hierarchical B frames, the middle B frame is predicted first, while in the classical approach, the first B frame is predicted first.

The coding efficiency of hierarchical B frames depends on the choice of the quantization parameters for each B frame. H.264 SVC introduces cascading quantization scales which assign a higher quantization parameter value (lower quality) to B frames belonging to higher temporal layers.

2.2. Spatial Scalability

A spatial scalable bit stream implies that streams with different frame resolutions, such as QCIF ( pixels), CIF (), and 4CIF (), are extractable from a single bit stream. In this example, the QCIF layer would be the spatial base layer, and the CIF and 4CIF layers the spatial enhancement layers. An important new property of H.264 SVC is that a spatial layer is decodable with a single motion-compensation loop.

Besides the encoding mechanisms that we described in Section 2.1, the tools that exploit the interlayer redundancies between spatial layers are interlayer motion prediction, interlayer residual prediction, and interlayer intra prediction [2]. Figure 2 depicts the intra- and interlayer prediction dependencies for two spatial layers (base and enhancement), illustrating that the interlayer prediction mechanisms operate in a bottom-up fashion, that is, the base layer is used for the prediction of the spatial enhancement layer.

2.3. SNR Scalability, Including Fine and Medium Granularity Scalability

With SNR (quality) scalability, the quality of the video frames is improved for a given spatial resolution and frame rate. The main quality scalability modes, although not all are part of the SVC amendment, are coarse granularity scalability (CGS), medium granularity scalability (MGS), and fine granularity scalability (FGS). In our traffic study, we focus on MGS (included in first SVC) and FGS (not included in first SVC), which we now briefly review.

H.264 FGS supports single-loop decoding. The I/P key pictures of the quality base layer are predicted from one another as in Figure 1(b), but the B frames can be predicted using all quality refinements available in the higher quality layers, as illustrated in Figure 3(a). This prediction using the quality refinements of the enhancement layer improves the coding efficiency since the highest quality representation is used for prediction, but results in a decoding drift error, that is only stopped at the next I/P key picture [71]. Alternatively, the quality base layer prediction structure can be based on the hierarchical B frames of the quality base layer only, with identical dependencies in the quality refinement layer, as illustrated in Figure 3(b). This prediction structure is also known as closed-loop motion compensated prediction at low and high bit rates, and we consider this structure in our traffic study.

In MPEG-4 Part 2 FGS, closed-loop motion compensation is adopted only for the quality base layer while for the quality enhancement layer, a bit-plane technique is used to code the difference between the original picture and the picture reconstructed from the quality base layer, as illustrated in Figure 3(c). However, not exploiting the temporal redundancies between the adjacent pictures in the FGS enhancement layer incurs a considerable loss in coding efficiency, which schemes, such as PFGS [72], tried to alleviate.

In H.264 FGS, hierarchical B frames are used to efficiently exploit the temporal redundancy among adjacent pictures in the FGS enhancement layer. Using a different coding technique (requantization of quantization error) instead of bit-plane coding in MPEG-4 Part 2 FGS, H.264 FGS codes the enhancement layer information in progressive refinement (PR) slices that can be truncated with byte granularity. Furthermore, motion refinement is allowed in the FGS enhancement layer, as detailed in [1].

SVC MGS similarly encodes additional quality layers that each consist of disposable quantities that are coarser than the byte truncation offered by FGS. One MGS quality enhancement layer, for example, increases the base layer quality corresponding to quantization parameter QP to the quality of an encoding with parameter . The information in each MGS enhancement layer can additionally be represented with a maximum granularity of 1/16 or equivalently up to 16 refinements included in the enhancement layer. This medium granularity enables network mechanisms to drop MGS enhancement packets in a simplified manner compared to FGS, which requires truncation.

2.4. Combined Scalability

H.264 SVC supports spatiotemporal-SNR scalability, also referred to as combined scalability. This means that one global bit stream supports spatial, temporal, and SNR scalability. Depending on the encoding configuration, several individual bit streams with different spatial resolutions, frame rates, and SNR enhancement layers are extractable from the global bit stream. The SNR enhancement can be provided by CGS, MGS, or FGS. Note that not all scalability modes are necessarily supported by a combined scalable bit stream.

3. Study Setup: Video Sequences, Encoding Tools, and Video Traffic Metrics

In this section, we introduce the setup used for obtaining the video traffic and quality characterizations presented in the subsequent sections.

3.1. Video Sequences

The Common Intermediate Format (CIF, pixels) video sequences used for the statistics presented in this study are the ten-minute Sony Digital Video Camera Recorder demo sequence (17,682 frames at 30 frames/sec), which we refer to as Sony Demo sequence, the first half hour of the Silence of the Lambs movie (54 000 frames at 30 frames/sec), the Star Wars IV movie (54 000 frames at 30 frames/sec), and the first hour of the Tokyo Olympics video (133 128 frames at 30 frames/sec). We also use about 30 minutes of the NBC 12 News (49 523 frames at 30 frames/sec), including the commercials. The video sequences Silence of the Lambs, Star Wars IV, Tokyo Olympics, and NBC 12 News can, respectively, be described as drama/thriller, science fiction/action, sports, and news video. The Sony Demo sequence consists of 29 scenes with varying texture and motion complexities. Due to space constraints, we present in this paper only illustrative plots for encodings with Silence of the Lambs and Star Wars IV. The corresponding plots for the other video sequences are available in [73, 74].

3.2. Encoding Tools

We used the MEncoder tool to decode the sequences into uncompressed YUV format and to subsample the originally higher resolution sequences to CIF resolution. We used the MPEG-4 Part 2 Microsoft v2.3.0 software, and the SVC reference software, named JSVM, version 5.9 for the temporal layer evaluations, and versions 7.10 and 7.13, respectively, for studying FGS and spatial scalability.

3.3. Encoding Setup

We employ four GoP structures in our study of temporal scalability layers, namely, IBPBPBPBPBPBPBPB (16 frames, with 1 B frame per I/P frame), which we denote by G16-B1, IBBBPBBBPBBBPBBB (16 frames, with 3 B frames per I/P frame) denoted by G16-B3, IBBBBBBBPBBBBBBB (16 frames, with 7 B frames per I/P frame) denoted by G16-B7, and IBBBBBBBBBBBBBBB (16 frames, with 15 B frames per I frame) denoted by G16-B15. In the context of SVC, these four GoP structures are, respectively, designated by their “GoP size” which is the number of hierarchical B frames plus one key picture, either of type I or P. Hence, G16-B1 has GoP size 2, G16-B3 has GoP size 4, G16-B7 has GoP size 8, and G16-B15 has GoP size 16. In the following, we employ our own GoP structure notation to emphasize the repetitive I-P-B frame type patterns in the encodings. These four GoP structures are natural structures for hierarchical B frames and allow us to compare temporal layer statistics across encoders based on identical underlying GoP patterns.

Due to space constraints, we primarily focus in this paper on the temporal scalability layers for the G16-B3 GoP structure, which supports three temporal layers. The other GoP structures are presented in [73]. In our study of the spatial and FGS scalability layers, we focus on the GoP structure G16-B3 since the RD efficiency of MPEG-4 Part 2 deteriorates for more B frames, making a comparison across encoders less useful.

3.4. Video Traffic Metrics

We summarize the video traffic and quality metrics, which are all defined with respect to a given video sequence encoded with a fixed quantization scale, in Table 1. We remark that the coefficient of variation of the frame sizes is widely employed as a measure of the variability of the frame sizes, that is, the bit rate variability of the encoded video. Plotting the CoV as a function of the quantization scale (or equivalently, the PSNR video quality) gives the rate variability-distortion (VD) curve [48]. Alternatively, the peak-to-mean (Peak/Mean or PtM) ratio of the frame sizes is commonly used to express the traffic variability.

Regarding the bit rate metrics, we note that if each video frame is transmitted during one frame period (e.g., 33 milliseconds for 30 frames/s), then the bit rate [bits/s] required to transmit frame is . The corresponding mean bit rate and peak bit rate [bits/s] are defined in Table 1.

We define a Group of Pictures (GoP) of an encoded video stream as one I frame and all subsequent P and B frames before the next I frame in the stream. The size [bits] of GoP equals the sum of the sizes of the frames that belong to the GoP.

We use the peak signal-to-noise ratio (PSNR) as the objective measure of the quality of a reconstructed video frame with respect to the uncompressed video frame . The larger the difference between and , or equivalently, the lower the quality of , the lower the PSNR value. The PSNR is expressed in decibels [dB] to accommodate the logarithmic sensitivity of the human visual system. The PSNR is typically obtained for the luminance video frame and in case of a frame consisting of 8-bit pixel values, it is computed as a function of the mean squared error (MSE) as We denote the PSNR quality of a video frame by . For a detailed definition of all statistics used in this study, we refer to [75].

4. Temporal Scalability Traffic Analysis

We examine the traffic characteristics of the temporal layers embedded in video streams encoded with H.264 SVC and MPEG-4 Part 2. We demonstrate that the traffic variability of H.264 SVC temporal layers is significantly higher than the variability of the corresponding MPEG-4 Part 2 temporal layers. For a fair comparison, we assume that the same temporal layers as for H.264 SVC can be extracted from the MPEG-4 Part 2 traffic. Although the bitstream syntax of the latter does not support this extraction, it is in principle feasible for an intelligent media gateway or decoder to drop the B frames belonging to the respective temporal layers according to the H.264 SVC dyadic layer principle.

4.1. Temporal Layer Basics

The G16-B3 GoP structure implies the repetition of the frame type pattern , whereby the subscripts denote the temporal layers (0, 1, 2) to which a frame belongs. The temporal base layer (0) is therefore , with zeroes replacing the dropped B frames of temporal enhancement layers 1 and 2. The first temporal enhancement layer is and the second enhancement layer is .

In case of our CIF sequences at a frame rate of 30 frames per second (fps), the temporal base layer represents a stream with a frame rate of 7.5 fps, the combination (aggregation) of the base layer and the first enhancement layer increases the frame rate to 15 fps, and the reception of the second enhancement layer results in the full frame rate of 30 fps. We note that the temporal base layer frames are required for decoding enhancement layer 1 frames, and that enhancement layer 2 frames need both lower layers to be decoded.

Let us examine the video quality associated with receiving certain temporal layers. Clearly, the average PSNR video quality of the combination of all temporal layers, that is, of the aggregated traffic, is equal to the average quality of the single-layer video stream. However, if we would simply average the PSNR values () of the base layer frames (), then this average would be unrealistically high compared to the average of the corresponding single-layer (30 fps) stream, since the subjective quality impression of human observers is much lower for the frame rate of 7.5 fps. In order to include this perceptual quality degradation in the PSNR measurement, we assume that the decoder duplicates a received base layer frame until the next frame is received and decoded. The result is the duplicated base layer frame sequence , where the upper index represents the duplicated frame number. This sequence has a frame rate of 30 fps.

The PSNR value of a duplicated frame located at frame number is calculated based on the MSE between this duplicated frame and the original frame from the original (uncompressed) sequence. This PSNR value reflects the subjective distortion that occurs when jerky sequences consisting of duplicated frames are viewed by human observers. In general, the perceived video quality of a sequence is high if the average PSNR is high and the quality variation is low [5]. When there is low motion activity in the successive frames, that is, when frames are alike (low MSE), then duplication of frames results in barely noticeable jerkiness. The variation of the PSNR values is therefore also small. On the other hand, when high motion activity is present, then successive frames differ substantially and the MSE between successive frames is large, as well as the quality variation. The computed overall PSNR average therefore sufficiently incorporates the perceptual video quality reduction due to the reduced frame rate (jerkiness).

We apply the same principle to the computation of the average quality when the temporal base and first enhancement layer are received and decoded. This means that the following frame sequence is displayed: . The combination of all temporal layers results in displaying the sequence , which is the single-layer sequence.

Before we analyze the temporal layer traffic statistics, we describe the simple smoothing that we apply to the temporal base and enhancement layers to decrease the traffic variability. Let denote the frame size (bytes) of frame . Since there are large transmission gaps between frames of the base layer, we can redistribute the frame sizes over these gaps by dividing the frame size by four, and hence sending a quarter of each base layer frame during one frame period: , . Equivalently, we say that we have smoothed the temporal base layer traffic over frames. Analogously, the first enhancement layer traffic is smoothed over frames: , , . The second layer is smoothed over frames since only one frame is missing in between the B frames of this layer: , , . This basic smoothing introduces extra decoding delays, but mitigates to some extent the high rate variability as we demonstrate in the next section.

4.2. Results and Discussion

We treat each temporal layer separately in the following analysis, except for the layer quality where we assume the reception of all lower layers. The aggregation of all layers is equivalent to the single-layer case, which is analyzed in detail in [36]. The main reason for treating each layer separately is that streaming protocols, such as the Real Time Protocol [76, 77], typically packetize and stream each layer separately to allow for different treatment of the layers in the network.

In Table 2, we summarize traffic and quality statistics of the temporal base layer and the two temporal enhancement layers included in the G16-B3 GoP structure. The table includes frame size, bit rate, smoothed frame size, GoP size, and video quality statistics. We estimate these statistics based on the five long CIF sequences that we encode with H.264 SVC and MPEG-4 Part 2. In the first column of Table 2, the encoding mode is specified by a code representing the encoder (SV for H.264 SVC and Mp for MPEG-4 Part 2) and the quantization scale. For each encoder, we present min/mean/max values (computed across the five sequences) for two selected quantization scales that result in approximately equal PSNR quality (max-to-min) ranges. For example, the base layer quantization scale 28 for H.264 SVC results in the PSNR quality range 29.3–36.6 dB, and the quantization scale 4 for MPEG-4 Part 2 results in the quality range 29.2–36.5 dB. We compare the various statistical quantities in Table 2 based on matching quality ranges between encoders. Detailed results for the full range of studied quantization scales, which gives the RD and VD curves presented in this paper, are available in [73, 74].

First, we can confirm the improved RD efficiency of the H.264 SVC temporal layers as compared to the MPEG-4 Part 2 layers based on the smaller mean frame sizes ranges (for corresponding quality ranges) or, equivalently, the lower mean bit rate ranges for H.264 SVC. Secondly, the mean bit rates are significantly lower for the H.264 SVC temporal enhancement layers as compared to the base layer rates. This is also the case for the MPEG-4 Part 2 enhancement layer rates as compared to the base layer, but to a lesser extent. The reason is that the base layer consists of large I and P frames (for both encoders). The assignment of cascading quantizers to the H.264 SVC B frames is responsible for the enhancement layer differences between the encoders. As opposed to MPEG-4 Part 2, H.264 SVC introduces cascading quantization scales that assign larger quantization parameters (lower quality and equivalently lower bit rate) to B frames belonging to higher temporal layers. This concept is based on the insight that the temporal base layer requires higher quality than the next temporal layer, since all other predictions depend on it. The quality (and bit rate) of each subsequent temporal layer can be gradually reduced, since fewer layers depend on it. The quality fluctuation that is introduced within a GoP is not subjectively noticeable by human observers, as studied in the standard committee. Hence, H.264 SVC temporal enhancement layers have significantly lower bit rates than the base layer as compared to MPEG-4 Part 2.

The quality analysis, and in particular the CoQV, demonstrates that the quality of the base layer is more variable than the quality when the first enhancement layer is additionally received by the decoder. When all layers are received, the CoQV is the lowest. We observe this quality variability decrease for H.264 SVC and MPEG-4 Part 2.

Next, we discuss the frame and GoP size coefficients of variation CoV and peak-to-mean ratios PtM of the temporal layers. Table 2 illustrates that the CoV and PtM of the unsmoothed frame size traffic, that is, traffic including zeroes (transmission gaps) for missing frames, are high for all temporal layers and all encoders. The zero frame sizes are the main reason behind this high variability. The H.264 SVC values are typically considerably higher than the MPEG-4 Part 2 values, for example, the H.264 SVC CoV values for the first enhancement layer are as high as 3.79 while the MPEG-4 Part 2 CoV reaches 2.85. With basic smoothing, the maximum CoV and PtM values decrease, for example, the maximum CoV of the first enhancement layer of H.264 SVC decreases to 1.68, while MPEG-4 Part 2's maximum CoV drops to 1.13. The GoP size CoV and PtM values exhibit similar trends, however, the differences between H.264 SVC and MPEG-4 Part 2 values are smaller, because the CoV and PtM of the GoP size are equal to the CoV and PtM of layers smoothed over the entire GoP (). Nevertheless, a fairly significant increase of H.264 SVC layer variability remains over MPEG-4 Part 2 with the mean CoVs of the H.264 SVC temporal enhancement layers being typically 1.5 times larger than the mean CoVs of the MPEG-4 Part 2 layers.

In Table 3, we provide an overview of the maximum CoV and PtM values for each temporal layer. The table includes the maximum of the maximum values, such as , and the maximum of the mean values, such as . In every instance, the overall maximum is over all studied quantization scales (not only the selected quantization scales included in Table 2), while the inner maximum or mean is over all sequences for a given quantization scale.

Table 3 clearly demonstrates the higher CoV and PtM values of the H.264 SVC layer traffic as compared to the MPEG-4 Part 2 traffic. We observe that the first H.264 SVC enhancement layer has the highest CoV and PtM values among all unsmoothed layers. When smoothing is applied, the values of the second enhancement layer are highest, mainly because this layer is smoothed over two frames () as compared to four frames () for the other layers. Nevertheless, the advantage of traffic smoothing to reduce traffic variability is clear when comparing smoothed to unsmoothed values. After smoothing is applied, the H.264 SVC layers—especially enhancement layer 1 and even more so enhancement layer 2—still exhibit higher variability than MPEG-4 Part 2 layers, making network transport of H.264 SVC temporal layers more challenging. The main reason for the increased variability of H.264 SVC temporal layer traffic is attributable to the improved compression tools (e.g., motion compensated prediction) that manage to exploit redundancies more efficiently, and therefore are more amenable to frame content variations.

In Figure 4, VD curves are depicted for each temporal layer and the aggregated traffic (single-layer) of the Silence of the Lambs and Star Wars IV sequences encoded with H.264 SVC and MPEG-4 Part 2 (G16-B3 GoP structure). We provide VD curves for unsmoothed and smoothed layer traffic. The VD curves for each temporal layer represent CoV values as a function of the average PSNR quality, obtained after decoding the current temporal layer and all lower layers, as we explained in Section 4.1. The average quality range increases from the temporal base layer VD curve with a quality range up to approximately 39 dB, to about 46 dB when the decoder additionally receives the first temporal enhancement layer, and to roughly 52 dB when the decoder receives all temporal layers. The figure also includes the VD curve of the aggregated traffic with values that lie between the individual unsmoothed and smoothed temporal layer VD curves. When comparing VD curves for the Silence of the Lambs sequence in Figures 4(a) and 4(c), respectively, for H.264 SVC and MPEG-4 Part 2, the higher traffic variability (CoV) of H.264 SVC is pronounced. The same applies to the Star Wars IV sequence VD curves in Figures 4(b) and 4(d). Additionally, we depict the VD curves for five temporal layers of the G16-B15 GoP structure in Figure 5. We observe even higher CoV values for the unsmoothed layers as compared to G16-B3.

5. Spatial Scalability Traffic Analysis

In this section, we focus on the spatial scalability layers of H.264 SVC and MPEG-4 Part 2, employing GoP structure G16-B3. All five CIF sequences are downsampled to QCIF () resolution, which forms the spatial base layer of the encodings. The CIF layer forms the spatial enhancement layer. The statistical analysis treats each spatial layer separately, similar to the temporal layer analysis. We do not consider the temporal scalable layers that are present in each spatial layer, since they are the subject of the combined scalability analysis in Section 8. We compare the spatial layer traffic generated by both encoders, and we compare with single-layer QCIF and CIF traffic. The latter is warranted by the lower rate-distortion efficiency of spatial scalable encoding based on interlayer prediction, as compared to single-layer encoding, even though the H.264 SVC encoding tools represent an improvement over MPEG-4 Part 2.

5.1. Spatial Layer Basics

Since we do not consider temporal layer issues in this spatial layer analysis, the statistical processing of each spatial layer and the aggregated traffic follow the single-layer analysis approach. However, the average quality (PSNR) assigned to the QCIF layer does not represent the subjective quality perception, as compared to the CIF layer, if the lower resolution effect is not taken into account. Therefore, we upsample the decoded spatial QCIF base layer to CIF resolution and compute the average quality based on the MSE between the upsampled QCIF and the original (uncompressed) CIF sequence. The decoded CIF sequence is directly compared to the original sequence. This approach is warranted for receivers with CIF resolution displays, requiring upsampling of QCIF video streams to fit the display size. We realize that the applied upsampling technique plays a role in the subjective quality of the upsampled QCIF sequence. However for our practical traffic study, this is of a lesser concern. We also clarify that the quality that we associate with the spatial enhancement layer is identical to the quality of the aggregated traffic (base and enhancement layers), since the enhancement layer is only decodable if the spatial base layer has been received.

5.2. Results and Discussion

In Table 4, we provide example H.264 SVC and MPEG-4 Part 2 traffic statistics (min/mean/max values across sequences as in Section 4) of the spatial base layer, spatial enhancement layer, the aggregated traffic, and single-layer QCIF and CIF traffic for comparison with the spatial layers. In the first column of the table, we specify the encoding mode by an encoder code (SVS for spatially scalable H.264 SVC and Mp4S for spatially scalable MPEG-4 Part 2) and the quantization scale.

We first analyze the spatial base layer traffic, comparing the mean frame sizes and mean bit rates of the H.264 SVC spatial base layer with the MPEG-4 Part 2 base layer for approximately the same quality ranges. We confirm the improved RD efficiency of H.264 SVC. The average qualities are overall quite low, since we used spatial upsampling to compute CIF resolution qualities, as explained in Section 5.1. The coefficient of quality variation CoQV is in the range of 0.11–0.19 for both encoders. For all spatial layers, Table 5 provides maximum-of-maximum and maximum-of-mean values for the CoV and PtM across all quantization scales and sequences. From the spatial base layer values, we observe overall significantly larger CoV and PtM values for H.264 SVC as compared to MPEG-4 Part 2, making the network transport of the H.264 SVC spatial base layer challenging.

In Table 4, we additionally summarize statistics of single-layer QCIF encodings for comparison with the spatial base layer statistics. Inspection of the values reveals that they are almost perfectly identical, which confirms that the spatial base layer is encoded independently from the spatial enhancement layers, and identical to single-layer encoding. The reason is that the interlayer tools predict the spatial enhancement layer employing the base layer and the latter is not predicted from the enhancement layer information. Therefore, the spatial base layer statistics follow single-layer trends that are extensively studied in [36].

Examples of H.264 SVC and MPEG-4 Part 2 traffic statistics of the spatial enhancement layer are summarized in Table 4. We also provide single-layer CIF statistics for comparison. For H.264 SVC, the average enhancement layer bit rate is more than twice the bit rate of the base layer for the highest qualities and converges to about the same bit rate for the lowest qualities, see [73]. For MPEG-4 Part 2, the enhancement layer bit rate is always significantly larger than the base layer rate. This is explained by the enhanced coding efficiency of H.264 SVC's interlayer prediction tools.

The enhancement layer average qualities extend to high qualities since the complete CIF resolution is decodable by receivers. The COQV values are about 0.04–0.13, which is lower than the base layer quality variability. Table 5 provides maximum-of-maximum and maximum-of-mean CoV and PtM enhancement layer values, which are typically twice as large or larger for H.264 SVC than for MPEG-4 Part 2. Furthermore, the H.264 SVC spatial enhancement layer has larger CoV and PtM values than the SVC base layer, while MPEG-4 Part 2 enhancement values are comparable to or lower than the base layer values. Secondly, the CoV and PtM enhancement layer values are only slightly larger than or comparable to single-layer CIF values in Table 5, for both H.264 SVC and MPEG-4 Part 2.

Next, we discuss the aggregated traffic statistics provided in Table 4, and compare with the enhancement layer and single-layer values. The mean frame sizes and bit rates are equal to the sum of the corresponding base and enhancement layer values. The quality statistics are identical to those of the enhancement layer, as discussed in Section 5.1. From Table 5, we again observe significantly larger maximum CoV and PtM values for the H.264 SVC aggregated traffic as compared to MPEG-4 Part 2. Compared to the SVC enhancement layer, the CoV and PtM values of the aggregated traffic are generally somewhat lower. Comparing the aggregated traffic statistics to the single-layer values reveals that the variabilities of the aggregate traffic are somewhat lower than the variabilities of the single-layer traffic.

In Figure 6, we depict VD curves of the spatial layers (QCIF and CIF) and the aggregated traffic, alongside the single-layer VD curves, for the Silence of the Lambs and Star Wars IV sequences encoded with H.264 SVC and MPEG-4 Part 2. We observe that the base layer and corresponding QCIF single-layer VD curves are identical for all sequences and encoders, as expected. Comparing Figures 6(a) and 6(c) for Silence of the Lambs encoded with H.264 SVC and MPEG-4 Part 2, clearly reveals the higher variability of the H.264 SVC base layer traffic. This is also observable in Figures 6(b) and 6(d) for the Star Wars IV sequence. The enhancement layer VD curves for H.264 SVC are above the MPEG-4 Part 2 curves in all cases. The VD curves of the aggregated traffic are the combined result of the base and enhancement layer variabilities, and as such, they are generally positioned between these two VD curves. An interesting distinction between H.264/SVC and MPEG-4 Part 2 is that the MPEG-4 layer 0 QCIF streams have higher traffic variabilities than the corresponding MPEG-4 layer 1 CIF streams. With H.264 SVC, this relationships is reversed, that is, the layer 1 CIF H.264 SVC streams have higher variability than the corresponding H.264 SVC layer 0 QCIF streams, further underscoring the high-traffic variability of the spatial enhancement layer of H.264 SVC.

6. Fine Granular Scalability Traffic Analysis

We compare H.264 SVC fine granularity scalability (SVC FGS) with MPEG-4 Part 2 FGS (MPEG-4 FGS) traffic based on GoP structure G16-B3. We analyze the base and enhancement layers separately and do not consider the temporal layers in this section, since they are the subject of our combined FGS-temporal analysis in Section 8.

6.1. FGS Layer Basics

For MPEG-4 FGS, many possible FGS structures can be used such as basic FGS, FGS temporal (FGST), combined FGS-FGST, and multilayer FGST, which are detailed in [71]. In this study, we use the basic FGS structure, depicted in Figure 3(c), with one FGS enhancement layer frame for every base layer frame. We employ the H.264 FGS prediction loop illustrated in Figure 3(b), which is closed with respect to both the highest and lowest quality points.

The subsequent FGS analysis is based on the CIF video sequences Silence of the Lambs, Star Wars IV, NBC 12 News, and Sony Demo. We configured both encoders with one FGS enhancement layer and specified the base layer quantization scale. We study the traffic characteristics of the FGS base layer, the untruncated and the truncated enhancement layer, as well as the aggregated (base + enhancement) traffic.

6.2. Results and Discussion

We analyze the statistics of base, enhancement, and aggregated traffic separately, in correspondence with the various possibilities of reception at the decoder. For selected base layer quantization scales, we present values of SVC FGS and MPEG-4 FGS traffic statistics for overlapping quality ranges in Table 6. We provide minimum, mean, and maximum (across the five video sequences) values of the traffic statistics. In the first column of the table, the encoder quantization scales are specified for MPEG-4 FGS () and SVC FGS (). In Table 7, we present the maximum values across quantization scales and sequences. We observe from Table 6 a significant compression efficiency improvement in the base layer due to the improved tools in SVC FGS. These improved compression tools result in very high traffic variabilities for the SVC FGS base layer with maximum CoV and PtM values up to 2.5 and 39.9, as compared to up to 1.5 and 22.14 for MPEG-4 FGS, as observed in Table 7. The maximum of means values are similarly higher for SVC FGS. From these values, we conclude that significant traffic variability is introduced in the SVC FGS base layer as compared to MPEG-4 FGS. When comparing with single-layer H.264 SVC, see [74], we find that the base layer of SVC FGS (Table 6) is nearly identical since the prediction structure of both utilizes a closed loop.

Table 6 also gives selected examples to compare the untruncated FGS enhancement layers of both encoders. From Table 7, CoV and PtM have maxima up to 2.11 and 20.28, respectively, for SVC FGS, compared to up to 0.6 and 4.0 for MPEG-4 FGS. The SVC FGS enhancement layer has been subject to improved compression tools, resulting in increased variability at the frame level. Analogously, for the aggregated traffic with untruncated enhancement layer (Table 6), we have a CoV of 1.97 and a PtM of 25.5 for SVC FGS, as compared to 0.92 and 8.54 for MPEG-4 FGS.

Next, we examine the RD graphs of the SVC FGS and MPEG-4 FGS layers. Figure 7 depicts the base, untruncated enhancement, and aggregated traffic (base + untruncated enhancement) RD graphs for SVC FGS and MPEG-4 FGS encodings of the Silence of the Lambs sequence. The FGS base layer RD graphs are typical (quality increases monotonically as a function of the bit rate) and demonstrate the improved RD efficiency of SVC FGS in the base layer. The untruncated enhancement layer for MPEG-4 FGS contains refinement information allowing high-quality reconstruction of the frames, resulting in the near-flat RD curve. The aggregated traffic RD graphs are the summation of the base and untruncated enhancement layer rates (per quality value).

To compare MPEG-4 FGS and SVC FGS with various truncations of the enhancement layer, we use average base layer PSNR qualities that are approximately equal. For Star Wars IV, we select quantization scales 34 and 8, respectively, for SVC FGS and MPEG-4 FGS, corresponding to an average base layer PSNR of approximately 34 dB. We further choose quantization scales 38 and 16 for SVC FGS and MPEG-4 FGS corresponding to a PSNR of approximately 37 dB for the Silence of the Lambs sequence. We truncate the enhancement layer progressively with 10% increments of the enhancement layer bit rate.

The RD graphs obtained for the aggregated (base + truncated enhancement) traffic for both sequences are depicted in Figure 8(a). The steep rise of the SVC FGS enhancement layer RD curve for every 10% increment in bit rate is in clear contrast to MPEG-4 FGS, which has a much lower RD performance with more gradual increments. This lower RD performance is explained by ignoring the enhancement layer in the prediction loop of MPEG-4 FGS. This also clearly demonstrates the substantial coding improvements made to the enhancement layer of SVC FGS, without significantly increasing the computational complexity (a major concern for portable devices). We also observe from Figure 8 that the Star Wars IV sequence has a better RD performance, which is consistent with earlier results. We note that the truncation of the MPEG-4 FGS enhancement layer resulted in outliers that are included in Figure 8 as disconnected tick marks.

The VD curves illustrate the significant contrast in variability between SVC FGS and MPEG-4 FGS. These VD curve points correspond to the RD curve points and represent the variability of the progressively truncated enhancement layer. For SVC FGS, we observe a marginal decrease in variability for increasing bit rate. The plots also include the smoothed traffic (, denoted by sm) VD curves, which show that the high variability of the SVC FGS stream can be significantly reduced by smoothing. However, the unsmoothed MPEG-4 FGS curves lie well below the smoothed SVC FGS stream curves, pointing to the inherently high variability introduced by the SVC FGS encoder. (The Star Wars IV VD curves for MPEG-4 FGS are above the Silence of the Lambs VD curves in Figure 8(b) due to the higher base layer CoV of Star Wars IV for the considered quantization scale 8.)

Although we consider a basic truncation strategy, which truncates each enhancement layer's progressive refinement (PR) slice by the same percentage, the traffic variability is still high. This is because the truncation of each PR slice results in widely variable truncated PR slice sizes (bytes). The SVC FGS traffic variability is consistently high across the range of percentages of enhancement layer added to the base layer; an important characteristic to take into account in the design of transport protocols as the enhancement layer is typically sent over a more error prone path with respect to the base layer.

7. Medium Grain Scalability Traffic Analysis

In this section, we focus on the medium grain scalability (MGS) mode of H.264 SVC, employing GoP structure G16-B0, which signifies 15 P frames in between I frames and no B frames. The resulting MGS base layer with CIF resolution conforms to the restricted Baseline profile of H.264/AVC. The MGS enhancement layer adds information that improves the quality of each video frame type up to the maximum quality encoded in the enhancement layer. Similar to the previous sections, the statistical analysis treats each layer separately and also aggregates the traffic in both layers. We compare the layer traffic generated by H.264 SVC MGS, however, we are not able to compare with equivalent traffic of MPEG-4 Part 2 since this standard does not include a similar quality scalability mode.

7.1. MGS Layer Basics

The statistical processing of the base layer, MGS enhancement layer, and the aggregated traffic follow the single-layer analysis approach. As for spatial scalability and FGS, the quality that we associate with the MGS enhancement layer is identical to the quality of the aggregated traffic (base and enhancement layers).

The MGS enhancement layer studied in this analysis supports one quality enhancement with a quantization parameter decrease of 6 (increased quality). We leave the study of multiple MGS quality extraction points (up to 16) within this enhancement layer for future research as well as the statistical analysis of the G16-B3 GoP structure.

7.2. Results and Discussion

Table 8 enumerates example H.264 SVC traffic statistics (min/mean/max) values across sequences of the base layer, MGS enhancement layer, and the aggregated traffic. In the first column of the table, we specify the encoding mode by the encoder code SVM followed by the quantization scale.

Comparing the mean bit rates between the base layer and corresponding MGS enhancement layer (same quantization scale), it is evident that the enhancement layer adds a large increase in bit rate to the base layer, and this for the entire range of studied quantization scales and sequences. The spanned decrease in quantization scale of 6, which halves the quantization step size, is encoded less efficiently by the MGS tools, resulting in the much larger required bit rates.

The CoV values of the MGS enhancement layer are considerably lower than the CoV values of the base layer (G16-B0). From Table 9, the maximum of the maximum CoV and PtM values are, respectively, 2.10 and 36.12 for the base layer while for the enhancement layer both values are 0.98 and 10.72. The differences are also this large for the maximum of the means of CoV and PtM. The aggregated traffic has maximum values that are comparable to or slightly larger than the values of the enhancement layer, hence the CoV and PtM values of the base layer are greatly reduced if transported in conjunction with the enhancement layer.

The statistics on the GoP level have similar trends, however the difference between the CoV values of the base; and enhancement layers is less pronounced while still significant differences exist between PtM values.

8. Combined Scalability Traffic Analysis

The H.264 SVC encoder supports combined scalability that allows to extract temporal, spatial, and SNR layers from one bitstream. The result of this flexibility from a video traffic analysis viewpoint is that analyzing all possible temporal-spatial-SNR encoding combinations of layers is prohibitive. Therefore, we focus on two case studies: spatiotemporal and FGS-temporal scalability. We compare the base and enhancement layers to the traffic characteristics obtained and discussed in the preceding sections that analyzed each scalability mode in isolation.

First, we explore the combined spatiotemporal scalability case, which is based on the spatial scalable encodings used in Section 5, that is, we employ GoP structure G16-B3 supporting three temporal layers in each spatial QCIF and CIF layer. Secondly, we analyze combined FGS-temporal scalability based on the encodings used in Section 6, supporting three temporal layers in the FGS base and enhancement layers.

8.1. Combined Spatiotemporal Scalability

Figures 9 and 10 depict the VD curves of the temporal layers in each spatial layer and in the aggregated traffic (base + enhancement) for the Silence of the Lambs and Star Wars IV sequences. Each complete spatial layer has been individually analyzed in Section 5. In the following, we focus on the temporal layers embedded in each spatial layer and compare with the corresponding single-layers.

First, we recall from Section 5 that the spatial base layer (QCIF) statistics are identical to the single-layer (QCIF) statistics, because these layers are identically encoded. Therefore, the temporal layer statistics of the spatial base layer are also identical to the statistics of the temporal layers embedded in the single-layer QCIF stream. Secondly, we compare the VD curves of the temporal layers embedded in the aggregated spatial stream in Figures 9(c) and 10(c) to the VD curves of the layers of the temporal-scalability only CIF encodings in Figures 4(a) and 4(b). Visual inspection reveals that these temporal layer VD curves have comparable values, however, with somewhat lower CoV values in the low-quality range of Figures 4(a) and 4(b). Thirdly, the spatial enhancement layer's temporal layers in Figures 9(b) and 10(b) cannot be directly compared with any prior results. However, visual inspection reveals that the VD curves in Figures 9(b) and 10(b) are very similar to the temporal layer VD curves of the aggregated spatial traffic in Figures 9(c) and 10(c). These VD curves have the same shapes, but the VD curves of the spatial enhancement layer have a slight vertical offset (somewhat higher CoV) than the VD curves of the aggregate streams. This indicates that the variability of the aggregated traffic is dominated by the spatial enhancement layer.

From (i) the similarity of the temporal layer VD curves of the spatial base and aggregate streams with the corresponding VD curves of the temporal-scalability only encodings, and (ii) the similarity of the temporal layers embedded in the spatial enhancement layer with the temporal layers in the aggregated spatial stream, we conclude that it suffices to separately analyze the layers of temporal-scalability only encodings at the individual spatial resolution (QCIF and CIF) to obtain good estimates of the traffic variabilities of the layers in the combined spatiotemporal encoding.

8.2. Combined FGS-Temporal Scalability

The SVC FGS encoder supports FGS-temporal scalability, which adds progressive refinement (PR) information to each temporal layer embedded in the base layer. This PR information is provided by the FGS enhancement layer. In this section, the three temporal layers included in the base and enhancement layer are separately examined. Figures 11 and 12 depict the temporal layers for base, untruncated enhancement, and aggregated (base + untruncated enhancement) traffic for the Silence of the Lambs and Star Wars IV sequences.

We compare the VD curves of the temporal layers embedded in the FGS base layers to the VD curves of the temporal-scalability only encodings in Figures 4(a) and 4(b). First, we observe that the temporal layers embedded in the FGS base layers in Figures 11(a) and 12(a) have comparable variability to the layers of the temporal-scalability only encodings in Figures 4(a) and 4(b). Direct comparison of the VD curves in Figures 11(a) and 12(a) with the VD curves in Figures 4(a) and 4(b) is difficult, because the qualities associated with the temporal layers are computed differently (a constant low PSNR value is used for missing frames in Figures 11 and 12 versus the PSNR between duplicated, and original frame is used in Figures 4(a) and 4(b)). Nevertheless, the maximum CoV values and the CoV values at the low- and high-quality ends of corresponding curves are very close. Given this similarity between the VD curves of the temporal layers embedded in the FGS base layer and the VD curves of the layers in the temporal-scalability only streams in Figures 4(a) and 4(b), we conclude that it suffices to study the traffic statistics of the layers of temporal-scalability only encodings to obtain reasonable estimates of the traffic variabilities of the temporal layers embedded in the FGS base layer. On the other hand, the FGS enhancement layer traffic, the aggregated FGS traffic, and their embedded temporal layers cannot be meaningfully compared to any previously obtained results. However, the unprecedented high variabilities of these streams are indicative of the high variability the network path encounters when different layers are transmitted independently.

9. Conclusion

We examined the video traffic characteristics of the temporal, spatial, and FGS scalability modes of the scalable video coding (SVC) extension of the H.264/AVC standard and compared with equivalent MPEG-4 Part 2 scalable video traffic. We also analyzed SVC's combined spatiotemporal and combined FGS-temporal scalability. Our traffic study focused on the joint characterization of the average bit rate and the bit rate variability as a function of the video quality. We employed long CIF resolution video sequences with a wide variety of texture and motion features. We summarize our findings for each scalability mode as follows.

(i) For the temporal scalability mode of SVC with three temporal layers, which we examined separately, we have found that the maximum coefficient of variation CoV of the frame sizes over all sequences and all unsmoothed SVC temporal layers is above 3.3, with the CoV of temporal layer 1 being as high as 3.8. For MPEG-4 Part 2, the maximum CoV stays below 2.9. Across temporal layers, we have found that temporal layer 1 has the highest variability. When basic smoothing is applied to SVC layers, we have found that the maximum CoV falls to 1.4 and 1.7 for the base layer and temporal layer 1, respectively, while the CoV of temporal layer 2 falls to 2.27. For MPEG-4 Part 2, the smoothed CoV does not exceed 1.25. These figures point to the significant increase in bit rate variability of temporal scalable SVC over MPEG-4 Part 2. From the bit rate and quality analysis, we find that the mean bit rates for the SVC temporal enhancement layers are significantly lower than for the base layer due to the presence of large I and P frames and the cascading quantizer assignment for SVC B frames. We also confirm that the coefficient of quality variation decreases as each layer is cumulatively added, thus increasing the subjective quality at the receiver.

(ii) The spatial scalability traffic analysis first focused on the separate analysis of the QCIF base layer, the CIF enhancement layer, and the aggregated CIF stream, without considering the temporal scalability present in each spatial layer. We have found that SVC's spatial enhancement layer (CoV up to 2.6) has larger traffic variability than its base layer (CoV up to 2.3) contrary to MPEG-4 Part 2 enhancement layer's traffic variability (CoV up to 1.4) which is lower than or comparable to its base layer (CoV up to 1.6). We have also found that the spatial base layer statistics are perfectly identical to the single-layer QCIF statistics, confirming that the spatial base layer is encoded independently of the enhancement layer. The traffic variabilities of both SVC and MPEG-4 Part 2 for the enhancement layer (CIF) are comparable to or slightly higher than for single-layer CIF. For the aggregated traffic (CIF), we have found significantly higher traffic variability for SVC as compared to MPEG-4 Part 2. Comparing with the CIF enhancement layer, the CoV of the aggregated traffic is generally lower than that of the enhancement layer.

(iii) We analyzed FGS by treating base layer, enhancement layer, and aggregated traffic separately. There has been a significant effort in the SVC extension to improve the RD efficiency over MPEG-4 FGS, the success of which can be clearly seen in up to 50% improvement made in many cases. We have studied the simple truncation of the enhancement layer of both encoders in progressive steps of 10% of the total enhancement layer and have found that the variability of SVC for each point can be over 2.5 times that of MPEG-4 FGS, which has CoV values less than or equal to 1. Smoothing the truncated bitstream lowers the SVC CoV to the range 1–1.5, while for MPEG-4 FGS, smoothing reduces the traffic variability to the range 0.4–0.6. Compared with single-layer encodings, we have found that the base layer statistics are quite similar, given that both use a closed loop prediction structure. We have observed that the untruncated enhancement layer of MPEG-4 FGS contains almost the full refinement information for the entire bit rate range (for all quantizers), resulting in an almost flat RD curve; in contrast, SVC provides significant quality increases for increases in the untruncated enhancement layer bit rate.

(iv) We examined combined spatiotemporal scalability by analyzing the temporal layers embedded in each spatial layer and compared with the layers in temporal-scalability only encodings. We have observed comparable values except in the low-quality range where somewhat lower traffic variability is exhibited by the temporal-scalability only encodings. We have also observed that the variability of the aggregated traffic is mainly determined by the spatial enhancement layer. From the fact that the VD curves of the temporal layers embedded in each spatial layer are similar to the VD curves of the corresponding temporal-scalability only encodings, and that the spatial enhancement layer is similar to that of the aggregated spatial traffic, we conclude that it suffices to analyze the video traffic of each resolution separately to obtain a good estimate of the traffic variabilities of all embedded layers. We also examined combined FGS-temporal scalability of SVC. Given the similarity of the temporal VD curves in the FGS base layer to the temporal layer curves embedded in the single layer, a reasonable estimate of the traffic variabilities of all layers embedded in the FGS base layer can be obtained from the single-layer equivalent.

Overall, these results clearly point to unprecedented levels of compression efficiency as well as traffic variability for SVC coding, a factor which should be taken into consideration for the design of efficient network transport protocols and mechanisms for H.264 SVC scalable-encoded video.

Acknowledgments

The authors are grateful to Dr. Patrick Seeling for contributing to the statistical processing. This work was supported in part by the National Science Foundation through Grants nos. ANI-0136774 and CRI-0750927.