Abstract

The aim of this paper is to present video quality prediction models for objective non-intrusive, prediction of H.264 encoded video for all content types combining parameters both in the physical and application layer over Universal Mobile Telecommunication Systems (UMTS) networks. In order to characterize the Quality of Service (QoS) level, a learning model based on Adaptive Neural Fuzzy Inference System (ANFIS) and a second model based on non-linear regression analysis is proposed to predict the video quality in terms of the Mean Opinion Score (MOS). The objective of the paper is two-fold. First, to find the impact of QoS parameters on end-to-end video quality for H.264 encoded video. Second, to develop learning models based on ANFIS and non-linear regression analysis to predict video quality over UMTS networks by considering the impact of radio link loss models. The loss models considered are 2-state Markov models. Both the models are trained with a combination of physical and application layer parameters and validated with unseen dataset. Preliminary results show that good prediction accuracy was obtained from both the models. The work should help in the development of a reference-free video prediction model and QoS control methods for video over UMTS networks.

1. Introduction

Universal Mobile Telecommunication System (UMTS) is a third generation (3G), wireless cellular network based on Wideband Code Division Multiple Access technology, designed for multimedia communication. UMTS is among the first 3G mobile systems to offer wireless wideband multimedia communications over the Internet Protocol [1]. Multimedia contents on the Internet can be accessed by the mobile Internet users at data rates between 384?kbps and up to 2?Mbps in a wide coverage area with perfect static reception conditions.

Video streaming is a multimedia service, which is recently gaining popularity and is expected to unlock new revenue flows for mobile network operators. Significant business potential has been opened up by the convergence of communications, media, and broadcast industries towards common technologies by offering entertainment media and broadcast content to mobile user. However, for such services to be successful, the users Quality of Service (QoS) is likely to be the major determining factor. QoS of multimedia communication is affected by parameters both in the application and physical layer. In the application layer, QoS is driven by factors such as resolution, frame rate, sender bitrate, and video codec type. In the physical layer, impairments such as the block error rate, jitter, delay, and latency. are introduced. Video quality can be evaluated either subjectively or based on objective parameters. Subjective quality is the users’ perception of service quality (ITU-T P.800) [2]. The most widely used metric is the Mean Opinion Score (MOS). Subjective quality is the most reliable method. However, it is time consuming and expensive and hence, the need is for an objective method that produces results comparable with those of subjective testing. Objective measurements can be performed in an intrusive or nonintrusive way. Intrusive measurements require access to the source. They compare the impaired videos to the original ones. Full reference and reduced reference video quality measurements are both intrusive [3]. Quality metrics such as Peak Signal-to-Noise Ratio (PSNR), SSIM [4], VQM [5], and PEVQ [6] are full reference metrics. VQM and PEVQ are commercially used and are not publicly available. Nonintrusive methods (reference-free), on the other hand, do not require access to the source video. Nonintrusive methods are either signal- or parameter-based. Nonintrusive methods are preferred to intrusive analysis as they are more suitable for online quality prediction/control.

Recently, there has been work on video quality prediction. Authors in [79] predicted video quality for mobile/wireless networks taking into account the application level parameters only, whereas authors in [10] used the network statistics to predict video quality. In [11] authors have proposed a model to measure temporal artifacts on perceived video quality in mobile video broadcasting services. We proposed in [12] video quality prediction models over wireless local area networks that combined both the application and network level parameters. In UMTS Radio Link Control (RLC), losses severely affect the QoS due to high error probability. The RLC is placed on top of the Medium Access Control and consists of flow control and error recovery after processing from the physical layer. Therefore, for any video quality prediction model, it is important to model the RLC loss behaviour appropriately. In this paper only RLC Acknowledged Mode (AM) is considered as it offers reliable data delivery and can recover frame losses in the radio access network. Recent work in [1316] has focused on the impact of UMTS link layer errors on the quality of H.264/MPEG4 encoded videos. In [17] the impact of H.264 video slice size on end-to-end video quality is investigated. In [18] authors have shown that RLC AM mode outperforms the unacknowledged mode and proposed a self-adaptive RLC AM protocol. In [19] performance evaluation of video telephony over UMTS is presented. Most of the current work is either limited to improving the radio channel or evaluation of parameters that impact on QoS of video transmission over UMTS networks. However, very little work has been done on predicting end-to-end video quality over UMTS networks considering both the different content types and the impact of RLC loss models.

As the convergence of broadcast/multicast and the Internet becomes a reality, delivery of multimedia content to large audiences will become very cost-effective using wireless access networks such as UMTS, WiFi, WiMax, or DVB-H. Therefore, provisioning of multimedia services can easily be offered over several access technologies. Hence, there is a need for an efficient, nonintrusive video quality prediction model for technical and commercial reasons. The model should predict perceptual video quality to account for interactivity. In this paper, we have looked at the UMTS access network. The error rate simulated in the physical layer is employedtogenerate losses at the link layer modelled with a 2-state Markov model [2022] with variable Mean Burst Lengths (MBLs) [23]. Hence, we evaluate the impact of different loss models on end-to-end video quality as it was shown in [24] that there is a strong impact of second-order error characteristics of the channel onto the performance of higher layer protocols. Furthermore, based on the content types, we are looking for an objective measure of video quality simple enough to be calculated in real-time at the receiver side. We present two new reference-free approaches for quality estimation for all content types. The contributions of the paper are twofold(1)investigation of the combined effects of physical and application layer parameters on end-to-end perceived video quality over UMTS networks for all content types,(2)development of learning models for video quality prediction as (a) a hybrid video quality prediction model based on an Adaptive Neural Fuzzy Inference System (ANFIS), as it combines the advantages of a neural network and fuzzy system [25] for all content types and (b) a regression-based model for all content types.

The model is predicted from a combination of parameters in the application layer, that is, Content Type (CT), video Sender Bitrate (SBR), and Frame Rate (FR), and in the physical layer, that is, Block Error Rate (BLER) and Mean Burst Length (MBL). The video codec used was H.264/AVC [26] as it is the recommended codec for video transmission over UMTS 3G networks. All simulations were carried out in the OPNET Modeler [27] simulation platform.

The rest of the paper is organised as follows. The video quality assessment problem is formulated in Section 2. Section 3 presents the background to content-based video quality prediction models. In Section 4, the proposed content-based video quality models are presented, whereas, Section 5 outlines the simulation set-up. Section 6 describes the impact of QoS parameters on end-to-end video quality. The evaluation of the proposed models is presented in Section 7. Conclusions and areas of future work are given in Section 8.

2. Problem Statement

In multimedia streaming services, there are several parameters that affect the visual quality as perceived by the end users of the multimedia content. These QoS parameters can be grouped under application layer QoS and physical layer QoS parameters. Therefore, in the application layer perceptual QoS of the video bitstream can be characterized as (i)Perceptual Qos = f (content type, SBR, frame rate, codec type, resolution, etc.)

whereas, in the physical layer it is given by(i)Perceptual QoS??=??f(PER, delay, latency, jitter, etc.).

It should be noted that the encoder and content dimensions are highly conceptual. In this research we chose H.264 as the encoder type as it is the recommended codec for low bitrates. We used our previously defined classification function [12] to classify the video contents based on their spatial and temporal features. In the application layer we chose Sender Bitrate (SBR), Frame Rate (FR), and Content Type (CT) and in the physical layer we chose Block Error Rate (BLER) and Mean Burst Length (MBL) as QoS parameters. A single Mean Opinion Score (MOS) value is used to describe the perceptual quality. Therefore, MOS in the application layer is given as , whereas, MOS in the physical layer is given by asMOSA = CT, SBR, FR and MOSP = BLER, MBLThe overall MOS is given by MOS = ,

In this paper we evaluated the impact of QoS parameters both in the application and physical layer and hence confirmed the choice of parameters in the development of the learning models. Video quality is affected by parameters in the application and physical layer. Therefore, video quality prediction model should take into account parameters in both layers. The relationships of QoS parameters are thought to be nonlinear. Therefore, an ANFIS-based neural network model is chosen for video quality prediction because it combines the advantages of fuzzy systems (based on human reasoning) and neural networks. In addition to ANFIS-based prediction models, we have also predicted video quality based on nonlinear regression. This method is chosen as it is easy to implement in QoS control, for example, video SBR adaptation. ANFIS-based models are more complex and to implement them in real-time for QoS control is not as straightforward as a regression-based model which is light weighted and easily implementable. The purpose of this paper is to highlight the two methods for video quality prediction.

3. Background to Content-Based Video Quality Prediction

In this section we present the background literature on content classification and its impact on video quality prediction.

3.1. Two-Dimensional Content Classification

The content of each video clip may differ substantially depending on its dynamics (i.e., the spatial complexity and/or the temporal activity of the depicted visual signal). The quantification of this diversity is of high interest to the video coding experts, because the spatiotemporal content dynamics of a video signal specify and determine the efficiency of a coding procedure.

From the perceptual aspect, the quality of a video sequence is dependent on the spatiotemporal dynamics of the content. More specifically, it is known from the fundamental principles of the video coding theory that action clips with high dynamic content are perceived as degraded in comparison to the sequences with slow-moving clips, subject to identical encoding procedures.

Thus the classification of the various video signals according to their spatiotemporal characteristics will provide to the video research community the ability to quantify the perceptual impact of the various content dynamics on the perceptual efficiency of the modern encoding standards.

Towards this classification, a spatiotemporal plane is proposed, where each video signal (subject to short duration and homogeneous content) is depicted as Cartesian point in the spatiotemporal plane, where the horizontal axis refers to the spatial component of its content dynamics and the vertical axis refers to the temporal one. The respective plane is depicted on Figure 1.

Therefore, according to this approach, each video clip can be classified to four categories depending on its content dynamics, namely, (i)Low Spatial Activity-Low Temporal Activity (upper left),(ii)High Spatial Activity-Low Temporal Activity (upper right),(iii)Low Spatial Activity-High Temporal Activity (lower left),(iv)High Spatial Activity-High Temporal Activity (lower right).

The accuracy of the proposed spatiotemporal content plane is subject to the duration of the video signal and the homogeneity of the content. For short duration and homogeneous content video clips, the classification is representative and efficient. However, for video clips of longer duration and heterogeneous content, spatiotemporal classification is becoming difficult.

We propose to use two discrete metrics, one for the spatial component and one for the temporal one in order to cover the spatiotemporal plane and the needs of this paper. The averaged frame variance is proposed for the spatial component of the video signal. This objective metric permits the quantification of the spatial dynamics of a video signal short in duration and homogeneous. Considering that a frame is composed of pixels , then the variance of a frame is defined in

Derived from (1), (2) presents the averaged frame variance for the whole video duration. represents the number of frames in the video

The averaged variance of the successive y frame luminance difference is proposed as a metric for the quantification of the temporal dynamics of a video sequence. Considering that a frame contains pixels and the number of frames in the video, then the averaged frame difference of the successive frame pairs is defined in Therefore, the averaged variance for the overall duration of the test signal is defined in

The scale in both axes refers to the normalized measurements (considering a scale from 0 up to 1) of the spatial and temporal component, according to the aforementioned metrics. The normalization procedure applied in this paper, sets the test signal with the highest spatiotemporal content to the lower right quarter and specifically to the Cartesian (Spatial, Temporal) values (0.75, 0.75). This hypothesis, without any loss of generality, allows to our classification grid the possibility to consider also test signals that may have higher spatiotemporal content in comparison to the tested ones.

For the needs of this paper six short sequences (three for training and three for validation) are used. Snapshots of these sequences are depicted in Figure 2. All sequences are available to download from [28].

Applying the described spatial and temporal metrics on the sequences used, their classification on the proposed spatiotemporal grid is depicted on Figure 3. According to Figure 3, it can be observed that the spatiotemporal dynamics of the selected sequences are distributed to the three quarters of the spatiotemporal grid, indicating their diverse nature of the content dynamics. Moreover, the validity of the proposed metrics is certified by these experimental results, showing that they provide adequate differentiation among the dynamics of the signals under test.

3.2. Video Quality Prediction Method

Figure 4 illustrates how the video quality is predicted nonintrusively. At the top of Figure 4, intrusive video quality measurement block is used to measure video quality at different network QoS conditions (e.g., different packet loss, jitter, and delay) or different application QoS settings (e.g., different codec type, content type, sender bitrate, frame rate, and resolution). The measurement is based on comparing the reference and the degraded video signals. Peak Signal-to-Noise Ratio (PSNR) is used for measuring video quality in the paper to prove the concept. MOS values are obtained from PSNR to MOS conversion [29]. The video quality measurements based on MOS values are used to derive nonintrusive prediction model based on artificial neural networks and nonlinear regression methods. The derived prediction model can predict video quality (in terms of MOS) from the physical layer QoS parameters of block error rate and mean burst length and the application layer QoS parameters of content type, SBR, and frame rate. In Figure 4 the video content classification is carried out from raw videos at the sender side by extracting their spatial and temporal features. The spatio-temporal metrics have quite low complexity and thus can be extracted from videos in real-time. Video contents are classified as a continuous value from 0 to 1, with 0 as content with no movement, for example, still pictures and 1 as a very fast moving sports type of content. The content features reflecting the spatiotemporal complexity of the video go through the statistical classification function (cluster analysis) and content type is decided based on the Euclid distance of the data [12]. Therefore, video clips in one cluster have similar content complexity. Hence, our content classifier takes the content features as input observations while content category as the output. For larger video clips or movies the input will be segment-by-segment analysis of the content features extracted. Therefore, within one movie clip there will be a combination of all content types.

3.3. Content Dynamics and Video Quality Prediction

In this subsection, we discuss the spatiotemporal content dynamics impact on (i) the video quality acceptance threshold (i.e., the perceptual quality level below which the user considers that an encoded video is of unacceptable quality), (ii) the highest achievable video quality level, and (iii) the pattern of video quality versus sender bitrate.

In order to examine the impact of the content dynamic on the deduced video quality versus the sender bitrate pattern, respective curves of PQoS versus sender bitrate and PQoS versus frame rate should be derived. Such curves can be derived using an audience of people, who are watching the video (e.g., a short video clip) and score its quality, as perceived by them. Such curves are shown in Figures 5(a) and 5(b). Figure 5(a) represents PQoS versus sender bitrate curves which follow the typical logarithmic/exponential pattern that can be met at the literature. More specifically, curve A represents a video clip with low temporal and spatial dynamics, that is, whose content has “poor” movements and low picture complexity. Such a curve can be derived, for example, from a talk show. Curve C represents a short video clip with high dynamics, such as a football match. Curve B represents an intermediate case. Each curve—and therefore each video clip—can be characterized by (a) the low sender bitrate (), which corresponds to the lower value of the accepted PQoS () by the audience, (b) the high sender bitrate (), which corresponds to the minimum value of the sender bitrate for which the PQoS reaches its maximum value () (see for curve A in Figure 5(a)), and (c) the mean inclination of the curve, which can be defined as . From the curves of Figure 5(a), it can be deduced that video clips with low dynamics have lower than clips with high dynamics.

In comparison to Figure 5(a), the curves in Figure 5(b) represent PQoS versus frame rates for the three types of video clips. As mentioned before curve A represents video clip with low spatiotemporal activity, curve B represents an intermediate case and curve C represents video with high spatio-temporal activity. We observe from Figure 5(b) that for video with low spatio-temporal activity, frame rates do not have any impact on quality. However, as the spatio-temporal activity increases, for example, from intermediate to high, then for low frame rates quality degrades significantly depending on the spatio-temporal complexity.

In the literature, the specific curves are characterized as Benefit Functions, because they depict the perceptual efficiency of an encoded signal in relevance to the encoding bitrate. The differentiation among these curves comes from their slope and position on the benefit-resource plane, which depend on the S-T activity of the video content. Thus, the curve has low slope and transposes to the lower right area of the benefit-resource plane, for audiovisual content of high S-T activity. On the contrary, the curve has high slope and transposes to the upper left area, for low S-T activity content.

Practically, the transposition of the curve to the upper left area means that content with low S-T activity (e.g., a talk show) reaches a better PQoS level at relatively lower sender bitrate in comparison with a video content with high S-T activity. In addition, when the encoding bitrate decreases below a threshold, which depends on the video content, the PQoS practically “collapses”. On the other hand, the transposition of the curve to the lower right area means that content with high S-T activity (e.g., a football match) requires higher sender bitrate in order to reach a satisfactory PQoS level. Nevertheless, it reaches its maximum PQoS value more smoothly than in the low S-T activity case.

Practically, it can be observed from Figure 5(a) that in low sender bitrates curve A reaches a higher perceptual level compared to curve B depicting a sequence with higher spatiotemporal content. On the other hand, curve C requires higher sender bitrate in order to reach a satisfactory PQoS level. Nevertheless, curve (C) reaches its maximum PQoS value more smoothly than in the low activity case.

Following the general pattern in Figures 5(a) and 5(b), it can be observed that the impact of the spatiotemporal activity on the sender bitrate pattern is depicted very clearly. It also shows two more important outcomes:(i)For video signals with low spatiotemporal activity, a saturation point appears, above which the perceptual enhancement is negligible even for very high encoding bitrates. However, frame rates do not have an impact on quality for the same videos.(ii)As the spatiotemporal activity of the content becomes higher, the respective perceptual saturation point (i.e., the highest perceptual quality level) becomes lower, which practically means that video of high dynamics never reaches a very high perceptual level. The low frame rates reduce the perceptual quality for the same videos.

4. Proposed Video Quality Prediction Models

4.1. Introduction to the Models

The aim is to develop learning models to predict video quality for all content types from both application and physical layer parameters for video streaming over UMTS networks as shown in Figure 6. For the tests we selected three different video sequences representing slow moving content to fast moving content as classified in our previous work [12]. The video sequences were of QCIF resolution () and encoded in H.264 format with an open source JM software [26] encoder/decoder. The three video clips were transmitted over simulated UMTS network using OPNET simulator. The application layer parameters considered are CT, FR, and SBR. The physical layer parameters are BLER and MBL modelled with 2-state Markov model.

4.2. ANFIS-Based Video Quality Prediction Model

ANFIS uses a hybrid learning procedure and can construct an input-output mapping based on both human knowledge (in the form of fuzzy if-then rules) and stipulated input-output data pairs. A two-input ANFIS [25] architecture as shown in Figure 7(a) is an adaptive multilayer feedforward network in which each node performs a particular function on incoming signals as well as a set of parameters pertaining to this node.

The entire system architecture in Figure 7(a) consists of five layers, namely, a fuzzy layer, a product layer, a normalized layer, a defuzzy layer, and a total output layer. The two inputs are and . The output is f. For a first-order Sugeno fuzzy model, a typical rule set with two fuzzy if-then rules can be expressed asRule :If is and ( is ), then Rule :If is and ( is ), then

where and are linear parameters, and and are nonlinear parameters

The corresponding equivalent ANFIS architecture for our model is shown in Figure 7(b).

The five chosen inputs in Figure 7(b) are Frame Rate (FR), Sender Bitrate (SBR), Content Type (CT), Block Error Rate (BLER), and Mean Burst Length (MBL). Output is the MOS score. The degree of membership of all five inputs is shown in Figure 8.

The number of membership function is two for all five inputs and their operating range depends on the five inputs. For example for input of SBR the operating range is (50–250).

4.3. Regression-Based Video Quality Prediction Model

The relationship of MOS and the 5 selected inputs is shown in Figure 9. From Figure 9 we canobtainthe range of MOS values obtained for each input, for example, from MOS versus FR, the range of MOS is from 1 to 4.5 for FRs of 5, 10, and 15. Once the relationship between MOS and the five selected inputs was found for the three content types representing low spatio-temporal to high ST features, we carried out nonlinear regression analysis with the MATLAB function nlintool to find the nonlinear regression model that best fitted our data.

4.4. Training and Validation of the Proposed Models

For artificial neural networks, it is not a challenge to predict patterns existing on a sequence with which they were trained. The real challenge is to predict sequences that the network did not use for training. However, the part of the video sequence to be used for training should be “rich enough” to equip the network with enough power to extrapolate patterns that may exist in other sequences. The three content types used for training the models were “akiyo”, “foreman”, and “Stefan”, whereas, the model was validated by three different content types of “suzie”, “carphone”, and “football” reflecting similar spatio-temporal activity [12]. Snapshots of the training and validation sequences are given in Figure 2. The data selected for validation was one third that of testing. The parameter values are given in Table 1. In total, there were around 600 sequences for training and around 250 test sequences for validation for the proposed models.

5. Simulation Set-Up

5.1. Network Topology

The UMTS network topology is modeled in OPNET Modeler and is shown in Figure 10. It is made up of a Video Server, connected though an IP connection to the UMTS network, which serves to the mobile user.

With regard to the UMTS configuration, the video transmission is supported over a Background Packet Data Protocol (PDP) Context with a typical mobile wide area configuration as defined in 3?GPP TR 25.993 [1] for the “Interactive or Background/UL:64 DL:384?kbps/PS RAB”. The transmission channel supports maximum bitrates of 384?kbps Downlink/64?kbps Uplink over a Dedicated Channel (DCH). Since the analyzed video transmission is unidirectional, the uplink characteristics are not considered a bottleneck in this case. Table 1 shows the most relevant parameters configured in the simulation environment.

The RLC layer is configured in Acknowledge Mode (AM) and without requesting in-order delivery of Service Data Units (SDUs) to upper layers. Additionally, the Radio Network Controller (RNC) supports the concatenation of RLC SDUs, and the SDU Discard Timer for the RLC AM recovery function is set to 500?ms. As a result of all these configuration parameters, the behavior of the UTRAN is as follows.(i)The RNC keeps sending RLC SDUs to the UE at the reception rate.(ii)When an RLC PDU is lost, the RNC retransmits this PDU.(iii)When all the PDUs of an RLC SDU are correctly received, the UE sends it to the upper layer regardless the status of the previous RLC SDUs.(iv)If a retransmitted RLC PDU is once again lost, the RNC tries the retransmission until the SDU Discard Timer expires.

Although the considered UMTS service supports the recovery of radio errors in the UTRAN, the quality of the video reception may be impacted in several ways. The local recoveries introduce additional delays, which may lead to frame losses in the application buffer. As well, the local recoveries are limited by a counter, so in severe radio degradations some frames may be actually lost in the UTRAN. Additionally, these recoveries increase the required bitrate in the radio channel, which in high usage ratios may further degrade the video transmission. As a result of all these considerations, we can state the great relevance of the combined impact of the video encoding parameters and the specific UMTS error conditions.

The implemented UMTS link layer model is based on the results presented in [23], which analyzes the error traces from currently deployed 3G UMTS connections. Specifically, the error model at RLC layer indicates that, for mobile users, the radio errors can be aggregated at Transmission Time Interval (TTI-) level. This error model leads to possible losses of RLC SDUs, which lead to losses at RTP layer and finally to frame losses at video layer.

5.2. Transmission of H.264 Encoded Video

The transmission of H.264 encoded video over UMTS network is illustrated in Figure 11. The original YUV sequences are encoded with the H.264/AVC JM Reference Software with varying SBR and FR values. H.264 is chosen as it is the recommended codec to achieve suitable quality for low bitrates. The resulting 264 video track becomes the input of the next step, which emulates the streaming of the mp4 video over the network based on the RTP/UDP/IP protocol stack. The maximum packet size is set to 1024?bytes in this case. The resulting trace file feeds the OPNET simulation model. For the aims of this paper, the video application model has been modified to support the incoming trace file (st) and generate the RTP packet traces in the sender module (sd) and in the receiver module (rd). Finally, the last step is in charge of analyzing the quality of the received video sequences against the original quality and the resulting PSNR values are calculated with the ldecod tool included in the H.264/AVC JM Reference Software. MOS scores are calculated based on the PSNR-to-MOS conversion from Evalvid [29].

Instead of setting up a target BLER value for the PDP Context, the UE model is modified in order to support the desired error characteristics. The implemented link loss model is special case of a 2-state Markov model [20, 21] and its performance is provided by two parameters: the BLER and the MBL. The 2-state Markov Model is depicted in Figure 12. According to this model, the network is either in good (G) state, where all packets are correctly delivered, or in bad (B) state, where all packets are lost. Transitions from G to B and vice versa occur with probability and . The average block error rate and mean burst length can be expressed as and . If , this reduces to random error model with the only difference that loss of two consecutive packets is not allowed.

The is selected based on the mean error burst length found in [23] from real-world UMTS measurements. The depicts a scenario where more bursty errors are found, while the depicts random uniform error model.

5.3. Test Sequences and Variable Test Parameters

The video encoding process was carried out using the H.264 codec as the most prominent alternative for low bandwidth connections. From the 3GPP recommendations we find that for video streaming services, such as VOD or unicast IPTV services, a client should support H.264 (AVC) Baseline Profile up to Level 1.2. [26]. As the transmission of video was for mobile handsets, all the video sequences are encoded with a QCIF resolution. The considered frame structure is IPP for all the sequences, since the extensive use of I frames could saturate the available data channel. From these considerations, we set up the encoding features as shown in Table 2.

The variable encoding parameters of the simulations are the video sequence, the encoding frame rate, and the target bitrate at the Video Coding Layer (VCL). The experiment takes into account six test sequences, divided in two groups: akiyo, foreman, and stefan are used for training the model, while carphone, suzie, and football are devoted to the validation of results. The selected frame rates and bitrates are considered for low resolution videos, targeted at a mobile environment with a handset reproduction.

For the UTRAN configuration, the 2-state Markov model is set up at different BLER values and different burst patterns in order to study the effect into the application layer performance. The video sequences along with the combination of parameters chosen are given in Table 3.

6. Impact of QoS Parameters on End-to-EndVideo Quality

In this section, we study the effects of the five chosen QoS parameters on video quality. We chose three-dimensional figures in which two parameters were varied while keeping the other three fixed. The MOS scores are computed as a function of the values of all five QoS parameters.

6.1. Impact of BLER and MBL on Content Type (CT)

The impact of MBL and BLER on our chosen content types is given in Figures 13(a) and 13(b).

The content type is defined in the range of [] from slow moving to fast moving sports type of content. From Figure 13(a), we observe that as the activity of the content increases the impact of BLER is much higher. For example, for 20% BLER, CT of slow to medium type gives very good MOS; whereas as the content activity increases, MOS reduces to 3. From Figure 13(b) we observe that the MBL similar to BLER has greater impact for content types with higher S-T activity.

Similarly, the impact of SBR and FR on CT is given by Figures 14(a) and 14(b). Again we observe that as the activity of content increases for very low SBRs (20?kb/s) and low FRs (5?f/s) the MOS is very low. However, for slow to medium content activity the impact of SBR and FR is less obvious. The lower value of MOS for higher SBR is due to network congestion.

6.2. Impact of BLER and MBL on SBR

The combined impact of SBR and BLER/MBL is given in Figure 15. As expected, with increasing BLER and MBL the quality reduces. However, for increasing SBR the quality improves up to a point (?kb/s) then increasing the SBR results in a bigger drop of quality due to network congestion. From Figure 15(b) we observe that the best quality in terms of MOS was for an MBL of 1 (depicted random uniform scenario). This would be expected because the BLER was predictable. The worst quality was for BLER of 2.5 (very bursty scenario). Again this substantiates the previous findings on 2-state Markov model. It was interesting to observe how MBLs impact on quality; however it is captured by the QoS BLER. Similar to Figure 15(a) for high SBR, quality collapse for all values of MBLs due to network congestion.

6.3. Impact of BLER and MBL on FR

Figures 16(a) and 16(b) show the impact of BLER and MBL on FR for all content types. We observe that for faster moving contents very low frame rates of 7.5?fps impair quality. Again, we observe that both BLER and MBL impact on the overall quality. The impact of frame rate is more obvious for low FRs and high BLER. However, when BLER is low quality is still acceptable. This is shown in Figure 16(a). Figure 16(b) shows that for low FRs quality is acceptable for MBL of 1.5. However, for MBL of 1 it starts to deteriorate. This is mainly for high spatio-temporal contents. However, quality completely collapses for MBL of 2.5 (very bursty scenario). Again the impact is much greater on contents with high spatio-temporal activity compared to those with low ST activity.

6.4. Analysis of Results

In order to thoroughly study the influence of different QoS parameters on MOS we perform ANOVA (analysis of variance) [30] on the MOS data set. Table 4 shows the results of the ANOVA analysis.

We performed 5-way ANOVA to determine if the means in the MOS data set given by the 5 QoS parameters differ when grouped by multiple factors (i.e., the impact of all the factors combined). Table 4 shows the results, where the first column is the Sum of Squares, the second column is the Degrees of Freedom associated with the model, and the third column is the Mean Squares, that is, the ratio of Sum of Squares to Degrees of Freedom. The fourth column shows the statistic and the fifth column gives the -value, which is derived from the cumulative distribution function (cdf) of [30]. The small -values () indicate that the MOS is substantially affected by at least four parameters. Furthermore, based on the magnitudes of -values, we can make a further claim that CT and SBR (-value = 0) impact the MOS results the most, followed by FR and then BLER, while MBL has the least influence. As the MOS is found to be mostly affected by CT and SBR, we further categorize the CT and SBR using the multiple comparison test based on Tukey-Kramer’s Honestly Significant Difference (HSD) criterion [31]. The results of comparison test for CT and SBR are shown in Figures 17(a) and 17(b), where the centre and span of each horizontal bar indicate the mean and the 95% confidence interval, respectively. The different colours in Figure 17 highlight similar characteristics and are very useful in grouping similar attributes together. In Figure 17(a) (CT versus MOS), CT is classified as [], 0.1 is slow moving content, for example, Akiyo and 0.9 are Stefan. Therefore, from Figure 17(a), we can see that MOS is from 2.5 to 2.7 for Stefan as compared to MOS between 4.3 to 4.6 for slow moving content (Akiyo) and 3.8–4.0 for medium ST activity (Foreman). Therefore, from Figure 17(a), we observe that content types with medium-to-high S-T activity show similar attributes, compared to that with low S-T activity. Similarly, in Figure 17(b),the impact of higher SBR (i.e., 128 and 256) have similar impact on quality due to network congestion issues compared to that of low SBR values.

Our studies (Figures 1317) numerically substantiate the following observations of video quality assessment.(i)The most important QoS parameter in the application layer is the content type. Therefore, an accurate video quality prediction model must consider all content types. Application layer parameters of SBR and FR are not sufficient in predicting video quality.(ii)The optimal combination of SBR and FR that gives the best quality is very much content dependent and varies from sequence to sequence. We found that for slow moving content FR = 7.5 and SBR = 48?kbps gave acceptable quality; however, as the spatio-temporal activity of the content increased this combination gave unacceptable quality under no network impairment.(iii)The most important QoS parameter in the physical layer is BLER. Therefore, an accurate video quality prediction model must consider the impact of physical layer in addition to application layer parameters.(iv)The impact of physical layer parameters of MBL and BLER varies depending on the type of content. For slow moving content, BLER of 20% gives acceptable quality; however, for fast moving content for the same BLER, the quality is completely unacceptable. Therefore, the impact of physical layer QoS parameters is very much content dependent.

7. Evaluation of the Proposed Video QualityPrediction Models

The aim was to develop learning models to predict video quality considering all content types and RLC loss models (2-state Markov) with variable MBLs of 1, 1.75, and 2.5 for H.264 video streaming over UMTS networks. The models were trained with three distinct video clips (Akiyo, Foreman, and Stefan) and validated with video clips of Suzie, Carphone, and Football. The application layer parameters were FR, SBR and CT and physical layer parameters were BLER and MBL. The accuracy of the proposed video quality prediction models is determined by the correlation coefficient and the RMSE of the validation results. MATLAB nlintool is used for the nonlinear regression modeling

7.1. ANFIS-Based

The accuracy of the proposed ANFIS-based video quality prediction model is determined by the correlation coefficient and the RMSE of the validation results. The model is trained with three distinct content types from parameters both in the application and physical layers over UMTS networks. The model is predicted in terms of the Mean Opinion Score (MOS). The predicted versus measured MOS for the proposed ANFIS-based prediction model is depicted in Figure 18.

7.2. Regression-Based

The procedure for developing the regression-based model is outlined below.

Step 1 (Select content types). We selected three video sequences with different impact on the user perception for training and three different video sequences for validation as shown in Figure 2. The video sequences ranged from very little movement to fast moving sports type of clips to reflect the different spatio-temporal features. Hence, the proposed model is for all content types.

Step 2 (Obtain MOS versus BLER versus MBL). The impact of BLER and MBL on MOS is shown in Figure 19. From Figure 19, we observe that the higher the BLER and MBL values, the greater loss in overall quality. However, MBL of 1 gives the best quality for all values of BLER. Introducing burstiness reduces quality as would be expected.

Step 3 (Obtain MOS versus SBR versus FR). The relationship between MOS, SBR, and FR is shown in Figure 20. From Figure 20, we observe that at higher SBRs the video quality degrades rapidly due to the UMTS network congestion at downlink bandwidth. Similarly, at lower FRs, video quality degrades. The impact is greater on videos with higher spatio-temporal activity.

Step 4 (Surface fitting for nonlinear mapping from BLER, MBL, CT, SBR, and FR to MOS). A nonlinear regression analysis was carried out with the MATLAB function nlintool. We obtained the nonlinear equation given in (6) with a reasonable fitting goodness. The coefficients of the proposed model given in (6) are given in Table 5. Figure 21 shows the MOS-measured versus MOS predicted for the proposed model

7.3. Comparison of the Models

The models proposed in this paper are reference-free. The comparison of the two models in terms of the correlation coefficient () and Root Mean Squared Error (RMSE) is given in Table 6.

The performance of both the ANFIS-based and regression-based models over UMTS network is very similar in terms of correlation coefficient and RMSE as shown in Table 5. The model performance compared to a recent work given in [32] where the authors have used a tool called Pseudo-Subjective Quality Assessment (PSQA) based on random neural networks performs well. In [32], the authors train the random neural networks with network parameters, for example, packet loss and bandwidth. In addition they have used their tool to assess the quality of multimedia (voice and video) over home networks—mainly WLAN. Our proposed tool can be modified in the future to assess voice quality. However, our choice of parameters includes a combination of application and physical layer parameters and our access network is UMTS where bandwidth is very much restricted. Also compared to our previous work [12], where we proposed three models for the three content types, both the models perform very well.

We feel that the choice of parameters is crucial in achieving good prediction accuracy. Parameters such as MBL in link layer allowed us to consider the case of less bursty or more bursty cases under different BLER conditions. Also, in the application level, the content type has a bigger impact on quality than sender bitrate and frame rate. However, if frame rate is reduced too low, for example, 7.5?f/s, then frame rate has a bigger impact on quality then sender bitrate for faster moving content. Similarly, if the sender bitrate is too high, then quality practically collapses. This is due to the bandwidth restriction over UMTS network causing network congestion. Also contents with less movement require low sender bitrate compared to that of higher movement. Finally, to predict video quality, content type is very important.

8. Conclusions

This paper presented learning models based on ANFIS and nonlinear regression analysis to predict video quality over UMTS networks nonintrusively. We, further, investigated the combined effects of application and physical layer parameters on end-to-end perceived video quality and analyzed the behaviour of video quality for wide-range variations of a set of selected parameters over UMTS networks. The perceived video quality is evaluated in terms of MOS. Three distinct video clips were chosen to train the models and validated with unseen datasets.

The results demonstrate that it is possible to predict the video quality if the appropriate parameters are chosen. Our results confirm that the proposed models bothANFIS-basedANN and regression-based learning model are a suitable tool for video quality prediction for the most significant video content types.

Our future work will focus on extensive subjective testing to validate the models and implement them in our Internet Multimedia Subsystem-based test bed, and further applying our results to adapt the video sender bitrate and hence optimize bandwidth for specific content type.

Acknowledgment

The research leading to these results has received funding from the European Community’s Seventh Framework Programme FP7/2007-2013 under grant agreement no. 214751/ /ICT-ADAMANTIUM/.