Abstract

We propose a modular no-reference video quality prediction model for videos that are encoded with H.265/HEVC and VP9 codecs and viewed on mobile devices. The impairments which can affect video transmission are classified into two broad types depending upon which layer of the TCP/IP model they originated from. Impairments from the network layer are called the network QoS factors, while those from the application layer are called the application/payload QoS factors. Initially we treat the network and application QoS factors separately and find out the 1 : 1 relationship between the respective QoS factors and the corresponding perceived video quality or QoE. The mapping from the QoS to the QoE domain is based upon a decision variable that gives an optimal performance. Next, across each group we choose multiple QoS factors and find out the QoE for such multifactor impaired videos by using an additive, multiplicative, and regressive approach. We refer to these as the integrated network and application QoE, respectively. At the end, we use a multiple regression approach to combine the network and application QoE for building the final model. We also use an Artificial Neural Network approach for building the model and compare its performance with the regressive approach.

1. Introduction

There has been a rapid advance in various video services and their applications, like video telephony, High Definition (HD) and Ultra High Definition (UHD) television, Internet protocol television (IPTV), and mobile multimedia streaming in recent years. Thus, quality assessment of videos that are being streamed and watched online has become an area of active research. As per a report published in [1], video streaming over the Internet is becoming increasingly popular on devices having small form factor screens (4 inches to 6 inches). Most of the mobile phones as of today have a screen resolution of at least 1280 × 720 pixels (HD) or 1920 × 1080 pixels (Full HD). Some phones even have a higher screen resolution of 2556 × 1440 (2K resolution) pixels. Price of the mobile phones has also fallen to a great extent which makes them a perfect candidate for watching videos on the go. Advancements in mobile hardware coupled with decreasing costs have resulted in a greater demand for high resolution video content that could be watched anytime. A report published in [2] confirms this fact stating that presently video traffic constitutes more than 55% of the overall Internet traffic. In order to mitigate the problem of increased load on the existing network infrastructure, sophisticated video compression techniques have been developed that provide an excellent viewing quality without consuming a large network bandwidth during the streaming sessions. H.265/HEVC (High Efficiency Video Coding) developed by the International Telecommunication Union’s (ITU) Video Coding Expert Group (VCEG) and VP9 by Google Inc. are prime examples of such modern codecs. Both of them provide an excellent quality to compression ratio.

Quality of Service (QoS) has been defined by ITU as “a characteristic of a telecommunications service that bear on its ability to satisfy stated and implied needs of the users of the service” [3]. On the contrary, the concept of Quality of Experience (QoE) is multidimensional that is influenced by a number of systems, users, and other contextual factors [4]. For the purpose of successful QoE management by the Internet service providers (ISPs), it is extremely important to understand the relationships between QoE and the underlying network and application-layer QoS parameters. In fact, QoS parameters are the most important business relevant parameters for the ISPs [5]. Therefore, in order to measure the user satisfaction, there is a need for mapping from the QoS to the QoE domain.

In this paper, a video quality prediction model has been presented for videos encoded with H.265/HEVC and VP9 codecs and viewed on mobile devices that are connected to a Wireless Local Area Network (WLAN-802.11 ac standard). We have considered only the infrastructure mode. A total of seven QoS parameters are considered (four representing the network QoS and the remaining three used for application QoS). Large scale subjective tests are carried out for the purpose of model building. Packet loss, jitter, throughput, and auto resolution scaling are the network QoS factors taken into consideration, while bit rate, frame rate, and video resolution are the application QoS factors taken into account. In order to introduce a modularity concept, the video model has been built in three stages. In stages one and two the video quality model for the network and application QoS factors is built separately one after another independent of each other. In stage three, the video models from stage one and stage two are combined together to obtain the final comprehensive model. This modular approach provides more flexibility since it treats the network and application video models independently. This modular feature should be particularly useful to the ISPs because if any change or modification in the model is required afterwards; then they can work on only those specific factors without having to disturb the remaining ones. Detailed methodology has been provided in a later section.

Rest of the paper is organized as follows. In Section 2, related literature review is done. Section 3 presents the detailed methodology for building the video model. Section 4 illustrates the subjective tests that have been conducted along with the relevant data analysis. Sections 5 and 6 present the video model for network and application QoS factors, respectively, while Section 7 presents the final integrated video model. Section 8 introduces the Artificial Neural Network (ANN) approach for model building along with the relevant statistics. Finally, in Section 9, we provide the conclusion and the scope for future work.

2. Literature Review

In this section, we present a brief review of all the relevant works.

2.1. Video Service Quality Estimation Methods

There are two main techniques for video quality assessment: subjective and objective methods. A concise report on both techniques is provided in the following paragraphs.

Till date, subjective tests are considered to be the most accurate video quality assessment method. Typically, in a subjective test, users are gathered together in a room to view some video samples. Then they are asked to rate those samples (typically on a scale of 1 to 5), where 1 denotes the worst and 5 the best quality. The rating which is given by the users is commonly referred to as the Mean Opinion Score (MOS), also known as the QoE. ITU has several recommendations where the procedure for conducting subjective tests has been laid down in detail. Different techniques are available for conducting the subjective tests depending upon the application requirement. Absolute Category Rating (ACR), Absolute Category Rating with Hidden Reference (ACR-HR), Degradation Category Rating (DCR), and Pair Comparison (PC) are some of the most frequently employed techniques. Both ACR and ACR-HR are examples of single stimulus method where only one video sequence is shown at a time. DCR and PC are examples of double stimulus methods where both the original and the degraded video sequences are shown to the users in pair. Further detail about these techniques can be obtained from relevant ITU recommendations [68]. The recommendations suggest using video sequences having duration of at least 10 seconds. However, the effect of using videos lesser or greater than 10 seconds on the subjective quality assessment has not been accounted for [9]. Papers [10, 11] provide a detailed comparison among the different subjective techniques. Although subjective methods are very accurate, they are time consuming and very expensive to be carried out. Hence, there is a necessity of objective approach.

Objective techniques are based upon certain algorithms or mathematical formulae that try to predict the perceived video quality by a human observer. An objective approach can be of intrusive or nonintrusive type. Intrusive methods are also known as Full Reference (FR) techniques, because the evaluation process requires both the original and degraded video sequences to be present. Peak Signal to Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Video Quality Metric (VQM) are examples of such a scheme [1215]. There is another variation to the intrusive method where only a subset of the original video sequence is presented to the user for the purpose of quality evaluation. This is referred to as the Reduced Reference (RR) method [16]. Nonintrusive methods do not require the presence of any original video sequence; hence, they are also called no-reference (NR) methods. The models presented in [1719] represent such a technique. For a video streaming scenario, a NR model is more preferable since it involves minimum overhead as it does not require the presence of the original video sequence. Further detail about these methods can be obtained from [20].

There is a third technique that is increasingly becoming popular in recent years. It involves a combination of the subjective and objective approach as mentioned above. The very recently published ITU-T Recommendation P.1203 is an example of such an approach [21].

2.2. Studies of Correlation between QoS and QoE

Due to the high cost of the subjective tests and relatively low accuracy of the objective algorithms, researchers have tried to estimate the QoE from various QoS factors. ITU-T Recommendation P.861 estimates the listening quality from various voice transmission factors and establishes a nonlinear relationship between the two [22]. Similar work has been done by authors in [23], where they discuss in detail how the human satisfaction of HTTP service (web browsing) is affected by two network QoS parameters, namely, network bandwidth and latency. A nonlinear relationship between the QoS and QoE for web user satisfaction has been proposed by the authors. However, both the works consider only the network QoS parameters and it needs to be seen if the same type of relationship hold true for video traffic also.

A generic formula in which QoE and QoS parameters are connected through an exponential relationship called the IQX hypothesis is presented by the authors in [24]. The IQX hypothesis is tested for two different services: voice over IP (VoIP) and web browsing for different values of packet loss, jitter, and reordering conditions. However, the validity of IQX hypothesis for other interactive and streaming applications like video is a matter of further investigation. Similarly, the authors in [25] explain the relationship between QoE and QoS in terms of the Weber-Fechner Law (WFL) that is an important principle in psychophysics. The testing environment is limited to a VoIP system and a mobile broadband application scenario involving web browsing, e-mails, and downloads only. Both IQX hypothesis and WFL have been tested for VoIP application and web services only and they are found to be just the inverse of each other. With respect to video streaming, other factors like video resolution, type of the codec used for compression, nature of the video content, and so on are very important towards determining the video QoE. However, they cannot be taken into account by the IQX hypothesis or WFL since they explain only the network QoS factors.

An adaptive QoE measurement scheme for IPTV services has been presented in paper [26]. The authors propose a Video Intelligent Probe (VIP) that integrates the analysis of video processing and network parameters together. The assessment is based upon the quality of images contained in the video signal, packet loss, and the packet arrival time. Similar work has been done by authors in [27] for an IPTV service where the effects of delay, jitter, packet loss rate, error rate, bandwidth, and signal success rate are considered. Both these works primarily take into consideration network QoS factors and they are targeted towards watching videos on a big screen like television. However, there is a considerable difference in the viewing experience on a television and small form factor mobile devices [28, 29].

Authors in [30] evaluate the video quality on a mobile platform considering the impact of spatial resolution, temporal resolution, and the quantization step size. All the videos that are considered have a resolution of 4 CIF (704 × 576) only. QoE modeling for VP9 and H.265 encoded videos on mobile devices have been investigated by the authors in [31]. Although application QoS factors like bit rate and video resolution have been taken into consideration, they do not include any network QoS factors. Also, the effect of frame rate has not been taken into account. A content based video quality prediction model over wireless local area networks that combine both application and network level parameters has been proposed in [32]. Bit rate and frame rate are the application QoS factors considered, whereas only the effect of packet loss has been taken into account as the network QoS factor across a variety of video content. Similar models have also been proposed in [3336].

A Dynamic Adaptive Streaming over HTTP (DASH) based multiview video streaming system that can minimize the view-switching delay by employing proper buffer control, parallel streaming, and server push schemes has been presented by authors in [37]. Similar HTTP based video streaming for long-term evolution (LTE) cellular networks has been proposed in [38]. Authors in [3941] try to predict the video QoE for a DASH based video streaming scenario. Papers [4244] provide an excellent survey on the QoE estimation techniques in place for a video streaming scenario in general.

From this section, we conclude that a lot of research has been done on the quality estimation of videos. The factors taken into consideration belong either to the application layer or the network layer of the TCP/IP protocol suite. Some authors who have considered an effect of both use low resolution videos encoded with older generation codecs like MPEG-2 and H.264/AVC. Also, very little work has been done with respect to video quality estimation on mobile devices. Considering the tremendous and ever increasing popularity of online video streaming services, it is the need of the hour to develop a comprehensive quality prediction model which takes into account factors from both the network and application layers for videos that have been encoded with the current generation H.265/HEVC and VP9 codecs on a mobile device.

3. Methodology

3.1. Problem Statement

In a video streaming service, there are several factors that affect the visual quality as perceived by the end users. These QoS factors can be grouped under the category of network and application QoS factors. Figure 1 gives a detailed classification of the factors that we have considered in this paper. Therefore, for our case, the network layer perceptual QoS/video model will be a function of

Similarly, in the application layer, it will be a function ofAs both and have the same scale (equivalent to the MOS scale of 1 to 5), hence, the overall/final video model can be expressed asEquations (1) and (2) are absolutely independent of each other, while (3) integrates (1) and (2) together. This is the reason that our proposed model is modular in nature. Depending upon the requirement either , , or can be obtained. Figure 2 depicts the overall methodology that is followed for building the video model.

11 reference videos are chosen from the SVT High Definition Multiformat Test Set database maintained by the Video Quality Experts Group (VQEG) [45]. These reference videos are then subjected to various types of network and application level impairments. The degraded video sequences are then shown to the users who rate them on a scale of 1 to 5. The results from the subjective test are used for creating the objective model.

We begin by mapping the individual network and application QoS factors to their corresponding QoE. Various types of fitting functions are considered, but we choose the optimal one based upon a decision variable (DV) that is discussed in a later part of the paper. After this, we find the perceived video quality due to multiple impairment factors, that is, multiple network and application QoS factors, and refer to them as and , respectively. This is done in three steps as discussed below:

In step 1, we use a weighted sum (additive) approach to find the network and application QoE by using the Analytic Hierarchy Process (AHP) algorithm and refer to them as and , respectively. However, due to a drawback in this approach next we use a multiplicative technique.

In step 2, we use a multiplicative approach to find the network and application QoE and refer to them as and , respectively.

In step 3, we take into account the interaction of the additive and multiplicative approaches to find the final network and application QoE denoted by and , respectively.

The final video model is found out from and by using a multiple regression approach. Due to the widespread use of different machine learning algorithms we also find using an Artificial Neural Network (ANN) approach and compare the results across the two methods.

Next we discuss in detail the various QoS factors that have been considered in this paper.

3.2. Network QoS Factors

Here we provide the detail of the considered network QoS factors.

(1) Packet Loss (PL). IP packets may be discarded during their transit over the network or dropped at any intermediate nodes due to network congestion or buffer overflow. Here, we consider a random packet loss pattern as it has a significant detrimental effect on the video stream quality as compared to other types of packet losses [46]. The different packet loss levels that we have considered have been taken from the recommended range of values as suggested by ITU-T via their recommendation ITU-T Y.1541 [47] and presented in Table 4.

(2) Jitter (J). It is defined as the variable delay in receiving packets at the receiver end. It can occur due to network congestion, improper queuing, or several other factors.

(3) Throughput (T). It refers to the amount of data that is successfully transferred from one place to another in a given time period. Its influence towards the video QoE has been well accepted by the research community.

(4) Auto Resolution Scaling (ARS). In an adaptive video streaming scenario, the videos are encoded at multiple discreet bitrates, that is, at different resolutions. For example, the most commonly used video resolution by YouTube is at 144p, 240p, 360p, 480p, 720p, or 1080p. Depending upon the available network bandwidth and other factors, a particular bitrate stream is broken into multiple segments or chunks, with each segment lasting between 2 and 10 seconds. For the sake of this research, the resolution combinations that we choose are (360p + 480p), (720p + 360p), (720p + 480p), (360p + 1080p), and (1080p + 720p). The duration of the video sequences that we use in our experiment are 10 seconds each. Considering the fact that the duration of each fragmented segment should be between 2 and 10 seconds in case of a resolution switch, we take into account only two resolution switches for a particular video playback. Higher number of resolution switches have not been considered keeping in mind the total length of the original video sequences. For the purpose of data analysis, we express the ARS factor as the ratio of a particular resolution combination to the minimum resolution combination of the videos that is used. For example, the ARS factor for (720p + 360p) combination is (1280 × 720 + 640 × 360)/(640 × 360 + 854 × 480) = 1.8. Similarly, for (360p + 480p), (720p + 480p), (360p + 1080p), and (1080p + 720p), the ARS factors are 1, 2.1, 3.6, and 4.7, respectively.

Now we explain how the secondary ARS factor is related to the primary ones. Auto resolution scaling is a type of adaptive bitrate streaming technique that is used by the video content providers with an aim to improve the viewing QoE. The video content provider stores the same video contents in multiple resolutions and then depending on various network factors like the available network bandwidth, extent of jitter, and the overall network loading conditions select a particular resolution for showing them to the users. Automatic switching to lower or higher resolutions than what is currently being played happens depending upon the network conditions and factors like amount of playout buffer left, video rendering capability of the viewer’s device, and so on. Hence, the ARS factor that we have considered is a consequence of the primary ones.

3.3. Application QoS Factors

Bitrate, frame rate, and resolution of the source videos are the application QoS factors that are considered. The videos that are used in the experiment vary over a wide range of video content. The bitrate factor is different from the throughput one (although they are measured using the same units). Bitrate is a codec related feature, while throughput is a network property that refers to the available bandwidth at any point of time.

The perceived video quality depends on the type of video content to a great extent which has been discussed by authors in [32, 4850]. To define the different types of video contents we have considered the Spatial Information (SI) and Temporal Information (TI) of the source videos. SI gives an indication of the amount of spatial details that each frame has and it has a higher value for more spatially complex scenes. The SI value for every video frame has been calculated by filtering each one of them using the Sobel filter followed by computing the standard deviation. The maximum value in the frame represents the SI content of the scene. Similarly, TI values give an indication of the amount of temporal changes in a particular video sequence. It has a higher value for sequences having greater amount of motion. Equations (4) and (5) show the calculation of the SI and TI values, respectively,where is the video frame at time , is the standard deviation across all the pixels for each filtered frame, and is the corresponding maximum value in the considered time interval. The SI and TI values are multiplied in order to arrive at the overall content complexity of any video sequence.

The Sobel filter is implemented by convolving two 3 × 3 kernels over the video frame and taking the square root of the sum of the squares of the results of these convolutions. For , if denotes the pixels of the input image at the th row and th column, then the result of the first convolution denoted by is given by

Similarly, which is the result of the second convolution is given by

Hence, the output from the Sobel filter image at the th row and th column is given by

The calculations are performed for all and , where denotes the number of rows and the number of columns.

Figure 3 shows the SI and TI values for the eleven video sequences that have been used in this paper.

For each video sequence, we have taken four different resolutions (VGA, qHD, HD, and Full HD). The resolution factor that is considered is totally different from the ARS factor discussed previously. refers solely to the resolution of the videos that have not been subjected to any sort of network impairments. However, the ARS factor has been introduced as a network QoS factor in order to take into consideration the effects of adaptive bitrate streaming. For the sake of data analysis, we express the resolution of a particular video sample denoted by in a ratio format given by where refers to the actual resolution of the video under consideration and refers to its corresponding minimum resolution. For example, the resolution value for any Full HD content will be . Similarly, the resolution value for any VGA content will be . Thus, a video having a higher value will be at a higher resolution. Next, we discuss the experiment that has been carried out in detail and the subsequent data analysis.

4. Experiment Details

First, we present the video sequences that have been used in this research.

4.1. Video Selection

The publicly available video database of VQEG has been used for selecting our reference videos. A total of 11 sequences are taken; the details of which are shown in Table 1. All the sequences are roughly of 10-second duration each and in native YUV 4.2.0 format. The raw videos are encoded using the H.265 and VP9 codecs. Tables 2 and 3 show the encoder configuration used for both the codecs, respectively.

Figures 4(a)4(k) show the snapshot of the videos that are considered.

4.2. Simulation Test-Bed

The simulation test-bed has been shown in Figure 5. We have created 2 sending nodes, namely, a constant bitrate (CBR) background traffic source node and a streaming server that contains all the video sequences we use encoded with the H.265 and VP9 codecs. The bitrate of the CBR has been fixed at 2 Mbps in order to simulate a realistic scenario. Both these sending nodes are connected to router A over the Internet across a 20 Mbps link. Router A is in turn connected to router B over a variable link. Router B is again connected to a wireless access point at 20 Mbps which further transmits this traffic to a mobile node at transmission speeds of up to 600 Mbps typically found in 802.11ac WLANs. No packets are dropped in the wired portion of the video delivery path. The maximum transmitted packet size is 1024 bytes. We use a random pattern for packet loss that takes six values at (0.1, 0.5, 1, 3, 5, and 10%). The effect of jitter has been added by introducing a fixed delay of 100 ms plus five variable delays corresponding to (1, 2, 3, 4, and 5 ms). The network throughput is varied by changing the bandwidth of the variable link between routers A and B and has been fixed at (500, 1000, 2000, 4000, and 8000 Kbps). As previously mentioned, range of all the values considered is based upon the ITU and ETSI recommendations provided in [47, 51, 52]. Values of all the parameters used in the experiment are provided in Table 4. For videos that have been impaired by a single ARS factor only or any particular application QoS factor, the simulation test-bed has not been used. In order to simulate the ARS factor, a particular video is segmented, with each segment being played back at two different resolutions. For example, in case of a video having 300 frames in total, the first 150 frames are played back at a particular resolution and the remaining 150 frames are played back at a different resolution.

The experiment has been conducted with Evalvid framework [53] and network simulator toolkit NS2 [54]. Integrating NS2 with the Evalvid platform gives us a lot of flexibility in choosing the parameters.

Next, the subjective evaluation process has been described in detail.

4.3. Subjective Evaluation

59 participants are involved in the subjective test and they are mixed in gender. Figure 6 shows the percentage breakdown of the participants’ age. Before recruiting the participants, an Ishihara color vision test has been conducted on them in order to ensure that none of them suffer from color blindness [55]. The test has been conducted in a controlled laboratory environment. It took 16 weeks to complete the entire subjective test. Table 5 gives the details of the viewing conditions under which the test has been performed.

The subjects had to evaluate 462 video sequences that have been impaired by exactly 1 network QoS factor. The total number of network impairment conditions is 21 (6 for PL + 5 for J + 5 for T and 5 for ARS). Considering the 11 video sequences across 2 codecs (21 impaired conditions × 11 video sequences × 2 codecs), we arrive at the number 462. In order to assess the quality of videos impaired by multiple network QoS factors, we limit the number of test sequences to 176. Since carrying out a subjective test consumes considerable amount of time and effort; hence, it was not feasible to include all possible values of the different impairment combinations while presenting the test video sequences. Instead, we choose 176 specific combinations, the details of which have been shown in Table 6.

32 video sequences are impaired by all the network QoS factors, while for all the other remaining conditions, we use 16 sequences for each one. For both the single and multifactor impaired videos, exactly half the number of samples is used for model building and the rest for validation.

Similarly, for creating the application video model, we have 308 video sequences impaired by exactly 1 application QoS factor. Five different BR and FR levels, respectively, along with 4 different resolution values across 2 codecs and 11 sequences give a total of 308 combinations. For creating the multifactor impaired videos, as explained previously we have used a subset of the total possible combination. In particular, 140 sequences are used, the details of which are provided in Table 7. As before, the 140 sequences are split evenly for the purpose of model creation and validation.

The final model is created by combining all the network and application QoS factors together. Thus, we have a total of 7 different factors. Since it is not possible to let the users watch such a huge number of videos, we limit the number of impaired videos to 156. Table 8 shows the relevant details. For this case, while creating the video sequences, care has been taken to include the effect of both the network and application QoS factors for every condition. 78 sequences are being used for model creation and the rest for validation.

All the videos are presented on a Samsung Galaxy Note for the purpose of evaluation. We chose this device as it has hardware level decoding capability for the H.265 and VP9 codecs. Hardware level decoding has certain advantages over software decoding. Sometimes software decoding results in a jittery/distorted playback for certain format of videos encoded specially with newer codecs. Hardware acceleration is very useful for such cases. In case of hardware acceleration manufacturers specifically implement multimedia chipsets as a part of the motherboard to assist with the video decoding process, whereas software decoders only use the CPU to play videos. Hence, the choice is between something specific (hardware decoders) versus something generic (software decoders). This is the exact reason why we choose Samsung Galaxy Note as it has a dedicated decoder chip for H.265 and VP9 codecs.

Single stimulus ACR technique as outlined in ITU-T Recommendation P.910 has been used for designing the experiment. The total number of test videos that the participants have to watch is quiet large (1242 sequences). Approximately, each subject needs about 4 hours of time in order to complete the entire assessment procedure. We divided the entire test duration into 9 different subsessions. Five sessions were completed on the first day and the remaining 4 on the next day for each subject. Each session lasted for about 30 minutes followed by a 15-minute break in order to remove any sort of tiredness and fatigue that may arise due to the extended viewing period. The videos were presented to the subjects in a random order.

Next, we discuss the data processing method used.

4.4. Outlier Detection and Score Estimation

In case of the subjects whose scores deviate to a certain extent from the mean score, outlier detection has to be carried out in order to remove the subject bias. Roughly, the score distribution should be normal which we find out using test (by calculating the kurtosis coefficient of the function, i.e., the ratio of the fourth-order moment to the square of the second-order moment). For a particular test video sequence , we calculate the mean , standard deviation (), and the kurtosis coefficient aswhere N is total number of subjects and is score given by th user for th test video

For each observer we find and as given below.If , then:if , if , Else:if , if , .

Following the above procedure, any subject will be removed from the analysis if and . Consequently, the ratings from 4 subjects for the packet loss factor, 5 subjects for the jitter factor, 7 subjects for the ARS factor, and 3 subjects for the frame rate factor have been removed from further analysis. Based upon the user rating the QoE or MOS is calculated aswhere is number of subjects and is score given by the th user for th video

Next, we present the network video model.

5. Network Video Model

To build the network video model first we consider the effect of single network QoS factors on the user QoE. Thereafter, we find the joint effect of all the network QoS factors considered.

5.1. Mapping for Individual QoS Factors to User QoE

We do a nonlinear curve-fitting on the subjective dataset to arrive at the relationships between the QoS factors and their corresponding QoE. An optimal fitting is chosen based upon a decision variable (DV) that is introduced here. The overall goodness-of-fit statistics is generally expressed in terms of the sum of squared error (SSE), root mean square error (RMSE), change, or the change values. For SSE and RMSE, values closer to 0 indicate that the model has a smaller random error component and that the fit will be more useful for prediction. Similarly, and values close to 1 indicate that a greater proportion of variance is accounted for by the model. and are given aswhere SST is sum of squared total, is number of observations, and is number of regression coefficients including the intercept. Based upon the above discussion, we propose the variable DV asEquation (13) suggests that a higher value of DV is always desirable. We considered various types of fitting models and choose the one which is optimized to get the highest value of DV possible. The goodness-of-fit statistics for each individual mapping has been shown in Table 9.

Equations (14)–(17) give the mapping from QoS to QoE domain for packet loss, jitter, throughput, and auto resolution scaling factor, respectively.where , ,  , and are the coefficients that are found out from curve-fitting and presented in Table 10.

Figures 710 show the relationship between the QoS and the corresponding QoE/MOS.

Generally we observe that, for all the factors, videos encoded with VP9 codec have a slightly better QoE as compared to the H.265 codec.

The PCC (Pearson Correlation Coefficient) has also been calculated for the set of equations obtained above for the individual factors. This has been shown in Table 11. Results show that the QoE values which the equations predict have a high degree of correlation with the actual subjective scores.

Next we present the integrated QoE measurement technique from the individual QoS factors.

5.2. Integrated QoE Measurement for Network Factors

An additive and multiplicative approach is used for finding out the integrated QoE. The final network video model is obtained by carrying out a regression across both the approaches.

In an additive form, the QoE is generally represented aswhere and are the weights that need to be found out for QoS factors and , respectively. Not all the network QoS factors considered here have the same impact on the perceived video quality. The factor which affects more should be given a higher weight as compared to the factor which is lesser important. Before going for the additive approach, in order to explicitly find out the effect of the different network QoS parameters on the QoE, we perform ANOVA (Analysis of Variance) on the subjective dataset that has the score collected from the 176 video sequences which have been impaired by all the network factors. Table 12 shows the result from the ANOVA analysis.

Small value suggests that all the parameters that are considered are significant. Based upon the magnitude of the values we can make further claim that jitter impacts the MOS results the most followed by packet loss and auto resolution scaling. Throughput has the least influence. This observation is extremely important in assigning proper weights to the different factors in the additive approach.

For assigning the weights, Analytic Hierarchy Process (AHP) algorithm has been used [56, 57]. It is a well-known structured technique that is often used in multicriteria decision making systems. As the first step we obtain the criteria comparison matrix that has been shown in Table 13.

The next step is to build the normalized matrix from which we can get the weight of every factor considered. This normalized matrix has been shown in Table 14.

Thus, for the case of the network QoE in additive form (18) reduces to

From (19), it is evident that the weight associated with the jitter factor is maximum, while it is minimum for the throughput factor. The QoE which is calculated by the additive method has a disadvantage that is now explained.

A video that has been distorted by two QoS metrics should not have a better QoE than the video which has been distorted by only one of the two QoS metrics. For example, we refer to Table 15 that shows a sample calculation.

The QoE value for each network factor is calculated from the individual QoS to QoE mapping functions that we presented previously in (14)–(17). The additive contribution of each QoE factor is calculated by multiplying each individual network QoS factor by its weight. Finally, is obtained by adding the contribution of the corresponding impairment terms. For this particular case, the range of the QoE values for the individual factors varies from 0.62 to 4.07. The additive QoE value obtained is 1.4, which is within this range. However, it contradicts the fact that the QoE should not be greater than 0.62 (which is the minimum QoE obtained). Thus, clearly there is an anomaly while calculating the QoE using the additive approach.

Hence, we consider an alternative multiplicative approach. As the subjects give their opinion on a scale of 1 to 5, we present the QoE equation in multiplicative form byEach individual QoS factor has been weighed on a scale of 5, while evaluating its contribution towards the final multiplicative QoE. Table 16 shows a sample calculation using (20). For the purpose of illustration, same set of QoS values have been taken for both the approaches. The value obtained is 0.08 (lesser than the minimum QoE value of 0.62 corresponding to jitter).

Comparison of the QoE values obtained from both the approaches for the same set of network QoS conditions reveal that the additive approach tends to overpredict the actual viewing quality, while the multiplicative approach tends to underpredict the same. Hence, for building the final network video model , we use a regression based approach that combines the additive and multiplicative techniques just presented.

The regressive model is built based upon (19) and (20) along with the results of the subjective dataset that have 88 video sequences impaired by multiple network QoS factors. Equation (21) represents the network video model which is further shown in Figure 11.

Equation (21) suggests that for the network video model the contribution of the multiplicative part is more as compared to the additive one. Accuracy of the network model is shown in Figure 12, while Table 17 reports the accuracy of each stage in the model building phase. While creating Figure 12, we have used the unseen subjective data that has not been used for the purpose of model building. We note that at each stage there is a gradual increase in the modeling accuracy.

Next, we present the application video model.

6. Application Video Model

A similar approach like the network video model is followed to build the application video model. First, the effects of the individual application QoS factors to the viewing quality are examined followed by an integrated application QoE estimation using the same three techniques previously presented.

6.1. Mapping for Individual Application QoS Factors to User QoE

Equations (22)–(24) show the variation of QoE with respect to bitrate, frame rate, and resolution of the impaired videos, respectively. All the mappings have been done with respect to the decision variable that has already been introduced in the previous section of the paper.The relevant coefficients are found out from the experiment and presented in Table 18. The corresponding graphs have been shown in Figures 1315. For all the factors VP9 codec offers a better viewing experience. Figure 14 suggests that for every video sequence there is an optimal frame rate beyond which the viewing quality does not improve and enters a saturation stage. Similarly, the effect of resolution on the perceived quality follows a Gaussian distribution as evident from (24) and Figure 15. We attribute this observation to the limitations of the human visual system and the size of the screen on which the video is being watched. In case of our experiment, the videos are viewed on a mobile device. The results clearly indicate that, for small sized screens, there will not be any substantial improvement in the viewing quality by increasing the resolution of the videos.

The model fitting statistics for the individual application factors are shown in Table 19. PCC coefficients presented in Table 20 show a relatively high correlation between the subjective scores and the calculated MOS.

Next, we present the integrated approach towards finding the application level QoE.

6.2. Integrated QoE Measurement for Application Factors

The application video model is also built in three stages comprising the additive, multiplicative, and the regressive approach, respectively. As before, an ANOVA analysis is carried out in the beginning over the subjective dataset containing 140 videos that have been impaired by all the application QoS factors. The result of this analysis is used to choose the relative importance of the factors and assign proper weights to them based upon the AHP algorithm. The ANOVA report has been presented in Table 21. The viewing quality is most impacted by frame rate followed by bitrate and resolution, respectively.

The additive form for the application factors has been shown in (25). Intermediate criteria comparison matrix and the final normalized weight matrix that are obtained from the AHP algorithm are presented in Tables 22 and 23, respectively.

The additive approach suffers from the same type of anomaly that has already been discussed in the previous section. Hence, we present the multiplicative form in

As before, the additive approach tends to overpredict the viewing quality, whereas the multiplicative approach tends to underpredict the same. Therefore, a regression based model is presented in (27) that integrates both the approaches for finding the final video quality due to the application factors. The regression model is build based on (25) and (26) along with the subjective score obtained from the 70 video sequences that have been impaired by all the concerned application QoS factors.

The application video model and its accuracy are shown in Figures 16 and 17, respectively. Table 24 presents the modeling accuracy across all the three stages.

Next, we find the final integrated video model by combining the network and application video models just presented.

7. Final Integrated Video Model

Till now, separate models have been built for the network and application QoS factors. With an aim to build a cross-layer model, we now combine these two models into one single entity.

For creating the final video model 78 video sequences are taken. All these video sequences are impaired by multiple network and application QoS factors considered here. Details about the video sequences are provided in Table 8. Based on the MOS scores obtained across these 78 sequences and (21) and (27); a regression approach is used to build the final video model. A stepwise method of variable entering scheme is used. During any step if we obtain a nonsignificant result, then the corresponding parameter is removed. Equation (28) represents the final video model and Figure 18 depicts the same. The coefficients of each of the contributing factors suggest that while calculating the overall video quality, the effect of the network QoE is more than the effect due to theapplication QoE. , adjusted, and PCC values of 0.953, 0.952, and 0.976, respectively, are obtained for our final model. The modeling accuracy has been shown in Figure 19. For finding out the final model accuracy, we have used the remaining 78 sequences from the subjective dataset that have not been used for the purpose of model building.

Next, we present the same video model using an Artificial Neural Network (ANN) based approach.

8. ANN Based Video Model

Till now we have used a regression based technique for building the video quality prediction model. The model is able to predict the perceived video quality with reasonable accuracy. However, recently due to the widespread use of different machine learning techniques for analysis of data, we decided to use an Artificial Neural Network approach for building the same model limited to the same parameters that we have considered before and evaluate the performance of both. The same subjective data consisting of 78 impaired video sequences that we have used previously is taken in this case also.

Video quality estimation using different types of neural networks has been attempted by several researchers in the past. Probabilistic Neural Networks (PNN), Backpropagation Neural Network (BPNN), Adaptive Neurofuzzy Inference System (ANFIS), and Random Neural Networks (RNN) are some of the techniques commonly used. However, as pointed out in the literature review section, video quality assessment on mobile devices has been done with low resolution videos and using only the H.264 and MPEG-2/4 video codecs. In order to estimate the video quality from subjective metrics like MOS feedforward type ANNs are most commonly used [5861]. This is the reason why we decided to use an ANN technique for this paper keeping in mind the current research gaps and trying to overcome those.

The ANN which we have used in our work is a multilayer perceptron model having one hidden layer. Considering the number of inputs that we have, that is, 7, going for more hidden layers would have increased the overall complexity of the system unnecessarily and also resulted in overfitting problems. Hence, we opted for the one hidden layer architecture. Training of the neural network has been done using the Levenberg-Marquardt (LM) algorithm by issuing the trainlm command in MATLAB. The trainlm command is a network training function that updates the weight and biases of the different nodes according to the LM optimization technique. It is considered to be one of the fastest back propagation algorithms and is highly recommended as a first-choice supervised algorithm, although it consumes more computer memory as compared to other algorithms. For the hidden layer, we have used a hyperbolic tangent sigmoid transfer function by issuing the tansig command. For the output layer, a pure linear transfer function is used by giving the purelin command. The neural network has the same 7 parameters that we have discussed previously as input, plus one extra factor for the type of codec used. As the output, we have the score that predicts the quality of the video.

We use a 70 : 30 split ratio for the input data as training, testing, and validation sets. To find the configuration of the network that achieves the best performance, several rounds of tests are conducted by varying the number of neurons in the hidden layer and observing the output. Since we have 8 inputs and 1 output, hence we varied the number of hidden neurons from 4 to 15. Optimal performance was observed with 10 hidden nodes. The system architecture showing the best configuration has been given in Figure 20. In the figure, the symbols w and b stand for the weight and bias factors for each node, respectively. The value of w and b for our configuration set for both input and hidden layer has been provided in Tables 25 and 26, respectively.

The performance of our model across the training, testing, and validation sets has been shown in Figure 21. The best validation performance is obtained at epoch 5 and marked in the figure. Also, we find that as the model learns during the training phase, the mean squared error (MSE) across all the three sets decreases and then becomes almost constant. Figure 22 shows the regression plot across all the 3 sets. The overall value for all the video sequences is 0.964 which is pretty high. The PCC value obtained is 0.984 at a significance level of less than 0.01. Compared to the regressive approach, the ANN model gives a slightly better performance.

9. Conclusion and Future Work

In this paper, we have proposed a no-reference video quality prediction model for relevant network and application QoS factors for a video streaming scenario. Our proposed model is a cross layered one as it takes into account factors from multiple layers of the TCP/IP protocol stack. At the same time, it has the unique characteristic of being a modular one. Depending upon the requirement, the model can be tuned for predicting the quality due to the network and application factors or a combination of both. At each stage, proper subjective tests have been done for the purpose of model building and validation. The ANN approach provides slightly better prediction accuracy as compared to the regression based approach.

All the videos that are used have Full HD resolution and encoded using the latest generation H.265/HEVC and VP9 codecs. Even though these codecs are capable of compressing videos at much higher resolutions up to 4K, we did not consider them in this paper due to the limited availability of the 4K video contents. With more improved network speed and widespread availability of 4K video content, we plan to investigate the effect of higher resolutions in future work. Also, all the video sequences that were used had roughly 10 seconds duration. The effect of longer video sequences has not been investigated in this research, which would be considered in the future.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.