Abstract

Over the past few years there has been an exponential increase in the amount of multimedia data being streamed over the Internet. At the same time, we are also witnessing a change in the way quality of any particular service is interpreted, with more emphasis being given to the end-users. Thus, silently there has been a paradigm shift from the traditional Quality of Service approach (QoS) towards a Quality of Experience (QoE) model while evaluating the service quality. A lot of work that tries to evaluate the quality of audio, video, and multimedia services over the Internet has been done. At the same time, research is also going on trying to map the two different domains of quality metrics, i.e., the QoS and QoE domain. Apart from the work done by individual researchers, the International Telecommunications Union (ITU) has been quite active in this area of quality assessment. This is obvious from the large number of ITU standards that are available for different application types. The sheer variety of techniques being employed by ITU as well as other researchers sometimes tends to be too complex and diversified. Although there are survey papers that try to present the current state of the art methodologies for video quality evaluation, none has focused on the ITU perspective. In this work, we try to fill up this void by presenting up-to-date information on the different measurement methods that are currently being employed by ITU for a video streaming scenario. We highlight the outline of each method with sufficient detail and try to analyze the challenges being faced along with the direction of future research.

1. Introduction

There has been a rapid advance in various video services and its applications, like video telephony, High-Definition (HD) and Ultrahigh-Definition (UHD) television, Internet protocol television (IPTV), and mobile multimedia streaming in recent years. Thus, quality assessment of videos that are being streamed and watched online has become an area of active research. As per a report published in [13], video streaming over the Internet is becoming increasingly popular and accounts for more than 55% of the overall traffic. A lot of work has been done by several researchers towards the quality assessment of streaming multimedia services [48]. At the same time, organizations like the International Telecommunication Union (ITU) also have in place different models and standardization efforts towards the perceived video quality evaluation under a variety of application scenarios. The main objective of this paper is to provide an up-to-date review of this research field from a standard ITU perspective.

Figure 1 shows a typical video streaming scenario over the Internet. Broadly, three distinct regions are identified as the production network (head-end), the distribution network (carrier), and the consumer network (tail end). Relevant contents are created, edited, encoded, and stored in suitable multimedia databases ready to be transported to the end-users (consumer network) over the Internet with the help of streaming servers. This multimedia traffic has to pass through the unreliable Internet (distribution network) where they are fragmented into various IP segments and ultimately delivered to the consumer end where they are displayed on a variety of devices like television, computers, or mobile phones. The inherent unreliable service provided by the Internet necessitates the use of perceptual quality evaluation schemes for such video traffic.

We segregate the multimedia streaming scenario presented in Figure 1 to two different types based upon the ownership use case of the Internet as the Internet protocol television (IPTV) service and over-the-top (OTT) streaming service. YouTube, Netflix, Hulu, etc. are prime examples of the OTT service. IPTV runs on a private, fully controlled network and hence has the advantage of tight control and guaranteed (overprovisioned) bandwidth [9]. IPTV typically uses the User Datagram Protocol (UDP) at the transport layer, and hence in case of any packet loss retransmission does not happen. Still, the reliability of IPTV service is generally high because the video traffic is being carried over a fully controlled network (usually private). On the contrary, in case of OTT services, the contents are streamed over the open and unmanaged public Internet. Thus, IPTV services utilize a network that guarantees a Quality of Service (QoS), which differentiates them from the other OTT services. Quality of Experience (QoE) provisioning for OTT services is a far more challenging job as compared to IPTV services. Hence, for this work we focus only on those ITU standards that do not include IPTV services. More specifically, we focus on video streaming over the public Internet only.

The main goal of this article is to summarize the current and other emerging approaches of video quality evaluation of a streaming service within the scope of ITU. Often due to the sheer variety of the different ITU standards, it becomes difficult for a new researcher to select a suitable method. This work aims to bridge the aforementioned gap by carefully analyzing the relevant ITU standards in detail and giving suitable recommendations as to which standard to choose for a specific context.

We begin by presenting the concepts related to QoS and QoE in Section 2 along with the interrelationship between them. Sections 3 and 4 present the review of subjective and objective methods, respectively. In Section 5, we discuss the current challenges in video quality measurement and the future trends. Finally, Section 6 provides the conclusion.

2. QoS and QoE

We begin the survey process by explaining the key concepts of QoS and QoE explicitly highlighting their differences.

2.1. QoS Concepts
2.1.1. QoS Definition

QoS has been defined by ITU-T as “totality of characteristics of a telecommunications service that bear on its ability to satisfy stated and implied needs of the user of the service” [15]. This definition of QoS is extremely generic in nature and needs to be reapplied in a specific application context. Figure 2 shows the concept of end-to-end QoS that is commonly prevalent in almost all scenarios. Terminal equipment refers to the devices that are used either by the service provider or by the consumer in order to provide/avail a particular service. Access network is a combination of the access medium and technology used for a particular service (e.g., wireless, cable, ADSL). Access network generally belongs to a specific service provider. Core network refers to the IP backbone network, which is usually controlled by different stakeholders. The QoS contribution of the core network is governed by the technology used (digital multiplexing, IP, etc.) and transmission media (air, cable, optical, etc.) along with other factors. While specifying the end-to-end QoS, it is necessary to state the specified operating conditions in which a service is supported over a connection (connectionless or connection-oriented) scheme. QoS is also affected by factors like traffic and routing [16]. Each of the elements presented in Figure 2 affects the QoS in its own way. In addition, it is evident that QoS comprises both network performance (NP) and non-network-related factors. Bit-error rate, latency, and jitter are some of the NP related factors, while tariff levels, service-repair time, etc. are the non-network parameters. Four different angles from which QoS can be viewed are discussed next.

2.1.2. Viewpoints of QoS and Their Interrelationship

We can classify the different perspectives of QoS into four different types as shown in Figure 3.(i)Customer’s QoS requirement refers to the quality level of any application that is expected by the end-users and expressed in nontechnical terms. The customer is not bothered about how a service is offered or about the internals of the network/application design; rather the focus is on the overall end-to-end quality.(ii)QoS offered by the provider refers to the level of service quality that the provider is expected to provide to the customers. The level of quality is expressed by values assigned to QoS parameters. Primarily this is used for planning purpose and framing of Service Level Agreements (SLA) between the provider and the customer.(iii)QoS achieved by the provider refers to the quality level of the service that the provider actually delivers to the customer, which ideally should be the same as the QoS offered by the provider. In reality, the values are different and the performance is compared across the two groups over a certain period.(iv)QoS perceived by the customer refers to the satisfaction level that the customer “believes” to have experienced. This is usually assessed from data gathered through customer surveys or individual assessment by a customer for the service.

The four viewpoints are interconnected as shown in Figure 3. Logically, the process starts at the customer’s QoS requirement stage. These requirements act as input suggestions to the service provider who plans to offer the desired level of quality. Most of the time, the planned level of service quality is not met due to several factors. As discussed before, these factors are primarily NP related ones like packet loss, jitter, latency, and throughput. A tradeoff between the cost incurred to deliver the ideal quality and the viability of the overall business model has to be done, which affects the service quality in general. The service is ultimately delivered to the customers who perceive the real quality that is achieved by the provider.

From the above discussion, it is clear that the customer viewpoint is the most important one for any service to be successful. This is exactly the reason why ITU has a separate recommendation in [11] that defines a model for multimedia QoS categories from an end-user viewpoint. Next, a brief overview of this recommendation is provided.

2.1.3. QoS Requirements of Different Application Types

Different types of applications are identified like voice, video, and web browsing, with each having different performance requirements for achieving a good perceived quality. Figure 4 shows a classification based upon the overall requirements of the applications in terms of two important QoS parameters, namely, packet loss and one-way delay.

The applications have been classified into eight distinct groups. Some applications such as conversational voice and video are sensitive to delay, but can tolerate a certain extent of packet loss. On the other hand, applications like Fax are sensitive to packet loss, but can withstand delay to a certain extent. Other interactive applications like online gaming are extremely sensitive to both packet loss and delay. These facts are presented in a more clear fashion in Figure 5. The figure shows four distinct delay types depending upon the extent of user interaction involved.

The recommended range of QoS values for some important applications have been provided in Table 1 [11]. The target values of certain applications like audio streaming, videophone, and video streaming are outdated as of 2018. For example, in case of video streaming the typical data rates can easily shoot up to the order of tens of Mbps instead of 384 kbps due to an increase in the network throughput as well as the video resolutions [17]. Similarly, with the advent of modern techniques like dynamic adaptive streaming over HTTP (DASH based streaming), the upper and lower bounds of the other QoS parameters like jitter, one-way delay, and packet loss also need to be updated.

2.2. QoE Concepts
2.2.1. QoE Definition

QoE is defined as the degree of delight or annoyance of the user of an application or service [18, 19]. The concept of QoE is closely related to the human auditory and visual system (HAS and HVS, respectively) and the overall satisfaction that the end-user has in using such a service. Thus, QoE also refers to a complete end-to-end experience that has been shown previously in Figure 2. It is obvious that for any service to succeed, it must provide a good experience to the end-users. A lot of work is being done by ITU towards the quality assessment of various application types. For this article, however, we concentrate only on the video streaming applications. Next, a general overview of the different QoE assessment methodologies being employed by ITU has been provided.

2.2.2. QoE Assessment Methodologies

Confining the scope of this work to video streaming only, Figure 6 shows an overview of the different QoE assessment methodologies being currently used by ITU. Irrespective of the methodology used, the QoE assessment technique must be valid and reliable. The concept of validity versus reliability has been shown in Figure 7. Validity describes how well a method measures what it is intended to measure, while reliability refers to the accuracy of a method in terms of scattering of results (for example, when a test assessment is repeated) [20].

The end-user experience can be measured using two broad techniques: subjective and objective tests [12]. Subjective tests that involve human subjects are considered the most accurate means of quality estimation. Objective tests on the other hand use some mathematical formulae or algorithms to predict the quality. Despite the accuracy of the objective methods being lesser than the subjective ones, they are preferred in many situations as they are automatic, i.e., easy and faster to be carried out and much cheaper than the subjective tests.

One way to categorize the objective methods is by a general approach, which lists down the different application scenarios in which a particular objective model can be used. There are three specific use-cases as mentioned below:(i) Monitoring: in which a particular objective model is used for live quality assessment of a video application. This is a real time usage scenario that assesses the video quality, e.g., ITU-T P.1202.(ii) Planning: in which an objective model can be used for network planning before an actual service startup. Mainly these models are used as network planning tools in which they help in selecting IP-network transmission settings such as the video format, video codec, and video bitrates with the assumption that the underlying network is subjected to packet loss, e.g., ITU-T G.1071.(iii) Lab testing: in which quality assessment is done in a typical laboratory setup. This type of approach is used when commercially it is not feasible to assess the quality or in certain situations that require the presence of the original source signal for the purpose of quality measurement or during the development and testing of particular equipment, e.g., ITU-T J.341.

In the second approach, the objective methods are classified based upon the type of measurement used as follows:(i) Media layer model: this uses actual audio/video signals as their input. They also take into account codec compression and the channel characteristics. These types of models can further be subdivided into three different types depending upon the extent of the original reference signal that they have for quality assessment:(1) Full reference (FR) methods in which a reference video is compared frame-by-frame with a distorted video sequence in order to obtain the quality. The comparison can be from many aspects like color processing, spatial and temporal features, contrast features, etc. These methods are generally used in lab-testing environments, e.g., ITU-T J.247.(2) Reduced reference (RR) methods in which certain characteristics/features of the reference signal are extracted out and used for the quality evaluation of the distorted signal. Hence, instead of the entire reference signal, only subsets of its features are used for quality assessment, e.g., ITU-T J.246.(3) No reference (NR) methods are those that do not require the reference video to be present while assessing the quality of the distorted video sequences. These methods are generally used for real time quality assessment of videos, e.g., ITU-T P.1201. Both the RR and NR methods can be applied to either the mid-points or the end-points of the network.

Figures 8(a), 8(b), and 8(c) show the conceptual view of the FR, RR, and NR type media layer models just discussed.(ii) Packet layer model: this utilizes only the packet header information for the purpose of QoE prediction. These models do not have the ability to check the payload information. Therefore, they are not suitable for situations that require the presence of media contents. Generally, such model types are used as network-probes at the mid-points or end-points of the network. Figure 9 shows the conceptual view of a packet layer model, e.g., ITU-T P.1201.(iii) Bitstream layer model: this type takes into account not only the encoded bitstream information, but also the packet header information while assessing the video quality. They are actually a combination of the media layer and packet layer models. Figure 10 shows the conceptual view of a bitstream layer model. These are ideal for live quality monitoring purpose, e.g., ITU-T P.1202.

A third approach to QoE assessment known as the hybrid method uses a combination of the subjective and objective techniques [21, 22]. In this method, typically at the beginning a subjective test is carried out to gather the opinion from the people regarding the quality of the test video sequences under consideration. These test-videos are impaired by one or more QoS factors (NP or non-NP related) depending upon the experimental scenario and requirements. Thereafter, mathematical techniques like linear or nonlinear regression, different types of neural networks, or other machine learning algorithms are used for creating a quality prediction model based upon the subjective scores. This approach tries to take into account the advantages of both the subjective and objective techniques [23], e.g., ITU-T G.1070.

2.3. The Relationship between QoS and QoE

After an elaborate explanation of QoS and QoE from an ITU perspective, now we present the interdependence between them. A possible relationship between the two has been shown in Figure 11. The QoS-QoE relationship has been separated into three distinct zones. Zone I (marked in green) shows the ideal region where the perceived video QoE should be. The users experience an excellent viewing quality. A certain QoS level needs to be maintained (corresponding to point “a” on the graph) in order to achieve this QoE. This point “a” represents the ideal threshold QoS level (in terms of packet loss, jitter, network throughput, or other factors) that should be maintained theoretically by all the concerned stakeholders. Zone II shows a diminishing QoE region where further deterioration in the QoS values results in a sharp drop in QoE. The point “b” on the graph represents the actual threshold value below which the user will probably stop using the service. There is no exact relationship that models this region of diminishing QoE [24, 25]. However, a number of ITU recommendations like ITU-T G.1070, ITU-T G.1071, and ITU-T P.1201 attempt to model this scenario. Zone 3 (marked in red) shows the region where the QoE is extremely poor and should be avoided under all circumstances.

The taxonomy of all the ITU recommendations related to video streaming that have been covered in this survey is shown in Figure 12.

3. Subjective Methodologies

In this section, we present the relevant subjective methods that are used for video streaming applications.

3.1. ITU-T Recommendation P.910

Noninteractive subjective assessment methods for evaluating the one-way overall video quality of multimedia applications such as videoconferencing, storage, and retrieval applications have been covered in ITU Recommendation P.910 [26]. The number of subjects in the tests varies from 4 to 40.

3.1.1. Overall Experiment Design

The test is usually carried out in a recording environment that has sufficient lighting. The lighting conditions should be representative of a typical office scenario rather than studio lighting. Specifically, the ambient lighting of the room should be between 100 lux and 10,000 lux.

The reference video sequences that are used for showing to the human subjects are extremely important. Perceived video quality depends largely on the type of video content [2730]. Hence, while selecting the reference sequences, spatial information (SI) and temporal information (TI) are two critical factors that must be taken into account. SI gives an indication to the amount of spatial details that each frame has and it has a higher value for more spatially complex scenes. The SI value for every video frame is calculated by filtering each one of them using the Sobel filter followed by computing the standard deviation. The maximum value in the frame represents the SI content of the scene. Similarly, TI values give an indication of the amount of temporal changes in a particular video sequence and it has a higher value for sequences having greater amount of motion. Equations (1) and (2) show the calculation of the SI and TI values, respectively: where is the video frame at time , the standard deviation across all the pixels for each filtered frame, and the corresponding maximum value in the considered time interval.

Figure 13 shows the SI and TI values of some commonly used video sequences [14]. The publicly available video database of VQEG is used most frequently while selecting the reference videos [31]. The relevant video details are given in Table 2. Table 3 summarizes the viewing conditions that must be satisfied. Normally, at-least 4 different types of video sequences should be used in a particular test.

Next, we present a brief overview of the different methods that are used by this recommendation.

3.1.2. Different Test Methods

Four different types of methods are used in this recommendation and they are classified as Absolute Category Rating (ACR), Absolute Category Rating with Hidden Reference (ACR-HR), Degradation Category Rating (DCR), and Pair Comparison (PC) method. Each of these techniques is discussed next.(i) ACR method: here the distorted test sequences are presented one at a time and the users give opinion scores (typically on a scale of 1 to 5), which are averaged into a Mean Opinion Score (MOS) [32]. Table 4 shows the MOS scale. The timing diagram of the stimulus presentation has been shown in Figure 14(a). The users are shown video sequences, which typically last for 10 seconds followed by a voting time interval of 10 seconds approximately, wherein the subjects need to enter their opinion in the form of MOS scores. The video presentation time can be increased or decreased depending on the test sequences.(ii) ACR-HR method: it is similar to the ACR method, with an exception that the reference version of each presented distorted test sequence is also shown to the subjects. This is referred to as the hidden reference condition. The subjects give their opinion in the form of MOS scores. However, for final quality assessment a differential quality score (DMOS) is computed for each distorted sequence and its corresponding reference one as per the following equation:where represents the MOS of a particular distorted video sequence and represents the MOS of its corresponding reference sequence. DMOS is also measured on a scale of 1 to 5 identical to MOS. If the distorted video sequence has a better quality than its corresponding reference one, the DMOS value will be greater than 5, which is valid and indicative of an excellent quality (better than the reference one).Similarly, when the values of and are the same, the DMOS value is maximum, i.e., 5, indicating no perceptual difference in quality between the distorted and the reference video sequences. The timing diagram is the same as the ACR method.(iii) DCR method: in this type, the test sequences are presented in pairs. In a pair, the reference sequence is always shown first followed by the distorted sequence. The timing diagram for this type of method has been shown in Figure 14(b). The two sequences should be perfectly synchronized; i.e., both of them must start and stop at the same frame. In this case, the subjects are asked to rate the distorted sequences with respect to the reference on a 5-point scale. Table 5 presents the 5-level opinion scale.(iv) PC method: in this method, the test sequences are presented in pairs like DCR. However, none of the sequences in the pair is a reference sequence. Instead, all the distorted sequences are combined in all possible combinations and then presented in pairs to the subjects. After each presentation, a judgment is made by the subject on which is the preferred sequence in the pair. The timing diagram has been shown in Figure 14(c).

3.1.3. Comparison of the Test Methods

The most crucial decision is to choose the right technique for a particular application. Normally, the choice is between applications that require or do not require the presence of the reference sequences. The DCR method should be chosen when testing the fidelity of transmission with respect to the reference signal. ACR is easy, fast to implement, and hence commonly used. The basic advantage of ACR-HR over ACR is that the memory effect of the reference sequences can be removed from the subjective scores. PC method should be used when a high discriminatory power is required on the subjective scores.

3.2. ITU-R Recommendation BT.500-13

This recommendation gives different methodologies for assessing the picture and video quality for any generic application scenario, not only restricting to a video streaming case [33]. Considering the popularity of the methods that have been outlined in this recommendation, we chose to include them as a part of this survey. The subjects can be experts or nonexperts depending upon the objectives of the assessment. Minimum 15 observers must be present with no limits on the upper bound. Next, the different test methodologies that are enumerated in this recommendation are presented.

Different Test Methods. Five different types of test procedures are described. They are the Single Stimulus Continuous Quality Evaluation (SSCQE) method, Double Stimulus Continuous Quality Scale (DSCQS) method, Double Stimulus Impairment Scale (DSIS) method, Simultaneous Double Stimulus for Continuous Evaluation (SDSCE) method, and the Stimulus Comparison Adjectival Categorical Judgment (SCACJ) method. The first one is an example of a single stimulus technique, while all the remaining four are examples of double stimulus methods, wherein both the reference and distorted video sequences must be presented simultaneously.(i)SSCQE method: this is a single stimulus method that enables a continuous evaluation of the distorted video sequences on a scale that has been shown in Table 6. The items are normalized in a range of 0 to 100. Generally, each video sequence lasts for at-least 5 minutes.(ii)DSIS method: this is a type of cyclic method in the sense that the subject is at first presented with the original sequence and then with the same impaired sequence. Each sequence is generally reproduced either one (variant 1) or two times (variant 2), after which the subject evaluates the distorted video sequence using an opinion scale that has been shown in Table 5. Interpretations for both the DCR and DSIS methods are the same. The timing diagrams for variants 1 and 2 are shown in Figures 15 and 16, respectively. For both the variants, the subjects need to watch the video sequence during the time slots and and voting is permitted only in . Time slots and are approximately of 10 seconds duration each, with being around 3-second pause/gap period and lasting for 5–11 seconds. time slot shows the reference sequence, followed by the distorted sequence in .(iii)DSCQS method: this is also a type of cyclic method in which the subject is asked to view a pair of video sequences consecutively, with both of them being from the same source, but one being the original reference sequence and the other one the distorted version of the same source. The subjects assess the quality of both the sequences on a continuous scale that has been shown in Table 6. In this case, the subjects do not know that whether a particular sequence is a reference one or the distorted version. The general timing diagram of the stimulus presentation for DSCQS method is the same as the second variant of the DSIS method (shown in Figure 16). However, the interpretations of the time slots , , , and are different. In time slots and the test sequences are presented (in no particular order and generally changed across different sequences in a pseudorandom fashion) while in slot the voting is done. represents a short gap period between and . Recommended values for the four time slots are the same as those in the case of the DSIS method.(iv)SDSCE method: in this procedure, the subjects are allowed to watch two video sequences simultaneously, where one sequence is the reference and the other one its distorted counterpart. Generally, both the sequences are shown side by side and the subjects know which is the reference sequence and which is its distorted version. This method is generally used for judging the fidelity of the video information. It is recommended when the video sequences are of longer duration (at-least 5 minutes) and uses the same scale that has been presented in Table 6.(v)SCACJ method: this is an example of a stimulus comparison method and similar to the double stimulus methods discussed above. However, the only difference is that the reference sequences are not shown in this case and only the distorted sequences are presented to the subjects. The subject has to rate the quality of the second video in comparison to the first one based upon the scale which has been shown in Table 7.

3.3. ITU-T Recommendation P.911

This recommendation presents the different subjective quality assessment methods for multimedia applications [34]. The number of subjects varies from 6 to 40. It uses four different techniques, namely, ACR, DCR, PC, and SSCQE. All these techniques have already been discussed in the previous sections. The only difference is in the stimulus type that is shown to the users. In this case video sequences are shown which have an audio counterpart. Therefore, the subjects evaluate the overall multimedia quality. However, in case of the previous recommendations, the videos normally do not have any audio portion. Next, a brief summary of the subjective methods discussed above and their shortcomings is presented.

3.4. Summary of Subjective Methods

Subjective methods are more accurate in gauging the user opinion when compared to the objective ones. A variety of techniques is available and a proper one should be chosen based upon the time available and application requirement. If time is not a constraint, then any of the methods discussed above can be used. For time critical conditions, generally ACR or ACR-HR method is preferred. Similarly, presence or absence of reference content also affects the choice of a particular technique. Sometimes, the duration of the video sequence that needs to be evaluated also plays a judgmental role in deciding which technique is to be chosen. For longer video sequences, normally SSCQE or SDSCE is used. Requirements related to certain specific quality aspects can also sometimes dictate a specific choice.

Reliability of subjects is one of the crucial factors that affect the quality of the results obtained from these subjective techniques. Human perception is often influence by factors like ambient room conditions, emotional and mental state of the subjects, personal profile (age, gender, etc.) that can affect the results obtained [35, 36].

It is obvious from the above discussion that a number of different subjective techniques are available. Hence, for a new researcher it becomes rather confusing which method to select out of the numerous alternatives. In Table 8 we try to provide a guideline to the best subjective technique that should be considered depending upon certain requirements like video duration, presence/absence of reference videos, and need for video repetition.

4. Objective Methodologies

In this section, we provide an overview of the objective models that are used for video streaming and listed in Figure 12. For each model, the overall methodology is discussed along with the mathematical relationships and algorithms wherever necessary.

4.1. ITU-T Recommendation G.1070

This recommendation proposes an algorithm that estimates the videophone quality and is specifically useful for the QoS/QoE planners [37]. This multimedia model takes input from the network and application layers of the TCP/IP protocol stack.

4.1.1. Overall Model Framework

The overall framework of the model has been shown in Figure 17. Certain video and speech quality parameters are given as inputs to the model and there are three main outputs: , and . refers to the video quality influenced by the speech quality, refers to the speech quality influenced by the video quality, and refers to the overall multimedia quality outputted by the model. In this survey for every recommendation, which produces a multimedia quality as output, we concentrate only on the video quality evaluation part. Therefore, our discussion will focus only on the video quality . Packet loss rate and jitter are the factors considered from the network layer, while bitrate, frame rate, codec type, and video format are the application layer factors.

4.1.2. General Model Equations

The overall video quality predicted by the model is given by where represents the basic video quality affected by the coding distortion, expresses the degree of video quality robustness due to packet loss, and denotes the packet loss percentage. is further expressed aswhere is an optimal frame rate that maximizes the video quality at each video bitrate and is expressed aswhere , , represents maximum video quality at each video bitrate and is expressed as represents the degree of video quality robustness due to frame rate and is expressed asThe packet loss robustness factor introduced in (4) is expressed asAll the coefficients to are dependent on the codec type, the video format, and the video display size and need to be found out by carrying suitable subjective tests.

Equation (4) highlights the fact that ITU-T G.1070 takes into account factors from the network as well as the application layer when evaluating the video quality. Therefore, this method is suitable when any new codec is to be tested for judging their performance. All the equations from (4) to (9) are generic in nature and show how this technique can be ported to a specific context (like evaluating the performance of a new codec along with the network QoS factors) by evaluating the coefficients to . ITU has validated this model only for a limited number of codecs (MPEG-2 and MPEG-4) across VGA, QVGA, and QQVGA resolutions [38, 39]. However, following the procedure that has been outlined through (4)–(9), this model has been extended to other recent codecs like H.265/HEVC and VP9 also [40, 41].

4.2. ITU-T Recommendation G.1071

This recommendation provides an opinion model for network planning of video and audio streaming applications [42]. Two application areas are addressed by this objective technique: a high-resolution area including IPTV and a low-resolution area including services like mobile TV. For reasons that we discussed previously, this survey presents only the mobile streaming application that is an IP based service. This algorithmic model tries to estimate the impact of typical IP layer impairments on the end-user QoE over transport formats such as Real Time Transport Protocol (RTP) over User Datagram Protocol (UDP), Motion Picture Experts Group-2 Transport Stream (MPEG2-TS) over UDP or RTP/UDP, and 3rd Generation Partnership Project Packet-Switched Steaming Service (3GPP-PSS) over RTP. Dynamic adaptive streaming over HTTP or DASH streaming that is currently being used by commercial services like YouTube and Netflix is not taken into account by this model.

4.2.1. Overall Model Framework

The overall model framework has been shown in Figure 18. The general way by which this model works is similar to [43] with an exception in the input that it takes. While as input this model takes into account different network planning parameters like the video bitrate, video codec type, video resolution, and the packet loss rate, the one described in [43] uses the IP packet header information to extract relevant parameters for predicting the video quality. Since the primary video quality estimation block is the same for both models, a conversion rule is applied for those planning parameters that are not taken into account by [43] in order to make it compatible. As output, this model provides three parameters:(i)audiovisual Quality () on a scale of 1–5,(ii)video only MOS () on a scale of 1–5 (without audio stream),(iii)audio only MOS () on scale of 1–5 (without video stream).

Here we discuss only . When compared against similar subjective tests, this model attains a Root Mean Square Error (RMSE) value of 0.60 and a Pearson Correlation Coefficient (PCC) value of 0.78 across 1430 different sample video types.

4.2.2. General Model Algorithm

The overall video quality can be classified into three different types, , , and , where is video MOS in case of no packet loss and no rebuffering (video quality due to compression), is video MOS in case of packet loss but no rebuffering (video quality due to packet loss), is video MOS in case of no packet loss but only rebuffering (video quality due to rebuffering).

An elaborate methodology to calculate the three different types of has been provided in [42]. However, in order to highlight the factors that this model takes into consideration and motivate the readers to port this for codecs that have not been tested by ITU yet, we present a snapshot of the calculation process by introducing three different algorithms. The same procedure can be applied to any other codec for evaluating the video quality. This ITU model is primarily used for planning purposes only. Since it does not take into account any reference video, it is an example of a NR scheme.

is calculated as per Algorithm 1. For this algorithm, represents the video content complexity factor, i.e., the spatiotemporal complexity of the video sequence and it can vary from an initial default value of 0.5 to a maximum value of 1. It has to be calculated for every sequence used. represents the normalized video bitrate in kbps and depends upon the video frame rate. The coefficients to are provided by ITU for H.264 and MPEG4 encoded video sequences at QCIF, QVGA, and HVGA resolutions only.

(1) set
(2) set
(3) set
(4) if ()
(5) compute
(6) compute
(7) else
(8) compute
(9) compute
(10) end if

The procedure for calculating is given in Algorithm 2. represents the video quality distortion due to packet loss, which can lead to either a slicing or video freezing scenario. Depending upon the scenario is calculated appropriately. represents the average impairment rate of the video frames whereas represents the impairment rate of the entire video stream itself. Both of these values lie between 0 and 1, with 0 depicting the best and 1 the worst case. represents the video packet loss event frequency, which is incremented by 1 each time a slicing or freezing event occurs. to are the coefficients provided by ITU for the same set of conditions as discussed before.

(1) set
(2) set
(3) denote
(4) if ()
(5) compute
(6) else
(7) compute
(8) end if
(9) compute

Algorithm 3 summarizes the procedure for calculation of . represents the number of rebuffering events, represents the average rebuffering length, and represents the multiple rebuffering events effect factor. The coefficients to are obtained in the same fashion as discussed before for the other coefficients.

(1) set
(2) set
(3) set
(4) denote
(5) if (
(6) set
(7) else
(8) set
(9) end if
(10) compute
(11) compute
4.3. ITU-T Recommendation P.1201/P.1201.1

This recommendation provides a parametric nonintrusive assessment of audiovisual media streaming quality [43]. This is a nonintrusive model based upon the packet header information, which provides certain algorithms for evaluating the audiovisual quality of IP based video services. The packet header information is fed to the algorithm in a Packet Capture Format (PCAP).

This model has 2 subparts: ITU-T P.1201.1 and ITU-T P.1201.2 [44, 45]. While the first one is intended for low-resolution application areas like mobile TV, the second one targets a high-resolution IPTV service. As output, the algorithm estimates the audio, video, and combined audiovisual quality in terms of the 5-point MOS scale.

Primarily, these models are used for in-service monitoring of perceived transmission quality or for maintenance purpose. As such they can be deployed either at the end-points of the transmission system, i.e., the service provider or customers premises, or in the middle of the network as monitoring points. This model works only for a UDP based streaming service. An alternative version has been proposed in [46] that uses TCP for a nonadaptive and progressive download type media streaming. Table 9 summarizes the main input types and scope of this model. For any other specific factor or technology used that has not been mentioned in Table 9, the model needs to be retrained and revalidated. An overview of the model inputs and outputs has been shown in Figure 19.

The packet header information is obtained dynamically from the transport layer in a PCAP file format (interface I.2). Since this model is used for monitoring the video quality in real time, the transport layer input information (in the form of transport header) is dynamic by nature. Relevant information from this PCAP file is filtered out by interface I.3 and fed to the core MOS estimation module. Additional information about the media stream and the decoder behavior is taken out of band in a static manner with certain predefined values. This is the function of the interface I.1. Interface I.4 provides information about the rebuffering information that is extracted and measured at the end-points and provided as an input to the core MOS estimation module.

Three model outputs are provided: , , and referring to the audio only, video only, and combined audiovisual quality all in a MOS scale of 1–5. The overall block diagram of the ITU-T P.1201.1 model has been shown in Figure 20.

The parameter extraction modules for audio, video, and audiovisual scenarios are labeled as PEA, PEV, and PER, respectively. The procedure for calculating the overall video quality is the same that has been presented previously in Algorithms 1, 2, and 3. For video, only of the model attains a RMSE value of 0.535 (based on 1430 samples) and PCC value of 0.830.

4.4. ITU-T Recommendation P.1202/P.1202.1

This recommendation is similar to the ITU-T P.1201 discussed above. However, in order to evaluate the perceived quality, this algorithm takes into account the bitstream information also, as well as the packet header information that has been used in the previous case [47]. Similar to the previous algorithm, in this case also the model can be subdivided into two parts: ITU-T P.1202.1, which is targeted towards low-resolution areas like mobile video streaming, and ITU-T P.1202.2, which is targeted towards high-resolution IPTV application [48, 49]. Since this model parses information from both the IP header and the payload, it is more accurate when compared to the previous algorithm but requires more computational effort. Also for this model to work, the payload data must be in an unencrypted form. There is another striking difference between this model and ITU-T P.1201 with reference to the number of outputs. P.1202 provides only 1 video MOS as the output, whereas P.1201 provides 3 outputs (audio only, video only, and audiovisual MOS). A summary of the application areas, test factors, and technology used by this model has been presented in Table 10. An overview of the model interfaces is shown in Figure 21. Interface I.1 provides the static information about the media stream and the decoder. These have certain predefined values and obtained from packet information or player application program interface (API). Interface I.2 provides the detailed packet layer header and payload data information in the form of a PCAP file. Relevant parameters are extracted from the PCAP file by the interface I.3. The model outputs a video only MOS.

The model description in a block diagram format has been shown in Figure 22. H.264 encoded video bitstream, along with other side information (error concealment type, rebuffering, etc.), is taken as input; relevant parameters are extracted out and then aggregated, which are then used to predict the video QoE.

Compression, slicing, freezing, and rebuffering are the four different types of artifacts considered by this model and included in the final video MOS. Each of them is calculated separately and they are finally aligned together to the same level (MOS) by using suitable mapping functions. This model attains a RMSE value of 0.357 (across 982 sequences) and a PCC value of 0.918.

4.5. ITU-T Recommendation J.247

This recommendation provides guidelines on the selection of an appropriate video quality measurement method when a full reference is available [50]. Presently this model has 4 different flavors: Video Quality Expert Group (VQEG) Proponent A (NTT, Japan), VQEG Proponent B (OPTICOM, Germany), VQEG Proponent C (Psytechnics, UK), and VQEG Proponent D (Yonsei University, South Korea). All these 4 models have been tested across video sequences having resolution of VGA, CIF, and QCIF only. All of them take the same inputs and provide the same output in terms of the video MOS (outperforming the commonly used Peak Signal to Noise Ratio (PSNR) model) [51]. Depending upon the operational requirement, these models can predict the quality of videos that have been impaired by codec related factors only, network transmission related factors, or a combination of both. Table 11 lists down the factors for which this model has been evaluated.

The performance overview for the 4 different models across the 3 different resolutions has been shown in Table 12. The PCC values are obtained by comparing the objective scores across the three different resolutions against the subjective data from 984 end-users. Figure 23 shows the comparison of the model performances (in terms of PCC values only). The outlier ratio is obtained by using the standard error of the mean as per the formulae given inwhere an outlier is a point for which In (11), is a constant that depends on the nature of the score distribution (Gaussian, exponential, etc.), represents the standard deviation of the individual scores associated with the th video clip, and is the number of viewers per video clip .

4.6. ITU-T Recommendation J.343

This recommendation specifies objective methods that use bitstream data in addition to the processed video sequences [52]. As this is a bitstream model, it has additional information about the payload data like codec type, bitrate, frame rate, spatial, and temporal shifts apart from the transmission errors like delay and packet loss. Six different application areas are addressed by it through [5358]. This model can work in FR, RR, and NR modes for both encrypted and unencrypted video payload data. Table 13 shows a summary of the inputs that this model can take across its different variants.

Figures 2426 show the hybrid NR, RR, and FR models (for both encrypted and nonencrypted video data). While the NR models have access to the bitstream and the PVS data, the RR models have access to the bitstream data and the source video sequences having some reduced set of features, and the FR models have full access to the bitstream data along with the entire source video sequences. For all the versions, the encrypted model does not have access to the video payload data and operates without parsing the packet payload.

Table 14 enlists the various parameters for which the models have been tested. The model performance summary has been shown in Table 15. PCC and RMSE values have been used for calculating the model performance statistics. For each of the models, relevant subjective tests are carried out, the results of which are fitted using a third order monotonic polynomial function. In case of the NR models, MOS values are used (obtained from the ACR subjective technique), while for the RR and FR models DMOS values are used (obtained from the ACR-HR subjective technique) for evaluating the model accuracy.

From the above discussion it is clear that a variety of objective techniques that can be used in a number of different scenarios are available. For evaluating the video quality, while some models take in account the presence of reference video signals (FR and RR methods), others do not have this requirement. Similarly, each of the models has been tested for specific codecs only corresponding to specific resolutions. In order to generalize them for different codecs and other factors like resolution and content complexity, we have provided a snapshot of the relevant methodologies. The video sequences that are used for testing the models also vary in terms of the video duration, content complexity, etc. Some of the ITU models are best suited for network monitoring (ITU-T P.1201 series), whereas some are used for network or QoS/QoE planning (ITU-T G.1070, ITU-T G.1071), while the others are used for laboratory testing purpose (ITU-T J.247). Due to this wide variety of objective ITU techniques, it becomes confusing for a researcher to select an appropriate method depending upon the requirements. Therefore, in order to make the model selection process easier, we list down certain factors in Table 16 that can serve as the baseline for selecting the most appropriate model under a specific circumstance. Once a particular model is selected, the necessary changes can be made depending upon the research context. Table 16 highlights the network and application parameters that each of the models takes into account along with their intended purpose. Therefore, based upon the parameters of interest for quality prediction and the application scenario, it will be easy to choose a particular reference model.

5. Current Limitations and Challenges

A lot of work is going on within ITU to assess the quality of video streaming services. However, a number of shortcomings exist especially for the quality evaluation of videos that are streamed to mobile devices. We enlist here the challenges that are being faced and should be addressed.

The primary dilemma is in the existence of numerous models, the basic aim of which is to measure the video QoE and the varied type of inputs that they take in predicting the quality. Each model takes a different input based upon either network parameters (packet loss, delay, jitter, etc.) or video characteristics (bitrate, frame rate, resolution, content type, etc.) or a combination of both. There can be variations among the network parameters itself. For example, the packet loss pattern may be random or bursty by nature. Similar situations can arise in case of delay also.

The assessment methodologies are also different in terms of the subjective, objective, and hybrid methods. To make the situation even more complex, the different QoS factors (network or application level) as outlined in this survey are not sufficient in predicting the QoE accurately. QoE is strongly influenced by external factors like the type of device used in viewing, the surrounding environmental conditions, and other factors. For majority of the models, the video sequences that are selected from the VQEG database are very short in duration (roughly 10 s only) and hence their ability to portray a real life-streaming scenario is questionable. In addition, the effect of using videos lesser or greater than 10 seconds on the subjective quality assessment has not been accounted for [59, 60].

When streaming is done on mobile devices, the characteristics of the device itself should be taken into account because the viewing experience is quite different on small form factor mobile screens and conventional televisions [61, 62]. There are several limitations to the mobile devices in terms of the variety of screen sizes, display resolution, limited battery backup, limited storage, and other connectivity problems [63]. Currently, none of the existing ITU models considers the peculiarities that are unique to a mobile streaming environment. Despite the fact that more than 55% of the overall Internet traffic is generated by some form of multimedia streaming over a mobile device, lack of a model that particularly addresses this scenario leaves a great void and a lot of scope for further research into this aspect [3].

In a mobile video streaming environment, the inherent unreliable nature of the wireless networks should also be kept in mind. A detailed analysis of the video QoE over a WiFi network and other mobile networks like 2G, 3G, and 4G should be carried out with sufficient detail. Often the low speeds that are associated with mobile networks result in a poor video QoE, which has prompted companies like Google to release a new version of the most popular YouTube application named as YouTube Go that is supposed to work in low speed networks [64]. Thus, ITU should have in place models that simulate these wireless environments in detail and are targeted towards mobile devices considering the recent trend of watching videos online.

Most of the ITU models use videos having low resolutions of VGA, HVGA, CIF, and QCIF only. Practically, only the J.343 series take into account HD resolution. This is in sharp contrast to the current trend where 4K is gaining in popularity. In fact streaming services like YouTube and Netflix have contents that can be streamed in 4K. However, ITU does not provide any model that is dedicated towards such high-resolution videos. Recent advances in virtual reality (VR) and augmented reality (AR) platforms coupled with the availability of mobile devices that can support these have carved out a new way in which videos are being watched by the users. These recent trends and changing viewing habits should be incorporated into future ITU models.

6. Conclusion

Video streaming has become extremely popular these days, which allows the users to watch videos anytime and anywhere. However, for the success of such a service, the quality provided to the end-users must be excellent. There are a number of challenges being faced particularly in a mobile streaming environment. The QoE should be calculated keeping in mind not only the network QoS factors like packet loss, jitter, delay, and throughput and application QoS factors like bitrate, frame rate, and content complexity, but also the nature and characteristics of the mobile devices being used together with the surrounding environment.

In this article, we have presented an in-depth review of the standardized approaches being followed by ITU towards the video quality evaluation. Proper definitions of QoS and QoE have been provided along with the interrelationship between the two. Taxonomy of all the ITU models has been provided based on a general approach and the measurement methodology used. The basic overview and working of all the objective models are provided with suitable diagrams and algorithms/mathematical formulae. Finally, the current drawbacks are discussed along with the scope of future work.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to thank Dr. Borworn Papasratorn from the School of Information Technology, KMUTT, for sharing his long expertise in telecommunications research and providing the guidelines for writing an effective review paper.