#### Abstract

The measurement and evaluation of the QoE (Quality of Experience) have become one of the main focuses in the telecommunications to provide services with the expected quality for their users. However, factors like the network parameters and codification can affect the quality of video, limiting the correlation between the objective and subjective metrics. The above increases the complexity to evaluate the real quality of video perceived by users. In this paper, a model based on artificial neural networks such as BPNNs (Backpropagation Neural Networks) and the RNNs (Random Neural Networks) is applied to evaluate the subjective quality metrics MOS (Mean Opinion Score) and the PSNR (Peak Signal Noise Ratio), SSIM (Structural Similarity Index Metric), VQM (Video Quality Metric), and QIBF (Quality Index Based Frame). The proposed model allows establishing the QoS (Quality of Service) based in the strategy* Diffserv*. The metrics were analyzed through Pearson’s and Spearman’s correlation coefficients, RMSE (Root Mean Square Error), and outliers rate. Correlation values greater than 90% were obtained for all the evaluated metrics.

#### 1. Introduction

The assessment of quality in digital video systems is a topic of great interest to the telecommunications companies that hope to increase the quality to their users. The QoE (Quality of Experience) is the degree of user’s satisfaction with any kind of multimedia service. This concept has been defined in different ways by various authors. Liu et al. [1] suggested that QoE involves two aspects:(i)The monitoring of the user’s experience online.(ii)The service control to ensure that QoS (Quality of Service) can satisfy the user’s requirements.

QoE is an extension of QoS, since the former provides information about the services delivery from point of view of the end users. QoE refers to personal preferences of users and so seeks to assess the subjective perception of the received service [2, 3]. However, this perception is influenced, by the network performance in terms of QoS and video encoding parameters. Different methodologies proposed in the literature aim at estimating subjective Quality of Experience, through the assessment of different metrics of video quality generally using objective methods.

In the implementation of digital television platforms (e.g., IPTV and DVB), some important restrictions can affect the management and proper operation of the network. Some of these are(i)the large amount of bandwidth the user should contract,(ii)the limitation of internal buffers in routers and STB (Set Top Box), which can generate problems such as packet loss that are critical on video or audio transmission,(iii)the type of video compression format, which will reduce the channel use without affecting the quality,(iv)other items that should be installed and properly configured, such as the last mile link used and the admission control class used.

Different researches have proposed quality assessment models for video streams and measurement strategies in order to identify the optimal values for each metric guaranteeing the experience of the viewer.

According to Winkler [4], there are projects for QoE assessment: VQEG (Video Quality Experts Group), QoSM (Quality of Service Metrics) from ATIS IPTV Interoperability Forum, and specific metrics such as those oriented to packets, a bitstream, hybrids, and images metrics [5–8]; but these are more complex, and the correlation methods have not yet been applied for real time services assessment. Other works propose a new relationship between KPIs (Key Performance Indexes) with QoE assessment on mobile environments, but it can be applied in other scenarios as fixed broadband networks focused particularly on telecommunications providers [9].

One of the main problems on the estimation of the video subjective quality is the lack of proper estimation or correlation models. These must guarantee results with accuracy but in many cases heavily depend on objective metrics [10–12]. Also, the correlation models in QoE metrics are not accurate and reliable.

According to [13], three strategies were developed to perform the estimation of video quality. The first one is to apply a subjective assessment with a selected group of people; the main drawback of the evaluations is the cost and time. The second one is the objective quality metrics assessment for video, in which the principal disadvantage is the low correlation with regard to subjective quality metrics [14]; in addition, such metrics obviate the network and content parameters. The last one is to use machine learning methods to analyse the objective and subjective assessment but the major drawback is the difficult for configuring and testing; furthermore, some methods may fail whether a suitable design and optimal parameter selection are not performed.

According to Kuipers et al. [15], the minimum threshold of accepted quality is a MOS (Mean Opinion Score) value of 3.5. The subjective tests (such as MOS) are widely useful in assessing the users satisfaction due to its accuracy; however, the application of them is still very complex due to the high consumption of time and money; therefore, they are impractical for tasks of testing in network devices, and a controlled environment is required is required (in some cases, its implementation are complex).

Also if we want to develop traffic management techniques in real time, it is necessary to find a relationship among them and the objective metrics, measurable by the network equipment.

Objective methods are based on algorithms for the assessment of video quality, making them less complex and, furthermore, can be performed on controlled simulation environments. The objective metrics are supported in mathematical models that approximate themselves to the Human Vision System (HVS) behavior and, therefore, try to estimate as accurately as possible the true QoE. However, the perception of each viewer is highly influenced by the quality of the data network, expressed by the QoS parameters [26]. A lot of proposals for assessing QoE metrics have been created to define the user’s experience. ITU-T has carried out standards for some of them [27, 28].

Due to complex factors such as the HVS, different kinds of solutions such as the implementation of machine learning techniques have been proposed. Some of the most used methods are Artificial Neural Network (ANN), fuzzy logic based on rules, neural-fuzzy networks (e.g., ANFIS), support vector machines (SVM), Gaussian processes, and genetic algorithm, among others. Nonintrusive QoE estimation methods for video are mainly based on application layer and network parameters.

Models as artificial neural networks have been little studied for estimation and prediction of video quality. For the evaluation of nonintrusive methods, we decided to employ two methods of machine learning: BPNN feedforward and RNN (Random Neural Network). We assess the output of each system, estimated MOS versus expected MOS. In each case, the fit of the correlation, determined by Pearson and Spearman rank order correlation coefficients, RMSE (Root Mean Square Error), and Outliers Ratio were calculated. The expected MOS was calculated from the VQM (Video Quality Metric), SSIM (Structural Similarity Index Metric), PSNR (Peak Signal Noise Ratio), and QIBF (Quality Index Based Frame) metrics.

This paper presents the key objective and subjective quality metrics and introduces a nonintrusive QoE assessment methodology based on machine learning techniques, showing its main features and functionality. Afterwards, the developed testbed is explained; both results obtained as their analyses are presented. Finally the main conclusions and further works will be shown.

#### 2. Related Works

##### 2.1. Methods for Quality of Experience Assessment

In spite of all proposals for objective metrics, they are always not close to the human perception, due to the fact that the perception is highly influenced by the performance of the network, defined in terms of QoS parameters. According to [29, 30], the objective metrics are computational models that predict the image quality perceived by any person and can be classified as intrusive and nonintrusive methods, as shown in Figure 1.

The Subjective Pseudo Quality Assessment model or PSQA (Pseudo Subjective Quality Assessment) is an example of nonintrusive method (NR). This model uses an RNN (Random Neural Network) [31–33] to learn and recognize the relationship between video and the characteristics of the network with the quality perceived by users.

Initially, for the training process of the RNN, a database is required which contains different sequences to assess the distortions generated by several QoS and coding parameters. Afterwards, the training of the RNN with any video sequence is evaluated in order to validate the MOS measure.

##### 2.2. QoE Metrics

###### 2.2.1. PSNR and MSE Metrics

The PSNR (Peak Signal Noise Ratio) and MSE (Mean Square Error) metrics are the most frequently applied ones [11]. They assess the quality of the received video sequence and, thus, can be mapped on a subjective scale PSNR aiming to compare pixel-by-pixel and frame-by-frame the quality of the received image with the source image. It is the most known FR (Full Reference) metric. If we consider frames with a size of pixels and 8 bits/sample, the PSNR can be calculated using (1) according to [34, 35] as follows:where denotes a pixel in the position of the original frame and refers to the pixel located at position of the frame reconstructed in the receiver side. Around 255 elements are the maximum value that the pixel can take (255 for 8-bit images). The denominator is known as MSE or Mean Square Error, which is the mean square of the differences among the grey level values of the pixels into the pictures or sequences and .

Since several studies have used this mapping, PSNR is limited by the image content and it is not able to identify artifacts due to packet loss. In addition, the measure does not always correlate with the real user’s perception due to the fact that a pixel-by-pixel comparison is carried out without performing an analysis of the image structural elements (e.g., contours or specific distortions introduced by either encoders or transmitting devices on the network or spatial and temporal artifacts). Therefore, some metrics are proposed to generate the extraction and analysis of the features and artifacts into video sequences such as SSIM metric [36].

###### 2.2.2. SSIM Metric

Assuming that the human visual perception is highly adapted for extracting structures from a scene, SSIM (Structural Similarity Index) calculates the mean, variance, and covariance between the transmitted and received frames [37]. To apply SSIM, three components (luminance, contrast, and structural similarities) are measured and combined into one value called SSIM index ranging between −1 and 1, where a negative or 0 value indicates zero correlation with the original image and 1 means that it is the same image [35].

According to Wang et al. [38], SSIM gives a good approximation of the image distortion due to changes in the measurement of structural information. On the video sequences, this metric considers a wide range of scenes complexity, in terms of movement, spatial details, and color. According to Wang et al. [37], this metric uses a structural distortion measure instead of the error. The above is focused on the human vision system to extract structural information from the visual field and ignored the extraction of errors. Therefore, a calculation of the structural distortion should give a better correlation with subjective metrics. In [39, 40], a simple and effective algorithm was proposed to calculate the SSIM index.

Let be the original signal, , and let be the distorted signal, ; the similarity structure index is given by where is the mean of , is the mean of , and are the variances of and , is the covariance of and , and and are constant values. The SSIM value can be defined as where , , and are comparison functions of the luminance, contrast, and structure components, and the parameters , , and are constants. is the quality assessment index. For more details see [39].

###### 2.2.3. VQM Metric

The NTIA VQM metric (Video Quality Metric) [39] considers two image inputs, the original and processed video, in which the quality levels are verified through the human vision system and some subjective aspects. This metric divides the image into sequences of space-temporal blocks, measuring elements like blurring, general noise, block distortion, color distortion, and mix among them into a single metric. The score closer to 0 is considered the best possible value. According to Wang [41], this metric shows a good correlation with subjective methods and has also been adopted by ANSI as a standard for the assessment of video quality.

In Table 1, the comparative table of the main QoE metrics is summed up, which is used in this work.

###### 2.2.4. New Mapping QoE Metrics

Zinner et al. [42] proposed a framework for the assessment of the QoE by using streaming video systems. On the other hand, Botia et al. [16] proposed a new mapping among the PSNR, SSIM, VQM, and MOS metrics, shown in Table 2. In our simulations, we calculated the average of each MOS for all video sequences with the value of each FR metric from Table 2 and then we obtained the expected MOS.

###### 2.2.5. QIBF Metric

Serral-Garcia et al. [43] propose a framework called PBQAF (Profile Based QoE Assessment Framework). This framework defines three states for frames (correct, altered, and lost) through the analysis of the payload. Moreover, a mapping is performed to generate and associate the quality index. This metric is calculated from payload of the received packets, associated with a PLR (Packet Loss Rate) in particular. Equation (4) shows the quality function to generate the mapping function : with being the packet loss rate of the frame . The mapping function is given bywhere shows the ratio of lost frames; therefore, when a high packet loss rate is measured, the quality index will be lower and tends to be 0. Thus, the final quality of the video stream into a set of quality values for particular frames is given bywhere is the set of all frames in the video stream. From Botia et al. [44], a mapping between QIBF (Quality Index Based Frame) metric and MOS metric is proposed for each frame* class *; that is, each frame is equivalent to one class. In (7), the QIBF is given bywhere is the maximum number of frames, seq is the sequence of actual video to be evaluated, refers to the class network applied over the transmitted sequence (*BestEffort* or* Diffserv*), is the number of frames () lost, determined by the set of MPEG images, and is an adjustment factor set to 0.05, obtained from several developed tests.

The factor 1/3 in (7) is defined through 3 classes of frames used in the tests. As , the value of NFL is divided by 100 in order to define and, therefore, to calculate . The demonstration is shown in Appendix A.

Table 3 shows the proposed mapping between quality index and MOS metric, where and, as observed in Table 3, for an index value , a good value for the MOS metric is obtained, indicating a video sequence with a minimum of artifacts.

#### 3. Simulation Testbed

For our test, a testbed is built up to use a simulation software tool* NS-2* and the framework* Evalvid* for assessing a set of video sequences. The results obtained from FR/RR metrics are evaluated to find the possible correlation with subjective metrics using machine learning techniques.

In the simulation, we used a selection of different video raw uncompressed sequences in format* YUV* with color mode or sampling 4 : 2 : 0, encoded with the ffmpeg and main concept software tools, to adapt them to different bitrates and GOP (Group of Pictures) lengths. Four video sequences were initially assessed with different levels of movement, encoded in the MPEG-4 format, which were adapted to be transmitted by a simulated IP network.

Figure 2 illustrates some screenshots of the evaluated videos (News of the Spanish Public* TV*, Mass, Highway and Winter Tree), converted and encoded at a resolution of 720 × 480 pixels (standard definition) under the NTSC standard, with frame rate of 30 fps. For each video stream, several parameters were combined as length of the GOP (10, 15, and 30), the bit rates recommended* by* DSL Forum [45] (1.5, 2, 2.5, and 3 Mbps), and packet loss rate for both networks (*BestEffort* and* Diffserv* using the congestion control algorithm,* WRED*), which produced 385 different video sequences for testing [16].

**(a) News TVE**

**(b) Winter tree**

**(c) Mass**

**(d) Highway**

The generated video traces were adapted to be sent to the data network through the encapsulation of each packet with an MTU (Maximum Transfer Unit) of 1024 bits, using the RTP protocol (Real Time Transport Protocol) with* MP4trace* software tool. Considering the simulation tool* NS-2* and* Evalvid* framework [46], the sender and receiver trace files that were created, to calculate the sent and received lost frames and packets, delays, and jitters. The above facilitates the analysis of each video sequence for both implemented scenarios (*BestEffort* and* Diffserv* data networks).

The Evalvid framework also supports PSNR and MOS metrics and has a modular structure, making it easily adaptable to any simulation environment. MSU VQMT software tool [47] allows getting the Y-PSNR, SSIM, and VQM metrics values through the original reference video and received video with distortion.

The simulation scenario is composed of a video sender (server for video on demand) and 9 cross-traffic sources, which consist of* CBR* and* On-Off* traffic sources. The network is based on dumbell topology [48] (see Figure 3). In our test, we send several video packets over a network with congestion and will be allowed to test the defined QoS scheme* (Diffserv with WRED)*. The MPEG-4 video flow is complete with background* On-Off* traffic flows, which has an exponential distribution with an average packet size of 1500 bytes, burst time of 50 ms, idle time of 0.01 ms, and sending rate of 1 Mbps [16, 44]. The access network is represented by a video receiver (simulating a last mile with ADSL2), with a bandwidth link of 10 Mbps and several receiving nodes (sink) for cross traffic with a bandwidth of 10 Mbps for each one. Transmission distortions were simulated at different PLR (Packet Loss Rate). The traffic behavior and QoE metrics were tested with several error rates over a link established between the core and edge routers, using a loss model with uniform distribution at rates of 0%, 1%, 5%, and 10% and delay of 5 ms. Tables 4 and 5 show the main parameters used in the simulation and encoding.

#### 4. Implementation of Nonintrusive Methods to Estimate Video Quality by Objective and Subjective Metrics

For the evaluation of nonintrusive methods, we decided to use two machine learning methods (BPNN feedforward and Random Neural Network). These methods have been used in different environments described in the next section. To assess the output of each system, we considered the following: estimates MOS versus expected MOS, where the latter is computed through the mapping of PSNR, VQM, SSIM, and QIBF metrics (see Tables 1 and 2).

In each case, the correlation adjustment was calculated, determined by the Pearson correlation coefficient and RMSE to estimate the error. The general methodology is presented in Figure 4 [26, 49].

Initially, the video sequences in RAW-format were obtained and they were codified in MPEG-4/AVC-format. Each sequence was sent through an IP-simulated network. The above is based on “*BestEffort*” and “*Diffserv*” where each FR/RR metric is calculated and their respective values are mapped. In this manner, average MOS value is obtained by using Tables 2 and 3. Considering the codification values, the kind of QoS network and the obtained FR/RR metrics allow building the input database for training two neural networks.

MATLAB software suite was used for BPNN, using the Neural Networks Toolbox. For RNNs case, we used the QoE-NNR software tool [50]. From 385 different sequences at different configured parameters, around 70% were used as training data and the remaining 30% as test and validation data. Finally, obtaining the estimated MOS value, the correlation is calculated with respect to the average MOS value. The given results of machines learning will be presented and discussed.

Table 7 shows a brief state of the art where several papers assess QoE using neural networks as PNN (Probabilistic Neural Network), BPNN (Backpropagation Neural Network), RNN (Random Neural Network), and ANFIS (Adaptive Neurofuzzy Inference System). Works, using a few video sequences, do not show resolutions or codec information and apply low resolutions (for mobile devices). Codec formats more used are H.264 and MPEG-4 AVC. We found that PSNR, SSIM, and VQM metrics are frequently applied and few works use 2 metrics or more. In our case, we were using 5 different metrics and 4 video sequences with motion levels in each scene. Finally, the machine learning based on neural networks (RNN, BPNN, and ANFIs) is used in these works. The results show a good performance of the MOS estimation. However, they are defined by few input parameters, only assessing 1 or 2 video sequences in low resolution(s). They do not compare with other kinds of ANNs and, in most cases, the Pearson correlation coefficient was <0.90.

##### 4.1. Case 1. Implementation of a Feedforward Artificial Neural Network with Backpropagation

The ANNs are a paradigm for processing information inspired by the human neural system. Usually, ANNs are composed of a large number of highly interconnected processing elements called neurons, which work together to solve problems [13]. The base is the creation of a neural network also called MLP (Multilayer Perceptron), usually divided into 3 or -layers.

The first layer contains neurons connected to the input vector data; the second layer is called the hidden layer and incudes a set of synapses and a number of weights and some activating functions defined for exciting or inhibiting each neuron, generating a response. According to Rubino et al. [51], if the number of hidden neurons is low, it can have large training and generalization errors due to the underfitting. Otherwise, if there are many neurons in the hidden layer, low training errors may occur, but high generalization errors may appear, causing the undesired effect of overtraining (overfitting) and high variance. The third layer is the output layer directly connected to the hidden layer where data vectors of each estimated output will be obtained. Multiple layers of neurons with nonlinear transfer functions (such as tangential-sigmoid) allow the ANNs learning the linear and nonlinear relationships between the inputs (PLR, GOP, bitrate, QoS class, QIBF, PSNR, SSIM, and VQM) and the desired output vectors (MOS). The structure of Backpropagation Neural Network is shown in Figure 5.

ANNs type feedforward are the most commonly used ones to perform estimation of subjective metrics. Several works propose different models based on ANN [52–58]. These approaches are generally applied on mobile systems or with low resolutions (QCIF or CIF); furthermore, these works obviate the network parameters or video objective metrics. In most cases, one or two metrics as input to the network are used, and the results from Pearson’s autocorrelations almost always are below 0.90.

According to Ding et al. [52], the neural network can be used to obtain mapping functions between objective quality assessment and subjective quality assessment indexes. This affirmation allows the understanding of the usefulness of the ANN to analyse the estimated MOS and the proposed model presented in this research.

In that case, the network training was carried out through several parameters, which will turn into input variables. As stated, the objective parameters may affect the video quality. After training, an evaluation with a set of test data (96 sequences) will be performed in order to reach the corresponding network validation. The idea is to reach the lowest error and to be able to correlate the estimated MOS by ANN versus the average MOS computed from the objective metrics defined by Tables 1 and 2.

For the case of study, we want to build the estimated MOS function, defined by where are the input parameters established by bitrate, packet loss rate, GOP length, QoS class (1 for* BestEffort* and 2 for* Diffserv*), SSIM, PSNR, VQM, and QIBF.

Like a human being, the ANN system needs a learning phase and another one for validation and testing to establish when the neural network generalized its learning for any data set. For this process, the network is trained through a training algorithm and the lowest possible error is computed through the cost function MSE (Mean Square Error). In that case, we used an iterative gradient descent algorithm or other learning algorithms to achieve convergence to a target value (target), that is, to calculate the minimum training error of the network.

According to Ries et al. [59], in the multilayer network MLP, with a wide variety of nonlinear continuous activation functions on the hidden layers, one of these layers, which contain a largely arbitrary number of neurons, is sufficient to satisfy the property called universal approach. This allowed defining the neural network with a single hidden layer (with 20 neurons) satisfying the outputs (estimated MOS and desired MOS), calculated from the mapping of objective metrics. One of the major drawbacks is to find the right number of hidden neurons due to factors such as the quantity of input/output neurons, the number of training cases, the amount of noise in the output and input, the used architecture, the learning algorithms, and the kind of activation functions in the neurons on the hidden layer.

An empirical methodology was performed starting with 16 neurons in the hidden layer equal to twice of input neurons and a new neuron to reach the objective function with the training algorithm Levenberg-Marquardt was added in each training and testing cycle. This is a widely used algorithm to solve the problem of least squares.

The proposed BPNN feedforward architecture is shown in Figure 6. In the input and output layer, the neurons have a linear activation function (*purelin*). In the hidden layer after various tests, the lowest training error was obtained with 20 neurons with function tangent-sigmoid activation (*tansig*).

After performing several iterations and resetting the weights in the training stage, the best performance is obtained as shown in Figure 7.

In Figure 8, the validation stage is illustrated. As shown, a good fit between the output data of the network and the desired MOS is obtained. An analysis of the linear regression between them is performed. The relationship is established between the estimated value by the BPNN ( variable) and desired data ( variable). The representation is given by the following classical linear equation:

According to Pearson correlation parameter between the estimated MOS by BPNN and the expected MOS, a linear fit of 96.72% and a RMSE of 0.1977 were obtained. Figure 9 shows the output of the correlation obtained. The results establish that the feedforward neural network allowed a good generalization with data used for validation and full linear relationship.

##### 4.2. Implementation of a Random Neural Network (RNN) through PSQA Methodology

This kind of network captures with great accuracy and robustness mapping functions, where several parameters are involved. According to Casas et al. [60], such networks have been used on multiple engineering fields, highlighting and solving NP-complete optimization problems, generating textures in images problems, video and image compression algorithms, and classification problems for perceived quality of voice and video over IP, which make them ideal for its application in QoE assess. Appendix B explains in detail the RNNs and the main parameters used in these networks. The RNNs are a cross between neural networks and queuing networks. By definition, RNNs are sets of ANNs formed by a range of interconnected neurons. These neurons exchanged instantly signals which travel from one neuron to another and send signals from and to the environment. Each neuron is associated with an integer random variable associated with a potential. The potential of a neuron at time is defined by . If the potential of the neuron is positive, the neuron is excited and randomly sends signals to other neurons or to the environment according to Poisson process of rate . The signals can be positive (+) or negative (−). The RNN has a three-tier architecture. Thus, the set of neurons is split into 3 subsets: the input neuron set, the set of hidden neurons, and the output neurons. An output is generated when the input is ; therefore, , for , the set of weights for step .

According to Casas et al. [60], it is necessary to apply a methodology to assess the Quality of Experience based on the use of network parameters (probability of packet loss, delay, jitter, etc.) and video parameters (encoding, bitrate, frame rate, GOP length, etc.), which will become input parameters. Based on these criteria, a mapping function between these parameters and subjective quality value defined by the MOS metric can be generated. To perform this task, the author proposes the methodology PSQA (Pseudo Subjective Quality Assessment), which uses the RNNs to learn the mapping between the parameters and perceived quality [51].

This methodology is characterized by its accuracy; it allows generating automatic evaluations in real time which is efficient and can be applied with many kinds of media codec and under different parameters and network conditions. Also, it can be extended for comparison with objective metrics, generating more accurate correlations [61].

The RNN is considered as a supervised learning machine, which uses multimedia and network characteristics and the expected MOS values. If in the training stage a relationship of the input parameters through the objective metrics and the expected output is found, it is possible to estimate and/or predict the subjective values with a higher level of accuracy. Due to the RNNs features, they are perfect to generate good assessments for a wide variation of all parameters that affect the quality [51]. Therefore, it is an accurate model, fast, and with low computational cost. To develop the proposed study case, RNN feedforward 3-layer architecture, proposed by Mohamed and Rubino [61], is implemented. QoE-RNN software tool was used to estimate the MOS value through the use of a RNN. This tool has LGPL license and was developed in the programming language-C [50].

The whole process is presented in Figure 10, where video sequences transmitted over the network are assessed by comparing the target average MOS. Thus, depending on the QoS strategy implemented, the MOS values associated to each objective metrics (PSNR, VQM, SSIM, and QIBF) are set and are defined by and so the average MOS () that is the target value of the RNN is calculated. is expressed as shown in the following equation [26]: where refers to the MOS for each objective metric and is the total of samples.

The input parameters are chosen and, with the obtained from the network, the corresponding correlation analysis is performed.

The number of boundary iterations is 2000 and the network topology is defined by 9 input neurons, 10 neurons in the hidden layer, and one output neuron. In different conducted tests, the best fit is achieved with 10 neurons in the hidden layer.

Using MATLAB, an analysis of the linear correlation between the expected MOS and is performed. A good linear fit is found by the Pearson coefficient at 0.9812 and RMSE at 0.1412. In Figure 11, this correlation is shown.

The general summary of all correlations obtained from each study case is presented in Table 6.

The performance of perceptual quality metrics depends on its correlation with the results of objective metrics. Thus, the accuracy in the estimation of subjective metrics with respect to issues such as prediction accuracy, monotonicity, and consistency is evaluated. This guarantees a high reliability in the subjective assessment of video quality over a range of video test sequences with different artifacts. The assessment methods are proposed by the Video Quality Experts Group (VQEG) [62]. Four statistical measurements are applied to evaluate the video quality metrics performance: Pearson correlation coefficient (PCC), Spearman’s rank order correlation coefficient (SROCC), Mean Square Error (RMSE), and Outlier Ratio (OR) [63, 64].

As shown in the results of Table 6, the correlations between metrics were certainly good with objective metrics, with high Spearman coefficient proving high linear trend in all cases. The generalization was very high and we noted the consistency, accuracy, and monotonicity results. In the simulations performed, we observed that the VQM, SSIM, and QIBF metrics are highly correlated and are close to the users’ perception (established by the MOS metric).

For all cases, the correlation reached values higher than 90%, and the correlation between the objective and subjective metrics in each case presents an excellent linear behavior. Accordingly, the metrics as VQM, SSIM, and QIBF are largely related to the subjectivity established by MOS. Also it was observed that the results of the BPNN feedforward indicate a good generalization for estimated MOS against every assessed objective metrics, although it was slightly higher for applying the RNN. The RNN with PSQA provides a better correlation with the predicted values for MOS. The strategy generated by the nonintrusive model based on machine learning methods was shown to be accurate and highly flexible. It allows the estimation of subjective MOS values by relating them with FR (Full Reference) and RR (Reduced Reference) chosen for the research.

#### 5. Conclusion

In this work, we obtained excellent correlation values between the objective and subjective QoE metrics through the use of nonintrusive methods. The accuracy, consistency, and monotonicity were validated with the analysis of Pearson’s and Spearman’s correlations, outliers rate, and RMSE (Roots Mean Square Error). One of the main limitations of the objective and subjective methods is the lack of complete methodologies to analyze the accuracy of QoE. Therefore, this problem was addressed in order to propose a new methodology that allows finding new correlations between objective and subjective metrics. To improve the analysis, the machine learning techniques were proposed through back-propagation artificial neural networks and Random Neural Networks to enhance the approach of the estimation of human perception. In different performed simulations, we observed that VQM, SSIM, and QIBF metrics are highly correlated and are close to the user perception (determined by the MOS metric). Unlike previous works, we developed a general correlation model which uses network and coding parameters applied to several video sequences. Analyzing the results from learning machines, the BPNNs and RNNs generated high correlations with objective metrics, obtaining PCC values higher than 90% and low error rates. The consistency of correlations between the metrics through outliers was calculated with low values. We conclude that the application of nonintrusive methods allows us to generate more accurate approaches to human perception. In addition, telecommunications providers can use this methodology for estimating the QoE of users and improve their data network architectures and/or global settings on their platforms, optimize the QoS, and employ better encoding mechanisms. The development of new models for the assessment of QoE is a top research topic according to the state of the art. The topics that are being currently working include(i)development of new objective metrics, which are easy to apply and can be mapped accurately to the true perception of the viewer,(ii)the close relationship of all QoS factors which affect the QoE,(iii)new transmission strategies over highly congested data networks that can have a greater impact on transmission environments for streaming video over the internet which affect the user experience on the multimedia content over the next generation mobile devices,(iv)the application of new kinds of video codecs specifically aimed at* HD* (*H.265/MPEG-DASH*, Dynamic Adaptive Streaming over* HTTP*) [65],(v)the application of different methodologies based on machine learning techniques especially those related to artificial neural networks, neurofuzzy networks, support vector machines, and genetic algorithms.

#### Appendices

#### A. Proof of QIFB Metric

Let be a QIBF metric ; (7) fulfills the following axioms:[Maximum] If for all , with 1 being equivalent to , 2 equivalent to , and 3 equivalent to , the index is excellent; .[Minimum] If for all , with 1 being equivalent to , 2 equivalent to , and 3 equivalent to , the index is bad; .[Resolution] for all , with 1 being equivalent to , 2 equivalent to , and 3 equivalent to , if ; ; and .[Symmetry] for all , with 1 being equivalent to , 2 equivalent to , and 3 equivalent to , if .

*Proof. *(P.1) Considering for all and supposing that , then . If as real adjustment, then but this value can be approached at 1 in order to get a better evaluation of MOS. Therefore, as , the MOS score is around 5 as maximum value.

(P.2) If (maximum lost) for all and supposing that , then . If as real adjustment, then but this value can be approached at 0 in order to get a better evaluation of MOS. Therefore, as , the MOS score is around 1 as minimum value.

(P.3) Assuming and , these relations are rewritten asTaking into account and ,Then,If , it is obtained thatTherefore, it is found that in which if is monotonically decreased, the loss is also decreased and is monotonically decreased if the loss is also increased. Thus, is a sufficient condition.

(P.4) If , it is obvious that where the quantity of loss is the same.

#### B. Random Neural Network

The RNNs are a cross between neural networks and queuing networks. By definition, RNNs are sets of ANNs comprising a series of interconnected neurons. These neurons exchange signals which instantly travel from one neuron to another and send signals from and to the environment. Each neuron is associated with an integer random variable and these are associated with a potential. The potential of a neuron at time is defined by . If the potential of the neuron is positive, the neuron is excited and randomly it sends signals to other neurons or to the environment according to Poisson’s process with rate . The signals may be positive (+) or negative (−). Thus, the probability that the signal sent from neuron to neuron is positive is denoted by , and the probability that the signal is negative is denoted by.

The probability that the signal goes to the environment is denoted by . If is the number of neurons, for all , then is expressed as shown inTherefore, when a neuron receives a positive signal from another neuron or from the environment, its potential is increased by 1. If a negative potential signal is received, it decreases by one. When a neuron sends a positive or negative signal, its potential decreases in a unit.

The flow of positive signals arriving from the environment to the neuron is Poisson’s process with a or rate. It is thus possible to have and for any neuron . To have an active network, (B.2) is needed:

If is defined as the equilibrium probability for neuron in excitation state then, (B.3) is considered:

In (B.3), if Poisson’s process for the potential of the neurons is ergodic, the network is defined as a stable and satisfies the conditions for a nonlinear system.

In RNNs, the purpose of the learning process is to obtain the values of and probabilities and . The above allows obtaining the weights of the connections between the neurons , , as shown in

From (B.4), the set of weights on the network topology is initialized with arbitrary positive values and iterations that are performed to modify the weights. For , the set of weights for the step is calculated from the set of weights at step . Let be the network obtained after step defined by weights and ; then, the set of inputs rates (positive external signals) on for will get a network that allows generating an output when the input is ; therefore, .

The RNN has a three-tier architecture. Thus, the set of neurons is split into three subsets: the set of input neuron, the set of hidden neurons, and the set of output neurons. The input neurons receive positive signals from the outside. For each node , . For the output nodes , . The intermediate nodes are not directly connected to the environment, for any hidden neuron , .

There are several video sequences with different parameters for the case study, where a set of training data and another set of test data are selected. In (B.5), denotes the set of training sequences, where each sequence is defined by and refers to the set of validation sequences. Moreover, is the set of parameters that affects each sequence, where

The value of the parameter in the sequence is defined by where , where is a matrix . Each sequence receives a score . Moreover, for the sequences , a function is obtained and the training process is completed. Otherwise, we can try with more data or change some parameters of the RNN and proceed to build a new function .

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgment

This research was developed as part of the macroproject “System of Experimental Interactive Television” for the Research and Innovation Center (Regional Alliance of Applied ICT-Artica) with code 1115-470-22055 and project no. RC584 funded by Colciencias and MinTIC.