The importance of multimedia streaming using mobile devices has increased considerably. The dynamic adaptive streaming over HTTP is an efficient scheme for bitrate adaptation in which video is segmented and stored in different quality levels. The multimedia streaming with limited bandwidth and varying network environment for mobile users affects the user quality of experience. We have proposed an adaptive rate control using enhanced Double Deep Q-Learning approach to improve multimedia content delivery by switching quality level according to the network, device, and environment conditions. The proposed algorithm is thoroughly evaluated against state-of-the-art heuristic and learning-based algorithms. The performance metrics such as PSNR, SSIM, quality of experience, rebuffering frequency, and quality variations are evaluated. The results are obtained using real network traces which shows that the proposed algorithm outperforms the other schemes in all considered quality metrics. The proposed algorithm provides faster convergence to the optimal solution as compared to other algorithms considered in our work.

1. Introduction

The use of tablets and smartphones is growing due to the extensive increase in mobile devices. The advanced mobile capabilities along with Wi-Fi and Long Term Evolution networks have introduced valuable user experience for mobile clients [1]. Cisco report [2] state that the world media traffic will be 78% of the mobile data traffic by 2021, and 86% of this traffic will be produced through smartphones. People are preferring mobile devices for viewing videos through streaming due to the rising popularity of media technology. The clients always demand stable media streaming services irrespective of the service provider.

The content providers use adaptive bitrate (ABR) algorithms for the optimization of video quality. The algorithms are applied on client-side to choose a bitrate for each video segment dynamically. ABR algorithms decide bitrate using throughput estimation and playback buffer occupancy. The aim of ABR algorithms is to increase user quality of experience (QoE) by adjusting video bitrate for underlying network conditions [3]. However, selecting the appropriate bitrate is challenging due to variations in network throughput [49] and conflicting QoE requirements such as higher bitrate, minimum rebuffering, and smoothness.

The Moving Picture Experts Group (MPEG) introduced dynamic adaptive streaming over HTTP (DASH) standard to assist deployment of this technology [10, 11]. The commercial platforms like Adobe Dynamic HTTP Streaming [11], Apple HTTP live streaming [12], Microsoft Smooth Streaming [13], and Move Networks use HTTP-based adaptive video streaming because of its popularity.

The DASH runs over the existing HTTP/TCP protocol for video transmission. The video contents are encoded in different resolutions and bitrates which are stored at the server in DASH architecture. The client can select the segment according to the bandwidth and resource availability to switch the quality level for smooth video streaming. This adaptive technique is applied independently to each client for a better quality of experience.

The study on DASH adaptations methods is still in progress. The majority of systems use simple heuristic techniques that result in disturbing quality fluctuation [14] and poor utilization of network resources. The main element in QoE performance degradation is rebuffering, which freezes the video playout temporarily [15]. The current DASH-based techniques are not effective under dynamic network conditions due to mobility and bandwidth variations for mobile devices. This results in frequent quality switches and video freezing which can decrease user QoE for video streaming. It is better to deploy rate switching approach for continuous bandwidth variations and to use relatively high aggressive rate switching for instant bandwidth hoping. The recent research concentrates on overcoming the adaptive streaming issues for dynamic network condition.

Reinforcement learning (RL) is a powerful method due to the autonomic capability of learning without feature crafting. The introduction of reinforcement learning presents an effective and viable solution for rate adaptation methods. RL selects the best strategy on the basis of previous experience through the trial-and-error method [16]. The RL for DASH methodology is based on media presentation file, dynamic wireless bandwidth, and buffer size for user player. When the rate adaptation agent is efficiently trained, the RL-based rate adaptation techniques show better QoE as compared to existing DASH-based methods [17]. The major problem in this policy is the large state space of the RL-based methods. The small state spaces are required to assure fast convergence and to develop online techniques that respond to the variations in the environment instantly.

The rate adaptation algorithms provide better video streaming services. However, these approaches face some challenges for designing and implementing the ABR algorithm as follows. (i) Network bandwidth switches over time that varies considerably across the network environment. Here, the selection of bitrate becomes complex because different network conditions need different values for input parameters. (ii) Rate adaptation algorithm is required to maintain various QoE objectives like enhancing multimedia quality, reducing rebuffering events, and minimizing quality level variations. (iii) The selection of bitrate for segment affects the status of media player significantly. The higher bitrate selection can consume the buffer occupancy and the next segment is downloaded at lower bitrate in order to reduce rebuffering events.

The RL method is applied to many disciplines but it has some limitations for nonstationary environment and large data dimensions [18]. The deep-learning method’s capability to learn complex patterns also leads to classification issues [19]. Recently, the RL methods are combined with a deep neural network in order to mitigate these issues [20]. This integration can be applied in training or approximating RL functions using a deep neural network. The Deep Q-Network (DQN) [21] is a notable example of this integration that combines deep Neural Network with Q-learning. The DQN agent can easily learn the policies using RL when the inputs are high dimensional. The deep Convolutional Neural Network (CNN) overcomes the divergence and stability issues by using a target network and replay experience.

We have proposed an adaptive rate control enhanced Double Deep Q-Network (ADQ) method that employs deep reinforcement learning to choose the better quality level based on its prior experience. We have carried out our simulations using dash.js to compare our framework with existing techniques. Our algorithm maximizes the quality of experience by limiting rebuffering events and quality variations. It provides quick convergence to optimal strategies that result in fewer quality variations for better QoE for the client.

The major contributions of this research work are as follows:(i)Enhancing video streaming quality using double Deep Q-learning that includes reward function, target network, and replay memory.(ii)Employing HSDPA real network traces and QoE Waterloo III video dataset to evaluate proposed ADQ and other rate adaptation methods.(iii)Reducing rebuffering events and quality variations for the smooth playback of the videos.(iv)Improving stability and convergence time for varying network conditions.

This paper is organized as follows. Related work for adaptive multimedia content delivery schemes is discussed in Section 2. Video quality assessment methods are also given in related work. Section 3 defines the system model and problem formulation. Section 4 covers the proposed methodology and algorithms. The evaluation results are discussed in Section 5. Finally, the conclusion is given in Section 6.

The recent research focuses on HTTP-based adaptive video streaming to enhance the user quality of experience. The sender-driven rate adaptation is one of the frequently explored strategies for video streaming using HTTP that uses existing CDN infrastructure. The videos are stored in the form of multiple segments at the server in DASH-based systems and the duration of a segment is a few seconds. The segments are encoded at different bitrates and a higher bitrate means a larger segment size and a higher video quality.

An interesting hybrid between standard algorithms and reinforcement learning is introduced by van der Hooft et al. [22]. They adapt the parameters of heuristic using Microsoft Smooth Streaming (MSS) to raise the performance of the system. The algorithm is not QoE aware but a similar hybrid solution could be a useful method to improve the QoE by overcoming RL drawbacks.

Buffer-Based Approach (BBA) [23] employs buffer allocation and estimation method to decide next segment bitrate. This algorithm begins with the smaller bitrate and with the rise in buffer allocation a larger bitrate is selected to smooth fluctuations caused by changing network capacity. The BBA is not suitable for short multimedia videos as it has a large buffer size. The BBA usually shows the lowest QoE even for the good network conditions. The buffer-based technique BOLA [24] uses a Lyapunov optimization formulation to optimize a specified QoE metric.

Additive increase and multiplicative decrease (AIMD) [25] identifies bandwidth variability by applying HTTP throughput for multimedia streaming. The segment fetch time is employed to estimate throughput. A Feedback Linearization Adaptive Streaming Controller (ELASTIC) [26] reduces the network traffic fluctuations by applying feedback control theory. A QoE-aware DASH system (QDASH) [27] is a probing method which computes network bandwidth. It measures the bandwidth in varying network conditions and provides reliable values for video quality. The rate-based [28] method employs ABR controller for the MPEG-DASH standard. It selects highest available bitrate that is less than throughput predicted on mean value of last 5 segments.

The FESTIVE [29] algorithm balances fairness, efficiency, and stability across video players. The FESTIVE is implemented without randomized scheduling and consecutive segments are downloaded with an assumption of no wait time. It uses the harmonic mean of the last 5 segments based on throughput prediction to calculate the efficiency score. The stability score is calculated using bitrate switches of last 5 segments. The performance of the FESTIVE algorithm decreases for the ramp up network condition.

Bokani et al. [30] applied a Markov Decision Process using dynamic programming to balance a stable and high QoE by achieving optimal adaptation logic. The drawback of the approach is its computational complexity. The authors have provided many solutions to decrease the computational complexity but it increases the rebuffing events and requires retaining a huge quantity of computational results in the memory. The authors in [31, 32] have worked on optimization problem instead of heuristics to address RL-based DASH control strategies. A Q-learning-based approach learns the best policy which utilizes reward function to reduce rebuffering events. Although this algorithm is efficient for heuristic-based approach, yet the parameters of the model are designed for slow-varying channels. The algorithm is not suitable for dynamic network conditions. The purpose of the research is to mitigate the current flaws by using learning technique and optimal policy.

QDASH [27] overcomes the issue of sudden quality degradation by choosing the average bitrate. This quality degradation originates because of a sudden drop in bandwidth. The throughput prediction is employed instead of proxy service for estimating bandwidth without affecting the performance using harmonic mean of last 5 segments.

AppATP [33] presents an energy efficient scheme for mobile devices which effectively select the time to prefetch frequently required data and defer delay-tolerant data until better network condition. In [34], the author introduced a mathematical model to capture the relationship between time and scale for peer to peer multimedia streaming under a flash crowd scenario. The authors designed a flexible population control scheme for the flash crowd that alleviates the requirement of costly server deployment.

Yin et al. [35] introduced a Model Predictive Control (MPC) framework that takes decisions using a look-ahead approach. They have used throughput predictor to make an optimal selection of future segments. The method is comparable to dynamic programming that depends on the quality of throughput predictor for a better decision. If the predicted values are not accurate, the decisions for the next state will be suboptimal. The heuristics are applied to address this issue which leads to more conservative throughput predictions. The authors state that it consumes additional memory to store precomputed tables for real-time implementation. These heuristics do not perform well in different environments.

The RL-based adaptive video streaming has been proposed in [22, 31, 32, 36] which uses a tabular form to store and learn the value function for each state and action instead of applying function approximations. These schemes do not perform better due to limited state space. H. Mao et al. [37] proposed A3C for DASH bitrate selection using RL to achieve better performance.

The goal of decision-making and Q-Learning regarding segment selection is to enhance the quality of experience (QoE). Various factors can affect the QoE; however, all factors will have different effects. Optimization of QoE in DASH-based techniques is an open research problem. The authors in [3840] have observed that quality level variations and the frequency of video freezing are the major factors affecting the QoE. In this research work, we consider the variations in quality level and frequency of video freezing to enhance the QoE for mobile users.

2.1. Video Quality Assessment

PSNR (Peak Signal-to-Noise Ratio) is media quality assessment method. It is computed by selecting the highest bitrate and the difference between the test and reference image [41]. Lee et al. [42] define the PSNR in represents the bitrate after video encoding. is the expected throughput of multimedia stream over the network and shows the current throughput of stream.

Structural similarity (SSIM) provides the structural information by employing HVS. It is computed using the original and observed image [43]. SSIM index is shown in The α, β, and γ parameters in (2) adjust three comparison functions. The general QoE metric for video streaming used by MPC [35] is defined in Here N denotes the number of segments and R is the set of all possible bitrates. denotes the bitrate of the segment n and q() represent the quality perceived by a user. The media player can select a video segment for downloading at bitrate . When the higher bitrate is chosen, the higher quality is perceived by the user. is quantified and measured by using Table 4. The rebuffering time is represented by that results from downloading segment n at bitrate .

Rebuffering is a stalling event caused by buffer underrun during the video playback. Rebuffering frequency is determined as the rebuffering counts per minute while playing a video. Switch frequency is the number of bitrate switches during a video playback session.

3. System Model and Problem Formulation

In this section, we will discuss the rate adaptation for DASH-based algorithms using reinforcement learning. Deep reinforcement learning (DRL) is a method in which agent and environment interact with each other using a specified set of actions. The decision-maker and learner is known as agent and the thing that interacts with an agent is called environment. The basic information regarding environmental conditions is contained in the state. The agent evaluates its actions according to assigned numerical rewards and does not need any previous information about the environment. The aim of an agent is to maximize the collective numerical reward by learning optimal action in a given environment [16]. The mapping from each state to actions is called policy π and the ultimate aim of an agent is to determine the optimal policy. The agent selects an action at time t after observing state . When the action takes place, the environment transition state is updated to and the agent gets a reward . The goal of learning is to constantly update decision-making policy for increasing the expected discounted reward: ; here is a discount factor.

Moreover, we will discuss the state, environment, agent, and reward function for DASH-based multimedia streaming.

Table 1 illustrates the notations used in the paper.

3.1. Video Streaming

The video streaming method can be viewed as RL task. It enables the agent to learn the best action using feedback from the environment through trial and error. We define segment size as . The segment is indicated by m and is the quality of segment m. The value of is known to the client from the MDP file before segment m is downloaded. The mean channel occupancy experienced while downloading the segment m is represented by We can find the total downloading time [44] using

The playout time and buffer time for the video segment m is denoted by T and , respectively. When ,   the playout buffer remains unoccupied before the complete downloading of the next segment resulting in rebuffering events. For , the segment m is downloaded before its defined playout time and the succeeding segment downloading starts quickly that adds extra time to the segment m+1 buffer. The rebuffering time for segment m is defined in The next segment buffer is calculated using We have limited the buffer size for our simulations up to 60 seconds.

3.2. State, Action, and Environment

The basic information regarding environmental condition is contained in a state . The agent takes the state as input from the environment and determines the discounted reward as output in order to take action in the given environment. We require determining the state which is fed to the agent and the agent’s network.

We require formulating the state transition model for ADQ-based multimedia content delivery system. The rate adaptation is selected at stage m and the next segment is decided at stage m+1. The state vector for the segment at stage m is denoted by . contains all the information of the network after complete downloading of segment m. The state vector consists of three parameters such as rebuffering events, quality level variation , and available bandwidth as shown in The quality level of the next segment for delivery is indicated by action and measured as encoding bitrate of the video segment representation. The environment in the DASH system relies on the video player, video source, and network bandwidth.

3.3. Reward and Policy

The reward is a subjective score of a video segment and it is computed when the agent chooses the bitrate of the segment. The RL agent performs an action after receiving the state and this action is selected on the basis of policy. The policy performs the probability distribution over action and is denoted by . The policy for a given state and action is defined as In RL methods, reward function is determined as a composite of the video quality, bitrate, rebuffering events, and weighting coefficients. A reward function derives policies to increase the QoE for the user. The reward function [44] for segment t is shown in The term at the right side is responsible for the quality of video. The two succeeding negative terms and are penalty factors that account for the frames sequence and rebuffering events. The term is employed when buffer level is less than defined threshold . It is a penalty factor value which further reduces the occurrence of rebuffering events. The values of , , and are weighting terms which add importance to the penalty terms.

3.4. Q-Learning

Q-Learning is an RL algorithm presented by Watkins [45]. In Q-table, rows represent the state s and columns represent the action a. A Q-value Q(s; a) is stored for every state-action (s; a) pair which represents the quality of taking an action in an environmental state. Q-values are updated after action takes place in a state which results in a reward R and new state . Equation (9) shows the new state after taking the action.In Equation (9), α is a learning rate which determines the agent learning from newly acquired information. The discount factor γ indicates the value of future rewards.

We will discuss the methodology of our proposed adaptive rate control using enhanced Double Deep Q-Learning (ADQ) method in Section 4. The QoE score for the entire episode of viewing video depends upon available bandwidth, resolution, buffer occupancy, and bitrate selection for next segment.

4. Adaptive Rate Control Using Enhanced Double Deep Q-Learning

The presented adaptive rate control enhanced Double DQN (ADQ) algorithm is developed on the basis of Q-learning and deep-learning approach to achieve the best policies for the DASH protocol. The extensions in DQN have been presented due to the growing popularity of deep reinforcement learning. Here, we will propose the rate adaptation algorithm and discuss the enhancements of ADQ in comparison with deep Q-learning methods.

4.1. System Architecture

We have applied ADQ algorithm to learn the best policy. The DASH-based client and server communication of our scheme is shown in Figure 1. The RL converges to the best solution in an efficient way and enhances the reward after a little training period. The client initiates connection with server to select and play the video. The client transmits an HTTP GET request to server after selecting a multimedia file. A multimedia file consists of small segments which are delivered to the client. The Media Presentation Description (MDP) file has information about adaptive streaming for the mobile client. The server stores video segments of different encodings and MPD file. The MPD file includes bit rates, resolution, timing, and URL for the video player. The client then parses the MPD file and regenerates the URLs of video levels encoded at different bitrates. The Request Handling Module (RHM) obtains and analyzes the data received from the mobile client. The media requests are processed by RHM and the requested segment is delivered to the mobile client. The HTTP Manager is responsible for handling HTTP communication between a server and mobile client. The ADQ algorithm uses parameters such as rebuffering events, quality level variation, resolution, and available bandwidth to determine optimal bitrate selection. On the basis of these parameters, the suitable video level is selected by the rate adaptation algorithm at client side. The requested video segment is then delivered to client and the process proceeds until the complete downloading of the video or video termination by the user.

A DQN is a multilayered neural network that results in a vector of action values for a specific state s and network parameters θ. This neural network is a function from for a state space of n-dimension and action space having actions. Minh et al. [46] proposed experience replay and target network usage in the DQN algorithm. The target network has parameters similar to the online network. The difference between target network and online network is that parameters are copied at every τ interval in the target network so that and remained fixed on all remaining intervals. The target used by DQN is given in

An extension of the DQN is Double DQN algorithm [47] which is also a deep RL algorithm. DQN employs the same values to select an action which usually results in overestimated values. We have modified Double DQN (DDQN) [48] method in which the selection is separated from the evaluation process.

DDQN estimates and updates two Q-values for every state-action combination. As a result of these observed Q-values, the Deep Neural Network (DNN) is updated after sequence execution. The updated Q-values are used in the next execution sequence. In Double Q-learning, the experiences are assigned randomly in order to learn two value functions and update one of the two value functions which result in two sets of weights and . The first set of weight determines the greedy policy and the second set of weight determines its value for each update. The selection and evaluation are represented in Q-learning to find the target value as shown inThe error for Double Q-learning is given in

The selection of the action is due to the online weights in the argmax. It means the greedy policy values are still computed using the first set of weights as in Q-learning process. The second set of weights s used to evaluate the value of the policy. The roles of and can be switched to update the second set of weights.

The expected QoE score depends on the player buffer occupancy, network bandwidth, and bitrate selection for the next video segment. The agent using better adaptation scheme can make use of network resources efficiently and select the optimal bitrate to achieve highest QoE for the user. Moreover, the quality variation and rebuffering events can be reduced with the agent’s optimal policy.

The introduction of ADQ approach reduces quality variations and rebuffering events and contributes to the performance improvement. We propose improvements in the learning mechanism of reward function and prioritized experience replay. The reward is calculated over last k steps, where k is the number of times the video quality varies in the last 1 minute. The introduction of using k steps for reward calculation reduces the quality variations as it provides a more stable estimation of the target bitrate for the next video segment. The improved target network and experience replay significantly increase the performance of the algorithm. We define and target is calculated using The experience replay memory is employed to mitigate the correlation between the data and nonstationary distribution. The agent samples the data randomly from the prior experience memory to sample a minibatch of tuples. An experience is assigned with a loss factor to select samples with greater loss using distributed prioritized experience replay. The loss factor [49] is calculated using is the multistep return that is given by Figure 2 shows the state and action update using the ADQ method. The state of the system consists of available bandwidth, rebuffering events, and quality level variation. Initially, the current state is input to the neural network which estimates the Q-value for all the actions in the environment. The action is chosen on the basis of ε-greedy policy. While taking the action , a new state is updated and a new reward is calculated accordingly. This updated information is then stored in the replay memory D. The system then randomly extracts M samples from this replay memory and updates the network weights by using an optimization method. These weights are updated at every K step. The updated weights of the target network minimize the loss function value and hence select the optimal policy.

4.2. ADQ Training Algorithm

The ADQ method finds the best policy according to Algorithm 1. is the reward for segment m. We considered two deep neural networks for training purpose. The first network called online network is upgraded for all new segments with weight at each time step t. It is used for mapping Q values. The target network is employed for improving the stability of the system. The target network weights are upgraded at every k steps by assigning online network and remain same for further k-1 steps. Target network and online network use the value of next state for computing the optimal value . The target value is calculated from the reward R and the discount factor γ. The weights of θ are updated by backpropagation of loss function values to the online network. To mitigate the stability issues, we use experience replay memory D with ADQ. A set of minibatch transactions is selected from the distributed prioritized replay memory to train the Q-network instead of using recent transitions.

Input: State
Output: Optimal policy to select action
Initialization: Experience Replay Memory D, Online Network Weights , Target Network Weights , online action
value function Q, target action value function , k=i=0
(1)   for  video-episode i=1 to E do
(2) Initialize state sequence for received selected video episode
(3)  for  m=1 to M do
(4)  Select action a according to є- greedy policy from Q with probability
(5)  Execute action a and observe reward
(6)  Set and preprocess the state
(8)  Store transition in D
(9)  Sample a mini batch of tuples from distributed prioritized replay memory D
(10)   // A is all possible set of actions
(11)   Determine
(13)  reset
(14)  s=
(15)  end for
(16)  end for

The algorithm runs all episodes of the videos and for each episode, the state sequence is initialized as a default state. The inner loop downloads all segments sequentially. The action is selected based on the є- greedy policy from Q. The action is executed and reward is observed for the current action. The next state is updated using the current state and action. The state transition is stored into the distributed prioritized replay memory.

4.3. Bandwidth Estimation and ADQ Testing Algorithm

We have used measurement based prediction [50] to determine estimated available bandwidth by employing the Exponentially Weighted Moving Average (EWMA). It deploys the recently perceived data and the weights of previous data in order to adjust the weights dynamically. The EWMA filter is then employed to estimate the network bandwidth as shown in Here, represents the estimated bandwidth for t time interval, represents the estimation difference, and indicates the bandwidth of t-1 time interval. The estimation difference is calculated by adjusting the weights as shown in represents the weight of moving average and represents the weight of standard deviation.

The testing process of ADQ is described in Algorithm 2. The ADQ agent selects the optimal bitrate for next segment of a video using trained dataset in the testing phase. The algorithm is continually executed until the complete downloading of all video segments. The device resolution is determined and stored in the memory. The bandwidth trace is started to emulate the real network scenario for varying environment. When the video is selected for playback, the outer loop selects all segments sequentially. After selecting each segment, the bandwidth is computed using (16). The desired bitrate is determined according to current state of the system. The segment download is initiated according to the device resolution and desired bitrate. The algorithm counts the rebuffering events and quality variations during the video playback session. The QoE score is calculated using (3). The objective quality assessment metrics like PSNR and SSIM are employed to measure video quality.

(1) Get the Device Resolution
(2) Start the bandwidth trace
(3) While  video all segments are not downloaded completely
(4) Find the estimated bandwidth using using equation (16)
(5) Choose the desired bitrate based on the current state
(6) Request and download the video segment according to Device resolution and desired bitrate
(7) Determine the rebuffering events and quality variations
(8) Compute QoE score for the user using equation (3)
(9) Calculate Objective quality assessment metrics score for PSNR and SSIM
(10) End while

We have compared the proposed ADQ method with FESTIVE, BBA, QDASH, MPC, Rate-based, and A3C in the testing phase. The dominant QoE factors are video quality, stability and, smoothness.

5. Results and Discussion

DASH.js is a web standard that employs the HTML5 video elements. The dash.js (version 2.9) [51] is modified to evaluate ADQ and existing ABR algorithms. It is configured to receive the video stream according to the selected bitrate. The buffer size of media player is configured for 60 seconds. The Google Chrome browser (version 71) and APACHE server are used for testing. The server machine for testing is 3.7 GHz Intel quad core processor and 16 GB RAM.

We have used Waterloo Streaming QoE Database III (SQoE-III) [52] for our implementation and result evaluation. 5 video clips are used to verify the ADQ method. Table 2 shows the frame rate (FPS), temporal information (TI), and spatial information (SI) of chosen videos. Each video is played for 300 segments which are encoded in six distinct bitrate levels. The duration of each segment is 2 seconds.

The mobile device classification according to screen resolution is shown in Table 3.

The FFmpeg [53] is employed to provide different encodings from the original video. This command line software is a fast encoder-decoder tool for converting videos to different sizes, formats, and bitrates.

The video is ranked using the mean opinion score as illustrated in Table 4. The quality scale for subjective testing is modified to relate six quality levels with mean opinion score.

5.1. Training Phase

The network traces of public datasets are employed for evaluation of algorithms on the basis of real network conditions. The datasets include a 3G/HSDPA mobile dataset [54] and a 4G trace dataset for different mobility patterns [55]. We use 135 throughput traces and average duration of each trace is 10 minutes. The throughput ranges from 0 to 173 Mbits/s with a granularity of one sample per second. ABR streaming is applied to RushHour video on 135 real network traces during training phase.

We have compared the training phase of our proposed ADQ method with existing RL-based ABR algorithms. The basic parameters of ABR algorithms are given in Table 5. The new state, old state, reward, and action are collected in the training phase which upgrades the Q-value network weights regularly.

Deep Q-learning is employed to select an action and to update Q-value in online phase. The DRL agent computes all actions for decision time and system state in the form of Q (, a) using the DNN. In ε-greedy policy, each agent selects an action with the highest value of Q(, a) estimated by probability 1- ε. The observed total reward (,) is used for updating the Q-value after action occurs during the time interval .

The greedy policy is used for the random bitrate selection in the prestages of the training phase. It trains an agent for all possible states of the environment. Training phase of our proposed method is illustrated in Algorithm 1. In training phase, the algorithm uses multistep rate selection to gain experience using distributed prioritized replay memory.

The well-trained agent can adapt the dynamic variations of network throughput for selecting the optimal bitrate. The main priority of the ADQ method is continuous video playback by using the maximum available bandwidth.

The learning machine uses an automatic mechanism that takes raw data to find the best representation for classification automatically. Deep-learning techniques use several layers of artificial neurons. The purpose of every layer is to transform the input into an abstract representation which is selected as the input of the next layer.

The parameters and values used for FESTIVE, MPC, QDASH, A3C, and proposed ADQ algorithm is given in Table 5.

The convergence speed is evaluated for RL-based and ABR methods. The video episodes experienced by the agent are set along the x-axis and the average reward value is set on the y-axis to show the convergence in Figure 3. It is clear from the graph that ADQ has higher QoE score and greater convergence speed.

The A3C and QDASH algorithms are used as a benchmark for comparison with the proposed technique. The simulations are carried out by employing greedy policy to get the highest reward after completing each video episode. The existing QDASH algorithm takes about 105 episodes to approach the A3C algorithm. QDASH obtains the lowest reward at convergence and ADQ achieves high reward after fewer video episodes.

In the initial stage of the training phase, bitrate is selected randomly using a greedy policy. It allows the agent to explore all possible and feasible states. The greedy policy is employed in the training phase for balancing exploration and utilization. The training mechanism is described in Algorithm 1. This algorithm selects the bitrate of a video segment, accumulates the multistep experience, and stores into the distributed replay memory. The network weights are updated according to error of loaded experience. The algorithm improves the performance and convergence speed. Figure 6 shows the comparison of RL-based approach with heuristic methods using the data set. The mean QoE score of ADQ is greater than the considered methods on the validation set. The enhanced performance of ADQ is particularly due to the less rebuffering events and higher bitrate level.

5.2. Testing Phase

We have used BigBuckBunny, Ski, TallBuildings, and TrafficAndBuilding videos for the testing phase. Figure 4 shows the real bandwidth trace used during the experimental evaluation of our proposed ADQ method and existing algorithms. A bandwidth throttling module [54] is employed to simulate the bandwidth that creates the real-time scenario for testing. The testing of proposed ADQ for video playback is presented in Algorithm 2. The algorithm chooses the appropriate bitrate for downloading of a video segment. The inner loop continuously runs until all the video segments are downloaded. The throughput traces have different network scenarios to test the ADQ algorithm. The trained ADQ agent can adjust the dynamic network variations. ADQ method addresses the playback fluency by utilizing the entire bandwidth.

The results are evaluated for each FESTIVE, BBA, QDASH, MPC, Rate-based, A3C, and ADQ algorithms. The PSNR, SSIM, rebuffering frequency, total switch frequency, and QoE are measured for each method to find the video quality. The average quality metrics are shown in Table 6.

The average values of PSNR and SSIM in testing process using dynamic traces are shown in Figure 5. The PSNR value shows improvement for ADQ over other techniques. The ADQ and A3C are capable of maintaining an average SSIM higher than 0.80 for each video under consideration. The ADQ algorithm achieves larger values of SSIM as compared to FESTIVE and rate-based heuristic. It is noticed that FESTIVE shows the better SSIM in comparison with rate-based heuristic. The ADQ outperforms FESTIVE, MPC, Rate-based, BBA, A3C, and QDASH algorithms in terms of PSNR and SSIM.

Figure 6(a) shows the rebuffering frequency for each technique. The MPC shows stability in video quality but it experiences rebuffering events because of optimistic throughput prediction. MPC and FESTIVE show better video stability and lesser rebuffering events. The MPC is computationally intensive and real-time implementation needs precomputed data that results in memory consumption. The ADQ performs better than MPC and FESTIVE with very fewer rebuffering events. Figure 6(b) shows the total switch frequency for the ADQ and existing ABR algorithms. The switch frequency for FESTIVE is lower so it shows stability in video quality. The rate-based method experiences huge quality fluctuations that have a significant effect on user QoE. ADQ shows significantly better performance as compared to FESTIVE and rate-based methods for the high bandwidth fluctuations. The ADQ performs better due to low rebuffering and quick convergence using real capacity traces.

Figure 7 shows the QoE performance. The QoE is measured using (3). The QoE performance of ADQ method is greater than the existing methods for the real-time network. It uses different mobility patterns, particularly with the low network bandwidth pattern. The proposed method shows high flexibility to different network conditions for gaining better QoE while considering the available bandwidth. The existing algorithms employ fixed control laws and do not adapt to varying network conditions. A3C performance is comparable to ADQ in terms of QoE metric. It is clear that the average QoE value of ADQ method is higher than RL based and heuristic methods. The main reason for the improvement in QoE is fewer rebuffering events, infrequent quality variations, and a higher bitrate level.

ADQ method introduces double Q-learning that increases the computational complexity. However, the benefits of ADQ restrict the cost to the training phase. Once the ADQ is trained well, it can provide better user experience by performing the adaptive rate selection at low cost.

We have used multibitrate levels and real-time HSDPA datasets to compare the proposed algorithm with existing algorithms. The advantages of rate-based and buffer-based algorithms are least quality variations and fewer rebuffering events, respectively. The MPC can perform trade-off between rebuffering frequency and total switch frequency to some extent but the implementation is hard due to the computational complexity which results in poor QoE performance. The performance of FESTIVE and QDASH algorithm is closer to our proposed method but there is still a significant gap. We notice that ADQ is able to keep the rebuffering events and quality variation length minimum throughout the video playback and maintains a higher bitrate level with low bitrate switching. Furthermore, we have achieved a higher QoE performance using real-time networks traces. The results show that our proposed algorithm ADQ outperforms BBA, QDASH, A3C, and Rate-based algorithms. Moreover, it is better than the heuristic methods such as MPC and FESTIVE.

The intelligent QoE-aware adaptation approaches have a better buffer control policy. Our algorithm uses the buffered segments for the video playback when the network bandwidth drops and increases the buffer according to the network capacity. The ADQ avoids the sudden fluctuations in playback video quality with reduced rebuffering events. The reward function obtained from the HSDPA dataset assures that QoE oriented policy is well-trained rate adaptation policy; however, circumstances that affect QoE are very complex. The QoE includes three factors such as bitrate variation, rebuffering frequency, and average QoE. Stalling events and initial delay can be integrated with the learning-based schemes in the near future so that learned policy results in the higher QoE. Reinforcement learning can also be used at the base of various DASH clients during network resource allocation to make the decisions according to the available network resources. The bitrate adaption over HTTP based on network and device parameters is an attractive subject for multimedia content delivery.

6. Conclusion

We have proposed an ADQ method based on the enhanced Double Deep Q-Learning. The ADQ method introduces improvements in the replay experience and double DQN network architecture so that the client agent learns efficiently through the previous experience and converges quickly to the optimal policy. We use HSDPA dataset for evaluation of our algorithm. The proposed algorithm is implemented in dash.js and compared with other algorithms considered in the study. The ADQ method converges faster than the FESTIVE, A3C, MPC, and QDASH during the training phase. The ADQ agent converges to higher bitrate adaptation policy while experiencing a few video segments during the training phase. These improvements occur due to the changes in the Q-value network architecture and learning process. The evaluation results depict that ADQ performs better than the considered rate adaptation techniques in terms of video quality, QoE, rebuffering frequency, and total switch frequency.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.