#### Abstract

In the resource scheduling of streaming Media Edge Cloud (MEC), in order to balance the cost and load of migration, this paper proposes a video stream session migration method based on deep reinforcement learning in cloud computing environment. First, combined with the current popular OpenFlow technology, a novel MEC architecture is designed, which separates streaming media service processing in application layer from forwarding path optimization in network layer. Second, taking the state information of the system as the attribute feature, the session migration is calculated, and gradient reinforcement learning is combined with in-depth learning and deterministic strategy for video stream session migration to solve the user request access problem. The experimental results show that the method has a better request access effect, can effectively improve the request acceptance rate, and can reduce the migration cost, while shortening the running time.

#### 1. Introduction

In recent years, with the maturity of cloud computing technology, streaming media services are gradually transforming to cloud form, that is, streaming Media Cloud. Streaming media cloud pushes the content requested by users to the edge of the network by placing media edge cloud in different geographical locations, so that to reduce the user response delay and reduce the traffic load of the main network [1]. At the same time, the subcloud can adapt to the changes of system load and the size of user requests, so that to effectively solve the problem of traditional streaming media services [2].

In streaming Media Edge Cloud (MEC), system resources are virtualized into resource pools to ensure service transparency. Cloud resource allocation is automatically adjusted by the cloud platform according to the scale of actual demand, so how to allocate system resources in real-time to meet user needs. Under the condition of limited resource allocation, the fluctuation and randomness of user request mode will make the system load unbalanced and affect the access effect of user request [3, 4].

In order to solve the above problems, domestic and foreign scholars have proposed a migration-based task scheduling method for streaming media. Ref. [5] proposed a session migration strategy based on dynamic threshold allocation (SMS-DTA). According to the popularity distribution, the session allocation thresholds of all kinds of videos on each server are determined, and the user request access is guided by the allocation thresholds. Ref. [6] proposed resource dispatch based on data priority (RDDP) algorithm. However, considering the impact of the urgency and scarcity of data blocks on priority, quantitative calculation is not given. Only the balance factor is used to measure the quantitative relationship between them, and the influence of time factor on emergency quantification is omitted. Ref. [7] proposed a direct access storage device (DASD) hopping algorithm to migrate sessions of nodes with different loads in order to maintain the load balance of hard disk. However, due to the lack of self-adaptability, it is difficult to adjust the strategy according to the system operation scenario. Moreover, the mathematical model is relatively complex and the calculation is large, which cannot solve the problem of large-scale resource allocation.

Ref. [8] explored how to make the streaming media edge cloud admit more requests via online session migration and proposed an adaptive strategy of online session migration. Besides the load information, the video popularity is adopted for obtaining the allocation thresholds of different videos on each server, and a new request would be admitted under the guidance of the obtained threshold distribution. Specially, when the video popularity varies, the allocation thresholds would be recalculated. Ref. [9] proposed a joint optimization algorithm of session migration and video deployment, the proposed strategy is more adaptive to dynamic fluctuation of video popularity, and thus gains a flexible balance between service cost and quality. The trace-driven experiment verified the effectiveness of the proposed method.

According to the resource allocation of streaming media edge cloud, in order to balance the cost and load of migration, considering the cost of migration, load balancing, and other constraints, this paper proposes a video stream session migration method based on deep reinforcement learning. Based on the current popular OpenFlow technology, a novel MEC architecture is designed, which separates streaming media service processing in application layer from forwarding path optimization in network layer to ensure service transparency. The main innovations are as follows:(1)This paper improves the resource utilization by effectively utilizing the state information of the MEC system, combining in-depth learning and deterministic strategy for video stream session migration(2)This paper proposes a session migration computing model to process user requests more scientifically, maximize the access rate of user requests, and control the migration cost appropriately, at the same time, make the system achieve load balancing as far as possible

#### 2. Streaming Media Edge Cloud Architecture

Streaming Media Edge Cloud is located on the edge of the network, which is responsible for local video services. As shown in Figure 1, combined with the current popular OpenFlow technology, this paper designs a novel MEC architecture. The whole MEC is composed of streaming media server, business management server, and OpenFlow controller and switch, in which the streaming media server is responsible for providing media streaming to users; the business management server is mainly responsible for the access scheduling of user requests, generating migration strategies and sending them to OpenFlow controller; the OpenFlow controller and switch, on the one hand, it constitutes a media stream distribution network, on the other hand, it is responsible for the actual implementation of session migration; OpenFlow controller is responsible for generating flow tables according to migration strategy and sending them to switches; OpenFlow switch completes the modification and forwarding of data packets according to flow tables.

By introducing MEC architecture, streaming media service processing in application layer is separated from forwarding path optimization in network layer, and transparency of video service is realized.

#### 3. Session Scheduling Strategy Based on Deep Reinforcement Learning

Assuming that the video content provided by the MEC system has kind of video content, the kind of video is represented by . Each kind of video is encoded at a constant bit rate and serves at the same bit rate [10, 11]. Assuming that the total number of MEC streaming media servers is , the server is represented by , and is defined as the video deployment matrix with the size of , and the element represents whether or not a copy of is deployed on . Assuming that all servers are homogeneous, and a single server can provide up to streaming sessions at the same time, as well as up to videos [12].

is defined as a session distribution matrix with the size of . A single element represents the ratio of all sessions of video on server to the total service capacity (JC) of the system. is defined as a server adjacency matrix with the size of . Element denotes whether there are sessions on server that can be migrated to the server, where .

Define as the load of streaming media server , that is, the total number of access sessions, then:

Define as the average load of all streaming media servers in the system, then:

In this paper, the state information of the MEC system is taken as an attribute feature, and the decision-maker and value function are fitted by deep convolution neural network combined with reinforcement learning elements such as state space, action set, and return function. In order to improve the efficiency of the algorithm, the deterministic strategy gradient is used to train the neural network.

##### 3.1. Session Scheduling Model

For streaming media edge cloud system, the goal of reinforcement learning is to access the video request to the most suitable server independently according to the current MEC system status and video request according to the experience strategy. Then, according to the load state of the server, the optimal migration method for the current incoming user requests is obtained by using the migration video strategy to perform the request access or one-step session migration action [13, 14].

In this paper, the deep reinforcement learning method is applied to session scheduling in streaming media edge cloud, and its session migration method is shown in Figure 2.

In Figure 2, for the current step video request , the decision-making action of the decision-maker is to connect the video request to a server, assuming that the server accessed is , then the strategy of moving out video is: if is not full, no video needs to be moved out, set the number of moving out video to be 0, corresponding to the request access; if is full, it needs to move out the video and move out the video. The set of numbers is , corresponding to one-step session migration.

##### 3.2. Enhanced Learning Model of Conversation Transfer

According to the characteristics of the problem, the state of time step in the MEC system is defined as follows:

Among them, is the server adjacency matrix and the size is , which indicates whether session migration can be carried out among servers, is the video request matrix of time step, the size is , the elements of one row in the video request matrix are all 1, the other elements are all 0, and the corresponding video number of the row with the elements all 1 is the appropriate one. The former video request is the video deployment matrix and the size is , which reflects the deployment of video copies in the MEC system. is the session distribution matrix and the size is , which will reflect the distribution of video sessions in the MEC system. Every time a new video request is processed, the system will undergo a state transition [15–17].

Since the task is to decide which server to access or reject the request based on the current MEC system status and video request, the action is defined as the server number to which the video request is accessed, where . For the current step video request , the optional action set is shown in Formula (4). When accesses MEC directly or through session migration, the set of optional actions is the set of servers deployed with video ; when rejecting video request , the corresponding action is 0.

If video request is accessed to server according to the decision-making action, this paper chooses the deployment of video on server , the load of server , and the variance of load balance of MEC system after executing the action as the immediate return function. The video deployment on the server, the load value of the server, and the load balancing variance of the system are different, so the load balancing variance of the system is normalized. The load balancing variance function is defined as:

Since the variance of load balancing is the inverse of variance, the formula above shows that the larger the variance of load balancing, the more balanced the load of the system.

If the video requested in time step is , the quotation value returned by action of that step is

where , , when video is deployed on the server corresponding to decision action , there will be corresponding reward value . If video is not deployed on server , the action is not a reasonable access action. It is not in the optional action set, reward value 0, and reward value represents the remaining service capability of the server. When the server is full, that is, the residual service capacity is 0, the reward value is 0. When migration occurs, because session migration has a certain cost, the reward value is reduced by 1 as the corresponding penalty. When the action is to reject the video request, the return value is set to -1. , , and represent the weights of the returns from the three optimization objectives, respectively. The weights can be set according to the importance of the optimization objectives, but the sum of the three weights must satisfy .

Defined in MEC system state , after taking action , if strategy is continuously implemented, the expected value of immediate return is action-value function. Defined Bellman equation as follows:

where is the immediate return value after taking action under the state of the MEC system. In the whole session scheduling process, the above equation is the final solution of the equation, and the optimal scheduling strategy is obtained by solving the equation.

##### 3.3. Migration Computing Model

In MEC architecture, the migration cost can be expressed as by the number of migrated sessions. In addition, this paper specifies the maximum threshold , i.e., , for a single migration cost, where the value of is determined by OpenFlow’s flow processing capability.

Under the unbalanced load distribution, the full-loaded servers can continue to access new requests only if some sessions are moved out. Therefore, whether the load is balanced or not will indirectly affect the cost of migration. In practice, due to the fluctuation of request distribution, all kinds of video requests do not arrive strictly according to popularity. The scheme of optimizing the acceptance rate mentioned above can easily lead to an unbalanced load and increase the cost. Therefore, a goal of load balancing maintenance is introduced.(1)For new requests arriving immediately , due to the directive allocation threshold, can only be connected to nodes that have not yet reached , in order to minimize the load imbalance. Therefore, the following new optimization objectives have been added

where is a constant matrix, the calculation method is: for , assuming that is the node deployed with the smallest load, the corresponding element ; of course, the corresponding element on the other nodes. In addition, for the rest of the video , , the corresponding element is .(2)For all subsequent arrival requests, in order to connect them to the minimum load node, it is necessary to ensure that each allocation threshold is larger than the number of sessions [18]. In addition, considering the continuity and randomness of request arrival, the difference between the allocation threshold and the number of sessions should also be related to the arrival of requests and other factors [19, 20]. Therefore, the following new optimization objectives have been added

where is a constant matrix, assuming that represents the number of requests that has not yet arrived, it can be approximately expressed as . For the node set of deployment , the corresponding element , and for the remaining nodes, the corresponding element . and are weight vectors, considering that popular video is more likely to affect the load distribution, the weight is desirable ; considering that lightweight nodes should allocate larger thresholds, the weight is desirable .

From the effect point of view, the smaller the load of nodes, the larger the allocation threshold to undertake more requests for access. However, since this optimization strategy is adopted after the start of MEC, the load of each node is basically balanced, so the abovementioned average allocation processing can still achieve the desired effect.

In addition to the limitation of the cost of a single migration, the following constraints should be considered: the service capacity limitation of each server; the value range limitation of ; and the principle of “no reduction in the number of actual sessions.”

In summary, with session assignment matrix as a decision variable, the migration computation model can be expressed as follows:

where and are const matrix.

Subject to:

##### 3.4. Scheduling Algorithm Based on Reinforcement Learning

###### 3.4.1. Choice of Behavior Strategies

This paper chooses deterministic behavior strategies and defines a function , which is expressed as:

The behavior of each step can be obtained by calculating function . Function is simulated by using convolutional neural network. The network is a strategy network with a parameter of . A function is used to measure the performance of strategy , which is defined as:

where is the state of the system, is in each state, if the action is selected according to policy , the value can be generated, that is, is the expected value of when policy . Therefore, the optimal behavior strategy is the strategy which maximizes , that is,

Network input is MEC system state, that is, video request matrix, video deployment matrix, session distribution matrix, and server adjacency matrix with size . The eigenvectors of the video request matrix and the video deployment matrix represent the deployment information of the current video on the server. The size of the eigenvectors is . The three eigenvectors are connected through concat layer. Finally, the probability distribution of server number is obtained by using Softmax classifier. The dimension is , and the decision-making action is the server number corresponding to the maximum probability.

In order to make it more exploratory, on the basis of the deterministic strategy, behavior search is added, that is, 30% of the actions are randomly selected in the optional action space, and the remaining actions are the output of the strategy network.

###### 3.4.2. Iterative Value Calculation

In this paper, the convolution neural network is used to simulate function. The network is called network. Its parameter is . The model of network is shown in Figure 3.

The input of network is the MEC system state and action vector, and the action vector is the result of transforming the probability distribution vector of the output of policy network into one-hot vector, the size of which is . In network training, input sample data is highly correlated with time, and direct training is not easy to converge. In order to break the correlation between data, the method of “experience playback” is used to save the generated sample data into the buffer, and the sample data used in training is randomly extracted from the buffer.

In the process of network training, this paper uses the target network method to establish the copy and of the policy network and network to calculate the target value, and then update the original network slowly in the proportion of . Through this network learning method, the learning process will be more stable and convergence will be more guaranteed. The flow chart of reinforcement learning algorithm based on deterministic strategy gradient is shown in Figure 4.

#### 4. Experiment and Analysis

##### 4.1. Parameter Setting

In the environment of the MEC system simulation, the environment parameters are as follows: the total number of streaming media servers , the capacity of each streaming media server, the maximum number of service sessions , and the number of video types . At the same time, assuming that the arrival rate of users’ requests obeys Poisson distribution of requests per minute, the value of ranges from 58 to 65. The average playback time is set to 30 minutes, and the system can support 2000 (JC) user video requests concurrently in one playback time. Therefore, when , the system reaches full load. The video content requested by users obeys Zipf distribution and random uniform distribution, respectively.

In order to analyze the effectiveness and practicability of deep reinforcement learning algorithm, this paper programmed on tensorflow platform and applied it in the MEC system session scheduling strategy. The parameters of the algorithm are as follows: the learning rate of policy network is 0.0001, the learning rate of network is 0.001, the discount coefficient is 0.95, the capacity of buffer is 1,00000, the preheating coefficient of buffer is 1,000, the number of iterations is 100,000, the upper limit of time step is 60 steps, the number of samples for each iteration is 600, and the weight coefficient in the return function is , , .

##### 4.2. Result Analysis

In this paper, according to the parameters set in Section 4.1, the deep neural network training is carried out. The trained network model is used in the MEC system simulation experiment, and the simulation time is set to 300 minutes. In order to better reflect the effect of algorithm optimization, under the same experimental conditions, the proposed algorithm is compared with Ref. [8] algorithm and Ref. [9] algorithm. In this paper, user request receipt rate, total number of migrated sessions, and running time are used as performance evaluation indicators.

Set the video content requested by the user to follow Zipf distribution. Figures 5–7 show the relationship between the user request reception rate, the total number of migration sessions and the running time of the simulation algorithm, and the system load under this condition, respectively.

As can be seen from Figures 5 and 6, in the case of low load (), the user request reception rate and the total number of migration sessions of this method are basically the same as Ref. [8] algorithm and Ref. [9] algorithm. In the case of high load (), the receiving rate of user requests and the total number of migrating sessions of this method are lower than Ref. [8] algorithm and Ref. [9] algorithm.

Compared with Ref. [8] algorithm and Ref. [9] algorithm, the average receipt rates of user requests in this method are reduced by 0.85% and 1.72%, respectively, and the total number of migrated sessions is reduced by 3.55% and 5.29%, respectively. The result shows the advantage of reinforcement learning. Because session transfer is cost-effective, in order to obtain greater returns, the decision-maker constantly adjusts the decision-making actions and ultimately reduces the cost of transfer, while guaranteeing a higher request reception rate.

As can be seen from Figure 7, for both low load and high load, the running time of the proposed algorithm is better than Ref. [8] algorithm and Ref. [9] algorithm, and the running time is shortened by 39.98% and 54.54% on average, respectively. Because in the process of user requesting access, the session allocation threshold needs to be constantly updated by method Ref. [8] algorithm, which leads to a lot of computation, and method Ref. [9] algorithm needs to be constantly overlapped. In order to find the optimal solution, the deep reinforcement learning method used in this paper only needs to make scheduling decisions through the trained strategy network, which has less computational complexity and improves efficiency.

In order to evaluate the adaptability of the proposed algorithm, a random uniform distribution of video content requested by users is set up. Figures 8–10 show the relationship between the request reception rate, the total number of migration sessions, and the running time of the simulation algorithm and the system load under this condition, respectively. Compared with Ref. [8] algorithm and Ref. [9] algorithm, the average receipt rate of user requests in this algorithm is reduced by 0.41% and 1.19%, the total number of migrated sessions is reduced by 3.64% and 6.57%, respectively, and the running time is reduced by 45.28% and 56.03%, respectively. The experimental results show that the proposed algorithm has a certain degree of self-adaptability. When the distribution of user requests changes, the scheduling strategy can still be adjusted in the training process, resulting in a lower migration cost and a higher user request reception rate.

In summary, the proposed deep reinforcement learning-based session scheduling strategy for streaming media edge cloud not only achieves better request access effect but also has lower migration cost. More importantly, it has a great speed advantage, that is, shorter running time. At the same time, it has strong adaptability in an uncertain MEC system environment.

#### 5. Conclusion

In order to achieve efficient and smooth resource scheduling for streaming media service system in cloud mode, this paper proposes a video stream session migration method based on deep reinforcement learning. The method transforms session migration problem into reinforcement learning problem; defines state space, action set, and return function; calculates session volume according to load; and uses convolutional neural network to fit behavior selection strategy function and action-value function. The experimental results show that compared with the methods of Ref. [8] algorithm and Ref. [9] algorithm, this strategy can reduce the migration cost and shorten the running time.

This paper only considers the video session request access server as the output of network. Later research focuses on the video session request access server and the video session moved out of the server as the output of network, in order to improve the migration method of streaming media edge cloud session and extend the application object to dynamic video session.

#### Data Availability

The data included in this paper are available without any restriction.

#### Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

#### Acknowledgments

We wish to express their appreciation to the reviewers for their helpful suggestions which greatly improved the presentation of this paper. This work was supported by the Henan Province Science and Technology Project (142102210366), Henan Public Security Think Tank Project (No. 2020-25), and Henan Police College Education and Teaching Reform Research and Practice Project (No. 2020-9).