Edge caching is a promising method to deal with the traffic explosion problem towards future network. In order to satisfy the demands of user requests, the contents can be proactively cached locally at the proximity to users (e.g., base stations or user device). Recently, some learning-based edge caching optimizations are discussed. However, most of the previous studies explore the influence of dynamic and constant expanding action and caching space, leading to unpracticality and low efficiency. In this paper, we study the edge caching optimization problem by utilizing the Double Deep Q-network (Double DQN) learning framework to maximize the hit rate of user requests. Firstly, we obtain the Device-to-Device (D2D) sharing model by considering both online and offline factors and then we formulate the optimization problem, which is proved as NP-hard. Then the edge caching replacement problem is derived by Markov decision process (MDP). Finally, an edge caching strategy based on Double DQN is proposed. The experimental results based on large-scale actual traces show the effectiveness of the proposed framework.

1. Introduction

With the development of network services and the sharp increasing of mobile devices, severe traffic pressure posed an urgent demand of network operator to explore the effective paradigm towards 5G. Related works show that the requests of top 10% video account for 80% of all traffic, that is, the repeated downloads of the same content [1]. Device-to-Device (D2D) content sharing is an effective method to reduce mobile network traffic. In this way, users can download required content from nearby devices and enjoy data services with low access latency [2], which can improve their service qualities (QoS).

In order to design an efficient caching strategy in mobile networks, we need to achieve the statistical information of the user requests and sharing activities by system learning from the extreme volume of mobile traffic. In previous work, some important factors in mobile networks (such as content popularity, mobile models, user preferences, and user behaviour) are assumed to be well known, which is not rigorous [3]. Recently, a learning-based method is proposed to jointly optimize the mobile content sharing and caching [4, 5]. The authors of [6] calculated the minimum unload loss according to user’s request interval and explored content caching of small base station (SBSs). Srinivasan et al. [7] used the Q-learning method to determine the load-based spectrum, optimizing the spectral sharing. However, traditional RL technology is not feasible for the mobile network environment with large state space.

Motivated by this, we studied the D2D edge caching strategy in hierarchical wireless network in order to maximize unloading traffic and reduce pressure through D2D communication. And the cache replacement process is modelled by Markov decision process (MDP). Finally, a Double Deep Q-network (Double DQN) based edge caching strategy is proposed. The contributions of this paper are summarized as follows:(i)We model the D2D sharing activities by considering both online factor (users’ social behaviours) and offline factor (user mobility). The optimization then is proved as NP-hard.(ii)The cache replacement problem is established by Markov decision process (MDP) to address the continuousness of edge caching problem. And we propose a Double DQN-based edge caching strategy to deal with the challenge of action/state spaces explosion.(iii)Combined with the theoretical model, real trace evaluation, and simulation experimental platform, the proposed Double DQN-based edge caching strategy achieves better performance than some existing caching algorithms, including the least recently used (LRU), least frequently used (LFU), and first-in first-out (FIFO).

The rest of this article is organized as follows. We explained the relevant work in the second part. The third part introduces the system model. The fourth part introduces the cache optimization strategy and raises the relevant problem. The fifth part introduces the details of cache strategy optimization. And in the sixth part, large-scale experiments based on real tracking are carried out.

There are many researches on edge caching in mobile network. For example, it is studied and proposed in [810] that adding caching to mobile network is very promising. Femto caching proposed in [11, 12] and AMVS-NDN proposed by [13] are both committed to adding the cache in BS for the purpose of unloading the traffic. The authors of [1416] proposed a collaborative caching strategy between BS, which greatly improves the QoS of users. In recent years, the application of intelligence in wireless networks is getting more and more attention. Research in [17, 18] shows that enhanced learning (RL) has great potential in the design of BSs content caching schemes. Particularly, the author proposed the base station caching replacement strategy based on Q-learning and used multiarmed bandit (MAB) to place the cache through RL technology [17]. However, considering the extreme complexity of the actual network environment and the maximum of the state space, traditional RL technology is not feasible. Besides, all of the works mentioned above are focused on single-level caching without considering multilevel caching.

Multitier caching is widely used to exploit the potential of system infrastructure, especially in web caching systems [1921] and IPTV systems [22]. Reference [23] focused on the theoretical performance analysis of the content cache in HetNets, which assumes that the content is in the same size. However, [22, 23] do not involve the design of caching policies, which required practical considerations in terms of constraints (for instance, limited front-end/backhaul capacity, diversity of content sizes) and specific characteristics of network topologies.

3. System Model

As shown in Figure 1, we consider hierarchical network architecture. The core network communicates with base stations via the backhaul link and the base station communicates with the user via the cellular link. N mobile users are uniformly distributed with a local buffer size , users can establish direct communications with each other via D2D links, and they can also be served by the BSs via cellular links. M files are stored in the content library , and their content sizes are denoted as . represents the size of the requested content . The cache state is described by . Here, is binary, where = 1 denotes that the user u caches the content f while = 0 means no caching.

3.1. Content Popularity and User Preference

The popularity of content is often described as the probability of a content from the library which is requested by all the users. Denote an popularity matrix , where is the probability of user requests for content in the component. In related studies, the content popularity is always described by ZipF distribution as [24].where the is popularity index that user gives to content in a descending order and is the ZipF exponent.

We measured users’ sharing activities by large-scale tracing of D2D sharing based on Xender. As shown in Figure 3 [25], in the real world, the matrix changes over time (we will introduce the tracking in detail in the sixth part). We assume that the matrix remains constant over time, and our caching strategy refreshes with changes of the popularity matrix . And the period of user sharing activities can be divided into Peak hours and Peak-off hours. The cache replacement action occurs during the Peak-off hours at each period.

User preference: the user preference, which is denoted as , is the probability distribution of a user’s request for each content. According to the content popularity matrix , each row of the matrix denotes a popularity vector of a user which reflects the preference of a user for a certain content in a statistical way. Assuming that the content popularity and user preference are stochastic, we can obtain the relation:where is the probability of user sending a request for various contents , given to a user request distribution , , , which reflects the request active level of each user.

3.2. D2D Sharing Model

Under the D2D-aid cellular networks, users can select either D2D links model or cellular links model. In the D2D links model, users can request and receive the content from the others via D2D links (e.g., Wi-Fi or Bluetooth) or request the content from the BSs directly in a cellular links manner. In our model, the users select D2D links model in advance. If the requested content is not in their own buffers (or their neighbours’), the cellular links model is chosen.

To model the D2D sharing activities among mobile users, the opportunistic encounter (e.g., user mobility, meeting probability, and geographical distance) and social relationship (e.g., online relations and user preference) are two important factors to be concerned about.

(1) Opportunistic Encounter. It is necessary to ensure that the distance between the two users is less than the critical value ,when the user communicates via the D2D link. Since the devices are carried by humans or vehicles, we use the meeting probability to describe the user mobility.

Similar with the prior work [26], we regard as the contact rate of user and , which follows the Poisson distribution and the contact event is independent of the user preference. We can obtain the opportunistic delivery as the Poisson process with rate . If user caches content in its buffer, we can derive the probability that user receives content from user before the content expires at time . For a node pair, we can derive thatHowever, if the content is not cached in user , . Combined with the definition of , we can overwrite (3), as . Hence, the probability that user cannot receive content from all the other user is . Then the probability of user receiving content from user can be expressed by

(2) Social Relationship. In social relationship among users, mobile users with weak social relationship may not be willing to share the content with the others owing to the security/privacy concerns. On the other hand, users sometimes have additional resource and are willing to share the content with others. However, the sharing activities may fail because of the hardware/bandwidth restriction (the content may be too large or the traffic speed is too slow). Thus, we consider the social relationship mainly depends on user preference and content transmission rate condition.

We employ the notion of Cosine Similarity to measure the user preference between two users and the preference similarity factor is defined asFinally, based on the opportunistic encounter and social relationship, we can obtain the probability of D2D sharing between user and as follows:where . The sum of probability of D2D sharing between each user and other users is less than 1.

3.3. Association of Users and BSs

Users can ask the content directly from the associated local BS when the requested content cannot be satisfied by D2D sharing. Definition is the cellular serving ratio, which is the average probability that the requests of user have to be served by local BS via backhaul link rather than D2D communications. Thus, we can obtain . In this paper, we consider that the content transmission process can be finished within the user mobility tolerant time, e.g., before the user moves out of the communication range of the local BS. The requested content can be satisfied from the buffer of local BS or obtained from the neighbour BSs via BS-BS link as well as the Internet via backhaul link. Let denote the probability of BS serving user , then we havewhere denote the time period of the -th cellular serving from BS to user during the total sample time . Therefore, we have the probability that user is served by BS as follows:Note that , .

3.4. Communication Model

We model the wireless transmission delay between the User and the BS as the ratio between the content size and the downlink data rate. Similar to [27], the downlink data rate from BS to User can be expressed aswhere is channel bandwidth, represents the background noise power, is transmission power of BS to User , and is the channel gain and is determined by the distance between the User and the BS .

3.5. Optimization for D2D-Enabled Edge Caching Problem

Mobile users can share the content via D2D communications. User pair and can get the requested content from if has the content (e.g.,) while does not under the probability of . Thus, the content offload from the BSs or Internet via D2D link between and can be obtained as . Whether the user has the content or not, we can obtain the total content via D2D sharing as Our aiming is to maximize the total size of content offload at users via D2D sharing while satisfying all the buffer size constraints of mobile users. Formally, the optimization problem is defined aswhere is the buffer size constraint of all the mobile users’ devices and is the caching state in each mobile device.

The optimization problem (11) is NP-hard.

Proof. Let , and .
Thus, we can rewrite Problem (11) aswhere is the cardinality constant of . It is easy to observe that Problem (11) has the same structure with the problem formulated in [28], which has been proved as NP-hard.

3.6. Cache Replacement Model

We model the cache replacement process as an MDP. Besides, we discuss the details of the related state space, action space, and reward function as follows.

State Space. We define as the content caching state during each decision epoch with respect to the content , which independently picks a value from a state space . means content is cached in the user and means the opposite. In addition, is introduced to denote the current requesting content from other users v in the decision epoch . The state of an available user during each decision epoch can be represented by

Action Space. The system action with respect to the state can be denoted as . All users possess the same action space asNamely, the system action can be divided into two parts according to their different characters as follows.

(a) Requests Handled via D2D link. The available cache control in the adjacent users is represented by , where indicates that whether and which content in the local user should be replaced by the current requesting content, represents whether the local user makes replacement; i.e., the content request is handled by the user itself.

(b) Requests Handled by BSs. Certainly, each user can get content directly from BSs when the D2D link fails to meet the requirements. is introduced to represent this kind of action, where means that the request is chosen to be directly handled by BSs, namely, the User shall fetch the content from BSs.

Reward Function. Reward (utility) function , which determines the reward fed back to the user when performing the action upon the state , shall be determined in the interactive wireless environment to lead the DRL agent (we will introduce it later) in users towards achieving ideal performance. Among the QoS metrics, the most important is to improve the hit rate of user-requested content. Our goal is to maximize the hit rate of user requests. Therefore, in our edge caching architecture, we design the reward function aswhere exponential function with respect to the traffic is adopted here to guide the objective of maximizing the traffic.

4. Edge Caching Policy Discussion

In the hierarchical wireless networks with cache-enabled D2D communications, we explore the maximum capacity of the network based on the mobility and social behaviours of users. The goal is to optimize the network edge caching by offloading the contents to users via D2D communications and reducing the system cost of content exchange between BSs and core network via cellular links.

4.1. Problem Formulation

Based on the above analysis and combined with (15), the optimization objective is defined aswhich indicates maximizing the expected long-term reward value conditioned on any initial state

Nevertheless, in general, a single-agent infinite-horizon MDP with the discounted utility (17) can be used to approximate the expected infinite-horizon undiscounted value when approaches 1.Further, we can obtain the optimal state value function for any initial state asIn conclusion, each user is expected to learn an optimal control policy that maximizes , with any initial state . The optimal control policy can be described as follows:

5. Double DQN-Based Edge Cache Strategy

5.1. Reinforcement Learning

Reinforcement learning is a machine learning algorithm. In other words, it is a way for an agent to keep trying, to learn from mistakes, and finally to find patterns. RL problems can be described as the optimal control decision making problem in MDP. RL contains many forms, among which Q-learning algorithm based on tabular learning is commonly used. Q-learning is an off-policy learning algorithm that allows an agent to learn through current or past experiences.

In our D2D caching architecture, the agent pertains to the user senses and obtains its current cache state . Then, the agent selects and carries out an action . Meantime, the environment experiences a transition from to a new state and obtains a reward .

According to the Bellman Equation, the optimal Q-value function can be expressed as (20), where is the state at current decision epoch i, and the next state is after taking the action . The iterative formula of Q-function can be obtained aswhere is the learning rate and the state will turn to the state when the agent chooses action along with the corresponding reward. Based on (21), the Q-Table can be used to store the Q value of each state-action pair when the state and action space dimensions are not high in the Q-Learning algorithm. We conclude the training algorithm based on the Q-Learning in Algorithm 1. The complexity of the Q-learning algorithm depends primarily on the scale of the problem. Updating the Q value in a given state requires determining the maximum Q value for all possible actions in the corresponding state in the table. In a given state, if there are possible actions, finding the maximum Q value requires comparisons. In other words, if there are states, the update of the entire Q-table requires comparison. Hence, the learning process in Q-Learning becomes extremely difficult when the scenarios are with huge network states and action spaces. Therefore, using neural network to generate Q value becomes a potential solution.

Initialization: Q-Table
1: for each episode
2: Initialize
3: for each step of episode
4: Generate at random
5: if
6: randomly select an action
7: else
8: choose using policy derived from
9: Take action
10: Obtain   and  
11: Update Q-Table:
13: end for
14: end for
5.2. Double Deep Q-Learning

DQN is the first model that successfully combines Deep Learning with Reinforcement Learning. It replaced the Q-table with the neural network, which effectively solved the complicated and high dimensional RL problems. It comes in many variations, the most famous of which is Double DQN [29]. In our model, we use Double DQN to train our DRL agents in users, which is formed as shown in Figure 2. The Q-function could be approximated to the optimal Q value by updating the parameter of neural network as follows:Experience replay is the core component of DQN. It actually is a memory for storing transitions with a finite size , and its stored procedures are overridden by loops. It can effectively eliminate the correlation between training data. The transition sample can be represented as , which represents one state transition. The whole experience pool can be denoted as . Note that each DRL agent maintains two Q networks, namely. Q and , with network Q used to choose action and network to evaluate action. Besides, the counterpart of network Q periodically updates the weight parameters of network

Throughout the training process, the DRL agent randomly samples a minibatch from the experience replay . Then, at each epoch, the network Q is trained towards the direction of minimizing the loss function asAnd with (23), the gradient guiding updates of can be calculated by . Hence, Stochastic Gradient Descent (SGD) is performed until the convergence of Q networks for approximating optimal state-action Q-function. We conclude the training algorithm based on the Double DQN in Algorithm 2.

Initialization: Experience replay memory , main network with random weights , target
network with , and the period of replacing target Q network .
1: for each episode
2: Initialize
3 i
4: for each step of episode
6: Randomly generate
8: if
9: randomly select an action
10: else
12: Take action
13: Obtain and .
14: Store into .
15: Randomly sample a mini-batch of transitions .
16: Update with .
17: if i
18: Update
21: end for
22: end for

About algorithm complexity, it mainly includes collecting transitions and executing backpropagation to train the parameters. Since collecting one transition requires computational complexity, the total computational complexity for collecting transitions into the replay memory is . Let and denote the number of layers and the maximum number of units in each layer, respectively. Training parameters with backpropagation and gradient descent requires the computational complexity of where m and i denote the number of transitions randomly sampled from the replay memory and the number of iterations, respectively. Furthermore, the replay memory and the parameters of the double deep Q-learning model dominate the storage complexity. Specially, storing transitions needs the about space complexity of while the parameters need the about space complexity of .

6. Experiment

In this section, we evaluate the proposed cache policy based on the experimental results of the mobile application Xender.

6.1. DataSet

Xender is a mobile APP that can realize offline D2D communication activities. It provides a new way to share diversified content files users are interested in without accessing 3G/4G cellular mobile networks, largely reducing repeated traffic load and waste of network resources, as a result, achieving resource sharing. Currently Xender has around 10 million daily and 100 million monthly active users, as well as about 110 million daily content deliveries.

We capture Xender’s trace for one month (from 01/08/2016 to 31/08/2016), including 450,786 active mobile users, conveying 153,482 content files, and 271,785,952 content requests [30]. As shown in Figure 4, the content popularity distribution in the Xender’s trace can be fitted by MZipf distribution with a plateau factor of −0.88 and a skewness factor of 0.35.

6.2. Parameter Settings

In our simulations, four BSs are employed with maximum cover range 250 m, in dB [31] is taken as the channel gain model, and the channel bandwidth of each BS is set as 20 MHz. The delays of D2D link, BS to MNO and MNO to Internet are 5ms, 20ms, and 100ms, respectively. Besides, the total transmit power of BS is 40 W with serving at most 500 Users. With respect to the parameter settings of Double DQN, a single-layer fully connected feed forward neural network, including 200 neurons, is used to serve as the target and the eval network. Other parameter values are given in Table 1.

6.3. Evaluation Results

In order to evaluate the performance of our caching strategy, we compared it with three classic cache replacement algorithms.

LRU: replace the least recently used content.

LFU: replace the least commonly used content first.

FIFO: replace the first in content first.

Figure 5 shows the performance comparison of cache hit ratio, delay, and traffic at F=1000 and C=100M. As we can see, at the beginning of the simulation, the caching strategy we proposed was surely at a great disadvantage among three aspects. But soon the hit rate increased and stabilized eventually. This is because our reward function is used to increase the cache hit rate; thus, our DRL agent is dedicated to maximizing the system hit rate. It can be seen that our caching strategy is significantly 9%, 12%, and 14% higher than LRU, LFU, and FIFO in terms of hit rate, respectively. At the same time, the improvement of the hit rate has a positive impact on the delay, traffic indicators, and other indexes. The delay of our strategy is 12%, 17%, and 21% lower than that of LRU, LFU, and FIFO, respectively. Besides, the traffic saved is 8%, 10%, and 14%, respectively.

In addition, we explored the effect of content quantity on performance comparison results. We compared the performance when the number of contents is 1000 and 2000. As shown in Figure 6, it can be inferred that when the number of contents increases, the convergence of the algorithm changes and the hit rate decreases. However, it cannot change the overall trend of the algorithm. Our caching strategy can still perform optimally in these four algorithms.

Finally, we explored the effects of learning rate and exploration probability on our algorithm performances. As shown in Figure 7, learning rate is 0.5 and 0.05 and exploration probability is 0.1 and 0.5, respectively. It can be seen that both of these factors have a great impact on the cache strategy, mainly manifesting in convergence and performance. Thus large numbers of experiments are performed to find an appropriate learning rate and exploration probability for the proposed edge caching scenarios. Hence, in our setting, and are selected for achieving better performance.

7. Conclusions

In this paper, we study the edge caching strategy of layered wireless networks. Specifically, we use the Markov decision process and Deep Reinforcement Learning in the proposed edge cache replacement strategy. The experimental results based on actual tracking show that our proposed strategy is superior to LRU, LFU, and FIFO in terms of hit rate, delay and traffic offload. Finally, we also explored the impact of learning rate and exploration probability on algorithm performance.

In the future, we’ll focus more on the user layer’s impact on cache replacement. (1) In the existing D2D model, the transmission process of files is not persistent, and complex user movement will lead to the interruption of content delivery. In the future, we will consider this factor in the reward function; (2) The cache replacement process requires additional costs, such as latency and energy consumption, all of which should be considered, but how to quantify these factors in the simulation experiment still needs to be explored. (3) The computing resources of user devices are limited. Although Deep Reinforcement Learning can solve the problem of dimensional explosion, it still requires a lot of computing resources. Therefore, we will explore the application of more lightweight learning algorithms in D2D-aid cellular networks.

Data Availability

The data used to support the findings of this study have not been made available because commercial reasons.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


The conference version of the manuscript is firstly presented in 2018 IEEE 15th International Conference on Mobile Ad Hoc and Sensor Systems (MASS). Authors have extended the work significantly by exploiting the edge caching problem with deep reinforcement learning framework in this journal version. This work was supported in part by the National Key Research and Development Program of China under grant 2018YFC0809803 and in part by the Natural Science Foundation of China under grant 61702364.