Recently, the application of deep reinforcement learning in the recommender system is flourishing and stands out by overcoming drawbacks of traditional methods and achieving high recommendation quality. The dynamics, long-term returns, and sparse data issues in the recommender system have been effectively solved. But the application of deep reinforcement learning brings problems of interpretability, overfitting, complex reward function design, and user cold start. This study proposed a tag-aware recommender system based on deep reinforcement learning without complex function design, taking advantage of tags to make up for the interpretability problems existing in the recommender system. Our experiment is carried out on the MovieLens dataset. The result shows that the DRL-based recommender system is superior than traditional algorithms in minimum error, and the application of tags have little effect on accuracy when making up for interpretability. In addition, the DRL-based recommender system has excellent performance on user cold start problems.

1. Introduction

With the increasing amount of information and access to information getting more and more smooth, users’ choice towards goods, movies, and restaurants has significantly increased. On the one hand, mass information brings more convenience; on the other hand, information overload brings the trouble of overchoice as well. The recommender system is the information filtering tool that deals with such problem through providing users information with guiding significance in a highly personalized manner [1].

As a subarea of machine learning, deep reinforcement learning-based recommender systems have gained significant attention by overcoming drawbacks of traditional methods and achieving high recommendation quality. Traditional algorithms regard recommendation as a statistic process, while DRL-based recommender algorithms solve the dynamic changes of interest and distribution of users and items. Due to MDP modeling and cumulative rewards, long-term returns are considered to improve user viscosity. The application of the deep neutral network effectively solved sparse data issues in the recommender system.

There are two main DRL algorithms applied in the recommender system, including DQN and actor-critic, which are elaborated in related works. Currently, most successfully applied methods are based on DQN. Considering the Q value-based deep reinforcement learning algorithm is only suitable for low-dimensional and discrete motion spaces, as it is well known that DQN was first proposed in Atari games, which only have four actions. In the rating prediction problem studied in our study, which has ten specific ratings, the Q value-based method is no longer proper. However, the actor-critic-based deep reinforcement learning algorithm is not limited to discrete motion space and even can handle continuous motion space, so the algorithm in this study is under the framework of actor-critic. This trend and the importance of topic motivated us to prepare this study. To be specific, we apply DDPG as our basic algorithm.

However, every coin has two sides. The deep neutral network lacks interpretability and leads to overfitting. Reinforcement learning needs complex reward function design and hardly handle with user cold start problem. In order to handle with problems caused by the combination of above methods, we propose a tag-aware recommender system based on deep reinforcement learning.

First, modeling recommendation questions (rating predictions) based on deep reinforcement learning. The deep reinforcement learning approach models recommendation issues as a dynamic process. In spatial, it is scalable as the number of users and items increases. In time, it adapts to the dynamic changes in user interests, not only taking short-term returns into account but also long-term returns. Besides, complex data preprocessing is not necessary for the processing of datasets, and the algorithm can automatically learn feature representations from scratch [2].

Second, tag information is used to make up for the interpretability problems existing in the recommender system. By Wikipedia’s definition, a label is a nonhierarchical keyword used to describe information, which can be used to describe the semantics of the item. Depending on who labels the item, there are generally two types of labeling applications: one asks the author or expert to label the item and the other allows the regular user to label the item, the latter also called user-generated content. When a user tags an item, the label describes the user’s interests on the one hand and the semantics of the item on the other. Users apply labels to describe their views on items, so labels are an important link between users and items. Labels also reflect the interests of users as an important data source; the effective use of labels to improve the quality of personalized recommendation results is of great help [3]. Douban does make good use of label data, increasing both the diversity and interpretability of recommendations.

Finally, user cold start problems [4]. The recommender system needs to predict user’s future behavior and interest according to the user’s historical behavior and interests, so a large amount of user behavior data become an important part and prerequisite. Designing a personalized recommender system without a large amount of user data is a cold start issue. This study focused on the problem of user cold start, which mainly solves the problem of how to make personalized recommendations for new users. Deep reinforcement learning can dig into the potential connection between user characteristics and item characteristics. Hence, it has potential advantages in solving cold start problems.

Our contributions are listed as follows:(1)Apply deep reinforcement learning for rating prediction, adapting to the dynamics of users and items, and taking long-term returns into account. To be specific, we use the DDPG algorithm innovatively, which is seldom used in the past.(2)Complex user label data are cleaned and filtered to make up for the interpretability of the recommender system(3)The trained neural network has learned the nonlinear relationship between users and items, and meanwhile, we used double network architecture to prevent overfitting, resulting in better adaptability for new users. So, the user cold start issue is relieved to some degree.

The rest of this study is organized as follows. Section 2 briefly reviews the work related to the combination of deep reinforcement learning and recommender systems. Section 3 first defines the recommendation problem according to the deep reinforcement learning and then uses the DDPG algorithm to predict ratings. Section 4 carries out experiment on the MovieLens 20M dataset, and the results show that our algorithm is superior than traditional algorithms in minimum error and has excellent performance on user cold start problems. Section 5 concludes the work in this study and puts forward directions of future work.

The recommender system is aimed at predicting users’ preference on items and recommend items that users may be interested in automatically [5, 6]. Recommendation algorithms are usually classified into three categories [5, 7]: collaborative filtering, content-based, and hybrid recommender system. Collaborative filtering makes recommendations according to users’ or items’ historical records, either explicit or implicit. Content-based recommendation is on the basis of items and users auxiliary information, such as voice, images, and videos. Hybrid recommendation integrates at least two different recommendation algorithms [7, 8].

Since the Netflix prize competition, different researchers from different countries have come up with numerous rating prediction algorithms. Traditional algorithms include averaging prediction: predictions are made by calculating the average of the ratings. Domain-based approach: predictions are calculated by the similarity of users or items [9]. With the development of machine learning, the latent factor model [10] and matrix factorization [11] are proposed, the essence of which is to study how to complete the rating matrix by the method of dedimensionality. The representative algorithms include SVD, LFM, and SVD++ [12], which fuse time information on the basis of SVD.

Reinforcement learning operates on a trial-and-error paradigm [13]. The basic model is composed of the following components: agents, environments, states, actions, and rewards. The combination of deep neural networks and reinforcement learning formulate DRL which have achieved human-level performance across multiple domains such as games and self-driving cars. Deep neural networks enable the agent to learn from scratch. Common DRL algorithms include DQN, DDQN, dueling DQN, and DDPG.

Recently, DRL has obtained good results [1416] in the recommender system. Zhao et al. [17] explored the page-wise recommendation scenario with DRL; the proposed framework deep page is able to adaptively optimize a page of items based on user’s real-time actions. On this basis, the list-wise method is proposed further [18], and these two articles mainly solve the sorting problem in the recommender system, applying the DDPG framework. Zhao et al. [19] proposed a DRL framework, DEERS, for recommendation with both negative and positive feedbacks in a sequential interaction setting and especially highlight the importance of negative feedback. This study also demonstrates the effectiveness of proposed framework in real-world e-commerce setting. Zheng et al. [20] proposed a news recommender system, DRN, with DRL to tackle the following three challenges: (1) dynamic changes of news content and user preference, (2) single feedbacks, and (3) diversity of recommendations. This study not only considers click labels or rating into consideration but also take user viscosity into account. Chen et al. [21] proposed a robust deep Q-learning algorithm to address the unstable issue with two strategies: stratified sampling replay and approximate regretted reward. The former idea solves the problem from sample aspect while the latter from reward aspect. DQN-based algorithm alleviates the problem of distribution shifting in dynamic environment, but needs complex reward function. Chen et al. [22] get a more predictive user model and learn the reward function in a way consistent with the user model. Learned rewards function benefits reinforcement learning in a more principled way, rather than relying on hand-designed rewards. The user model makes it possible for model-based RL and online fit for new users, which address the user cold start problem. Although complex reward functions no longer need to be built when using user models, the design of reward functions is still required during the user model building phase. Choi et al. [14] proposed solving the cold start problem with RL and biclustering. This study uses biclustering to improve cold start problem and provides interpretability for the recommender system. Munemasa et al. [15] proposed using DRL for stores recommendation. The state of art survey divides the DRL-based recommender system into three categories according to different DRL algorithms; they are DQN, actor-critic, and reinforce [23].

3. Methods

Considering the behavior of user rating movies is typical of sequential decision, which is in accord with the delayed feedback in reinforcement learning, and we apply reinforcement learning to model recommendation problems. In this study, the dataset of user rating records are viewed as the environment, and the agent needs to perceive the environment when predicting ratings. Reinforcement learning is usually modeled in the form of the Markov decision process (MDP), which is a tuple , so our model is defined as follows.

3.1. Problem Definition
3.1.1. State Space

The state should be able to represent explicit features of users and movies, respectively, and implicit features between users and movies. Based on MovieLens datasets, the dimension of each state is 28 and is sorted in chronological order by rating timestamp. Suppose ,

3.1.2. Action Space

Our goal is to predict users’ rating on movies, so we regard ratings as actions directly. The scale of ratings ranges from 0.5 to 5 in half-star; thus, there are 10 discrete ratings in total. Therefore, the action space has 10 actions.

3.1.3. Reward Function

The key of rating prediction is to enhance the accuracy. The larger the difference between predicted rating and actual rating is, the smaller the reward is. On the country, the smaller the difference between predicted rating and actual rating is, the larger the reward is. The reward function in this article is a subtraction of the difference between the prediction rating and the true rating. Since user ratings are the only feedback used, this article does not require complex reward function design and reward shaping.

3.1.4. Discount Factor

when , and the recommender system only takes immediate reward into consideration, and when , all future rewards are fully counted.

3.2. DDPG-Based Rating Prediction Algorithm

The full name of DDPG [24] is deep deterministic policy gradient, a combination of actor-critic and DQN algorithms [25]. Deep means using the experience pool and double network structure applied in DQN to promote effective neural network learning. Deterministic, that is, actor no longer outputs the probability of each action, but rather a specific action, which helps us learn in the continuous motion space. The theory basis of our proposed algorithm is based on DDPG, as follows.

Figure 1 shows the network structure of DDPG.

We call the two networks in actor are the action estimation network (rating prediction network) and action reality network. We call the two networks in critic are the state reality network and state estimation network.

DDPG applies a double network structure similar to DQN, and both actor and critic have TargetNet and EvalNet. It is important to emphasize here that we only train the parameters of the action estimation network (rating prediction network) and the state estimation network, while the parameters of the action reality network and the state reality network are copied by the first two networks at a certain time.

First of all, on critic’s side, the learning process on critic’s side is similar to that of DQN, and we all know that networks in DQN are learned based on the following loss functions, namely, the real Q value and the estimated Q value square loss:where is obtained from the state estimation network, and is the action passed over by the action estimation network (rating prediction network).

is the real Q value. Instead of using greedy strategy to select action , we directly get action through the action reality network.

In general, the training of the critic’s state estimation network is based on the square loss of the real Q value and estimated Q value. The estimated Q value is obtained after inputting the current state and action , which is outputted by the action estimation network (rating prediction network) to state the estimation network. The real Q value is obtained after putting the reality reward , the next state , and the action of the action reality network into the state reality network and then calculated the discount value.

Second, on the actor’s side, in this study, we estimate the parameters of the action estimation network (rating prediction network) according to the following formula [24]:

Let us set an example to explain this formula. Suppose to the same state, the action estimation network (rating prediction network) predicts two different ratings a1 and a2 and gets two feedback Q values from the state estimation network: Q1 and Q2. Assuming Q1 > Q2, that is, rating 1 is closer to the true value and then, according to the idea of policy gradient, increases the probability of action 1 and decreases the probability of action 2. Based on this, actor wants to get as large a Q value as possible. Therefore, the loss of actor can be simply understood as the greater the feedback Q value is, the less the loss is, or the less the feedback Q value is, the greater the loss is.

In addition, the traditional DQN uses a TargetNet network parameter update called “hard” mode, that is, assigning network parameters in EvalNet to TargetNet every certain steps. While, DDPG applies a “soft” mode of TargetNet network parameter updates, that is, each step updates the parameters in the TargetNet network a little bit. This method of parameter updating has been tested, showing that the stability of learning can be greatly improved.

Algorithm 1 is the DDPG-based rating prediction algorithm.

Algorithm 1. DDPG-based rating prediction algorithm.(1)Randomly initialize the critic network and actor with weights and (2)Initialize the target network and weights , (3)Initialize reply buffer (4)For episode do(5)Initialize a random process for action exploration(6)Receive initial record (7)Fordo(8)Predict rating according to the current policy and exploration noise(9)Apply rating and observe reward and next record (10)Store transition in (11)Sample a random minibatch of transitions from (12)Set (13)Update critic by minimizing the loss: (14)Update the actor policy using the sampled gradient:(15)Update the target networks:(16)End for(17)End for

4. Experiment

4.1. Dataset

MovieLens datasets are widely used in recommendation research. Our experiment employs the MovieLens 20M dataset, which contains 138493 users’ 20 million ratings and tag apps for 27278 movies. Only users with at least 20 ratings are included.

Different from former datasets, 20M datasets do not include any demographic information (age, gender, occupation, and zip code), which is stopped being collected in the site, but include tag applications [26]. Besides, 20M includes a table mapping MovieLens movie IDs to movie IDs in two external sites to allow dataset users build more complete content-based representations of the items.

Table 1 is the overview of ML-20M dataset.

Before the experiment, we preprocessed the dataset:(1)Tags are words or short phrases applied by users to movies. This study does not use word2vec or any other NLP methods but directly selected 1127 labels most commonly used, according to tags’ initials distributing ID number and directly apply tagId as a feature.(2)This study only selects users who have both ratings and tags for a movie(3)All features are normalized(4)Records include both tag and rating

Specifically, our dataset is sorted by rating timestamp in a chronological order, the top 80% data are used for training, and the last 20% data are used for testing.

In order to test the user cold start problem, users in the test set are divided into two parts. 502 users were old users (existing in the training set), and the remaining 960 users were new users (not in the training set). There are 21227 records for old users and 21902 records for new users.

Our experiment datasets are given in Table 2.

The preprocessing of experimental data in this study refers to TRSDL [27]; although the evaluation measures are all the same, due to the different data preprocessing methods, the experimental results are not comparable.

4.2. Evaluation Measures
4.2.1. MAE (Mean Absolute Error)

MAE is used to measure the difference between the true rating and the estimated rating of recommendation algorithms.

4.2.2. RSME (Root Mean Squared Error)

RSME is the evaluation criterion used by Netflix prize. The smaller the value of RSME, the more accurate the algorithm is.

4.3. Compared Method

This study selects some classic algorithms in the recommender system for comparative analysis.

4.3.1. Normal Predictor

Algorithm predicting a random rating based on the distribution of the training set is assumed to be normal.

The prediction is generated from a normal distribution , where and are estimated from the training data using maximum likelihood estimation:

4.3.2. Coclustering

A collaborative filtering algorithm is based on coclustering [28]. Basically, users and items are assigned some clusters , , and some coclusters . The prediction is set aswhere is the average rating of cocluster , is the average rating of u’s cluster, and is the average rating of i’s cluster.

4.3.3. KNN Basic

A basic collaborative filtering algorithm. The prediction is set as

4.3.4. KNN with Means

A basic collaborative filtering algorithm, taking into account the mean ratings of each user. The prediction is set as

4.3.5. KNN with Baseline

A basic collaborative filtering algorithm taking a baseline rating into account. The prediction is set as

4.3.6. Slope One

A simple yet accurate collaborative filtering algorithm [29]. The prediction is set aswhere is the set of relevant items, i.e., the set of items rated by that also have at least one common user with . is defined as the average difference between the ratings of and those of :

4.3.7. SVD++

The famous SVD algorithm is popularized by Simon Funk during the Netflix prize. When baselines are not used, this is equivalent to probabilistic matrix factorization [30]. The prediction is set as . If user is unknown, then the bias and the factors are assumed to be zero. The same applies for item with and [31, 32].

The SVD++ algorithm is an extension of SVD taking into account implicit ratings.

The prediction is set aswhere the terms are a new set of item factors that capture implicit ratings. Here, an implicit rating describes the fact that a user u rated an item , regardless of the rating value.

4.3.8. NMF

A collaborative filtering algorithm is based on nonnegative matrix factorization. This algorithm is very similar to SVD. The prediction is set aswhere user and item factors are kept positive. Our implementation follows that suggested in NMF [33], which is equivalent to [34] in its nonregularized form. Both are direct applications of NMF for dense matrices.

4.3.9. DQN

The neural network (supervised learning) used by DQN is trained using a variant Q-learning algorithm, using SGD to update the weights and replay mechanism that are used to eliminate relevance between data by randomly sampling in the past transitions.

The update formula for the Q value is as follows:

4.4. Parameter Setting

We decide the parameters setting according to experience, grid search, and random search. Replay memory size is 10000, batch size is 32, gamma is 0.9, learning rate is 0.001, batch size is 32, tau is 0.001, and episode is 10000.

4.5. Experimental Results

Our experiment is carried out on the processed ML-20M dataset. First, we use the traditional algorithm and DQN as baselines to make comparisons. Then, the tag-aware DDPG algorithm proposed in this study is employed to calculate the error; meanwhile, in order to verify whether tags have effect on error, this study also calculates the error without tags to make comparative analysis. Finally, we select a test set which only contains new users to independently verify the user cold start issue.

Table 3 provides the minimum errors of various algorithms. SVD++ performs best among all traditional algorithms, whose MAE and RSME are 0.3816 and 0.5620. Our algorithm is slightly lower than that of SVD, where the MAE and RSME of tag-free reach 0.3720 and 0.5432 and the MAE and RSME of tag-aware reach 0.3900 and 0.5577. The comparison between DQN and DDPG also shows the superiority of the later.

It is vivid that the best results of our algorithm are superior than traditional methods. In addition to the advantage in reducing error, the DDPG-based recommender system is more scalable when the number and characteristics of users and items enlarge and can adapt to the dynamic changes of users and items as well. What is more, deep learning is good at digging potential connections between users and items, which provide a better idea for optimizing the long-term user experience.

With RSME as the evaluation indicator, DDPG’s performance is given in Table 4.

Judging from the performance of DDPG, although it exceeds the traditional algorithm in the minimum value, the robustness is poor. To be specific, the range between minimum value and maximum value is large. Besides, the use of tag apps has little effect on error, but adds interpretability to the recommender system. What is more, when all users in test set are new users, DDPG shows excellent performance. More details will be analyzed.

4.5.1. Tag-Free and Tag-Aware

Table 5 and Figure 2 separately show the comparison results of tag-free and tag-aware from table format and figure formats.

Since reinforcement learning learns from scratch, the initial predicted ratings are rather random, causing the training error very large at first. However, with the increase of training time, reinforcement learning gradually learned the correct strategy. The error is decreased and stabilized.

Figures 3 and 4 show the RSME of tag-free and tag-aware, respectively. The average RSME of tag-free is 0.9448, and the best RSME is 0.5274. The average RSME of tag-aware is 0.9456, and the best RSME is 0.5577. The results show that the application of tag has little effect on the accuracy; however, it makes up for the drawback of interpretability existing in the recommender system.

Figures 5 and 6 show the distribution of RSME of tag-free and tag-aware. The error distribution follows the normal distribution, and the number of errors on the left side of the mean is greater than the right side, that is, most errors are concentrated in the interval with smaller errors. It can be seen that DDPG algorithms tend to have smaller errors, which means, more accurate.

4.5.2. Cold Start

Table 6 and Figure 7 separately show the results of cold start from table format and figure formats.

Figure 8 shows the distribution of RSME of cold start. When the test set only contains new users, the accuracy of DDPG is still very high. It can be speculated that deep reinforcement learning can be a good solution to the problem of user cold start. This is something the former algorithms cannot solve.

DDPG shows a lower error in dealing with the user cold start problem, which shows that the method adopted in our study has a good effect on the problem of overfitting.

The error distribution follows the normal distribution, and DDPG algorithms also tend to have smaller errors.

To sum up, the DDPG-based recommender system has three advantages:(a)Enhance accuracy and reduce errors(b)Increase explanatory(c)Reduce overfitting and solve cold start problems

5. Conclusion

The combination of deep reinforcement learning and recommender systems has become a popular trend, and Internet giants such as Google and Alibaba both have performed a lot in theoretical exploration and engineering practice in. In this study, the DDPG algorithm is applied to predict the rating in the recommender system. Since the basic algorithm of DDPG is generally used to deal with large-scale continuous action, this study first discretizes the continuous action, which is the rating of movies. Although the average error is higher than traditional algorithms, the minimum error is much smaller than the existing recommendation algorithm, and the results of this experiment tend to have smaller errors. Then, without increasing the error, tag is used to make up for the interpretability of the recommender system. Finally, on the issue of user cold start, the experiment proves that the recommendation algorithm used in this study has smaller errors, and it also has a good effect on the overfit problem.

For future work, we have the following directions. (1) Scalability. This study uses the MovieLens 20M dataset, and we can continue to research on the 25M and latest datasets to explore the scalability problems. (2) Robustness. Although the error of the DDPG algorithm converges to a great result, the error range is large; hence, there is room for improvement of the robustness. (3) Parameters. The DDPG algorithm requires a lot of tuning, which is a common disease of machine learning. We want to propose more adaptable recommended algorithms.

Data Availability

The data used to support the findings of this study are available in MovieLens datasets (https://grouplens.org/datasets/movielens/).

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This work was supported by the National Natural Science Foundation of China (61806221).