Collective Behavior Analysis and Graph Mining in Social Networks
View this Special IssueResearch Article  Open Access
Zhao Li, Long Zhang, Chenyi Lei, Xia Chen, Jianliang Gao, Jun Gao, "Attention with LongTerm IntervalBased Deep Sequential Learning for Recommendation", Complexity, vol. 2020, Article ID 6136095, 13 pages, 2020. https://doi.org/10.1155/2020/6136095
Attention with LongTerm IntervalBased Deep Sequential Learning for Recommendation
Abstract
Modeling user behaviors as sequential learning provides key advantages in predicting future user actions, such as predicting the next product to purchase or the next song to listen to, for the purpose of personalized search and recommendation. Traditional methods for modeling sequential user behaviors usually depend on the premise of Markov processes, while recently recurrent neural networks (RNNs) have been adopted to leverage their power in modeling sequences. In this paper, we propose integrating attention mechanism into RNNs for better modeling sequential user behaviors. Specifically, we design a network featuring Attention with Longterm Intervalbased Gated Recurrent Units (ALIGRU) to model temporal sequences of user actions. Compared to previous works, our network can exploit the information of temporal dimension extracted by time intervalbased GRU in addition to normal GRU to encoding user actions and has a specially designed matrixform attention function to characterize both longterm preferences and shortterm intents of users, while the attentionweighted features are finally decoded to predict the next user action. We have performed experiments on two wellknown public datasets as well as a huge dataset built from realworld data of one of the largest online shopping websites. Experimental results show that the proposed ALIGRU achieves significant improvement compared to stateoftheart RNNbased methods. ALIGRU is also adopted in a realworld application and results of the online A/B test further demonstrate its practical value.
1. Introduction
Due to the increasing abundance of information on the Web, helping users filter information according to their preferences is more and more required, which emphasizes the importance of personalized search and recommendation [42–45]. Traditional methods for providing personalized content, such as itemitem collaborative filtering [33], did not take into account the dynamics of user behaviors, which are recently recognized as important factors. For example, to predict the user’s next action such as the next product to purchase, the profiling of both longterm preferences and shortterm intents of user is required, where modeling the user’s behaviors as sequences provides key advantages. Nonetheless, modeling sequential user behaviors with temporal dimension raises even more challenges than modeling them without the temporal dimension. How to identify the correlation and dependence among actions is one of the difficult issues. This problem has been studied extensively, and many methods based on the Markov process assumption, such as Factorizing Personalized Markov Chain [32] and Hierarchical Representation Model [41], have been designed and adopted in different tasks [11, 15]. These methods usually focus on factor models, which decompose the sparse useritem interaction matrix into lowdimensional matrices with latent factors. However, for modeling sequential information, it is often not clear how to integrate the dynamics of user intents into the framework of factor models.
Recently, neuralnetworkbased algorithms have received much attention from researchers [6, 9, 18, 30, 48]. For instance, many different kinds of graph neural networks (GNNs), instead of matrix factorization (MF) based algorithms [36], have been proposed to learn graph embedding due to their ability of learning from nonEuclidean space [30]. Many different kinds of recurrent neural networks (RNNs) have been proposed to model user behaviors due to their powerful descriptive ability for sequential data [14, 29, 39, 47, 52]. For example, Hidasi et al. [14] propose an approach based on a number of parallel RNNs with rich features to model sequential user behaviors. Wu et al. [47] endow both users and movies with a long shortterm memory (LSTM) autoregressive model to predict user future behaviors. Furthermore, to utilize the temporal information better, Zhu et al. [52], Neil et al. [29], and Vassøy et al. [39] introduce time intervals between sequential actions into RNN cells to update and forget information rather than only considering the order of actions.
Despite the success of abovementioned RNNbased methods, there are several limitations that make it difficult to apply these methods into the wide variety of applications in the real world. One inherent assumption of these methods is that the importance of historical behaviors decreases over time (e.g., equation (15) in [52]), which is also the intrinsic property of RNN cells such as gated recurrent units (GRU) and long and shortterm memory (LSTM). However, this assumption does not always apply in practice, where the sequences may have complex crossdependence [46]. For example, user’s online actions are not straightforward but contain much noise and randomness. See Figure 1 for illustration, which shows a real sequence of clicked items of a user in one of the largest online shopping websites. We can conjecture the user intended to buy a Tshirt in honor of Lebron James and he/she finally bought . But the user also viewed items of different kind (shoes as and ). Clearly, and are more important than and to predict the final deal, although the former two items are earlier than the latter two items in the temporal dimension. This example displays the difficulty in analyzing sequential user behaviors, where the simple assumption of time intervalbased correlation between actions is not enough to cope with.
In this paper, we are inspired by the attention mechanism proposed for natural language processing [1, 49] which achieves remarkable progress in the past few years. Attention mechanism introduced into deep networks provides the functionality to focus on portions of input data or features to fulfill the given task. Similarly, we expect that a trained attention mechanism helps to identify the important correlated actions from sequential user behaviors to make prediction. However, the existing attention mechanism is inefficient in modeling sequential user behaviors. Hence, we design a new attention mechanism specifically for our purpose.
Specifically, we propose a network featuring Attention with Longterm Intervalbased Gated Recurrent Units (ALIGRU) for modeling sequential user behaviors to predict user’s next action. The network is depicted in Figure 2. We adopt a series of bidirectional GRU to process the sequence of items that user had accessed. The GRU cells in our network consist of not only normal GRU but also time intervalbased GRU, where the latter reflects the shortterm information of time intervals. In addition, the features extracted by bidirectional GRU are used as the input of attention model, and the attention distribution is calculated at each timestamp rather than as single vector as in Seq2Seq model [1, 46]. Therefore, this attention mechanism is able to consider the longterm correlation along with shortterm intervals. Our designed attention mechanism is detailed in Section 4.
We have performed a series of experiments using both wellknown public datasets (LastFM and CiteULike [52]) and the dataset collected by ourselves, built from realworld data. Extensive results show that our proposed ALIGRU outperforms the stateoftheart methods by a significant margin on these datasets. Moreover, ALIGRU is adopted online and we have performed online A/B test; test results further demonstrate the practical value of ALIGRU in comparison with the welloptimized baseline in a realworld ecommerce search engine.
This paper makes the following contributions:(i)First, we propose a bidirectional time intervalbased GRU to model the long and shortterm information of user actions for better capturing temporal dynamics between actions. Time intervalbased GRU is able to effectively extract shortterm dynamics of user intents as driven signals of attention function and refine the longterm memory by contextual information.(ii)Second, we design a new attention mechanism to encode long and shortterm information and identify complex correlation between actions, which attends to the driven signals at each time step along with the embedding of contextual information. This mechanism is less affected by the noise in the historical actions and is robust to extract the important correlated information between sequential user behaviors to make a better prediction.(iii)Third, we conduct a series of experiments on two wellknown public datasets and a largescale dataset constructed from a realworld ecommerce platform. Extensive experimental results show that our proposed ALIGRU obtains significant improvement compared to stateoftheart RNN methods. In addition, ALIGRU is adopted and we conducted online A/B test, and the results further demonstrate the practical value in a realworld ecommerce search engine.
The remainder of this paper is organized as follows. Section 2 discusses related works. The problem of modeling sequential user behaviors is formulated in Section 3, followed by detailed description of our proposed ALIGRU in Section 4. Experimental results are reported in Section 5 and concluding remarks in Section 6.
2. Related Work
We give brief overview of related work at two aspects, modeling of sequential user behaviors and attention mechanism.
2.1. Modeling Sequential User Behaviors
Due to the significance of usercentric tasks such as personalized search and recommendation, modeling sequential user behaviors has attracted great attention in both industry and academia. Most of the pioneering work relies on modelbased Collaborative Filtering (CF) to analyze useritem interaction matrix. There have been a variety of such algorithms including Bayesian methods [28] and matrix factorization (MF) methods [31, 50]. Due to the characteristics of sequential information, several CF works take the temporal dynamics into account, often based on the assumption of Markov processes [11, 21, 32]. For the task of sequential recommendation, Rendle et al. [32] propose Factorizing Personalized Markov Chain (FPMC) to combine matrix factorization of useritem matrix with Markov chains. He and McAuley [11] further integrate similaritybased methods [20] into FPMC to tackle the problem of sequential dynamics.
The major problems of the abovementioned work are that these methods independently combine several components, rely on lowlevel handcrafted features of user or item, and have difficulty to handle longterm behaviors. On the contrary, along with the development of deep neural networks, Lei et al. [22] and Zheng et al. [51] employ deep learning to learn effective representations of user/item automatically. Furthermore, with the success of recurrent neural networks (RNNs) in the past few years, a paucity of work has made attempts to utilize RNNs [14, 25, 47]. For example, Liu et al. [25] consider jointly the contextual information such as weather into the RNN architecture to improve modeling performance. The insight that RNNbased solutions achieve success in modeling sequential user behaviors is that RNN has welldemonstrated ability of capturing patterns in the sequential data. Recent studies [10, 29, 39, 52] also indicate that time intervals within sequential signal are a very important clue to update and forget information in RNN architecture. Zhu et al. [52] design several time gates in LSTM units to improve modeling performance. He et al. [10] embed items into a “transition space,” where users are modeled as translation vectors operating on item sequences. Liu et al. [26] employ adaptive contextspecific input matrices and adaptive contextspecific transition matrices to capture external situations and how lengths of time intervals between adjacent behaviors in historical sequences affect the transition of global sequential features, respectively. But, in practice, there is complex dependence and correlation between sequential user behaviors, which requires deeper analysis of relation among behaviors rather than simply modeling the presence, order, and time intervals. To summarize, how to design an effective RNN architecture to model sequential user behaviors effectively is still a challenging open problem.
2.2. Attention Mechanism
Attention mechanism is now a commonly adopted ingredient in various deep learning tasks such as machine translation [1, 27], image captioning [24], question answering [38], and speech recognition [5], which has been shown to be effective against capturing the contribution and correlation between different components in the network. The success of attention mechanism is mainly due to the reasonable assumption that human beings do not tend to process the entire signal at once; instead, they only focus on selected portions of the entire perception space when and where needed [17]. To avoid the limitation of normal networks that the entire source must be encoded to one hidden layer, the attentionbased network contains a set of hidden representations that scale with the size of the source. The network learns to assign attention weights to perform a soft selection of these representations.
With the development of attention mechanism, recent researches start to leverage different attention architectures to improve performance of related tasks [1, 3, 34, 40, 46, 49]. For example, Bahdanau et al. [1] conjecture that the use of a fixedlength vector is a bottleneck in improving the performance of this basic encoderdecoder architecture; therefore, they design a model to automatically search for parts of a source sentence which are relevant to predicting a target word. Yang et al. [49] propose a hierarchical attention network at word and sentence level, respectively, to capture contributions of different parts of a document. Vaswani et al. [40] utilize multihead attention mechanism to improve performance. Wang et al. [46] propose a coverage strategy to combat the misallocation of attention caused by the memorylessness of traditional attention mechanism. Nevertheless, most of the previous work calculates attention distribution according to the interaction of every source vector with a single embedding vector of contextual or historical information (such as translated words in the sentence), which may lead to information loss caused by early summarization and noise caused by incorrect previous attention. In particular, Shen et al. [35] propose an attentionbased language understanding method without any other network structure (e.g., RNN). In [35], the input sequence is processed by directional (forward and backward) selfattentions to model context dependency and produce contextaware representations for all tokens. Then, a multidimensional attention computes a vector representation of the entire sequence.
Indeed, the attention mechanism is very important to the task of modeling sequential user behaviors. However, to the best of our knowledge, there are a few works concentrating on this paradigm. Chen et al. [2] consider the attention mechanism into a multimedia recommendation task with multilayer perceptron. Song et al. [37] propose a recommender system for online communities based on a dynamicgraphattention neural network. They model dynamic user behaviors with a recurrent neural network and contextdependent social influence [23] with a graphattention neural network, which dynamically infers the influencers based on users’ current interests. In this paper, an effective solution with attention mechanism for better modeling sequential user behaviors is to be investigated.
3. Problem Formulation
We start our discussion with the definitions of some notations. Let be a set of users and let be a set of items in a specific service such as products in online shopping websites. For each user , his/her historical behaviors are given by , where denotes the user’s action and denotes the interaction between user and item at time ; interaction has different forms in different services, such as clicking, browsing, and adding to favorites. The objective of modeling sequential user behaviors is to predict the conditional probability of the user’s next item for a certain given user .
We take RNN as the basic model, which generates the conditional probability in multiple steps sequentially. At step , the th item is vectorized into and then fed into RNN units by nonlinear transformation, e.g., multilayer perceptron. Then, it updates the hidden state of RNN units, i.e., , as well as the output of RNN units. The representations of hidden state and output are trained to predict the next item vectorized as given . To train the RNN, we aim to maximize the likelihood of historical behaviors of a set of users :where is the target item for the given user . In other words, we aim to minimize the negative logarithmic likelihood, that is, the objective function:where is the set of parameters in the RNN model.
To fulfill this learning, it requires us to design an effective RNN architecture including the inner functions of RNN cells and the overall network structure, which together approximate a highly nonlinear function for obtaining the probability distribution of next item. In this process, RNN usually suffers from the complex dependency problem, especially when we deal with user actions that have much noise and randomness. Attention mechanism is a possible solution, which constructs a pooling layer on top of the RNN cells at each step to characterize the dependence between the current intent and all of the historical actions. We will describe our designed network architecture with attention mechanism in the next section.
4. ALIGRU
As illustrated in the left part of Figure 2, our designed network features an attention mechanism with longterm intervalbased gated recurrent units for modeling sequential user behaviors. This network architecture takes the sequence of items as raw signal. There are four stages in our network. The embedding layer maps items to a vector space to extract their basic features. The bidirectional GRU layer is designed to capture the information of both longterm preferences and shortterm intents of user; it consists of normal GRUs and time intervalbased GRUs (see Figure 3). The attention function layer reflects our carefully designed attention mechanism, which is illustrated in the right part of Figure 2. Finally, there is an output layer to integrate the attention distribution and the extracted sequential features and utilize normal GRUs to predict the conditional probability of next item.
4.1. Embedding Layer
The purpose of the embedding layer is to map the raw data of items into a rectified vector space, where the vectorized representations of items still keep the semantics of the items; e.g., semantically relevant items have small distance in the vector space. Usually items can be first represented as onehot vectors and then processed by several fully connected layers [52]. If however the number of items is too large, pretrained encoding network is useful to process the items, which encodes not only the basic properties such as category of items but also the crowdsourcing properties such as sales of items [4]. In this paper, we adopt these two strategies for different datasets, respectively.
4.2. Bidirectional GRU Layer with TimeGRU
This layer is designed to extract driven signals from input sequence and to refine the longterm memory by contextual information. We now detail our method for these two targets.
In the previous work for natural language processing tasks, the attention function is driven by a single vector of input [1, 27, 40]. That model works well because of the relatively stable syntax and semantics of input words. However, sequential user behaviors contain much noise and randomness that make the simple model problematic. We propose a new network structure with timeGRU to extract shortterm dynamics of user intents as driven signal of the attention function.
The structure of timeGRU in comparison with normal GRU is shown in Figure 3, where the black lines denote the network links of normal GRU and the red lines denote new links of timeGRU. The normal GRU are as follows:where denotes the th sequence item vector. denotes the (N − 1)th hidden state vector. is the candidate activation. represents the update gate, which decides how much the unit updates its activation. is the reset gate to control how much the last state contributes to the current activation. represents the sigmoidal nonlinearities function and tanh represents the tanh nonlinearities function, and is an elementwise multiplication. Weight parameters and connect different inputs and gates; parameters are biases.
The above equations imply that normal GRU is good at capturing the general sequential information. Since GRU is originally designed for NLP tasks, there is no consideration of time intervals within inputs, which are very important for modeling sequential user behaviors. To include the shortterm information, we augment the normal GRU with a time gate :where is the time interval between adjacent actions. The constraint is to utilize the simple assumption that smaller time interval indicates larger correlation. Moreover, we generate a timedependent hidden state in addition to the normal hidden state ; that is,where we utilize the time gate as a filter to modify the update gate so as to capture shortterm information more effectively.
In addition, we want to utilize contextual information to extract longterm information with as little information loss as possible. Recent methods usually construct a bidirectional RNN and add or concatenate two output vectors (forward and backward) of bidirectional RNN. Bidirectional RNN outperforms unidirectional one but still suffers from the embedding loss, since the temporal dynamics are not considered enough. On the contrary, we propose to combine the output of forward normal GRU ( in equation (6)) with all the outputs of backward GRU at different steps (the output of backward GRU at step is denoted by in Figure 2). Specifically, we produce concatenated vectors , , …, , as shown in the right part of Figure 2, where [,] stands for concatenation of vectors. This design effectively captures the contextual information as much as possible.
4.3. Attention Function Layer
Attention function layer is responsible for linking and analyzing the dependence and contribution over driven signals and contextual longterm information provided by the previous layers. Unlike previous attention mechanisms, we do not simply summarize the contextual longterm information into individual feature vectors, e.g., using for calculating the attention weight of item to the hidden state at the th step [46]. Instead, we design to attend to the driven signals at each time step along with the embedding of contextual information.
Specifically, as shown in the right part of Figure 2 and as already discussed in the last subsection, we use , where is the dimension of GRU states, to represent the contextual longterm information. denotes the shortterm intent reflected by item . We then construct an attention matrix , whose elements are calculated bywhere the attention weight,is adopted to encode the two input vectors. are the weight parameters. There is a pooling layer, for example, average or max pooling, along the direction of longterm information, and then there is a Softmax layer to normalize the attention weights of each driven signal. Let be the normalized weight on ; then the attended shortterm intent vector is . At last, we use as the output to the next layer, where is the embedded vector of the item at the th step.
We want to emphasize the insight of our carefully designed attention mechanism described above, which is different from the existing methods, to reduce the loss of contextual information caused by early summarization. Furthermore, since driven signals are attended to the longterm information at different steps, the attentions can obtain the trending change of user’s preferences, being more robust and less affected by the noise in the historical actions.
4.4. Output Layer
Given produced by the attention function layer, we use a layer of normal GRUs to produce the embedding vector ( in Figure 2), which is expected to contain the contextual longterm information about all of the user’s historical actions with respect to the single item and shortterm intents. The embedding vector is then decoded to produce the final result. For example, we use a Softmax function after a fully connected layer to obtain the probability distribution of different items in the next action:
If the number of candidate items is too big, we shall use a slightly different decoding function, which will be detailed in Section 5.3.
5. Experiments
In this section, we first describe the used datasets and several stateoftheart approaches that were compared as baselines in this paper. Then, we report and discuss experimental results on different datasets.
5.1. Datasets
To verify our proposed ALIGRU, we conduct a series of experiments on two wellknown public datasets (LastFM (http://www.dtic.upf.edu/∼ocelma/MusicRecommendationDataset/lastfm1K.html) and CiteULike (http://www.citeulike.org/faq/data.adp)). Additionally, we also perform offline and online experiments on the real data from one of the largest online shopping websites. Table 1 shows the statistics of LastFM and CiteULike:(i)LastFM contains <, , , > tuples collected from Last.fm API (https://www.last.fm/api/). It represents the whole listening habits (till 5 May 2009) for 1000 users. We extract tuples <, , > from the original dataset to conduct experiments, where each represents an item and each tuple represents the action or behavior that the user listens to the song at time .(ii)CiteULike consists of the tuples <, , timestamp, >, where each tuple represents that the user annotates the paper with at time . One user annotating one research paper (i.e., item) at a certain time may have several records, in order to distinguish different tags. We merge them as one record and extract tuples <, , > to construct dataset like in [52].

5.2. Compared Approaches
We compare ALIGRU with the following stateoftheart approaches for performance evaluation:(i)Factorized Sequential Prediction with Item Similarity Models (Fossil) [11]. This is a stateoftheart factorized sequential prediction method based on Markov processes. Fossil also considers the similarity of explored items to those already consumed/liked by user, which achieves a certain success to handle the longtail problem. We have used the implementation provided by the authors (https://drive.google.com/file/d/0B9Ck8jwTZUEeEhSWXU2WWloc0k/view).(ii)Basic GRU/Basic LSTM [7]. This method directly uses normal GRU/LSTM as the primary network. For fair comparison, we set the network to use the same embedding layer and the same decoding function as our method.(iii)Session RNN [13]. Hidasi et al. propose an RNNbased method to capture the contextual information according to sessions of user behaviors. In our experiments, we use a commonly adopted method described in [16] to identify sessions so as to adopt this baseline.(iv)TimeLSTM [52]. This method utilizes LSTM to model the pattern of sequential user behaviors. Compared to normal LSTM units, TimeLSTM considers the time intervals within sequential signal and designs several time gates in LSTM units similarly as our timeGRU.(v)Simplified Version 1 (SV1). This approach is designed to verify the effectiveness of our designed attention mechanism. The SV1 approach is identical to ALIGRU, with the only difference being that SV1 uses an attention mechanism provided in [1, 46] which simply summarizes the contextual longterm information into individual feature vectors. Specifically, user contextual behaviors are modeled as a vector , and all driven signals are attended to this vector.(vi)Simplified Version 2 (SV2). This approach is designed to verify the effectiveness of our proposed timeGRU for generating driven signals according to shortterm information. Compared to ALIGRU, the only difference is that SV2 uses single item at each step (the embedded vector) to attend to contextual information.
All RNNbased models are implemented with the open source deep learning platform TensorFlow (https://www.tensorflow.org/). Training was done on a single GeForce Tesla P40 GPU with 8 GB graphical memory.
5.3. Experiments on LastFM and CiteULike
We first evaluate our method on two wellknown public datasets for the task of sequential recommendation.
5.3.1. Datasets
In this experiment, we use the same datasets as those adopted in [52], i.e., LastFM and CiteULike. Table 1 presents the statistics of these two datasets. Both datasets can be formulated as a series of tuples <user_id, item_id, timestamp>. Our target is to recommend songs in LastFM and papers in CiteULike for users according to their historical behaviors.
For fair comparison, we follow the segmentation of training set and test set as described in [52]. Specifically, users are randomly selected for training. The remaining users are for testing. For each test user with historical behaviors, there are test cases, where the th test case is to perform recommendations at time given the user’s previous actions, and the groundtruth is . The recommendation can also be regarded as a multiclass classification problem. For more details, please refer to [52].
5.3.2. Implementation
Following the method in [52], we use onehot representations of items as inputs to the network and one fully connected layer with 8 nodes for embedding. The length of hidden states of GRUrelated layers including both normal GRU and timeGRU is 16. A Softmax function is used to generate the probability prediction of next items. For training, we use the AdaGrad [8] optimizer, which is a variant of Stochastic Gradient Descent (SGD). Parameters for training are minibatch size of 16 and initial learning rate of 0.001 for all layers. The training process takes about 8 hours.
5.3.3. Evaluations
In the test stage, following the evaluation method in [52], we select 10 items with top probabilities as final recommendations. We use Recall@10 to measure whether the groundtruth item is in the recommendation list. Recall@10 is defined as
where is the number of test cases where is in the recommendation list and is the number of all test cases. We further use MRR@10 (Mean Reciprocal Rank) to consider the rank of ground truth in the recommendation list. This is the average of reciprocal ranks of in the recommendation list. The reciprocal rank is set to 0 if the rank is above 10.
5.3.4. Overall Performance
The results of sequential recommendation tasks on LastFM and CiteULike are shown in Table 2. It can be observed that our approach performs the best on both LastFM and CiteULike for all metrics, which demonstrates the effectiveness of our proposed ALIGRU. Specifically, ALIGRU obtains significant improvement over TimeLSTM, which is the best baseline, averagely by and for Recall@10 and MRR@10, respectively. It owes to the superiority of introducing attention mechanism into RNNbased methods, especially in capturing the contribution of each historical action.

5.3.5. Performance of ColdStart
Coldstart refers to the lack of enough historical data for a specific user, which often decreases the efficiency of making recommendations. We analyze the influence of coldstart on the LastFM dataset and the results are given in Figure 4. In this figure, test cases are separately counted for different numbers of historical actions, and small number refers to coldstart. We can observe that, for cold users with only 5 actions, ALIGRU performs slightly worse than the stateoftheart methods. This is because ALIGRU considers shortterm information as driven signals, which averages source signal to some extent and leads to less accurate modeling for cold users. Along with the increase of historical actions, ALIGRU achieves significantly better performance than the baselines, which indicates that bidirectional GRU and attention mechanism can better model the longterm preferences for making recommendations.
5.4. Offline Experiments
We have collected a largescale dataset from a realworld ecommerce website for further performance evaluation. ALIGRU is also adopted online and results of online A/B test will be reported in the next section.
5.4.1. Dataset
User behaviors of this dataset are randomly sampled from the logs of clicking and purchasing in seven days (a beginning week of July 2017) on a realworld ecommerce website. The dataset is again formulated as a series of tuples <user_id, item_id, timestamp>.
We focus on the task of the personalized search of ecommerce websites. So we define positive cases as those purchasing behaviors led by the ecommerce search engine mentioned above, while negative cases are those clicks without purchases (if there is no purchase around the click within 5 actions). Finally, we have positive cases, negative cases, users, and items. We randomly select 80% users for training, and the remaining users are for testing. For each positive or negative case in the sequence, our target is to predict whether the user would purchase according to his/her historical behaviors, which is a typical binary classification problem.
5.4.2. Implementation
Since the amount of items is too huge in this dataset, it is inconvenient to employ the onehot representations as inputs to RNNbased models. Instead, we use pretrained embedding vectors of items as inputs and additionally use two fully connected layers, both with 128 nodes, to reembed the item vectors. Also, we have followed the wide & deep learning approach in [4] to convert the outputs of the final fully connected layer, whose size is 48, into representations of corresponding items. For fair comparison, all RNNbased approaches employ the pretrained item representations as inputs. Hidden state size of GRUrelated layers is 128. We finally use the sigmoid function to predict whether the user would purchase . For training, the loss function is crossentropy and AdaGrad optimizer is employed with a minibatch size of 256 and an initial learning rate of 0.001 for all layers. The entire training process takes about 50 hours.
5.4.3. Evaluations
In the test stage, we use PrecisionRecall of positive cases to measure the performance. Moreover, AUC (Area under ROC Curve) is also adopted, which is widely used in imbalanced classification tasks [12]. The larger the value of AUC is, the better the performance is.
5.4.4. Overall Performance
Table 3 shows the AUC results for measuring the overall performance. We can observe that all RNNbased methods except BasicGRU outperform Fossil that is based on matrix factorization with Markov processes, which indicates the advantage of RNN for modeling sequential data. Furthermore, for different views of the capabilities of different RNNbased approaches, we also report PrecisionRecall curves shown in Figure 5 and make some comparisons and summarize our findings as follows.

5.4.5. Basic GRU versus Session RNN versus TimeLSTM
Session RNN and TimeLSTM achieve significant improvement compared to Basic GRU, which is consistent with the previous results on public datasets. This is due to the limitation of BasicGRU/BasicLSTM in modeling complex longterm sequential data. Compared to Session RNN, TimeLSTM achieves better performance for high precision range (precision larger than about 0.73), which owes to the advantage of shortterm intents for predicting highly confident items. On the contrary, Session RNN outperforms TimeLSTM for low precision range (precision lower than about 0.73), since Session RNN introduces session view to better model contextual information and then benefits from recalling items based on longterm preferences of user.
5.4.6. Session RNN versus TimeLSTM versus ALIGRU
By adopting time gate, which is strong at modeling shortterm dynamics, and bidirectional RNN, which leads to advantages for modeling longterm information, ALIGRU better analyzes the complex dependence among items and user intents, together with a novel matrixform attention mechanism to enhance the performance. ALIGRU outperforms Session RNN and TimeLSTM by as much as 10.96% and 8.53% for AUC (Table 3), respectively. Observing PrecisionRecall curves, we found that ALIGRU beats Session RNN and TimeLSTM over the entire range, and the improvement is more significant for the high precision range. The superior performance of ALIGRU on various datasets and views demonstrates its efficacy to handle longterm sequential user behaviors with dynamical shortterm intents.
5.4.7. SV1 and SV2 versus Others
We also show results of SV1 and SV2 for ablation analyses (Figure 5). Observing the curve of SV1, we can find that the previous attention mechanism with bidirectional GRU only achieves slight improvement compared to TimeLSTM, which indicates the limitation of the previously studied attention mechanism for capturing dynamical importance of items in sequential user behaviors. On the contrary, SV2 outperforms both Session RNN and TimeLSTM consistently, especially for low precision range. It suggests that our proposed matrixform attention mechanism with bidirectional GRU has superior capacity in distinguishing items’ importance for modeling longterm preferences of user. Nevertheless, the curve of SV2 drops a lot when the precision is larger than about 0.82, where it is comparable to SV1 and TimeLSTM. This is because user behaviors and intents are dynamic with a certain randomness, and single item is not robust enough for calculating attention distribution and capturing shortterm intents. Last but not least, we can find that ALIGRU leads to performance boost against SV1 and SV2 consistently. It demonstrates the advantages of our carefully designed matrixform attention with longterm intervalbased GRU framework for modeling sequential user behaviors.
5.4.8. Case Study and Insights
We present three cases in Figure 6 for comprehensive study to give some insights of our proposed approach. Each case consists of one user’s historical items ordered by click time and shows the attention heat map for each item and the final item (click or purchase) with prediction and ground truth.
Case A. The user clicked items of several classes such as watch, handbag, and dress and finally purchased the watch that was clicked long before. We make a few observations as follows: (1) ALIGRU gives higher weights to most watches than other items, which is consistent with the user’s contextual intent (purchasing a watch). It suggests that our proposed approach has capacity to capture user’s real intents from historical behaviors. (2) The 1st, 3rd, and 5th watches, which are the same as or similar to the final purchased watch, have higher weights than the other watches, especially for the 1st watch, though it was clicked earliest for a long time ago. More interestingly, we can observe that the 6th watch is for women, and the user is probably a woman (according to the dresses he/she clicked), but the 6th watch has the relative lowest weight in all watches. These observations indicate that ALIGRU successfully distinguishes the user’s current intent for purchasing a watch for men.
Case B. If items were inherently important or repeated or had low frequency, models without attention mechanism might work well since such model could automatically assign low weights to irrelevant items and vice versa. However, the importance of items and user intents is highly dependent on context and is consistent to a certain degree. In Case B, the user finally purchased a coat hanger that belongs to the class he/she never clicked. Nevertheless, ALIGRU looks at the context of his/her recent behaviors, conjectures the intent is possible for something about laundry, and correctly figures out this is a positive case.
Case C. It is not a correct prediction case according to ground truth. Observing the historical behaviors and attention distribution of the user, we can find that ALIGRU chooses to ignore various items before the first suit; these actions had a long time interval (about two days) from the latter actions. Furthermore, ALIGRU conjectures that the user wants to buy something for formal wearing. Therefore, ALIGRU predicts it is a negative case for purchasing a USB cable, which is purchased by the user finally. In such cases, there exist choppy and decisive intents of user, which is a great challenge left for future exploration.
5.5. Online Test
Online test with realworld ecommerce users is carried out to study the effectiveness of our proposed method. In particular, we integrate ALIGRU into the ecommerce search engine mentioned above, which has billions of clicks per day. A standard A/B test is conducted online. Users of the search engine are randomly divided into multiple buckets, and we randomly select two buckets for experiments. For users in bucket A, we use the existing highly optimized ranking solution of the search engine, which performs Learning to Rank (LTR) and Reinforce Learning (RL) with several effective algorithms such as wide & deep learning and CF prediction. For users in bucket B, we further integrate the results produced by ALIGRU. Specifically, for a given user, his/her sequential behaviors (clicked items and timestamps) are collected from the entire service, and the user’s intent vector is predicted by ALIGRU in real time. When the user provides a query, we combine the calculated user intent vector with all the retrieved items to calculate purchasing probability, which is similar to the method of offline experiments. Finally, we integrate the purchasing probability into the existing ranking strategy.
Measures for the online A/B test include Gross Merchandise Volume (GMV), user Click Through Rate (uCTR), Click Conversion Rate (CVR), Per Customer Transaction (PCT), and Unique Visitor Value (UV_Value), which are all frequently used metrics in ecommerce [19]:
The test was performed within one week in July 2017. Comparative results are given in Table 4, where the absolute values are omitted for business confidentiality. The results show that ALIGRU achieves better performance for all the metrics. As we expected, uCTR and CVR are improved, which means that users are more likely to click the reranked items, and there is higher probability to purchase these items. More interesting is the improvement of PCT and UV_Value, which is due to the increase of number of transactions per user with purchasing actions. This result suggests that our model provides kinds of recommendation functionality into search engine, such as case B in Figure 6. In summary, our proposed ALIGRU consistently improves the highquality baseline of one of the largest online ecommerce platforms that has been optimized for several years. Such improvements are very important for ecommerce search engine systems and have significant business value. ALIGRU has been adopted into the search engine before this paper is prepared.

6. Conclusions
Modeling user behaviors as sequential learning plays an important role for predicting future user actions, such as personalized search and recommendation. However, most of RNNbased methods assume that the importance of historical behaviors decreases over time and fail to consider the crossdependence in sequences, which makes it difficult to apply to the realworld scenarios. To address these problems, we propose a novel and efficient approach called Attention with Longterm Intervalbased Gated Recurrent Units (ALIGRU) for better modeling sequential user behaviors. We first propose a bidirectional time intervalbased GRU to identify complex correlation between actions and capture both longterm preferences and shortterm intents of users as driven signals. Then, we design a new attention mechanism to attend to the driven signals at each time step for predicting the next user action. The empirical evaluations on two public datasets for sequential recommendation task show that ALIGRU achieves better performance than stateoftheart solutions. Specifically, ALIGRU outperforms Session RNN and TimeLSTM by as much as 10.96% and 8.53% in terms of AUC. In addition, online A/B tests in a realworld ecommerce search engine further demonstrate its practical value. As GRU cannot be calculated in parallel, it takes a lot of time to train the model. In the future, we will adopt a parallel approach to solve this problem.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (no. 61873288), AlibabaPKU Joint Program, and Zhejiang Lab (nos. 2019KE0AB01 and 2019KB0AB06).
References
 D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 2014, https://arxiv.org/abs/1409.0473. View at: Google Scholar
 J. Chen, H. Zhang, X. He, L. Nie, W. Liu, and T.S. Chua, “Attentive collaborative filtering: multimedia recommendation with itemand componentlevel attention,” in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 335–344, Tokyo, Japan, August 2017. View at: Google Scholar
 H. Cheng, H. Fang, X. He, J. Gao, and Li Deng, “Bidirectional attention with agreement for dependency parsing,” 2016, https://arxiv.org/abs/1608.02076. View at: Google Scholar
 H.T. Cheng, L. Koc, J. Harmsen et al., “Wide & deep learning for recommender systems,” 2016, https://arxiv.org/abs/1606.07792. View at: Google Scholar
 J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attentionbased models for speech recognition,” 2015, https://arxiv.org/abs/1506.07503. View at: Google Scholar
 C. Chu, L. Zhao, B. Xin et al., “Deep graph embedding for ranking optimization in ecommerce,” in Proceedings of the ACM International Conference on Information and Knowledge Management, Turin, Italy, 2018. View at: Google Scholar
 J. Chung, C. Gulcehre, K. H. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” 2014, https://arxiv.org/abs/1412.3555. View at: Google Scholar
 J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2011. View at: Google Scholar
 L. Guo, Q. Zhang, W. Hu, Z. Sun, and Y. Qu, “Learning to complete knowledge graphs with deep sequential models,” Data Intelligence, vol. 1, no. 3, pp. 224–2243, 2019. View at: Publisher Site  Google Scholar
 R. He, W.C. Kang, and J. McAuley, “Translationbased recommendation: a scalable method for modeling sequential behavior,” in Proceedings of the International Joint Conference on Artificial Intelligence, pp. 5264–5268, Stockholm, Sweden, 2018. View at: Google Scholar
 R. He and J. McAuley, “Fusing similarity models with Markov chains for sparse sequential recommendation,” in Proceedings of the International Conference on Data Mining, pp. 191–200, Barcelona, Spain, 2016. View at: Google Scholar
 R. He and J. McAuley, “VBPR: visual bayesian personalized ranking from implicit feedback,” in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 144–150, Phoenix, AZ, USA, 2016. View at: Google Scholar
 B. Hidasi, A. Karatzoglou, L. Baltrunas, and T. Domonkos, “Sessionbased recommendations with recurrent neural networks,” 2015, https://arxiv.org/abs/1511.06939. View at: Google Scholar
 B. Hidasi, M. Quadrana, A. Karatzoglou, and T. Domonkos, “Parallel recurrent neural network architectures for featurerich sessionbased recommendations,” in Proceedings of the ACM Conference on Recommender Systems, pp. 241–248, Boston, MA, USA, 2016. View at: Google Scholar
 B. Hidasi and T. Domonkos, “Fast ALSbased tensor factorization for contextaware recommendation from implicit feedback,” in Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 67–82, Bristol, UK, September 2012. View at: Google Scholar
 X. Huang, F. Peng, A. An, and S. Dale, “Dynamic web log session identification with statistical language models,” Journal of the American Society for Information Science and Technology, vol. 55, no. 14, pp. 1290–1303, 2004. View at: Publisher Site  Google Scholar
 R. Hübner, M. Steinhauser, and C. Lehle, “A dualstage twophase model of selective attention,” Psychological Review, vol. 117, no. 3, pp. 759–784, 2010. View at: Publisher Site  Google Scholar
 S. Ji, S. Pan, E. Cambria, P. Marttinen, and S. Y. Philip, “A survey on knowledge graphs: representation, acquisition and applications,” 2020, https://arxiv.org/abs/2002.00388. View at: Google Scholar
 P. Jiang, Y. Zhu, Yi Zhang, and Y. Quan, “Lifestage prediction for product recommendation in ecommerce,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2015. View at: Google Scholar
 S. Kabbur, X. Ning, and K. George, “FISM: factored item similarity models for topn recommender systems,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 659–667, Chicago, IL, USA, 2013. View at: Google Scholar
 Y. Koren, “Collaborative filtering with temporal dynamics,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 447–456, Paris France, June 2009. View at: Google Scholar
 C. Lei, L. Dong, W. Li, Z.J. Zha, and H. Li, “Comparative deep learning of hybrid representations for image recommendations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2545–2553, Las Vegas, NV, USA, 2016. View at: Google Scholar
 S. Li, T. Cai, Ke Deng, X. Wang, T. Sellis, and F. Xia, “Communitydiversified influence maximization in social networks,” Information Systems, Amsterdam, The Netherlands, vol. 92, Article ID 101522, 2020. View at: Publisher Site  Google Scholar
 C. Liu, J. Mao, F. Sha, and A. Yuille, “Attention correctness in neural image captioning,” in Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4176–4182, San Francisco, CA, USA, 2017. View at: Google Scholar
 Q. Liu, S. Wu, D. Wang, Z. Li, and L. Wang, “Contextaware sequential recommendation,” in Proceedings of the IEEE 16th International Conference on Data Mining (ICDM), pp. 1053–1058, Barcelona, Spain, 2016. View at: Google Scholar
 Q. Liu, S. Wu, D. Wang, Z. Li, and L. Wang, “Contextaware sequential recommendation,” IEEE International Conference on Data Mining, pp. 1053–1058, 2016. View at: Google Scholar
 M.T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attentionbased neural machine translation,” 2015, https://arxiv.org/abs/1508.04025. View at: Google Scholar
 K. Miyahara and M. J. Pazzani, “Collaborative filtering with the simple Bayesian classifier,” in Proceedings of the Pacific Rim International Conference on Artificial Intelligence, pp. 679–689, Melbourne, Australia, August 2000. View at: Google Scholar
 D. Neil, M. Pfeiffer, and S.C. Liu, “Phased LSTM: accelerating recurrent network training for long or eventbased sequences,” 2016, https://arxiv.org/abs/1610.09513. View at: Google Scholar
 S. Pan, R. Hu, S.F. Fung, G. Long, J. Jiang, and C. Zhang, “Learning graph embedding with adversarial training methods,” IEEE Transactions on Cybernetics, vol. 50, no. 6, pp. 2475–2487, 2020. View at: Publisher Site  Google Scholar
 A. Paterek, “Improving regularized singular value decomposition for collaborative filtering,” Proceedings of KDD Cup and Workshop, vol. 2007, pp. 5–8, 2007. View at: Google Scholar
 S. Rendle, C. Freudenthaler, and L. SchmidtThieme, “Factorizing personalized Markov chains for nextbasket recommendation,” in Proceedings of the International Conference on World Wide Web, pp. 811–820, Raleigh, CA, USA, 2010. View at: Google Scholar
 B. Sarwar, K. George, K. Joseph, and J. Riedl, “Itembased collaborative filtering recommendation algorithms,” in Proceedings of the tenth international conference on World Wide Web, pp. 285–295, Hong Kong, China, 2001. View at: Google Scholar
 M. Seo, A. Kembhavi, F. Ali, and H. Hajishirzi, “Bidirectional attention flow for machine comprehension,” 2016, https://arxiv.org/abs/1611.01603. View at: Google Scholar
 T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan, and C. Zhang, “DiSAN: directional selfattention network for RNN/CNNfree language understanding,” 2018, https://arxiv.org/abs/1709.04696. View at: Google Scholar
 X. Shen, S. Pan, W. Liu, Y.S. Ong, and Q.S. Sun, “Discrete network embedding,” in Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, pp. 3549–3555, Stockholm, Sweden, July 2018. View at: Google Scholar
 W. Song, Z. Xiao, Y. Wang, L. Charlin, M. Zhang, and J. Tang, “Sessionbased social recommendation via dynamic graph attention networks,” in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 555–563, Melbourne, Australia, 2019. View at: Google Scholar
 S. Sukhbaatar, J. Weston, R. Fergus et al., “Endtoend memory networks,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 2440–2448, Montreal, Canada, 2015. View at: Google Scholar
 B. Vassøy, M. Ruocco, E. d. S. d. Silva, and E. Aune, “Time is of the essence: a joint hierarchical RNN and point process model for time and item predictions,” in Proceedings of the ACM International Conference on Web Search and Data Mining, pp. 591–599, Los Angeles, CA, USA, 2019. View at: Google Scholar
 A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008, Long Beach, CA, USA, 2017. View at: Google Scholar
 P. Wang, J. Guo, Y. Lan, J. Xu, S. Wan, and X. Cheng, “Learning hierarchical representation model for nextbasket recommendation,” in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 403–412, Santiago, Chile, 2015. View at: Google Scholar
 P. Wang, L. Zhao, X. Pan, D. Ding, X. Chen, and Y. Hou, “Density matrix based preference evolution networks for ecommerce recommendation,” in Proceedings of the International Conference on Database Systems for Advanced Applications, pp. 366–383, Chiang Mai, Thailand, 2019. View at: Google Scholar
 P. Wang, L. Zhao, Y. Zhang, Y. Hou, and L. Ge, “QPIN: a Quantuminspired preference interactive network for ecommerce recommendation,” in Proceedings of the ACM International Conference on Information and Knowledge Management, pp. 2329–2332, Beijing, China, 2019. View at: Google Scholar
 X. Wang, Y. Liu, and F. Xiong, “Improved personalized recommendation based on a similarity network,” Physica A: Statistical Mechanics and Its Applications, vol. 456, pp. 271–280, 2016. View at: Publisher Site  Google Scholar
 X. Wang, Y. Liu, G. Zhang, Y. Zhang, H. Chen, and J. Lu, “Mixed similarity diffusion for recommendation on bipartite networks,” IEEE Access, vol. 5, no. 2017, pp. 21029–21038, 2017. View at: Publisher Site  Google Scholar
 Y. Wang, H. Shen, S. Liu, J. Gao, and X. Cheng, “Cascade dynamics modeling with attentionbased recurrent neural network,” in Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, pp. 2985–2991, Melbourne, Australia, 2017. View at: Google Scholar
 C.Y. Wu, A. Ahmed, A. Beutel, A. J. Smola, and H. Jing, “Recurrent recommender networks,” in Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 495–503, Cambridge, UK, 2017. View at: Google Scholar
 Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, and C. Zhang, “Connecting the dots: multivariate time series forecasting with graph neural networks,” 2020, https://arxiv.org/abs/2005.11650. View at: Google Scholar
 Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical attention networks for document classification,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489, San Diego, CA, USA, 2016. View at: Google Scholar
 H.F. Yu, C.J. Hsieh, S. Si, and I. Dhillon, “Scalable coordinate descent approaches to parallel matrix factorization for recommender systems,” in Proceedings of the IEEE 12th International Conference on Data Mining, pp. 765–774, Brussels, Belgium, 2012. View at: Google Scholar
 L. Zheng, V. Noroozi, and S. Y. Philip, “Joint deep modeling of users and items using reviews for recommendation,” in Proceedings of the ACM International Conference on Web Search and Data Mining, pp. 425–434, Cambridge, UK, 2017. View at: Google Scholar
 Yu Zhu, H. Li, Y. Liao et al., “What to do next: modeling user behaviors by timeLSTM,” in Proceedings of the International Joint Conference on Artificial Intelligence, pp. 3602–3608, Melbourne, Australia, August 2017. View at: Google Scholar
Copyright
Copyright © 2020 Zhao Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.