Abstract

A session-based recommendation system is designed to predict the user’s next click behavior based on an ongoing session. Existing session-based recommendation systems usually model a session into a sequence and extract sequence features through recurrent neural network. Although the performance is greatly improved, these procedures ignore the relationships between items that contain rich information. In order to obtain rich items embeddings, we propose a novel Recommendation Model based on Multi-channel Convolutional Neural Network for session-based recommendation, RMMCNN for brevity. Specifically, we capture items' internal features from three dimensions through multi-channel convolutional neural network firstly. Next, we merge the internal features with external features obtained by a GRU unit. Then, both internal features and external features are merged by an attention mechanism together as the input of the transformation function. Finally, the probability distribution is taken as the output after the softmax function. Experiments on various datasets show that our method's precision and recommendation performance are better than those of other state-of-the-art approaches.

1. Introduction

With the explosive growth of the information in the Internet era, recommendation systems have become an effective solution for users to deal with large amounts of information [1]. In order to have a better user experience, personalized recommendation systems have been applied to many scenarios, including movie recommendation [2, 3], music recommendation [4, 5], online shopping [6, 7], and other settings. In the recommendation scenario, the user behavior is modelled as a session. The session consists of the sequence of clicks performed by the user on items. The first time the user clicks on an item is regarded as the beginning of a session, and the last item of the user’s continuous click browsing is the end of the session. Thus, the session contains the time series of user behavior and information between users and items [8, 9].

Traditional recommendation systems are mainly divided into recommendation systems based on collaborative filtering (CF), content-based recommendation systems (CB), and hybrid recommendation systems (HRS) [10]. CF-based recommendation systems build user preference models through the similarity of users or/and items. In addition, the CB recommendation systems state recommendations based on the content of item characteristics [11]. The former does not require contextual features; it only needs to train the matrix factorization model. The latter has good interpretability. In order to combine the advantages of both, HRS emerge to extract information from item attributes [12], users’ social networks [13], and item comments [14].

On the other hand, in recent years, deep learning technology has been widely used in recommendation systems [15]. At the same time, powerful cloud computing capabilities have also laid the cornerstone for the development of deep learning [16]. For example, edge computing technology has made it possible to use machine learning technology to achieve intelligent network optimization [17]. Among many neural models, the recurrent neural network [18] approach was the first to be used. Afterwards, the community took into account the rich features of data. Hence, the user temporal behavior is used in data augmentation [19]. Recently, STAMP [20] and SG-RNN [21] apply graph neural network to capture users’ long-term and short-term interests as global interests and the last time the user clicks on the item as the current interest for recommendation.

Although the aforementioned methods achieve greater improvements, they still have some limitations. Firstly, a large number of session recommendation systems are based on users' historical behavior information. Without a large amount of user information, these recommendation systems are not be able to make proper recommendations. Secondly, the sequential features thanks to time stamp are fully captured, but the information between items is ignored.

To overcome the limitations mentioned above, we propose a Recommendation Model based on Mutichannel Convolutional Neural Network (RMMCNN). The main contributions are as follows:(i)We introduce a multichannel convolutional neural network to extract item information in the context of a session.(ii)To embed richer features, we use graph neural network to extract sequence features and internal features and then combine them as the final embedding vector representation through an adaptive mechanism.(iii)Experiments are performed to compare our model with the baseline models. The results indicate that Precision and Mean Average Precision have been increased by at least 0.37% and 0.52%, respectively.

Conventional recommendation methods include CF, CB, and HRS systems. In recent years, neural networks have greatly improved the performance of recommendation systems, including recurrent neural network, convolutional neural network, and graph convolutional neural network. Recurrent neural network [2123] can extract users’ historical click sequence features. Convolutional neural network [24, 25] can extract different local features of items and generate the corresponding item vector. Graph neural network [21, 26, 27] can learn graph structure data and capture vector embeddings of different nodes. These characteristics enable the neural network to learn more features. Therefore, the neural network can achieve better recommendation performance than conventional recommendation methods.

2.1. Conventional Methods
2.1.1. Collaborative Filtering (CF)

Sarwar et al. [28] consider the impact of items on recommendation performance. This work analyzes the user-item matrix to identify different relational items and then uses relational items for indirect calculations. Cheng et al. [29] propose a collaborative filtering method based on user interest sequences. They introduce the similarity of users in the sequence dimension and extract the length of the user’s longest common subinterest sequence and the total number of users. The number of common subinterest sequences is used to extract the information hidden in the sequence.

2.1.2. Content Based Recommendation (CB)

Putri et al. [30] use the supervised learning method to represent and learn the data of 3700 articles in a vector space and apply a K-Neighbor algorithm for metrics. In order to alleviate the cold start problem, Zhang et al. [31] build the learned feature relationship matrix to extract user preference information hidden in content features. Trinh et al. [32] construct an association matrix between events and the user characteristics content. Concurrently to this, they also combine temporal and spatial relationships together with user interests to make recommendations to friends of key users.

2.1.3. Hybrid Recommendation

Hybrid recommendation algorithms aim to inherit the advantages of CF and CB recommendation algorithms. Rojsattarat and Soonthornphisaj [33] improve the recommendation performance using support vector machines, which map the information to the Euclidean space in order to extract features. Kiewra [34] utilizes the similarity between search and recommendation and uses positive and negative feedback to enhance the range of recommendations. Kolahkaj et al. [35] give importance to certain features of data such as time, location, user's hidden rating, and geographic information location. Then, it combines this information with collaborative filtering, context awareness, and other methods to make recommendations dynamically.

2.2. Deep Learning Methods
2.2.1. Recurrent Neural Network (RNN)

Based on the RNN [22], the user’s next clicked item can be predicted through similarity. Taking into account the essential characteristics of the item sequence, Long Short-Term Memory (LSTM) is used to capture the similarity between sequences. Xia et al. [23] combine RNN and an attention mechanism to learn about the session, sequence characteristics, and session context information. This fully mines the sequence characteristics of user sessions through recurrent neural network. Wu et al. [21] use the GRU unit to capture the sequence features and combine them with the last item in a session. Then, they are both generated into potential vectors for recommendation.

2.2.2. Convolutional Neural Network (CNN)

Cai et al. [24] propose a multi-domain recommendation method based on CNN. It uses the generated user and item preference vectors to predict product ratings through a decomposition machine. Gao et al. [25] establish the CNN to capture the user’s sequential features as positive feedback information and obtain negative feedback information through confrontation training, respectively. Afterwards, it combines both feedbacks to generate action value functions for recommendation.

2.2.3. Graph Neural Network (GNN)

GNNs are excellent in node information and graph structure information extraction. Fan et al. [26] propose a GNN framework for user-item graphs and their interactions to model two graphs and heterogeneous intensities. Xian et al. [27] construct the framework that combines the GNN with the repetitive exploration mechanism. It dynamically processes the sequence in a session through the graph structure and captures the complexity between items through the graph neural network. Wu et al. [21] use GNN and the GRU unit to generate a latent vector representation of a session sequence information and apply an attention mechanism to combine global and local user preferences.

3. The Proposed Model

In this section, we firstly present the proposed RMMCNN model. Then, we formulate the problem and introduce the process of our proposed model. Our model includes five steps: (1) knowledge distillation, (2) external feature extraction, (3) internal feature extraction, (4) joint feature extraction, and (5) possibility prediction.

3.1. RMMCNN Framework

Figure 1 introduces the framework of the proposed RMMCNN method. At first, all sessions are fed into a vector space via a directed graph. The items clicked by the user are nodes of the knowledge graph. The direction and connections of nodes in the graph represent the sequence of users that clicked two adjacent items consecutively. If we take into consideration the fact that some users may have clicked the same items in the very same sequence order, we have normalized each edge. We learn the latent vector representation of items through a graph neural network, so that each session will generate the corresponding embedding vector. After the embedding vector generation through the knowledge graph, the embedding vector containing the sequence features is generated by means of the GRU unit. More precisely, the session node embedding vector is propagated among different nodes through the GRU, not only extracting the features of neighboring nodes but also combining these neighboring features as the input of the graph neural network. Following that, the reset gate in the GRU determines whether the information should be kept or dropped, whereas the update gate refreshes all nodes to ensure the convergence of the results. Then, we extract external features and internal features separately for these embedding vectors. Internal features are extracted through a multi-channel convolutional neural network, which has three channels, where each channel has its own weight parameter, and we extract separately the internal features of different dimensions. The results of all channels will be combined through the attention mechanism as the final internal embedding. Finally, the node (resp., item) click probability is generated through linear and softmax transformations.

3.2. Problem Formulation

The goal of a session-based recommendation system is to predict for each session which item is going to be clicked next. Let denote the set of all unique items in the sessions. All the sessions are composed of a series of items. Let represent a history session sorted by time, where represents that the user clicked the item in session . In this session, our goal is to predict where the user is going to click next, that is, , in the context of the session.

Next, we are going to describe the steps that we follow to obtain the next clicked item in a session.

3.3. Knowledge Distillation

Recall that, within each session , each item is represented as a node in the directed graph . The direction of each edge between nodes represents that the user clicked both items consecutively, according to the direction of the edge. Thus, a user in a session firstly clicks item and, immediately after that, clicks the item , which we denote as . In the directed graph, each item would be embedded into a unified embedding space. Let denote the item embedding vectors, where indicates the vector embedding of item and indicates its dimensionality, respectively. Let denote the session in the graph . For this session , the proposed RMMCNN will output probabilities for all the items, where . The top-K items in will be chosen as the candidate items.

3.4. External Feature Extraction

The session contains the users’ clicked items within a period of time. In order to predict which item the user will click and the user behavior, we need to extract the representations of the session. At the same time, users’ interests will change over time, and the temporal characteristics of the session should also be extracted. Therefore, we use the GRU joint attention mechanism to represent these external features.

GRU controls the flow of information through gates. GRU uses two gates, combining the input and forget gate of the LSTM into the update gate. The update gate determines the ratio of the previous value and the current one. The computation formula of the update gate is as follows:where denotes the sigmoid function:

The reset gate establishes whether the current candidate state needs to depend on the network previous state and the weight of this dependency:

We need to estimate another intermediate value, which is the memory value. It is defined aswhich is determined by the memory value in the previous state and the current input value. AGG is the aggregator function. We have chosen to use element-wise multiplication of the vectors. Thus, the state value of the hidden layer is defined aswhich is the weighted combination of the memory value of the current moment and the previous state value. It is important to note that, in the above formulae, is the corresponding weight matrix.

After the GRU extracts all the features, the session can be denoted as .

3.5. Internal Feature Extraction

Following the notations used in Section 3, we use to denote the raw input items sequence. We use a multi-channel convolutional neural network to extract the rich internal item features. Besides, we feed into the neural network for further processing.

We expand the original two-dimension embedding vector into a higher dimension vector by firstly performing a convolution operation, which considers the difference of the internal features extraction with different convolution kernels. Our approach is loosely based on the RGB image processing, and we capture the different dimension features with a multi-channel convolutional neural network that we use to build the future maps. We set the convolution kernel to [1, 1, 1, 1], [1, 1, 1, 2], and [1, 3, 1, 1], respectively. Thus, the three convolution channel results can be expressed as , , and .

Then we merge these future maps with a linear transformation.where is a linear transformation function. Once we extract the features, we perform a dimension reduction for subsequent processing to obtain the vector .

On the other hand, we extract the last item as , which is denoted as ; that is, .

3.6. Joint Feature Extraction

In order to maximize the representation of information, we aggregate the external and internal features together. Firstly, we aggregate all node embedding vectors and the last item:where , , and controls the weights of item vectors.

Then, we aggregate the result into the final feature. After this, we compute the hybrid embedding vector through function transformation over the combination of the last clicked item and the external features:where matrix compresses two combined embedding vectors into the latent space .

3.7. Probability Prediction

After obtaining the representation of the session, we calculate the score as follows:where is the corresponding conversion dimension matrix and represents the similarity scores with regard to each candidate item.

The score is transformed through the softmax function in the following:where denotes the next clicked item probability for a given user in the context of the given session.

In our work, we adopt the cross-entropy as the loss function, and it is defined as follows:where denotes the user actual clicked items in the session.

The recommendation model can be trained through the backpropagation algorithm, and then the parameters in the model can be updated. In this process, we use the Adam optimizer [36] to train the parameters in the RMMCNN model.

4. Experiments and Analysis

In this part, we conduct three groups of experiments on two real world datasets. The datasets come from the RecSys 2015 Challenge called Yoochoose and CIKM Cup 2016 Challenge called Diginetica. The first experimental setup compares the recommendation performance of different models, whereas second group experiment compares the recommendation performance of different session embedding methods. Finally, the third group compares the recommendation performance of different evaluation criteria. Our experiments are based on TensorFlow 1.4.0 and Python 3.6.

4.1. Experiment Settings
4.1.1. Datasets

Yoochoose contains a series of click events of users on e-commerce websites, and these click events can be used to predict whether the user intends to click a certain product. Diginetica contains a large amount of information such as searches, logs, product data, and transaction data. In this paper, we only use transaction data.

Considering the existence of some noisy data in the sessions, we will filter out those values whose session length is 1 and those items that have been clicked fewer than 5 times, following [20, 21]. Then, we separate the two datasets and divide them into a training dataset and a test dataset, respectively. The reader is referred to Table 1 for a more detailed description. Considering that the Yoochoose dataset is quite large, we sort the sequences in the Yoochoose dataset and obtain the latest fractions 1/64 and 1/4 of the entire sequence according to time. We refer to them as Yoochoose 1/4 and Yoochoose 1/64.

4.1.2. Data Availability Statement

There are two datasets used in this paper: one is Yoochoose and the other is Diginetica. The Yoochoose dataset comes from the 2015 ACM RecSys Challenge. The content is a series of click events performed by users during a typical session in an e-commerce website. The data files include training data files and test data files. The former contains click events and purchase events, and the latter contains files. Each click event contains the session ID, timestamp, item ID, and item category. The Yoochoose dataset can be accessed at https://2015.recsyschallenge.com/challenge.html. The Diginetica dataset comes from CIKM Cup 2016. The dataset contains product data and transaction data. In this paper, we only use transaction data. The Diginetica dataset can be obtained from the following link: https://competitions.codalab.org/competitions/11161#learn_the_details-data2.

4.2. Baselines

In order to measure the performance of the proposed model, we compare the model with the following baseline algorithms.(i)NARM [37] uses an attention mechanism to obtain the features in the hidden state to enhance the original information, which emphasizes the main purpose of the user in the current session. It proposes a neural attention recommender to solve the problem of lack of user purpose analysis in a session-based recommendation setting. NARM proposed a hybrid encoder to simulate the user's sequential behavior, capture the user's main purpose in the conversation, and merge this information as the final user behavior information representation.(ii)STAMP [20] combines the current session information with the last clicked item in the current session. This mainly solves the problem of user behavior prediction based on anonymous sessions. It considers the impact of the user’s current operation on the next clicked item as a tradeoff with the short-term memory model. It combines the short-term attention model with the original long-term memory model to extract the current and long-term user interest and generates the final interest of the user.(iii)SRGNN [21] uses graph neural networks and GRU units to generate node latent vector representations. This approach tries to solve the problem of accurate user vector generation. SRGNN models user behavior as a graph structure data and captures item conversion information with a graph neural network. Then, the final embedding vector is generated through a linear transformation and the recommendations are made based on user’s clicked items sequence and the last clicked item in the session.

4.3. Evaluation Metrics

We use often-used Precision (P) and Mean Reciprocal Rank (MRR) evaluation metrics to evaluate the performance of the RMMCNN model.

P@20: P@K measures predictive accuracy in recommendation systems. P@K describes the ranking ratio of recommended items accuracy in the recommendation lists, and it is defined as follows:where denotes the total count of test data and represents the count of the hit data in the top-K ranking list.

MRR@20: MRR@K measures the accuracy of recommended clicked items, ordered by the probability of correctness. Given K = 20, if the right clicked item is suggested in apposition greater than 20, it will be set to zero:where denotes users’ first item ranking position in the recommendation list.

4.4. Parameter Settings

We set the latent vectors’ dimensionality to for the two datasets, like the settings of [20, 21, 37]. Besides, all parameters are set initially by a Gaussian distribution , where its mean is 0 and is its standard deviation. The initial learning rate is set to 0.001 and the batch size is set to 100. Recall that we use Adam optimizer [36] to update all parameters. For NARM, we set the mini-batch size to 512, the learning rate to 0.001, and the epochs to 30, respectively. For STAMP, the learning rate decay is set to 1 and the latent vectors’ dimensionality is set to 100. For SRGNN, we set the learning rate decay to 0.1 behind every 3 epochs and the dimensionality is also 100. These key variables are shown in Table 2.

4.5. Experimental Results

In this section, we compare RMMCNN in various aspects. Firstly, we compare RMMCNN with state-of-the-art methods. Then, we compare the performance of RMMCNN methods with variants of session embeddings. Finally, we compare the performance of RMMCNN with different evaluation metrics.

4.5.1. Comparison with Baseline Methods

To compare the performance of our proposed model, we compare it with some baseline models: NARM, STAMP, and SRGNN. The results show that RMMCNN outperforms all of them. The overall output is given in Figures 2 and 3; further detailed overview of the results can be seen in Table 3. RMMCNN aggregates the external and internal features of the session into the final data and makes recommendations based on both data. Compared with the current mainstream recommendation methods based on neural networks, RMMCNN shows excellent performance. NARM captures the user’s overall interest through a cyclic neural network, and STAMP adds the last clicked event to the information extraction of the recommended item to achieve the purpose of information enhancement. These neural network-based methods achieve better recommendation performance than other methods. Figure 2 shows that STAMP is very low on the Diginetica dataset; this may be due to the fact that STAMP only uses the transition between users’ last click item and users’ historical click item. This information may not be enough to predict user session behavior. Therefore, the performance of STAMP in Diginetica dataset is very poor. SRGNN [21] generally considers the sequence characteristics of the user’s click items and the user’s last click event, which also combines the global characteristics and the local characteristics of the last click event for recommendation.

Our proposed model, RMMCNN, considers not only the external characteristics but also the internal characteristics of the session. Therefore, it combines the sequential characteristics of the session, the internal item characteristics of the session, and the last clicked event of the user to make recommendations. Simultaneously, RMMCNN uses an attention mechanism to automatically learn the weight representation. Thus, RMMCNN has obtained a richer user session representation and can make better recommendations for different user behaviors.

4.5.2. Comparison with Variants of Session Embeddings

We conduct distributed experiments on the model to verify the rationality of the RMMCNN model connection method and conduct experiments on the internal and external recommendation methods. We define the sequence feature of the session as an external feature and the relationship between users and items as an internal feature. We call the two methods RMMCNN-EXTER and RMMCNN-INTER, respectively.(i)RMMCNN-EXTER: We first model the session as a directed graph and then extract the sequence features of the session through the GRU unit. Finally, we obtain different weight representations through the attention network.(ii)RMMCNN-INTER: We obtain the embedding vector of the session to represent the internal relationship through a multichannel convolutional neural network. After that, we can get richer embedded vector information.(iii)RMMCNN-JOINT: We combine the external vector with the internal vector of the session to generate the final session embedding representation.

Through Figures 4 and 5, we can see that the recommendation performance of RMMCNN-JOINT is better than that of the other two recommendation models. This is because RMMCNN-JOINT embeds more user information, and the content representation for recommendation is also richer. More details can be viewed in Table 4.

4.5.3. Comparison with Different Evaluation Metrics

In order to further measure the recommendation performance of different methods, we use different evaluation metrics to evaluate them. As illustrated in Figure 6, the results show that as the value of K increases, the recommendation accuracy rate and recommendation performance decrease, which indicates that the ideal recommendation result ranking position is not high. On the other hand, from the results, we can see that the recommendation performance improves as the model complexity increases. This shows that as the complexity of the model increases, the embedded information extracted by the model is richer, and the recommendation results rank higher.

We tested on the Yoochoose dataset and Diginetica dataset. On the Yoochoose dataset, the result of P@10 is 60.32% and the result of P@20 is 70.94%. The results are better than those in other baseline models. On the Diginetica dataset, the result of P@10 is 36.64%, which is lower than 39.89% of NARM. However, the result of P@20 is 51.82%, while the result of SRGNN is 50.73%, indicating that the ranking of RMMCNN recommendation results in Diginetica data is mostly in [10, 20].

In the indicators of P@10 in Table 5, we see that NARM is superior to other methods. This is mainly due to the NARM’s hybrid encoder attention mechanism. In the Diginetica dataset, the attention mechanism is more complex than those in other models. More representation information is aggregated, so the performance is significantly better than those of other recommendation models. However, as the K value increases, the recommendation performance of NARM is not as good as those of other more complex models. We argue that more complex models introduce more dimensional representation information, which is much better than simply using the attention mechanism.

5. Conclusions

In this paper, we first propose a novel multichannel convolution model to capture the rich information of items for recommendation. Then, we use the attention mechanism to obtain the features of the user clicked items sequence adaptively and combine the internal and external features of the session to jointly generate the final users’ session vector embedding. Once we have the vector, we perform a linear transformation along with the softmax function. We convert the result into a probability value between . We have conducted some experimental results with stat-of-the-art recommender systems with two real datasets and both Precision and MRR have been improved. In the future, we will further study the richer representations of items’ embedding to make more accurate recommendations.

Data Availability

The data used can be found at https://2015.recsyschallenge.com/challenge.html and https://competitions.codalab.org/competitions/11161#learn_the_details-data2.

Conflicts of Interest

There are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by Major Project of the National Natural Science Foundation of China (no. 51935002) and the National Key Research and Development Program of China (no. 2018YFC0831903).