Abstract

Personalized recommender systems, as effective approaches for alleviating information overload, have received substantial attention in the last decade. Learning effective latent factors plays the most important role in recommendation methods. Several recent works extracted latent factors from user-generated content such as ratings and reviews and suffered from the sparsity problem and the unbalanced distribution problem. To tackle these problems, we enrich the latent representations by incorporating user-generated content and item raw content. Deep neural networks have emerged as very appealing in learning effective representations in many applications. In this paper, we propose a novel deep neural architecture named DeepFusion to jointly learn user and item representations from numerical ratings, textual reviews, and item metadata. In this framework, we utilize multiple types of deep neural networks that are best suited for each type of heterogeneous inputs and introduce an extra layer to obtain the joint representations for users and items. Experiments conducted on the Amazon product data demonstrate that our approach outperforms multiple state-of-the-art baselines. We provide further insight into the design selections and hyperparameters of our recommendation method. In addition, we further explore the relative importance of various item metadata information on improving the rating prediction performance towards personalized product recommendation, which is extremely valuable for feature extraction in practice.

1. Introduction

With the exploding growth of the network scale and the number of products, it is difficult for customers to deal with the large amount of available information. To alleviate information overload [1], personalized product recommendation has been utilized by e-commerce websites to present products that best meet customers’ needs and expectations. The success of many e-commerce companies is due to their accurate and personalized product recommender systems, such as Amazon, eBay, Yelp, and Netflix [2].

Collaborative filtering (CF) [36] is one of the most successful recommendation algorithms in both industry and academic communities. The basic idea of the technique is that people who share similar ratings tend to have similar preferences. However, the CF-based method easily suffers from the cold-start problem [7] when there are only a few ratings for items, which severely deteriorates the accuracy of recommendation.

Recently, researchers have found that additional data sources beyond ratings are extremely helpful in personalized recommendation. However, most existing recommender systems take into account only the user-generated content such as textual reviews in user/item profiling, while ignoring the raw content of items. First, the user-generated content is very sparse in many applications. For example, in the Netflix dataset [8], the number of movies that have been rated by users is only approximately 1 percent of the total number of movies. Second, the distribution of user-item interaction data is highly unbalanced. According to Anderson [9], due to the long-tail effect, only a few users interact with a large number of items, while most users rarely, or never, interact with items. Besides, these additional data sources come in very different and heterogeneous forms, which make it difficult to fuse them in a unified way.

Fortunately, deep learning technique has shown powerful representation learning performance and outstanding scalability, which has shed light on this problem. Several deep learning-based recommender systems have been proposed. Yet, most of them [1014] were usually restricted to limited data sources or learned the latent representations of users and items independently. As a result, these approaches cannot achieve fine-grained modeling of user preferences and item features. In addition, the metadata of product (item), i.e., price, brand, title, and description, on e-commerce websites plays an important role on user buying behaviors. However, to the best of our knowledge, no study has been conducted in which the relative importance of various item metadata information was considered.

In this paper, we propose a novel personalized recommendation method based on deep learning and multiview fusion, called DeepFusion, for rating prediction task. In this framework, each kind of data source is considered as a view and different views describe different aspects of user preferences and item features. Because of the nature of heterogeneous data sources, we utilize multiple types of deep neural networks, i.e., multilayer perceptrons (MLP) and convolutional neural networks (CNN), to make the best use of these inputs. Then, representations from each view are further mapped to a shared semantic space with a merged layer to obtain the integrated user/item representations. Finally, a multilayer perceptron layer is introduced to capture the complex relations between users and items. For model learning, the loss function is defined as the error between the predicted rating and the actual rating, and model parameters are adjusted via a backpropagation. Our proposed model is evaluated over the Amazon product data and compared with classic and state-of-the-art recommendation approaches. The experimental results demonstrate that DeepFusion significantly outperforms all baselines.

The main contributions of our work are summarized as follows:(i)We propose a novel personalized recommendation method based on deep learning and multiview fusion, called DeepFusion, for rating prediction task in product recommendation. The method is capable of incorporating user-generated content and item raw content including numerical ratings, textual reviews, and item metadata in a unified space.(ii)We utilize multiple types of deep neural networks that are best suited for each type of heterogeneous data sources to jointly learn user and item representations, which is beneficial to tackle the sparsity problem and the unbalanced distribution problem.(iii)We conduct a series of extensive experiments on a real-world dataset. The experimental results demonstrate that our approach outperforms all the baselines. And we further study the impact of the design selections and hyperparameters of our recommender system.(iv)To the best of our knowledge, we are the first to explore the relative importance of product (item) metadata, i.e., price, brand, title, and description, on user buying behaviors on e-commerce websites.

The remaining of the paper is organized as follows. In Section 2, we present the overview of related work. In Section 3, we describe our proposed model in depth. Then, we present the experimental results and analyses in Section 4. Finally, we conclude our work and introduce the future work in section 5.

There have been extensive works on recommendation systems with a myriad of publication. In this section, we briefly review a representative set of approaches that are mostly related to our proposed approach.

2.1. Additional Data Sources for Recommender Systems

In recent years, numerous works have been proposed to exploit additional data sources for personalized recommendation. A popular research line is the joint modeling of numerical ratings and textual reviews for recommendation. Textual reviews are able to express user opinions towards various item features. McAuley and Leskovec [15] proposed the hidden factors as topics (HFT) model. This model extracted latent topics from reviews via topic model latent dirichlet allocation [16] and associated topics with rating dimensions. Chambua et al. [17] introduced the linguistic similarity between review texts and incorporated it into the probabilistic matrix factorization (PMF) model [18]. Cheng et al. [19] emphasized the importance of considering users’ varying attentions on different aspects and applied the aspect-aware topic model (ATM) on the review text to estimate the aspect attention weights of a user towards an item. Although the mixture of aspects discovered by topic-based methods may describe a corpus fairly well, aspects often consist of unrelated or loosely related concepts [20]. Therefore, failing to preserve the original order of words and ignoring their semantic meaning, the above methods cannot achieve the successful modeling of a given review. To tackle these limitations, researchers have paid extensive attentions to neural network methods. Zheng et al. [11] presented deep cooperative neural networks (DeepCoNN) for learning user behaviors and item properties by using two parallel CNNs. Chen et al. [12] aimed at exploiting the usefulness of reviews and developed a neural attention regression model for predicting ratings and selecting highly useful reviews simultaneously. Cheng et al. [21] developed a novel aspect-aware recommender model named A3NCF, which can capture a user’s special attention on each aspect of the targeted item with an attention network.

In addition to textual reviews, additional data sources such as tags, item descriptions, item images, and user social networks have been used as supplemental information for sparse ratings. Ma et al. [22] investigated the combination of tags and genre information for identifying user interests via an augmenting matrix factorization approach. Kim et al. [23] utilized a CNN to capture contextual information of item description documents and integrated it into the PMF method with consideration of the Gaussian noise. Cheng et al. [24] applied a proposed multimodal aspect-aware topic model (MATM) on textual reviews and item images to model user preferences and item features from different aspects. Huang et al. [25] constructed a hybrid multigroup coclustering recommendation framework to cluster users and items into multiple categories simultaneously, which fully utilized various data sources including ratings and user social networks. Qian et al. [26] fused three types of representative heterogeneous information to comprehensively analyze user features, such as ratings, user social networks, and user review sentiments. Zheng et al. [27] considered the evolving nature of user preferences over time and developed a time-sensitive and tag-aware recommendation framework. Bougiatiotis and Giannakopoulos [28] presented a content-based movie recommender system that was based on textual information and audio and visual channels.

2.2. Deep Learning for Recommender Systems

Recently, deep learning has realized tremendous success in recommender systems. In this section, we review several representative-related approaches. There are extensive works that combine neural network structures with collaborative filtering. Li et al. [29] learned effective latent representations via a deep architecture for CF, which coupled PMF with marginalized denoising stacked autoencoders (mDAs). Wang and Wang [30] integrated the deep belief network and the probabilistic graphical model and used the hybrid model to learn features from audio content. Wu et al. [31] designed a collaborative denoising autoencoder (CDAE) for top-N recommendation by training on a corrupted version of the known user-item interactions.

Due to the strong performance of deep learning on feature representation and combination, many deep models have been developed for learning the latent representations of users and items. Wang et al. [32] proposed collaborative deep learning (CDL), which learned a deep representation from movie content by using the generalized stacked autoencoder (SAE) model. Yu et al. [33] proposed an interactive attention mechanism to learn the latent representations of users and items and provided interpretable item recommendation. Zhou et al. [34] developed the deep interest network (DIN) for adaptively learning the representation of user interests via a local activation unit.

Several recent studies focused on the interactions between features. He et al. [35] proposed a general neural network-based collaborating filtering (NCF) approach for modeling nonlinear interactions between users and items. After that they [36] developed the neural factorization machine (NFM), which seamlessly combined the linearity of the factorization machine (FM) [37] and the nonlinearity of the neural network to capture second-order and higher-order feature interactions between users and items. Cheng et al. [38] designed the wide&deep learning model for enhancing the memorization and generalization performances of recommender systems. Guo et al. [39] presented DeepFM, which is an end-to-end learning model, for emphasizing both low- and high-order feature interactions. Chambua et al. [40] established a hybrid recurrent neural network-long short-term memory (RNN-LSTM) architecture for learning user preferences with item aspects. Zhou et al. [41] proposed a deep interest evolution network (DIEN) for click-through rate prediction, which can capture temporal interests via an interest extractor layer.

Recently, several deep learning recommendation methods with multiview fusion have been proposed. Chen et al. [42] argued that, in multimedia recommendation, there exists item- and component-level implicitness which blur the underlying user preferences and proposed an attention mechanism in CF. In their model, the CNN was used to extract image features and video features. Zhang et al. [43] used three heterogeneous data sources including ratings, reviews, and image information to jointly model the user and item representations based on deep representation learning architectures. Gan et al. [44] adopted a convolution neural network to extract the hidden feature from item description and then fused it with tag information. Elkahky et al. [45] proposed a content-based cross-domain recommender system, which learned rich user and item features according to user web browsing histories and search queries. Tal and Liu [46] presented a textual and contextual embedding-based neural recommender (TCENR) for point-of-interest (POI) recommendation. Multiple types of deep neural networks were utilized to analyze additional data sources including social networks, geospatial locations, and textual reviews. Guo et al. [47] presented a multimodal representation learning method to predict user preferences based on multimodal content, including visual features, text features, audio features, and user interactive history in short video understanding and recommendation. Chen et al. [48] proposed a novel neural architecture for fashion recommendation based on both image region-level features and user review information. Cui et al. [49] presented a visual and textural recurrent neural network (VT-RNN), which simultaneously learned the sequential latent vectors of user’s interest and captured the content-based representations. Though better performance against modeling ratings is achieved, the above models learned the latent representations of users and items independently. Thus, it is difficult for the above models to effectively model user preferences and item features.

3. DeepFusion Approach

3.1. Overview of DeepFusion

The architecture of our proposed method DeepFusion is illustrated in Figure 1. The model consists of three components: Reviews Modeling, Ratings Modeling, and Item Metadata Modeling. The Reviews Modeling component, which is presented in the left part of the figure, utilizes two parallel convolutional neural networks to extract the rich relevant textual features of users and items by jointly considering their corresponding reviews. Shown in the middle of Figure 1, the Ratings Modeling component uses a neural architecture of matrix factorization (MF) to acquire pair-dependent latent features on the basis of numerical ratings. Shown in the right side of Figure 1, the Item Metadata Modeling component introduces four deep neural networks to learn item complementary features based on item raw content, i.e., price, brand, title, and description. The outputs of three components are further mapped to a shared semantic space with a merged layer to obtain the integrated user preferences and item features. Then we develop an MLP architecture to capture the complex relations between users and items and predict ratings. Finally, the loss function is defined as the error between the predicted rating and the actual rating, and model parameters are adjusted via a backpropagation process. Key notations adopted in this paper are summarized in Table 1.

3.2. Reviews Modeling

To capture the underlying meanings of textual reviews, we use the Reviews Modeling component to improve the model’s coverage. This process is conducted with two similar CNNs [11]: one network for users and one network for items. In the following, we describe the user network in detail.

The first layer is the word embedding layer, which receives user reviews as the input in their original order and outputs a -dimensional distributed vector. In this paper, user reviews refer to all the reviews that were written by user . As a widely used word representation in information retrieval, the bag-of-words model is based on one-hot representation [50] and is usually used to transform a word into a feature vector. However, the one-hot representation of a word, ignoring semantic and grammatical interpretation, tends to suffer from the curse of dimensionality [51]. To alleviate this problem, we resort to a widely used natural language processing method Word2Vec [52] to convert the dictionary of words into formal and uniform vectors. The output of the word embedding layer can be represented aswhere denotes the word of user review, represents the length of user review, is a lookup function that returns the corresponding word vectors, and denotes the concatenation of word vectors.

The next layer is the convolution layer, which consists of neurons. Each neuron is associated with a convolution kernel , where is the window size. The convolution kernel can perform multiple convolution operations on word vectors and add bias to obtain the feature map. The features of neuron can be defined aswhere symbol denotes the convolution operation, is the bias, and is the nonlinear activation function. According to [33], we employ the rectified linear unit , which has been widely used in neural networks, as an activation function:

Then, we use the maxpooling operation to select the maximum value as the significant feature for extraction. The pooling layer not only reduces the dimension of data, but also retains the representative features. Suppose that are the features of neuron on the sliding window , the convolutional result is reduced to a fixed-size vector:

After many applications of similar operations, the final output of the pooling layer is the concatenation of various features of neurons, and the output is

As expressed in equation (6), the output is fed to the fully connected layer, which comprises weight matrix and a bias term and uses as the activation function. The outputs of the fully connected layer are user preferences or item features based on textual reviews and are denoted as and , respectively:

3.3. Ratings Modeling

Matrix factorization is one of the most popular model-based CF methods and can learn linear interactions between users and items. Inspired by [35], we develop a full neural treatment of MF for deriving pair-dependent latent representations on the basis of user-item rating pairs, which is named as the Ratings Modeling component.

As is illustrated in Figure 1, the inputs of the neural network are the identities of users. The embedding layer projects the sparse representation to a dense vector via a lookup function . Let be the ID embeddings of users. Then, the corresponding dense vector is defined as

The next layer is a full connection layer whose outputs can be regarded as the latent vector for user in the context of numerical ratings. Here, we still use ReLU as the activation function:

The same process is adopted by the item network with corresponding layers, and we can acquire the latent vector for item in the context of numerical ratings.

3.4. Item Metadata Modeling

The above two components are built based on user-generated content and easily suffer from the sparsity problem and the unbalanced distribution problem. Therefore, we develop the Item Metadata Modeling component to model item raw content. Different types of item metadata describe different aspects of item features. Considering feasibility and availability, we use four types of product (item) metadata including price, brand, title, and description as supplemental information for item features.

We regard “price” and “brand” as structured indicators, which are typically easy to understand. Thus, the network of price is analogous with that of the brand and they differ only in terms of their inputs. In the first layer of the price network, an embedding function maps the price range into a -dimensional corresponding vector . Then, we use a full connection layer to learn the representation of price feature :

Similarly, we use the brand category as the input of the brand network and acquire the brand feature representation .

In addition, title and description texts typically reflect an overall profile of item and are structured in natural language. Therefore, we use two parallel convolutional neural networks to learn the title and description feature representations and , respectively. The CNNs are similar to the user network that is described in Section 3.2.

After that, to map the above four aspect features into a unified feature space, we concatenate them directly:

To prevent high dimensionality of feature vectors and emphasize the most important features of item, we feed into a full connection layer to obtain the item features in the context of the item metadata:

3.5. Multiview Fusion and Prediction

To effectively use multiview information in recommender systems, we fuse the above three components by merging the outputs of their last layers. The final representations for users consist of two data sources: user-item numerical ratings and user-item textual reviews. Except for ratings and reviews, the final representations for items comprise the item metadata:

Then, to better capture the complex relations between users and items, we utilize an MLP architecture [34], which is a widely used technique with excellent scalability. We concatenate and into a single vector and feed into new hidden layers to predict ratings.

The number of hidden layers can be customized to better model the latent structure of user-item interactions. In our model, we use two hidden layers with 32 and 16 hidden units.

3.6. Learning

According to [12, 53], we adopt the squared loss as the objective function, which is commonly used in rating prediction problems. Suppose and represent the predicted value and the ground-truth value of user on item , the objective function can be defined aswhere denotes the set of instances for training.

Then, we optimize the model via adaptive moment estimation (Adam) [54] and adjust the model parameters via a backpropagation process. By automatically selecting the learning rate, the deep learning model quickly converges. In addition, to avoid overfitting, we adopt the dropout [55] strategy. After obtaining the merging vector , we randomly drop percent of the neurons and their connections, where is the dropout ratio.

4. Experiments and Evaluation

4.1. Dataset

To evaluate the performance of our model in terms of rating prediction, we conduct extensive experiments on three real-world datasets that were collected from Amazon product data (http://jmcauley.ucsd.edu/data/amazon): Musical_Instruments, Automotive, and Sports_and_Outdoors. These datasets consist of users’ explicit ratings on items on a scale of 1 to 5 and textual reviews for various products. In summary, each rating record is a four-tuple (userID, itemID, rating, and reviews). Each item contains itemID, price, brand, title, and description. Items that do not have metadata information such as the price, title, or description documents are removed from the dataset. The users and the items that have less than 5 ratings are also removed. The detailed statistics are presented in Table 2.

We preprocess textual documents including reviews, titles, and descriptions according to [12]. Since the length of the text and the size of vocabulary exhibit a long-tail effect [9], we only keep percent of the length of reviews, where is set to 0.85, and keep percent the size of vocabulary, where is set to 0.7. We project the prices into specified price ranges with intervals of 10 and classify items for which the price exceeds 260 into the same class. Items that lack the brand attribute are classified as “other.” And, we argue that users who purchase those items do not pay much attention to the brand attribute.

4.2. Evaluation Criteria

We adopt the root mean square error (RMSE) as the evaluation criterion, which is a standard evaluation metric for rating prediction in recommender systems [12]. Given a predicted rating and a ground-truth rating from user to item , the RMSE score is computed by

We also used the mean absolute error (MAE) to evaluate our model, which has been widely used in previous studies [18]. The MAE score is calculated aswhere denotes the number of ratings between users and items. A lower RMSE (equation (13)) and a lower MAE (equation (14)) correspond to a better recommendation performance.

4.3. Baseline Methods

To evaluate the performance of our proposed DeepFusion method, we select four comparison algorithms and describe them briefly as follows:(i)PMF [18]: probabilistic matrix factorization based on ratings is a standard rating prediction method that initializes the latent factors for users and items from a Gaussian distribution.(ii)NeuMF [35]: neural matrix factorization based on ratings is a state-of-the-art model that fuses generalized matrix factorization (GMF) and MLP to jointly model the linear and nonlinear interactions between user preferences and item features.(iii)ConvMF [56]: convolutional matrix factorization is based on ratings and textual description documents. ConvMF is a strong baseline that integrates CNN into PMF to improve the rating prediction accuracy.(iv)DeepCoNN [11]: the deep cooperative neural network method based on ratings and textual reviews utilizes two parallel neural networks to jointly learn user preferences and item features and enables these two latent factors to interact with each other in a manner similar to factorization machine. Since DeepCoNN was evaluated against strong topic models, such as collaborative topic regression (CTR) [57], hidden factors as topics (HFT) [15], and collaborative deep learning (CDL) [32], and demonstrated superior performance, we do not repeat those same comparisons in this paper.

4.4. Experiment Details

Each dataset is split randomly into a training set (80%), a validation set (10%), and a test set (10%). The training set contains at least one rating on every user and item and is used to train our model. The validation set is used to tune hyperparameters and early stop the training phase. The test set is used to conduct the final performance comparison.

According to the results of parameter tuning, we set the number of latent factors to  = 50 and regularization parameters to and for PMF. For ConvMF, we leverage reviews on the items as item description documents and set the latent dimension of and to 50 and the regularization parameters to and . For deep learning-based methods, NeuMF, DeepCoNN, and DeepFusion, we set the learning rate to , the batch size to , the dropout ratio to , and the number of latent factors to . For the CNNs in DeepFusion, we reuse most of the hyperparameter settings that were presented by the authors of DeepCoNN, and we set the number of neurons to and the window size to . We use Word2Vec as the pretrained word embedding model. We initialize word latent vectors via pretrained 300-dimensional word embeddings, which were trained on more than 100 billion words from Google News [41].

4.5. Performance Evaluation

The performances of our proposed algorithm and all baselines on the three datasets are reported in Tables 3 and 4. The experiments are repeated 3 times, and the averages are reported with the best results indicated in bold.

Between the methods that consider ratings, NeuMF outperforms PMF on both evaluation criteria. The main limitation of PMF is that it learns the latent factors via a global optimization strategy and predicts an unknown rating by the dot product of the targeted user and item latent factors. As a result, the performance could be severely compromised locally for individual users or items. In contrast, NeuMF combines the linearity of MF and the nonlinearity of deep neural networks to learn user and item latent features. The hybrid architecture accurately models the interactions between them and thus offers better performance.

In addition, the experimental results demonstrate that all review-based rating prediction methods (ConvMF, DeepCoNN, and DeepFusion) outperform PMF. This is because a rating only reflects the overall satisfaction or judgment of a user towards an item. Relying solely on ratings makes it hard for PMF to explicitly and accurately model user and item features. Textual reviews indicate user opinion and emotion towards items’ various features. Therefore, these methods can provide such fine-grained analysis. Moreover, DeepCoNN and DeepFusion outperform ConvMF. ConvMF only utilizes a CNN to acquire item features from reviews, while user preferences are just learned from numerical ratings. As a result, ConvMF suffers from poor rating prediction.

Overall, our method DeepFusion outperforms all the baseline methods. The reasons are as follows. First, our approach jointly learns user preferences and item features from multiview data: numerical ratings, textual reviews, and item metadata, which support our hypothesis that incorporating user-generated content and item raw content will improve recommendation performance. Second, our model utilizes multiple types of deep neural networks that are best suited for each type of heterogeneous data sources, which is beneficial to tackle the sparsity problem and the unbalanced distribution problem. Third, our method adopts a multilayer perceptron architecture to learn the complex relationships between users and items and thus obtains a better rating prediction.

4.6. Model Analysis

In this section, we discuss the effects of hyperparameters and several design selections on our model’s performance.

4.6.1. Impact of

Our method encodes feature representations of users and items into two -dimensional latent vectors, where denotes the number of latent factors. To empirically study the effect of , we compare the performances of DeepFusion on the validation set among the values of in [4, 8, 12, 16, 32, 50, and 64]. Figure 2 plots the RMSE and MAE changes in our model as functions of on three datasets. The increase of from 4 to 16 promotes the performance, while the increase of from 16 to 64 does not boost the performance, but rather causes degradation. Hence, a relatively low or high value of may cause underfitting or overfitting. Therefore, a proper value of can enhance the performance of the recommender systems, and is the optimal setting for our experiment.

4.6.2. Impact of the Merged Layer

To further investigate the performance of the merged layer in capturing intricate relations between users and items, we compare the DeepFusion with variants that are based on three types of latent feature interactions: DeepFusion-dp, DeepFusion-lmf, and DeepFusion-fm. These three variants are summarized as follows:(i)DeepFusion-dp: we use a simple dot product of the latent features of users and items as the rating predictor.(ii)DeepFusion-lmf: the merged layer enables the latent features of users and items to interact with each other, similarly to latent matrix factorization (LMF) [6].(iii)DeepFusion-fm: we utilize the factorization machine [37] in place of the original neural prediction layer.

As shown in Figure 3, DeepFusion-fm outperforms DeepFusion-dp and DeepFusion-lmf on three datasets. This is because the factorization machine models not only the first-order interactions, but also the second-order interactions between the representations of users and items. In addition, the factorization machine automatically selects helpful features and thus offers better performance. More importantly, our original method DeepFusion significantly outperforms all variant methods. The results demonstrate that the multilayer perceptron architecture effectively learns the hidden intricate relationships between users and items and models the nonlinear interactions between them. Therefore, the multilayer perceptron architecture was integrated into our final model.

4.6.3. Impact of Multiview Fusion

Does the multiview fusion boost the recommendation performance? To answer this question, we investigate the impacts of the three independent learning components: Reviews Modeling, Ratings Modeling, and Item Metadata Modeling on the performance of our proposed method. We compare DeepFusion with its three variants: DeepFusion-Reviews, DeepFusion-Ratings, and DeepFusion-Metadata. These three variants are summarized as follows:(i)DeepFusion-Reviews: the Reviews Modeling component is omitted, namely, this variant only consists of Ratings Modeling and Item Metadata Modeling.(ii)DeepFusion-Ratings: similar to DeepFusion-Reviews, the Rating Modeling component is excluded.(iii)DeepFusion-Metadata: the Item Metadata Modeling component is excluded.

Figure 4 plots the prediction accuracy of the three variants and DeepFusion in terms of RMSE values. First, DeepFusion outperforms DeepFusion-Reviews on the three datasets. This is because the semantic and syntactic information of textual reviews compensates for a shortage of ratings. Moreover, this result demonstrates that textual reviews cover rich information, which is beneficial to reveal user preferences and item features.

Second, DeepFusion-Ratings performs relatively weaker than DeepFusion. Hence, the application of the interaction-based learning process on the basis of user-item pairs is conducive to rating prediction. The pair-dependent latent representations complement the independent reviews and item metadata learning approaches. Meanwhile, MF easily captures linear interactions between users and items and is thus effective in improving the prediction performance.

Finally, DeepFusion-Metadata performs the worst by far on all three datasets. The results show that the item metadata information effectively reflects the comprehensive characteristics of items. Since user-item interaction data are extremely sparse in e-commercial recommender systems, approaches only using user-generated content easily suffer from poor performance. By incorporating item metadata, richer features can be acquired for items that had few ratings and reviews, which is conducive to tackling the sparsity problem and the unbalanced distribution problem. Since the three components complement with each other, our proposed model DeepFusion can reap the benefits of additional data sources and realize superior rating prediction accuracy.

Furthermore, to further explore the relative importance of various product (item) metadata in promoting the rating prediction performance, we conduct a set of experiments using DeepFusion-Price, DeepFusion-Brand, DeepFusion-Title, and DeepFusion-Des. These variants are summarized as follows:(i)DeepFusion-Price: the price attribute is neglected.(ii)DeepFusion-Brand: the brand attribute is neglected.(iii)DeepFusion-Title: the title texts are neglected.(iv)DeepFusion-Des: the description texts are neglected.

Figure 5 illustrates the reductions in RMSE that resulted from DeepFusion compared to DeepFusion-Price, DeepFusion-Brand, DeepFusion-Title, and DeepFusion-Des on the three datasets. In terms of RMSE, DeepFusion-Brand performs the worst among these variants. Hence, the item brand attribute plays the most important role in product recommendation. Furthermore, this phenomenon demonstrates that customers pay more attention to the brand attribute of products; therefore, businesses should make efforts to enhance brand images. In addition, although DeepFusion-Des and DeepFusion-Title slightly outperform DeepFusion-Brand, they are both outperformed by DeepFusion. The results demonstrate that incorporating textual titles and textual descriptions into our model promotes the rating prediction accuracy. Textual titles and textual descriptions typically reflect an overall profile of an item and facilitate to acquire the item features. Finally, DeepFusion-Price outperforms other variants but is outperformed by DeepFusion; hence, the price attribute slightly contributes to boosting the recommendation accuracy. In summary, the relative importance of product (item) metadata is different on user buying behaviors in e-commerce. The brand attribute plays the most important role in product recommender systems, followed by textual descriptions and textual titles, and the price attribute has a relatively slight effect on the rating prediction performance.

5. Conclusions

In this paper, we proposed a novel personalized recommendation method based on deep neural networks and multiview fusion, called DeepFusion, for rating prediction task in product recommendation. The model is capable of incorporating user-generated content and item raw content including numerical ratings, textual reviews, and item metadata in a unified space. Meanwhile, we designed three neural network components to jointly learn user and item representations. Finally, a multilayer perceptron layer was employed to capture the complex relations between users and items. Evaluated over the Amazon product dataset, our proposed DeepFusion model achieved better performance than all the baselines. In addition, we further explored the relative importance of product (item) metadata on user buying behaviors on e-commerce websites.

In the future, we intend to evaluate our proposed method over additional datasets. In addition, we will consider incorporating more heterogeneous data sources such as item images and user social networks towards a unified, multiview informed practical recommender system.

Data Availability

The Amazon product data supporting the findings of this study are from previously reported studies and datasets, which have been cited. The processed data are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was partly supported by the National Science Foundation of China (nos. 71871019 and 71471016) and by the Fundamental Research Funds for the Central Universities under Grant no. FRF-TP-18-013B1.