Computational Intelligence and Neuroscience

Volume 2016, Article ID 5968705, 11 pages

http://dx.doi.org/10.1155/2016/5968705

## Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

Received 22 July 2015; Revised 12 October 2015; Accepted 13 October 2015

Academic Editor: Cheng-Jian Lin

Copyright © 2016 Bingkun Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

E-commerce develops rapidly. Learning and taking good advantage of the myriad reviews from online customers has become crucial to the success in this game, which calls for increasingly more accuracy in sentiment classification of these reviews. Therefore the finer-grained review rating prediction is preferred over the rough binary sentiment classification. There are mainly two types of method in current review rating prediction. One includes methods based on review text content which focus almost exclusively on textual content and seldom relate to those reviewers and items remarked in other relevant reviews. The other one contains methods based on collaborative filtering which extract information from previous records in the reviewer-item rating matrix, however, ignoring review textual content. Here we proposed a framework for review rating prediction which shows the effective combination of the two. Then we further proposed three specific methods under this framework. Experiments on two movie review datasets demonstrate that our review rating prediction framework has better performance than those previous methods.

#### 1. Introduction

Web 2.0 and e-commerce give rise to the explosion of online reviews. In turn, intelligently learning the sentiment propensity and opinions of these reviews is exactly the key to success in current wave of e-commerce. Binary classification, or positive-negative classification, of these reviews has been quite common but it gradually fails to meet the requirement of accuracy [1]. For instance, which item will be selected out of several items that all belong to the positive category is, therefore, hard to predict. However, even the nuances between review rates can lead to great difference in their volume of sales. The review rating prediction research of [2] shows that consumers are often willing to pay 20% to even 99% more money to buy a product of 5-star rate than that of 4-star rate.

There are mainly two types of methods of current review rating prediction (RRP). The first one based on review text content adopts the perspective of natural language process. Researchers transform review text into feature vectors and then employ a multiclass classifier or a regression model to predict review rates [3–7]. It simply ignores the relationship between the costumers and the items. The second one based on collaborative filtering (CF) focuses on the standpoints of recommenders [8, 9]. Researchers employ the -nearest neighbor methods [10, 11] or matrix factorization methods [12–17] to extract information from previous reviewer-item rating matrix for review rating prediction. This type of method exploits no information from review text.

In order to include more information to accomplish finer-grained review rating prediction, we proposed a framework combining review text content with previous reviewer-item rating matrix. After that, we contrived three specific methods under this framework. Then we did some experiments on two movie review datasets to examine the efficiency of our framework and three methods. And the result shows that our methods under this framework by and large refine the performance of RRP and generate desirable results.

The outline of the paper is as follows. Section 2 introduces those related researches on RRP. RRP based on review text content and RRP based on CF are described, respectively, in Section 3. Section 4 presents our framework and the three methods under it. Experimental results on two movie review datasets are reported in Section 5. Finally, Section 6 concludes the paper and points out our future research direction.

#### 2. Related Work

The research of sentiment classification is mainly divided into two aspects [18, 19]: one is positive-negative binary classification; the other is the fine-grained RRP. As to binary classification of reviews, some of the creative methods for achieving this have been discussed in [20, 21]. In RRP, there are many research results proposed [3–17], but accuracy of RRP still cannot meet the real demands at present.

At present, there are mainly two ways of fulfilling finer-grained RRP. One includes methods based on review text content (MBRTC), which mines information from review text content by discerning and quantifying a variety of text features and then employs the regression model to predict review rating [7, 22]. For example, Qu et al. [7] consider RRP as a feature engineering problem, and extracted various features, such as words, patterns, syntactic structures, and semantic topics, from the review text to improve the performance. Wang et al. [22] proposed a type of methods based on the content of review and weighting strong social relation of reviewers to predict review rating. To be specific, they predict review rating by incorporating the character of reviewer’s social relations, as regularization constraints, into content-based methods. The main problem of MBRTC is that it mainly uses the review text content information and does not refer to reviewer-items rating matrix information.

There are also some researches based on review text content taking into account characteristics of the items or the reviewers [5, 23]. Wang et al. [23] noticed that the score that a piece of review relates to cannot be fully determined by the review content itself, since review content is not an absolute metric of sentiment propensity. A tough reviewer may use tough words for all items, even items that he rates high. Different items have to meet different basic requirements. Simply analyzing the review content is not enough. Li et al. [5] proposed a method of incorporating reviewer and item information into review text content. They consider the personal characteristics of the reviewers when mining reviews content and use tensor factorization techniques to learn parameter of regression model and predict reviews rating. This method only considers the effect of reviewer and item on review text content and then uses review text content to predict review rating, which is indeed a method based on review text content.

The other one contains methods based on CF (MBCF). Those methods can be further divided into two categories. The first one dictates people to do similarity calculation to find the -nearest neighbor reviewers or items to do prediction [11, 12]. The second one requires people to use the latent factor model to fulfill the matrix factorization. Several low-dimensional matrix factorization techniques are presented in [24–26]. Koren et al. [12–14] proposed several enhanced matrix factorization methods which can generate promising results by applying heterogeneous information to object functions. Koren [12] built a combined model by merging the matrix factorization and neighborhood models and improved accuracy of recommendation by extending the models to exploit both explicit and implicit feedback by the users. Koren [14] proposed a method to model the time changing behavior throughout the life span of the data and improved the performance of recommendation. In [27], researchers extended the matrix factorization objective function with the social network information of reviewers. In [28], Shi et al. proposed a context-aware movie recommendation algorithm based on joint matrix factorization (JMF). They jointly factorize the user-item matrix containing general movie ratings and other contextual movie similarity matrixes to integrate contextual information into the recommendation process.

Up to now, some researches combining ratings and text reviews have been applied to recommend system [29, 30]. For example, Cremonesi et al. proposed a hotel recommender algorithm (Interleave), which provides recommendations based on the text reviews and ratings [29]. Levi et al. proposed a recommender system that combines reviews and ratings to recommend hotels [30]. But as far as we know, the methods based on combining ratings and text reviews have not been applied in review rating prediction. Different from the existing methods focusing on recommend system, we focused on review rating prediction and proposed a general framework and three special methods based on review text content and reviewer-item matrix.

Different from the above methods, we propose a framework combining MBRTC and MBCF to include more information to improve the accuracy of prediction. We also present three specific methods under this framework. Finally, the experiment results verify effectiveness of our proposed framework and methods.

#### 3. RRP Based on Review Text Content and RRP Based on Reviewer-Item Rating Matrix

For an online review site with items , reviews about the items, reviewers who have written the reviews, and review ratings which are corresponding to reviews , our goal is to predict rating of each review in . In this section, we introduce two existing types of methods in review rating prediction. One is based on review text content; the other one is based on collaborative filtering.

##### 3.1. RRP Based on Review Text Content

Review text content is a very important information source for RRP. Current review text content-based RRP methods mainly use vector space model (VSM) to express review text content and then use linear regression model to predict review rating. To be specific, there are four steps to take. Firstly, online review text content, which includes segmentations of terms, part-of-speech tagging, and frequency statistics, should be preprocessed. Secondly, regarding words, phrases, and -gram as features, people employ some feature selection methods to choose features that can perfectly express the review text content to compose the feature set . Thirdly, each online review in is expressed as an -dimensional vector which is exactly an instantiated value of . Fourthly, the linear regression model dealing with those vectors of reviews is adopted to predict review rating. The linear regression model is described in

To work out the parameter vector , given training datasets with and available, least squares error loss is used to minimize the objective function:

Here, , the regularization term of parameter vector , is employed to avoid overfitting; is the regularization coefficient. To estimate the parameters , a simple stochastic gradient descent algorithm is adopted to solve the optimization problem. For each observed rating , we refer to the following updating rules to learn the parameters :

Here , and are the learning rates. After getting , given , we can apply to predict the review rating of each review in .

##### 3.2. RRP Based on Collaborative Filtering

RRP now plays an essential role in recommend system. At present, RRP for recommend system mainly based on collaborative filtering involves two methods. One uses the -nearest neighbor to predict and estimate the current object. The other one uses matrix factorization.

###### 3.2.1. RRP Based on the -Nearest Neighbor Model

RRP based on the -nearest neighbor model includes the reviewer-based method and the item-based method. With the reviewer-item rating matrix available, a typical reviewer-based approach is to predict a reviewer’s rating on a target item by aggregating the previous ratings on it from -nearest reviewers. We can consequently formulate the predicted rating on item from reviewer as follows:

Here, is the set of nearest neighbor of reviewer and represents the similarity between reviewer and reviewer ; is review rating from reviewer on item .

To get the parameter , given training datasets with and available, we have to solve the optimization problem, that is, to minimize the square error loss function below:

Here is a regularization term of parameter aimed at avoiding overfitting and is regularization coefficient. Then, a simple stochastic gradient descent algorithm is adopted to solve the optimization problem. For each observed rating , we refer to the following updating rules to learn the parameters :

Here, and are the learning rates. After getting , we can apply to predict the review rating.

###### 3.2.2. RRP Based on Matrix Factorization

Matrix factorization (MF) is one of the most popular methods in recommend system. The kernel of MF is to find a small number of latent features that might relate to the preferences of reviewer and use them to match observed ratings. A typical model associates each reviewer with a vector of reviewer factors and each item with a vector of item factors. The prediction is done through an inner product which is described by

In order to compute the two parameters and , we follow the least squares error loss principle to minimize the objective function:

Here and are the regularization terms of parameters and serving to avoid overfitting; is the regularization coefficient. In order to estimate the parameters and , a simple gradient descent algorithm was successfully applied to solve the optimization problem. For each observed rating , we used the following updating rules to acquire the parameters and :

Here, and are the learning rates. After getting and , we can apply to predict review rating.

#### 4. RRP by Combining Review Text Content and Reviewer-Item Rating Matrix

##### 4.1. Problem Description

In order to illustrate the problem we study in this paper, a toy example about reviewers, items, review text content, and review rating is shown in Table 1. From the toy example, we can get three types of information: the user-item rating matrix, review text content with corresponding rating, and review text content without corresponding rating. The problem we study is how to effectively predict missing rating of each review in user-item rating matrix. In this section, we propose a new RRP framework combining reviewer-item rating matrix (RIRM) with review text content (RTC). That is, we want to find a function .