Abstract

The ongoing rapid development of the e-commercial and interest-base websites makes it more pressing to evaluate objects’ accurate quality before recommendation. The objects’ quality is often calculated based on their historical information, such as selected records or rating scores. Usually high quality products obtain higher average ratings than low quality products regardless of rating biases or errors. However, many empirical cases demonstrate that consumers may be misled by rating scores added by unreliable users or deliberate tampering. In this case, users’ reputation, that is, the ability to rate trustily and precisely, makes a big difference during the evaluation process. Thus, one of the main challenges in designing reputation systems is eliminating the effects of users’ rating bias. To give an objective evaluation of each user’s reputation and uncover an object’s intrinsic quality, we propose an iterative balance (IB) method to correct users’ rating biases. Experiments on two datasets show that the IB method is a highly self-consistent and robust algorithm and it can accurately quantify movies’ actual quality and users’ stability of rating. Compared with existing methods, the IB method has higher ability to find the “dark horses,” that is, not so popular yet good movies, in the Academy Awards.

1. Introduction

The fast development of the Internet and related infrastructures has created vast opportunities for people to date, read, shop, and enjoy entertainment online [13]. As people come to rely more and more on the Internet, they place themselves at additional risk. Disinformation and rumors mislead people into making wrong decisions. For example, some e-commercial websites sellers manipulate information in order to present low quality products in a good light. How to effectively disentangle truth from falsehood to protect individuals from malicious deception is a critical problem, especially for the companies that provide information services or products online [47]. Reputation systems arose as a result of the need for Internet users to gain trust in the individuals they transact with online [8, 9]. Additionally, reputation systems enable users and customers to better understand the provided information, products, and services [10, 11]. Reputation systems may help users to make decisions on whether or not to buy specific services or goods that they have no prior experience of or never purchased before [1214].

Reputation systems use a collection of historical ratings records and attributes of users’ and items’ to calculate their reputation/quality levels, which are usually represented in the form of scores. Most e-commercial and interest-based websites employed some kinds of reputation systems to differentiate the qualities of services, products, or entities before recommendation or information push. For example, Netflix, which provides DVD rental service allows users to vote on the movies and then computes the reputation score of each movies. Since the ratings have a large influence on users’ online purchasing decisions and the online digital content distribution, various algorithms have been proposed to give objective evaluations. Laureti et al. [15] proposed an iterative refinement (IR) method where a user’s reputation, that is, rating stability, is inversely proportional to the difference between the user’s ratings and the corresponding objects’ estimated quality. The estimated quality of each object and reputation of each user are iteratively updated until convergence is reached. Zhou et al. [16] proposed a robust ranking algorithm where a user’s reputation is calculated by the Pearson correlation between user’s ratings and the objects’ estimated quality. Compared with the IR method, this method shows higher robustness against spammer attacks. More recently, Liao et al. [17] developed a reputation redistribution process to the iterative ranking measurement, which can effectively increase the weight of votes cast by highly reputable users and reduce the weight of users with a low reputation, when estimating the quality of objects. There are also some other algorithms that are built on the basis of Bayesian theory [18], belief theory [19], the flow model [20], or fuzzy logic concepts [21]. Most of the previous methods are based directly on ratings while neglecting the fact that users may have a personal bias when they give a score to an object. We have empirically investigated four benchmark datasets that were obtained from two video-provided websites, MovieLens [22] and Netflix [23], and found that each user has a certain magnitude of rating error which decreases the prediction accuracy of ratings [24]. In order to eliminate the effects of this rating error on the evaluation results, we propose a new algorithm called the iterative balance (IB) method. Experiments on MovieLens and Netflix datasets show that the IB method is a highly self-consistent and robust algorithm; it can accurately quantify a user’s reputation and a movie’s quality. Compared with the state-of-the-art methods, the IB method has a greater ability to find the “dark horses” for Oscar award.

This paper is organized as follows. In Section 2, we introduce the representation of rating systems and the general framework of iterative ranking algorithms. Next, we describe our IB method and some well-known iterative algorithms which will be used for comparisons. In Section 3, four benchmark datasets and several evaluation metrics are described. In Section 4, we show the performance of the IB method in terms of accuracy and robustness. Conclusions and discussions are drawn in the last section where the potential relevance and applications of the IB method are discussed.

2. Materials and Methods

2.1. Bipartite Network Representation of Rating Systems

Bipartite networks are commonly used to represent the relationships between two groups of entities, such as the relationships between actors and movies, goods and customers, books and readers, and publications and authors. Only the relationships between the two groups of entities are allowed. Here, we use bipartite networks to represent the rating systems which include the set of users (denoted by ), the set of objects (denoted by ), and the ratings between users and objects (denoted by ). A link in the bipartite network connecting user and object represents a historical rating (∈R). We give a simple example in Figure 1 to show how to construct a bipartite network based on a set of rating data. The original data shown in Figure 1(a) has seven rating records made by four users on four movies. The ratings are given on the integer scale from 1 star to 5 stars (i.e., worst to best). Figure 1(b) shows the corresponding bipartite network, where users are represented by circles and objects are presented by squares. Users are connected with the movies that they have rated. All the users who have rated object are represented by set , while all the objects which have been rated by user are represented by set . For example, in Figure 1   and . The object ’s degree is the number of users in set , and the user ’s degree is the number of objects in the set .

2.2. Iterative Ranking Framework

As a matter of fact, items have a set of qualities, based on a set of traits. A user’s aggregate rating is a reflection of the quality of those traits, plus the individual weighting that reflects the user’s value system. A user’s reputation is the accuracy of rating those traits, independently of his individual weighting of the traits. For convenience, this paper deals with the case where . and denote the quality of object and the reputation of user , respectively. Note that, when users’ biases and mistakes are absent, that is, for every user, any two users would rate any object the same according to the instinct quality of the object. The most straightforward method to quantify one object’s quality is to consider the historical ratings that the object received. Averaging over all ratings (abbreviated as AR) is the simplest method, which mathematically readsObviously, in this form the ratings from different users contribute equally to . However, the ratings of users with higher reputation are more reliable than the ratings from low reputation users. Therefore a weighted form to calculate the quality of an object was proposed:where is usually the normalized reputation score of user .

KR is “a very crude approach” of evaluating the reputation of users in system. The basic assumption is that the user with more experience, that is, rating more items before, has a higher ability to rate trustily and precisely. The reputation of a user is directly proportional to the number of items he or she has rated in KR. However, due to the unreliability of this assumption, nothing more will be discussed in this paper.

There are also three iterative ways to calculate each user’s reputation score . Laureti et al. [15] presented an iterative refinement method (abbreviated as IR), which considers users’ reputation scores as inversely proportional to the mean-squared error between users’ rating records and the quality of objects; namely,After normalization, we obtain

Zhou et al. [16] proposed a correlation-based iterative method (abbreviated as CR), which assumes that a user’s reputation is calculated by the Pearson correlation [25] between user rating records and the corresponding objects’ quality:The reputation scores are defined asNormalizing , we obtain

More recently, Liao et al. [17] proposed a reputation redistribution process (abbreviated as IARR) to improve the validity by enhancing the influence of highly reputed users. Then (7) can be rewritten aswhere is a tunable parameter to control the influence of reputation. Obviously, when , is a constant value for all the users; when , IARR reduces to the CR method. In this paper, we set which is suggested by the proposers [17]. At the same time, Liao et al. also presented another similar algorithm, called IARR2, by introducing a penalty factor to IARR. IARR2 algorithm thought that a user is more reliable if he rates more objects and his reputation is still high, and so do the objects. In IARR2, (2) should be written asand the in (8) was revised asIn summary, under the framework of iterative models, there are four steps to achieve the final results through four different algorithms:(i)Initialize the reputation of users: specifically, we set , , , and for the IR, CR, IARR, and IARR2 methods, respectively (We have checked the results when the initialization of IR is the same as the other three algorithms; i.e., . The results are exactly the same as the case when . To follow the original paper of IR method, we use in our paper and experiments.).(ii)Estimate the quality of each object with (2), where can be (4), (7), and (8), while IARR2 can be calculated based on IARR according to (9).(iii)Update the reputation of each user according to (3) and (4) for IR, (5)–(7) for CR, (8) for IARR methods, and (10) and (8) for IARR2, respectively.(iv)Continue the iteration process according to (ii) and (iii) until the change of the quality estimates is less than a threshold , and then terminate the iteration. In our experiments, we set .

2.3. Iterative Balance Model

The above three methods neglect the fact that the ratings of different users may have bias due to personal interests and criteria. This bias can be measured by the standard deviation and the skewness of the user’s rating records. Let us consider users and objects. Each user has a certain magnitude of rating error and each object has an intrinsic quality which is unknown for users. The magnitude of rating error indicates the inaccuracy degree of the rating score, which could play negative or positive effect on the rating. Then the rating of user on object , namely, , can be written asHere, we assume that the distribution of the magnitude of rating error has zero mean. For an arbitrary user , his magnitude can be measured by the standard deviation (), which readswhere is the average score of all ratings on object . Furthermore, we also give the skewness of the rating records, which refers to asymmetry in the real distribution of a user’s rating records about its mean:where could come in the form of “negative skewness” or “positive skewness,” depending on whether the user’s rating records are skewed to the left (negative skew) or to the right (positive skew) of the average rating records.

We empirically analyze four benchmark user-movie datasets: three of them are samples from MovieLens, named M1, M2, and M3, and the other one is from Netflix, named NF (see Table 1 for basis statistics of the datasets). For each dataset, we investigate the distribution of SD and SK of users, respectively, shown in Figures 2(a) and 2(b). Both SD and SK follow normal distribution where the parameters are estimated via maximum likelihood approximation method. Due to the user’s personal bias of rating, we proposed an iterative balance model to eliminate the bias in order to better quantify the user’s reputation. The model considers the user magnitude to meet (9), and its process can be described as follows:

(i) Initialize the quality of each object according to (1); we obtain .

(ii) Update the reputation of each user according to measures the rating bias of user . Obviously, the lower the is, the higher reputation the user has.

(iii) Update the quality of each object according to where is the sign function, which returns if , if , and for . It is noted that if , then .

(iv) Continue the iteration process of (ii) and (iii) until the change of the quality estimate is less than a threshold , and then terminate the iteration. The final stable values of and are used to quantify the intrinsic quality of object and the reputation of user , respectively.

3. Data and Metric

3.1. Datasets

To test the performance of our IB method, we consider four benchmark datasets, which are sampled from MovieLens [22] and Netflix [23]. MovieLens is an online movie recommendation website, which invites users to rate movies. Netflix website also has DVD rental service and the users can vote on the movies. The first three datasets are sampled from MovieLens with different sizes, which are named as M1, M2, and M3. The fourth dataset is a random sample of the whole records of user activities on http://Netflix.com. The rating scale for both MovieLens and Netflix is from one (i.e., worst) to five (i.e., best). Based on the users’ historical records, we can construct a user-movie bipartite network. If user selects movie and rates it, a link between user and movie would be established. The statistical features of the four networks constructed based on four datasets are summarized in Table 1. In this paper, we consider only users and objects with degrees greater than .

3.2. Evaluation Metrics

To evaluate the performance of IB method, we employ the mean-squared error (MSE) to measure the algorithm’s accuracy on quantifying users’ reputation and the precision to evaluate the algorithm’s accuracy on identifying good movies. Besides accuracy, we also investigate the robustness of our method, which is measured by the MSE and the Kendall’s tau () coefficient [26].

A good method should give a higher reputation score to users with a lower error magnitude. represents the scoring stability of user , which readswhere is the intrinsic quality of object , that is, the final quality value . Usually, the comparisons focus on the top-rank users; therefore we here consider the average MSE value of the highest reputation users:Lower MSE value indicates higher accuracy.

The accuracy of measuring object quality is evaluated by comparing with the movies nominated at Annual Academy Awards [27] and Golden Globe Awards [28]. These nominated movies are the benchmark good movies in the evaluation. A good algorithm will rank the benchmark movies higher than others; therefore we apply precision to evaluate the ability of an algorithm to find good movies. Instead of considering all movies, we focus on the top- places. Then precision is defined aswhere indicates the number of benchmark movies existing in the top- places of the ranking list. Higher precision corresponds with better performance.

The robustness is measured by Kendall’s tau () coefficient [26]. For a dataset, each method gives a ranked list of objects. If movie A is better than movies B in dataset M1, then a robust algorithm will also rank movie A higher than movie B in dataset M2 (or M3). To measure the robustness, we consider the common objects in two datasets (i.e., M1 and M2, M1 and M3, M2 and M3) and extract the subranking list of the common objects from each original ranked list. Assume there are common objects between two lists where the quality score of object is denoted by and , respectively. The Kendall’s tau rank correlation coefficient counts the difference between the number of concordant pairs and the number of discordant pairs, which readswhere is the sign function, which returns if , if , and for . Here means concordant, and negative value means discordant. The higher the value is, the more robust the algorithm is. In the ideal case, indicates that the two ranking lists are exactly the same.

4. Results

4.1. Accuracy for Quantifying Users’ Reputation

Figure 3 shows the MSE value for IB method with different ; see (17). We also present other representative algorithms for comparisons. However, the penalty factor in IARR2 amplifies the value of the users’ reputation and objects’ quality greatly, which makes the MSE value of IARR2 much bigger than other methods. If we plot the curve of IARR2 in Figure 3; all the other curves will become nearly linear, so the MSE result for IARR2 is not present here. We could observe that as increases, the MSE value of the IB method is always the lowest, indicating that the IB method is a good measure of quantifying user reputation. Besides, we also investigate the correlation between the users’ reputation scores and their personal MSE values. Table 2 shows the Kendall’s tau correlation coefficient between the two ranking lists, respectively, generated by ranking users decreasingly according to their reputation scores (the higher, the better) and by ranking users increasingly according to their MSE values (the lower, the better). For all four datasets, the IB method yields the highest value, indicating that our IB method is highly self-consistent.

4.2. Accuracy for Identifying Good Objects

Firstly, how do you define good objects? More specifically, how do you define good films? This is a well-known and highly controversial issue so that the opinion concerning this topic varies from person to person. According to a collection of answers in https://Quora.com, many people define a good film by how much it entertains and/or moves audience, how much it is related to audience, or how strongly it makes audience emote. Just as the saying goes, “each reader creates his own Hamlet.” Here we want to adopt the movies that are most interesting, most appealing, and most exciting as the benchmarks of the good films, and we believe that the selecting of movies that were nominated by either the Academy Awards [27] or the Golden Globe Awards [28] should be an authority choice. We adopt the precision to calculate the accuracy for identifying good movies. In Table 3, we summarize the number of nominated movies in three MovieLens datasets.

Note that, users’ behavior in the movies rating website changes over time, particularly before and after a movie be awarded in famous film festival like Academy Awards or Golden Globe Awards. The Academy Awards was first presented in 1929 while Golden Globe Awards was first presented in 1943. However, the two datasets we used in our manuscript, Netflix and MovieLens, are created in recent decades. This means that most of the rating scores are created after the movies were awarded in the film festival. The datasets we obtained limited our exploration of the rating dynamics over time in this paper. We will try to study this problem in our future works.

Figure 4 shows the precision of five methods, including the IB, AR, CR, IR, and IARR methods, on identifying good movies. For all methods, the precision decreases with the increase of . Generally speaking, our IB method does averagely well. In some cases, IB performs good. For example, in M3 dataset the IB performs the best when evaluating with the Academy Awards but is defeated by IR method when evaluating with the Golden Globe Awards.

Each method will generate a ranked list where the top-ranked movies are predicted as the nominated movies. After comparing the nominated movies that predicted right by different methods, we find that our IB method is good at finding niches (i.e., unpopular yet good movies). This ability to find novel movies is important, since finding popular movies is much easier than digging niches. Usually the niches constitute the so-called “long tail” market which is considered to be promising and profitable. For instance, Netflix finds that in aggregate “unpopular” movies are rented more than popular movies and provides a large number of niches movies on their website. The novelty of a movie can be measured by its degree, namely, how many users have rated it. An algorithm’s novelty is defined as the average degree of the nominated movies in its ranking list, the lower, the better. We compare the novelty scores of five methods. The results are shown in Figure 5. We can see that, in all presented cases, IB method always yields the lowest novelty score, indicating that IB method has higher ability to find “dark horses” (i.e., niches, not so popular yet good movies).

Table 4 shows the movies nominated for an Academy Award as identified by our IB method in the top-100 places, but not in the lists of other six methods in M2 dataset. The average number of ratings of the 13 movies is 455, much lower than the average number of ratings of all nominated movies in M2 dataset (i.e., 717; see Table 3). Besides, among the 13 movies, only three movies have been rated more than 717 times. We have also checked that the results of the other four methods highly overlapped, while our IB method yields results which are considerably different from the rest. The results of other datasets are similar, so we will not present the detailed information. In the M1 dataset, there are also 27 nominated movies that are predicted right by IB method but cannot be identified by the other four methods. The average number of ratings is 132, which is smaller than the average value of all nominated movies in the M1 dataset (i.e., 175; see Table 3). In the M3 dataset, there are 23 nominated movies that cannot be identified by the other four methods. The average number of ratings is 3245, which is smaller than the average value of all of the nominated movies in the M3 dataset (i.e., 3942; see Table 3).

4.3. Robustness

Besides accuracy, robustness is another important aspect to consider when selecting algorithms. Robustness usually refers to an algorithm’s ability to counteract malicious activities. Here we consider the algorithm’s robustness against different datasets. The intrinsic quality of an object will not change in different sampled datasets. If an algorithm says object A is better than B based on sampled dataset 1, while it says object B is better than A based on sampled dataset 2, then this algorithm is not robust because it generates inconsistent results on different sampled datasets. Therefore, instead of adding artificial ratings to investigate the algorithm’s robustness, we apply MSE and the Kendall’s tau () coefficient to measure the consistency of the results on different sampled datasets. M1, M2, and M3 are ready-made sampled datasets for experiment. Firstly, we calculate the object quality scores by the AR, IB, CR, IR, and IARR methods in the three datasets, respectively. For the three datasets, there are three pairs for comparison, namely, M1 versus M2, M2 versus M3, and M1 versus M3. We consider the same objects of the two datasets in each pair and then calculate the difference between the two quality scores. and denote the quality scores of object in the two datasets and (), respectively; the , where is the number of same objects between datasets and . The results are shown in Table 5. In all three cases, the IB method has the lowest MSE value. Moreover, we use Kendall’s tau () coefficient to analyze the correlation between the two ranked lists of common objects in two datasets in each pair. Table 6 shows that the Kendall’s tau () of the IB method is the highest among all five methods. In other words, the two ranked lists of the same objects given by the IB method in different datasets are more consistent than those given by the other four methods, indicating that IB is more robust.

5. Conclusions

Building online reputation systems is important to companies that provide services or products online (i.e., Taobao e-business platform for goods [29], Netflix for movies, Amazon for books/other products, Pandora for music [30]). Since the reputation scores generated by the system’s algorithm are usually used to assist users who want to buy or select something of which they have no prior experience, finding a good ranking method is important. A good method should be both effective (i.e., reflecting the intrinsic values) and efficient (i.e., simple to calculate). Additionally, it must be robust against tampering. Users’ rating bias greatly ruins the algorithm’s performance in terms of the above three criteria. Motivated to eliminate user bias for better evaluation, we proposed an iterative balance (IB) method to identify each user’s reputation and each object’s quality in online rating systems. Firstly, we empirically studied the standard deviation and the skewness of users’ rating scores and found that each user has a certain magnitude of rating error. Then, we introduced an equation to correct this magnitude of rating error during the iterative process. We applied mean-squared error (MSE) to measure the algorithm’s accuracy on quantifying each user’s reputation and the precision to evaluate the algorithm’s accuracy on identifying good objects. The algorithm’s robustness is measured using both MSE and Kendall’s tau coefficient. Experiments on four benchmark datasets show that the IB method is a highly self-consistent and robust algorithm. Compared with other state-of-the-art methods, the IB method has a higher ability to identify niche items (i.e., unpopular yet good objects). For example, results of using the MovieLens dataset show that the IB method is good at finding the “dark horses” for the Academy Awards. We believe our studies may find wider practical applications, such as helping online e-business platform to identify tampering, integrating the object’s quality score into the recommender systems to improve the accuracy of recommendations, and generally improving user experiences. Furthermore, this may also generate higher quality evaluation reports for seller reference.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

All the authors contributed equally to this work.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Grants nos. 11622538, 61673150) and the Zhejiang Provincial Natural Science Foundation of China (Grant no. LR16A050001). Z. Ren is thankful for the NSFC-Zhejiang Joint Fund under Grant no. U1509220.