#### Abstract

Item-based collaborative filter algorithms play an important role in modern commercial recommendation systems (RSs). To improve the recommendation performance, normalization is always used as a basic component for the predictor models. Among a lot of normalizing methods, subtracting the baseline predictor (BLP) is the most popular one. However, the BLP uses a statistical constant without considering the context. We found that slightly scaling the different components of the BLP separately could dramatically improve the performance. This paper proposed some normalization methods based on the scaled baseline predictors according to different context information. The experimental results show that using context-aware scaled baseline predictor for normalization indeed gets better recommendation performance, including RMSE, MAE, precision, recall, and nDCG.

#### 1. Introduction

The abundance of information available on the Internet makes the increasing difficulty in finding what the people want, especially for the Electronic Commerce domain. As a consequence, building personalized information selection models is becoming crucial. Among many different information selection technologies, the recommendation systems are greatly developed due to their application on most of the famous online shopping companies [1, 2].

The algorithms of recommending items have been studied extensively, most of which belong to two main categories. Content-based recommendation systems try to recommend items according to the users’ past preference [3–5], whereas the collaborative recommendation systems make the recommendation in terms of the similar neighborhood preference [6–9]. Recommendation systems based purely on content generally easily suffer from the problems of limited content analysis and overspecialization. Defining the appropriate items’ features is very difficult for many situations, and these features depend heavily on the users’ history, which cannot find the latent profiles for recommendation.

Collaborative filter (CF) approaches overcome some of the limitations of content-based ones. Items for which the content is not available or difficult to obtain can still be recommended to users through the feedback of other users. CF ones can also recommend items with very different content, as long as other users have already shown interest for these different items. Among collaborative recommendation approaches, methods based on nearest neighbors still enjoy a huge amount of popularity, due to their simplicity, their efficiency, and their ability to produce accurate and personalized recommendations [10–12]. CF models try to capture the interactions between users and items that produce the different rating values. However, many of the observed rating values are due to effects associated with either users or items, independently of their interaction. A principal example is that typical CF data exhibit large user and item biases, that is, systematic tendencies for some users to give higher ratings than others and for some items to receive higher ratings than others.

Item-based collaborative filter [13, 14] has much more accuracy than user-based one [15, 16], when the number of items is larger than the number of users. The electronic commercial business always has huge productions. The number of productions far exceeds the number of users. However, the average number of common ratings is very small, because most of the users only have interest in very few items. User-based collaborative filter systems easily suffer from overfitting problems in this situation. So the item-based collaborative filter algorithms play an important role in modern commercial recommendation systems (RSs). This paper intends to improve the recommendation performance using a novel rating normalization strategy.

When it comes to assigning a rating to an item, each user has its own personal scale. Even if an explicit definition of each of the possible rating is supplied, some users might be reluctant to give high/low scores to items they liked/disliked. There are some different rating normalization schemes which are designed for different reasons [17–19]. Also, many of the observed rating values are due to effects associated with either users or items, independently of their interaction. We do not only convert individual ratings to a more universal scale but also consider the user and item biases.

The baseline predictor (BLP), which combines the overall averaging rating and user or item biases, involves these factors for normalization. But, for the item-based collaborative filter systems, the BLP is always a statistical constant which cannot be adaptively changed according to the context [20–23]. We found that the recommendation performance can be improved if we slightly scale the different parts of the BLP in a limited range. In this paper, we provided some novel context-aware scaled baseline predictors (CASBLP) for item-based collaborative filter normalization, considering different context information. The experimental results show that CASBLP can significantly improve the prediction performance, such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), precision, recall, and Normalized Cumulative Discounted Gain (nDCG).

The rest of this paper is organized as follows. We present the details of CASBLP in Section 2 and show experimental results in Section 3. Finally, we conclude the paper in Section 4.

#### 2. Description of Models

##### 2.1. Baseline Predictor for Item-Based CF Normalization

A general neighborhood-based collaborative filter recommendation using BLP normalization is defined as follows:

is rating predictor based on the nearest neighbors. is the baseline predictor, which is always defined as

Denote by the average ratings. The parameters and indicate the observed deviations of item and user , respectively, from the average.

For item-based CF, we do not use the user biases due to using the similar items as neighbors. So the BLP in item-based CFS is

is replaced by the following formula:

is the set of the most similar items to the item , and is the set of items the user has rated.

There are many different similar weight functions. In this paper, we use two popular ones, Cosine and Pearson’s Correlation, the details of which are defined, respectively, as

##### 2.2. Motivation of Scaling Baseline Predictor

The baseline predictor can introduce some information which is independent of neighborhood influence, but it is always set as a constant. However, we found that slightly scaling the baseline predictor could get a better predicting accuracy. But using a single scaling factor for , , and is not a good idea. Figure 1 shows an example where we can decrease the RMSE when scaling (e.g., ) on a small MovieLens dataset.

From Figure 1, the best scaling factor is 0.6, at which we can get the lowest RMSE. However, from another perspective, such as Top measure, using the same scaling factor is not a good choice. Figure 2 shows that scaling BP could not improve the precision and recall.

For the recommendation systems, Top measure is more important than RMSE. To improve both RMSE and Top measure, we should not use the same scaling factor for the parameters in :

Determining these three parameters is very difficult, but, unlike matrix factorization models, NBCFs can also not train the unknown parameters. In this paper, we provide several context-aware scaling factors. Before describing the details, we first change (6) to another representation. Actually, the baseline predictor can be also described as

is the set of users rating item . The scaling version of baseline predictor can be considered as

Here, we use the denominator to control the scaling factors, and hence . In fact, is the Bayesian mean damping term [24]. It biases means toward the global mean . Our task is to determine and according to the context information.

The recommendation system is a very special machine learning research. The user-item matrix is always too sparse. When data is sparse, we need other sources of knowledge to help the machine learning algorithm. Mining the context information is a way of adding knowledge to the recommendation system algorithms.

##### 2.3. Context-Aware Scaled Baseline Predictors

We consider several context situations to determine the scaling baseline predictors: ratings distribution, categories distribution, timestamp distribution, and links distribution. At first, we denote by the set of all the items and by the set of all the users.

The rating distribution aware (RDA) method scales the baseline predictors in terms of ratings distribution. The values of ratings are usually discrete. Denote by the set of possible rating values, where .

Denote by the set of rating records of which the value is : is the user, represents the item, and means the rating of rated by . Also, denotes the set of users whose ratings contain , and denotes the set of items which are rated using the value . Now we sort all , and let be the set of order by descent according to the size of sets: , where . Denote by all the rating records. The scaling factors of RDA are evaluated as Here, we use the largest . If the sizes of some sets are equal and the number of candidates is larger than , we randomly select the sets of the same size.

Like RDA, the category distribution aware (CDA) method scales the baseline predictors in terms of category distribution. The items in recommendation system always have some labels, indicating some special attributes. In the MovieLens, the movies have some labels of genres. Each label corresponds to a category, and each item may belong to at least one category.

Suppose we have categories, and denote by the set of these different categories, where . Denote by the set of items belonging to . is a descent ordered set according to the size of set: , where . For CDA, the scaling factors are expressed asNote that to determine we use as the numerator and as the denominator. The difference is that the items always belong to multiple categories.

There is always a timestamp record for each rating. The timestamp distribution aware (TDA) method scales baseline predictor in terms of timestamp distribution. Suppose that the element of is a 4-tuple, where . The meanings of , , and are the same as in . is just the timestamp when rated by the score . The format of is usually a Unix timestamp. We change to “yy-mm-dd” format . That means the base unit of time is the day, and now .

Let be the set of rating records, of which the reduced timestamp is . Like the previous two methods, we create a descent ordered set according to the size of , where .

We select the first elements of to compose another truncated set . Denote by the set of distinct users of the rating records belonging to . The scaling factors of TDA are expressed as

The links distribution aware (LDA) method scales baseline predictor in terms of links distribution. The links mean the relationship between users and items, which make up a rating network. Any pairs of users have no link, and any pairs of items also have no link. Equation (13) and Figure 3 show an example of rating network:

Only when the rating between and is larger than or equal to can we connect and . The degree of the user is expressed as and for the item . We create two descent ordered sets and according to the degrees. It is obvious that and . But, for convenience, we use different symbols. That is, and . There is a unique mapping from to and from to . For and , we have and . We put the ordered degrees of users and items into two sets, respectively: and , where and .

For LDA, we have two ways of evaluating the scaling factors. When considering the degrees of the users, the method is called LDAU, which is expressed as

Also, when considering the degrees of the items, the method is called LDAI, and the scaling factors are expressed as

Unlike the other methods, LDA controls and using different distributions. For , we use the top and top degrees, but for , we use the top degrees and the average degree.

#### 3. Experiments

##### 3.1. Experimental Settings

We use a MovieLens latest dataset in our experiments, including 100,000 ratings and 6,100 tag applications applied to 10,000 movies by 700 users [25]. There are four files for each dataset: links, movies, ratings, and tags. We use these files to get different context information. We compare several different methods in our experiments, the names and meanings of which are shown in Table 1.

The total methods compared are defined in Table 1. There are two similarity weight functions in our experiments: Cosine and Pearson’s Correlation. The neighborhood sizes of item-based models are all set to 20, while they are 100 for user-based models. Values of in (11)~(15) are the same, 6 in default. The values of are also the same for these different methods, 20 in default. We randomly split the dataset into 5 parts and use cross-validation to train and test the models.

For top metric (e.g., precision and recall), we randomly select 100 items on the testing as the candidates, excluding the ones appearing in the training. Only the items rated above 3.5 (including 3.5) are recommended.

The neighborhood collaborative filter models always incur a high memory cost. So we use a 16 GB RAM to run different NHCF algorithms.

##### 3.2. Experimental Metrics

Five metrics are used in our experiments: precision, recall, RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and nDCG (Normalized Cumulative Discounted Gain).

For a test dataset , denote by TP the set of recommend items which the users are really interested in, denote by FP the set of recommend items which the users are not interested in, denote by FN the set of not recommend items which the users are interested in, and denote by TN the set of not recommend items which the users are not interested in. The metrics of precision and recall are defined, respectively, as follows:

The recommendation system generates predicted ratings for a test set of user-item pairs () for which the true ratings are known. The RMSE and MAE between the actual ratings are given by

The recommendation systems always present to the user a list of recommendations, imposing a certain natural browsing order. In many cases, we are not interested in predicting an explicit rating or selecting a set of recommended items, as in the previous sections; rather we are interested in ordering items according to the user’s preferences. nDCG is a measure from information retrieval, where positrons are discounted algorithmically. Assuming that each user has a “gain” from being recommended an item , the average Discounted Cumulative Gain (DCG) for a list of items is defined as where the logarithm base is a free parameter, typically between 2 and 10. A logarithm with base 2 is commonly used to ensure that all positions are discounted. nDCG is just the normalized version of DCG:where is the ideal DCG, the value of which ranges from 0 to 1. The larger the value is, the better the performance is.

##### 3.3. Experimental Results

We change a little the format of the MovieLens dataset and import this dataset to a MySQL database. The coefficients of BLP can be conveniently calculated using some advanced SQL sentences. All of the coefficients of CASBLP methods are shown in Table 2.

The experimental comparison results are shown in Table 3 (using Cosine similarity) and Table 4 (using Pearson’s Correlation). It seems that using Cosine is better than using Pearson’s Correlation in our experiments. Maybe this is because even if each user has different personal rating scale, the rating matrix is too sparse to become the major issue. When data is sparse, Cosine is always a good choice.

From Table 3, we can see that when not using normalization scheme (NoBP) all of the metrics are much worse than the others. The unscaled BLP is even better than NoBP, in which the precision increases by about 7%, recall increases by more than 10%, and RMSE decreases by 4%. It is surprising that only using the simple unscaled BLP the MAE increases by 15% and the nDCG increases by more than 20%. Because recommendation order has a great commercial significance, the normalization is an important improvement in recommendation system. Our context-aware scaled BLP normalization schemes make further improvement, mainly on the precision and recall metrics. From both Tables 3 and 4, CASBLP normalization has almost the same RMSE, MAE, and nDCG as the USBP, sometimes even little worse than USBP. But, for a commercial recommendation system, what the users care about is whether the RSs recommend what they really need. The production selling would benefit from even a 1% improvement on precision or recall. The precision of our CASBLP schemes increases by about 5%, and the recall increases by about 8%, which is a great improvement from the commercial perspective.

An important problem is that the coefficients we used have optimal values. So we change from 0 to 1 and from 0 to 200 to see the changes of the performance. Figures 4–6 show the impact of scaled factors on RMSE, precision, and recall, respectively.

For all these three metrics, the optimum of is near 20, at which the RMSE is the lowest and the precision and recall are the highest. What is interesting is that any shrinking of can improve precision and recall, even if we set to zero. However, shrinking would cause a slightly higher RMSE except at the value near 0.8.

This means that can control the accuracy of the rating prediction, but when has shrunk, plays a crucial role in items recommendation. What causes this phenomenon is that maybe the mean rating is computed in terms of all the users, which involves the global information, while the biases are computed in terms of only very few similar neighbors, which involves the local information. For the personalized recommendation systems, the local information is much more important, and an ordinary average prediction has little meaning. That is why even if we set to 0 and only using the item biases we can also get a passable prediction performance.

The neighbor size is an important factor in the neighborhood-based recommendation systems, for item-based or user-based ones. We increase the neighbor size geometrically from 5 to 320. Figures 7, 8, and 9 show the change of recommendation performance including precision, recall, and RMSE, respectively.

What we can see from Figure 9 is consistent with what we have concluded from Tables 3 and 4. Whether using scaled BLP or unscaled BLP, we can get similar RMSE, which are all much lower than the NoBP scheme. With the growth of the neighbor size, all the RMSE are trending toward stability.

What surprised us is the results of precision and recall. Both metrics are increasing until reaching the stable values with the growth of neighbor size except the NoBP scheme, the precision and recall of which decrease to the stable values. This is due to the fact that, maybe without normalization, the prediction lacks personalization and causes too many more decoys to choose from.

Figures 7 and 8 also show the results which are consistent with Tables 3 and 4. Just slightly changing the coefficients of BLP, we can get higher precision and recall than unscaled BLP scheme and NoBP especially when using larger neighbor size.

#### 4. Conclusions

Rating normalization is an important step when designing collaborative filter recommendation systems, especially for the item-based ones which play a key role in the domain of online commercial business. Using the baseline predictor for normalization considers both the global information and local information. Although we found that balancing them can improve the recommendation performance, there is no clear way of determining the weight of these two sources of information. In this paper, we proposed some context-aware scaled BLP schemes, which compute the weights of mean ratings and biases, respectively, in terms of different context information. What we concluded from the experiments not only verified the advantage of scaled BLP but also pointed out the different roles of each part of BLP. This paper only studied the BLP normalization of item-based collaborative filter system on a sole MovieLens dataset. The user-based and matrix factorization models actually are much different from item-based ones, the details of which we will explore in the future work using some different and larger recommendation dataset.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

This work is supported by the National Natural Science Foundation of China (no. 61602399 and no. 61502410) and Shandong Provincial Natural Science Foundation, China (no. ZR2016FB22).