Mathematical Problems in Engineering

Volume 2015, Article ID 380472, 13 pages

http://dx.doi.org/10.1155/2015/380472

## Improving Top-*N* Recommendation Performance Using Missing Data

^{1}Beijing Research Center for Information Technology in Agriculture, Beijing 100097, China^{2}School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China

Received 23 April 2015; Accepted 26 August 2015

Academic Editor: Jean-Charles Beugnot

Copyright © 2015 Xiangyu Zhao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Recommender systems become increasingly significant in solving the information explosion problem. Data sparse is a main challenge in this area. Massive unrated items constitute missing data with only a few observed ratings. Most studies consider missing data as unknown information and only use observed data to learn models and generate recommendations. However, data are missing not at random. Part of missing data is due to the fact that users choose not to rate them. This part of missing data is negative examples of user preferences. Utilizing this information is expected to leverage the performance of recommendation algorithms. Unfortunately, negative examples are mixed with unlabeled positive examples in missing data, and they are hard to be distinguished. In this paper, we propose three schemes to utilize the negative examples in missing data. The schemes are then adapted with SVD++, which is a state-of-the-art matrix factorization recommendation approach, to generate recommendations. Experimental results on two real datasets show that our proposed approaches gain better top-*N* performance than the baseline ones on both accuracy and diversity.

#### 1. Introduction

In the current age of information overload, it is becoming increasingly hard for people to find relevant content. Recommender systems have been introduced to help people in retrieving potentially useful information in a huge set of choices. Conventional recommendation methods are based on users’ rating values. These rating values are considered as indications of users’ preference level towards the rated items. Recommender systems estimate the ratings of items that have not been rated by the target user based on the rating history and recommend top- items with highest predicted ratings. This kind of rating prediction approaches has gain significant success. Recently, there is a growing interest in improving recommender systems in terms of ranking performance as it seems to better approximate the true task [1, 2]. As a result, some researchers consider the recommendation problem as a ranking prediction problem and directly optimize a ranking goal to learn their recommendation algorithms.

Most of these approaches, either rating prediction ones or ranking prediction ones, are trained and tested on observed ratings only. The effectiveness of these approaches is based on an implicit underlying assumption that the ratings in the available data are missing at random. If the assumption is not satisfied, the missing data mechanism cannot be ignored in general and has to be modeled precisely so as to obtain correct results. Indeed, some recent works find that the data are missing not at random [3–5]. Marlin et al. [3] provide evidence that low ratings are much more likely to be missing from the observed data than high ratings in the Yahoo!LaunchCast data. This may be a consequence of the fact that users are free to choose which items to rate. Steck [4] works on training and testing recommender systems on data missing not at random and illustrates that accounting for missing ratings can improve the top- performance of simple matrix factorization model. Therefore, missing data, which have not been rated by the active users, carry useful information of user preferences.

We agree with the idea that data are missing not at random. However, our assumption is different with these ones in [3–5], which consider missing data mainly as negative ratings. In our opinion, there are lots of negative examples of users’ preference in missing data while observed data are positive ones. We have found that the rating behaviors themselves are evidences of user preferences no matter whether the rating values are high or low in the previous work [6]. This indicated that users are free to choose which items to rate in the context of recommender systems. Therefore, lots of missing data are because users choose not to rate them. For example, one who dislikes horror film will not watch “Evil Dead” and also will not rate it. To a certain user, the entire item data can be split into two sets: one is the item set () that the user chooses to rate and the other is the item set () that the user does not deliberately choose to rate. Observed data are part of . The rest of combines with to be missing data. The goal of recommender systems is to identify the items in but not been rated yet and to recommend them to users. As a result, it is important to distinguish which item set an item belongs to. Unfortunately, there are only positive examples to be used for classification; negative examples are mixed with some positive ones in missing data.

An intuitive approach to solving this problem is to distinguish the negative examples and use them together with the positive ones to learn recommendation model. In this paper, we propose three kinds of schemes to get negative examples from missing data and adapt them with an existing recommendation approach to a unique model. It is expected that the new model can better distinguish between positive items and negative ones compared to the original model and recommend more items that users may rate. As a result, the new model is expected to gain improvement in top- recommendation performance.

The first scheme considers that all missing data are negative examples with different confidence towards the positive ones. The other two sample some negative examples from missing data with a stochastic method or a neighbor-based method. All the schemes utilize the information from missing data. To verify the effectiveness, we adapt the schemes with SVD++ [7] (a state-of-the-art rating prediction approach using a matrix factorization model) to new models. Our experiments demonstrate that the new models gain significant improvement of SVD++ in top- recommendation.

The remainder of the paper is organized as follows. We review related literature in Section 2. The schemes to deal with missing data are introduced in Section 3. The improvements of SVD++ are proposed in Section 4. Section 5 introduces some popular evaluation metrics. Experiments are carried out on MovieLens and EachMovie datasets in Section 6 to compare the proposed approaches with existing ones. Finally, we conclude the paper in Section 7.

#### 2. Related Work

In this section, the review of literatures is divided into four parts. The first one is about conventional rating prediction recommendation algorithms. The second one includes some studies on ranking prediction recommendation approaches. The third one is about some recent works on nonrandom missing data. The last one focuses on one-class collaborative filtering, the idea which is similar to our proposed schemes.

##### 2.1. Rating Prediction Approaches

Recommendation techniques have been studied for several years. Conventional recommendation approaches are based on rating prediction. They are used for providing personalized recommendations to help people in solving the information explosion problem. Collaborative filtering (CF) is a very popular technique, since it is not necessary to analyze the content of the candidate items using swarm intelligence instead. Furthermore, it can be easily adapted from one domain to another. CF algorithms can be divided into two classes: memory-based and model-based [8, 9].

Memory-based algorithms are heuristic methods that make rating predictions based on the entire collection of items previously rated by users [10, 11]. They are based on a basic assumption that people who agreed about their preferences to certain items in the past tend to agree again in the future [12]. The level of agreement can be measured by similarity. Based on the similarity calculation, recommender systems predict ratings for unknown items using adjusted weighted sum of known ratings and recommend items with high predicted values [11].

Model-based CF is another kind of typical CF methods. Model-based algorithms use the collection of ratings to learn a model, typically using some statistical machine-learning methods, which are then used to make rating prediction. These approaches always design appropriate loss functions and optimization procedure to learn their model by minimizing the error between predicted ratings and actual ones. Examples of such techniques include Bayesian clustering [9], matrix factorization [7], and topic model [13].

SVD++ [7] is a model-based CF using matrix factorization technique. It considers implicit feedbacks as complement of explicit feedbacks and utilizes them together to build recommendation models by minimizing prediction errors. This approach is a state-of-the-art rating prediction approach, which is used as the foundation of our improvement.

##### 2.2. Ranking Prediction Approaches

Different from those rating prediction approaches, some researches directly consider the recommendation problem as a ranking problem. They propose models for ranking predictions by directly modeling user preferences with respect to a set of items rather than the rating scores on individual items.

Weimer et al. [14] present a method (CofiRank) which uses Maximum Margin Matrix Factorization and considers maximum NDCG as the optimizing target. The approach is adaptable to different scores. Since the optimizing target of CofiRank is a listwise one, the approach scales well on collaborative filtering tasks.

Liu and Yang [15] measure the similarity between users based on the correlation between their rankings of the items rather than the rating values. Based on the preferences of similar users, they propose collaborative filtering algorithms for ranking items with either a greedy strategy or a random walk model.

Liu et al. [2] propose a probabilistic latent preference analysis (pLPA) model to make ranking predictions. From a user’s observed ratings, they extract his/her preferences in the form of pairwise comparisons of items which are modeled by a mixture distribution based on Bradley-Terry model. An EM algorithm for fitting the corresponding latent class model as well as a method for predicting the optimal ranking is described.

Koren and Sill [16] propose a collaborative filtering recommendation framework (OrdRec), which is based on viewing user feedback on products as ordinal, rather than the more common numerical view. Their approach is based on a pointwise ordinal model, which allows it to linearly scale with data size. OrdRec is also an improvement of SVD++. It is used as a comparing approach in our experiments to verify the effectiveness of our proposed approaches in the top- recommendation task.

##### 2.3. Nonrandom Missing Data

Most of conventional collaborative filtering approaches use observed ratings only, and they expect that the model optimizing with observed ratings only is an unbiased estimating of using the entire data. These approaches are based on an implicit assumption that the ratings not in observed data are missing at random. However, this may not be satisfied. Some recent works have found that data are not missing at random [3–5].

Marlin et al. [3] find that low ratings are much more likely to be missing from observed data than high ratings in the Yahoo!LaunchCast data. This is an evidence of data missing not at random. Steck [4, 5] works on training and testing recommender systems on data missing not at random. He assumes that the relevant rating values are missing at random, and the other ratings are missing with higher probability. Based on the assumption, he presents two performance measures that can be estimated, under mild assumptions, without bias from data even when ratings are missing not at random. In addition, he also propose an appropriate surrogate measure for training models which is captured as AllRank. In this measure, both observed and missing data are considered. It improves the top- performance of a simple matrix factorization model by accounting for missing ratings.

Cremonesi et al. [1] propose an improvement of matrix factorization by considering all missing values in the user rating matrix as 0, which is captured as PureSVD. This approach gets better top- performance even than more detailed and sophisticated latent factor models. The result demonstrates that considering missing data as 0 value is much more effective than just ignoring them, which is also an evidence of data missing not at random.

##### 2.4. One-Class Collaborative Filtering

In some recommendation context, the training data usually consist simply of binary data reflecting a user’s action. Researchers consider these problems as one-class collaborative filtering problems (OCCF). In these problems, users’ action data are usually extremely sparse (a small fraction are positive examples); therefore ambiguity arises in the interpretation of the nonpositive examples. Negative examples and unlabeled positive examples are mixed together, and they always are unable to be distinguished. Pan et al. [17] propose two frameworks to solve the OCCF problems. One is based on weighted low rank approximation; the other is based on negative example sampling. Li et al. [18] exploit the rich user information to improve recommendation accuracy in the OCCF problems. They propose two ways to incorporate such user information into the OCCF models: one is to linearly combine scores from different sources and the other is to embed user information into collaborative filtering. Rendle et al. [19] consider missing data as a mixture of real negative feedback and missing positive values and present a generic optimization criterion (BPR) for personalized ranking that is the maximum posterior estimator derived from a Bayesian analysis of OCCF problem.

The schemes, which we will propose in the next section to deal with missing data of recommender systems by weighting or sampling, are similar to the idea in OCCF. However, the context of recommendation is different between our schemes and the schemes in OCCF. OCCF focuses on the binary recommendation problems with implicit feedbacks while our proposed ones focus on the classical recommendation problems with explicit feedbacks. Furthermore, the neighbor-based sampling scheme proposed in Section 3.3 can utilize the advance of NN methods while sampling negative examples.

#### 3. Schemes to Deal with Missing Data

In the context of recommender systems, users are free to choose which items to rate. As a result, the observed rating data can indicate users’ preferences. In the survey of Marlin et al. [3] using Yahoo!LanchCast data, there are 93.9% users report that they rate an item which they love very often, while only 36.5% users report that they rate an item for which they are “neutral” with the same frequency. The survey is a collection of ratings for songs, which is a little time-consuming context. If the context changes to a very time-consuming or cost-consuming, such as movie or e-commerce, the ratio of users choosing to rate an item for which they are “neutral” should be less. Therefore, there are two types of items for a certain user. One is the items that the certain user wants to rate. They are partitioned to a set . The other is the items that the user does not care and does not want to rate. They are partitioned to . The observed data contains the rated items. It is a part of . The rest part of combines with to be missing data. In this paper, we consider the items in as positive examples, and the items in as negative examples.

Based on the partition, Steck considers that the rating distribution is different between and [5]. He tries to model the difference to improve a simple matrix factorization approach in top- recommendation task [4]. In his opinion, the negative ratings with low value get high probabilities to be missing. Therefore, he imputes a small value () for all missing data, and uses a weighting parameter () to control the effect of missing data. In this way, the improved models using missing data can gain better top- performance than the original matrix factorization model using observed data only (in the work, AllRank-Regression with and gains the best top- performance. It is used as a comparing approach in our experiment).

The main idea of Steck [4, 5] is that most of missing data are negative ratings. The difference between and is rating distribution. Different from them, in our opinion, most of missing data are negative examples. Positive examples in and negative examples in are two different item sets. The goal of the recommender systems is to identify the unrated positive examples. As a result, it is necessary to distinguish an item belong to which item set. Unfortunately, only positive examples are explicit in recommendation context, negative examples are mixed with some positive examples in missing data. In order to solve the problem, we try to distinguish the negative examples, and use them together with the positive ones to learn recommendation model. Like the idea in [4], we use an imputed value () for negative examples in order to model both positive and negative examples in one unique model. Different from Steck [4], our is used to represent negative examples, which are actually in a different item set from positive examples. Therefore, the value of should out of the range of rating scale in order to distinguish negative examples with positive ones using rating value (The typical value of is 0. The impact of different is experimented in Section 6 even with the value in the range of rating scale.).

In the rest of this section, three schemes are introduced to deal with missing data.

##### 3.1. Weighting Scheme

Weighting Scheme (WS) considers that all missing data are negative examples with different confidence levels towards the positive ones. The weighting value indicates the confidence level, which determines how much missing data are considered as negative examples. The weighting function can be written as:where is the weighting value for user on item , is the observed data, is a uniform confidence threshold for all missing data. If user has rated item , it is a positive example, and the weighting value is set to 1. Otherwise, is considered as a negative example with a confidence level . In this scheme, all missing data are imputed with . It can be formalized as:where is the data for learning recommendation models, and is the rating value in observed data.

With WS, a recommendation approach aims at finding a prediction model to minimizing the objective of a weighted Frobenius loss function as:where is the re-construct matrix which contains both observed data and imputed ratings, while is the rating predicted by recommender systems.

Broadly speaking, WS can be considered the same as AllRank-Regression in [4]. The main difference is that their opinion about missing data is negative ratings (AllRank-Regression) or negative examples (WS). In addition, PureSVD, which is proposed in [1] is a special case of using WS in SVD approach with .

##### 3.2. Random Sampling Scheme

WS considers all missing data as negative examples. This assumption is roughly held in most cases. However, the main drawback is that the computational costs are very high especially when the target problem of recommender systems is information overload, which contains a massive set of missing data. Sampling scheme could solve this problem in a certain degree by considering some missing data as negative examples, which is much different from WS.

In this subsection, we propose a random sampling scheme (RSS) which samples some negative examples from missing data with a stochastic method. In RSS, percentage of missing data is randomly selected as negative examples (). These negative examples are combined with rating matrix to be the re-construct matrix for RSS. It can be formalized as:

RSS uses to optimize the recommendation model and generate recommendations. Therefore, the size of is a major aspect of the computational cost for different recommendation approaches. As the size of is a constant, the computational cost is mainly relevant to the size of . When is 1, is the entire set of missing data, the computational cost of RSS is similar to WS. When is 0, is an empty set, the computational cost of RSS is similar to the original recommendation approach which does not consider the effect of missing data. When is between 0 and 1, the computational cost of RSS is reduced with a decrease in . The experimental results will show that RSS gains the best performance when is 0.2. This indicates that the computational cost of RSS is much less than WS.

As RSS mainly focuses on utilizing missing data without improvement of training process of recommendation approaches. Therefore, RSS learns the prediction model by minimizing the objective of an unweighted Frobenius loss function as most recommendation approaches do. It can be written as:

It is notable that the should be re-built in each learning step, as the sampling scheme is a stochastic one, in order to reduce the randomness.

##### 3.3. Neighbor-Based Sampling Scheme

Sampling scheme can reduce the computational costs of weighting scheme. However, using a stochastic method leads that both the missing positive examples and the negative ones have the same chance to be selected as negative examples. In this subsection, we propose a neighbor-based sampling scheme (NSS) to increase the chance of negative examples to be selected and to decrease the selected chance of positive ones. Different from RSS sampling with a stochastic method, NSS samples some negative examples from missing data using swarm intelligence.

NSS is based on an assumption that similar users have similar tendency about negative examples. Like the idea of neighbor-based CF, in NSS, for a certain user, items that have rarely been rated by his/her neighbors are very likely to be negative examples. As a result, NSS searches the most similar users as neighbors for individual users and then selects the items which have not been rated by users’ neighbors. In this case, negative examples have bigger chance to be selected than positive ones. After that, NSS randomly samples some items as negative examples from the selected items. The sampled result is a negative example set (). The detail of NSS is described in Algorithm 1.