Mathematical Problems in Engineering

Volume 2016 (2016), Article ID 9535808, 10 pages

http://dx.doi.org/10.1155/2016/9535808

## A Probability-Based Hybrid User Model for Recommendation System

^{1}School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China^{2}Beijing Institute of Astronautic System Engineering, Beijing 310027, China

Received 13 July 2015; Revised 16 December 2015; Accepted 20 December 2015

Academic Editor: Matteo Gaeta

Copyright © 2016 Jia Hao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

With the rapid development of information communication technology, the available information or knowledge is exponentially increased, and this causes the well-known information overload phenomenon. This problem is more serious in product design corporations because over half of the valuable design time is consumed in knowledge acquisition, which highly extends the design cycle and weakens the competitiveness. Therefore, the recommender systems become very important in the domain of product domain. This research presents a probability-based hybrid user model, which is a combination of collaborative filtering and content-based filtering. This hybrid model utilizes user ratings and item topics or classes, which are available in the domain of product design, to predict the knowledge requirement. The comprehensive analysis of the experimental results shows that the proposed method gains better performance in most of the parameter settings. This work contributes a probability-based method to the community for implement recommender system when only user ratings and item topics are available.

#### 1. Introduction

Many researchers believe that comprehensive utilization of accumulated knowledge is vital to maintaining the competitive advantage in today’s knowledge-based economy, particularly within research and development organizations [1–3]. According to the statistic, over 90% of product design is variant design [4], which means new requirements can be satisfied by simply revising the accumulated design knowledge like solution report, 3D models, and others [5]. Therefore, the design knowledge acquisition and utilization are of great importance for completing design tasks. However, the process of knowledge acquisition usually takes up over half of the design time, which significantly affects the design efficiency and extends the design cycle. This is caused by the separation between design knowledge and designers; in that situation the designers are inevitably to ask or search required information and design knowledge. Therefore, a recommender system is an urgent demand for decrease in the time spent on design knowledge acquisition.

To manage design knowledge, a number of design knowledge management systems, like experience design knowledge, 3D model base and standard parts base, and so forth, are deployed in many organizations. These knowledge management systems have recorded information for implementing a recommender system. On one hand, in most of the knowledge management systems, a set of carefully predefined knowledge topics or classes is used to organize design knowledge existing in the knowledge base, and the most commonly used is the product structure (Bill of Material (BOM)). On the other hand, many knowledge management systems record the ratings that designers have given to the design knowledge. Before this research work, we have developed a recommender system for designers based on the method by Al-Shamri and Bharadwaj [6], which combines the user ratings and product topics, and the system is deployed to the practical environment. After a period of running, there is a requirement for further promoting the performance of the recommender system. Therefore, this research work is an attempt to promote the performance of the recommender systems which is based on the topics and ratings of design knowledge. In this work, we present a probability-based hybrid user model; based on that, the corresponding similarity calculation method as well as the recommendation method is proposed. The experimental results show that this model obtains better performance in most parameter settings. The main contributions of this research include (1) combining the ratings and topics by a simple probability formula; (2) providing a new choice for implementing recommender system when only ratings and topics are available. The remainder of this paper is structured as follows. In Section 2, the related techniques are reviewed and analyzed. Then Section 3 presents the proposed method for constructing the probability based hybrid recommendation model. Following that, Section 4 gives a comprehensive experiments analysis. At last, Section 5 makes the conclusion and discusses the future works.

#### 2. Related Works

In modern enterprises, tremendous amounts of data and information are stored and even more are generated than ever before. One side effect for the staff is information overload, which means the staff has been provided more information than they can process efficiently [7–9]. Recommendation systems have emerged as a research methodology in response to this problem [10]. According to [7], there are four kinds of filtering techniques that recommender systems may leverage, including content-based method (CB) [11], collaborative filtering method (CF) [12], knowledge-based method (KB) [13], and hybrid method [14]. Each method has its own advantages and disadvantages; for example, the CF method has the advantage of independence of item’s content and the disadvantage of sparsity rating data. To overcome the disadvantages and combine the advantages of different methods, the hybrid recommender system is proposed [15]. The hybrid recommender system indicates the recommender system that combines multiple recommendation techniques together to produce better performance [7, 16]. According to [15], there are seven strategies for combining different recommender techniques, including weighted, mixed, switching, feature combination, feature augmentation, cascade, and metalevel.

In recent years, CF has become one of the most widely implemented recommender methods [6, 17, 18]. Researchers in the area tend to combine CB method to achieve high scalability while maintaining relatively high prediction accuracy [6, 10, 19]. For examples, Rastin and Zolghadri Jahromi [20] proposed a hybrid method which integrates collaborative filtering and content-based methods. They believe that two users are similar if their ratings on items that have similar context are similar. Yao et al. [21] built a collaborative filter and content-based filter hybrid user model to dynamically recommend Web services. Ronen et al. [22] presented a framework for automated selecting of informative content-based features, and this framework is in dependent on the type of recommender system, which means the method is generalized well to different recommender systems. Barragáns-Martínez et al. [23] developed a Web 2.0 TV recommendation system which is named queveo.tv. They proposed a hybrid method which combined content-filtering techniques with collaborative filtering; the social network also integrated together to make better recommendation performance. Berkovsky et al. [24] proposed a method to merge different user models from different systems and use the merged user model to make recommendations. The same authors transformed this merged user model from a collaborative filtering system to a content-based recommendation system [25]. Degemmis et al. [26] proposed a new content-collaborative hybrid recommender that computes similarities between users, relying on their content-based profiles instead of comparing their rating styles and the content-based user profiles play a key role in the proposed hybrid recommender. Many of the existing researches rely on the description content of the items and this is unavailable in many situations, especially in the product design domain.

When the description content of the items is sometimes unavailable, the knowledge related to users or items can be leveraged for recommender system. Ontology-based or semantic-driven recommender system is a research branch to utilize the knowledge about items for recommendation [7, 16]. In these methods, ontology is used to model a big amount of related knowledge for recommender systems [27]. If knowledge related to users or items can be obtained, this method will contribute to the performance of recommender systems. Moreno et al. [28] developed a Web-based system that provides a personalized recommender system of touristic activities in the region of Tarragona, and the ontology is used to classify and label the touristic activities for further reasoning process.

The above-mentioned methods cannot well satisfy the situation we are facing. On one hand, most of the content-based and collaborative-based hybrid methods require description content, which is unavailable in our problem. On the other hand, we do not have complicated knowledge about items except the predefined simple topics or classes, and the ontology construction itself is an extremely complex process, which requires many experts to work collaboratively. Therefore, the requirement of building an ontology for recommender systems is not urgent in this research work. Among the literature, Al-Shamri and Bharadwaj [6] developed an approach that uses the ratings and topics to build a hybrid model which well meets our requirement and situation. Based on their work, we develop a new probability-based method, and the experimental results show that the newly developed method is better in most of the situation. Therefore, we contribute a method to the research community, which can be used when they have only the topics or classes and ratings.

#### 3. Proposed Method

##### 3.1. Symbols Definition

We first defined the symbols that might be used in the following sections. Let be the set of all users, let be the set of all items, and let be the set of all topics. We defined as the two-dimensional user-item rating matrix and the element represents the rating of user on item . The value of is on either the numerical ordinal scale (e.g., 1 to 5) or zero, which indicates an unknown rating (not rated). Here, is defined as the ratings. The matrix can be explained either using a row vector or a column vector. The row vector represents the rating of user to all items. has been defined as a set of items rated by user . The column vector represents the rating of item by all users. We define as a two-dimensional item-topic matrix, and the element is set to 1 when the item belongs to topic and 0 otherwise. The row vector represents which topics item belongs to. One item may belong to more than one topic, which means . The following list shows details regarding the symbols used in this paper:: the set of items. : the number of items. : the set of users. : the number of users. : the set of topics. : the number of topics. : the set of scores. : the number of scores. : the rating matrix. : the row vector of . : the column vector of . : the item and topic matrix. : the user and topic matrix.

##### 3.2. Database Validation

In many situations, items are classified into one or more specific topics according to their content; the number of topics is normally far less than the number of items. In this paper, we have attempted to formulate a hybrid user model for item recommendation by combining the rating data and the topic data. Before we develop the hybrid user model, one question should be addressed first: whether the topics could be used to express the interests of the users.

For a given database , which includes and , we cannot ensure the user interests can be expressed by the item topics. An obvious example is that when most of the items belong to a small number of topics, the topics are unable to distinguish user interest. Therefore, we need to analyze and confirm whether the topics are capable of expressing user interest before building the user model.

We determine the relationship between users and topics using ; each element in the matrix indicates the total score that user rates for the topic . The row vector represents the total scores that user rates for every topic. For a specific user , we define as the metric to measure the coverage of the user’s interests:where is a set of topics that the user may be interested in (the total score is higher than other topics) and it meets the following condition:where is the threshold value that determines how many interested topics the user has. Based on that, we define as the metric to measure the topic’s capability of expressing the interests of all users:When most of and are small compared with , we infer that the user’s interests are concentrated on a small number of topics. In other words, the topics are able to distinguish the user’s interests.

##### 3.3. User Model Construction

In the traditional CF method, user model is simply represented by a rating vector in which all ratings that the user has given to item are recorded. Similarly, we use a row vector of to represent user model, which records the sum of ratings that the user has given to all items which belong to every topic. However, this method is unreasonable because the sum of ratings may be very close although each individual rating is different. For example, user_{1} rates 5 items which belongs to topic_{1} and the scores are 1, 4, 5, 1, and 1. User_{2} also rates 5 items which belongs topic_{1} and the scores are 5, 1, 1, 3, and 2. Although the sum of both ratings is 12, we cannot believe the two users have similar interest.

Based on the above analysis, we adopts probability to construct the user model. The basic idea is that two similar users should have similar possibilities of rating a specific score on every topic. For example, the possibilities of user_{1} rates 1, 2, 3, 4, and 5 on items which belong to topic_{1} are 10%, 10%, 20%, 40%, and 20%, and if user_{2} and user_{1} have similar interest, user_{2} should have similar possibilities of rating a specific score on items which belong to topic_{1}.

In this work, a matrix is used to represent user model (), as shown in formula (4). The is matrix in which row vectors indicate the possibility that a user may rate a specific score on topics:where means the possibility of user rates score to topic . The value of is calculated according to Bayes formula:where means the probability that user gives a score when the item belongs to the topic ; indicates that the probability of an item belongs to the topic , which user ever rated; and both and may be determined out based on matrix and matrix .

##### 3.4. Neighborhood Calculation

In this research work, we use CF as the method to make predication. Therefore, the calculation of neighborhood is a critical step. The number of neighborhood can be fixed or floated [6]. The fixed neighborhood means to select the top users which have the biggest similarities with the active user while the floated neighborhood means to determine the neighborhood by a similarity threshold. We will use the fixed number of neighborhood in the experimental section.

The essential problem of calculating neighborhood is the measurement of similarity between two user models. The measurement of the similarity is highly depended on the user model, which means that different user model requires corresponding similarity measurement. In this work, the user model is represented by a probability matrix and this is very different with traditional method, which represents user model as a rating vector. When the user is a rating vector, the similarity is figured out easily by many existing methods, like Euclidean distance, Manhattan distance, Cosine, and so on.

In this work, the basic idea of measuring the similarity of two user models is the number of topics which are interested by both of the two users, as shown in formula (6). Fox example, if user_{1} and user_{2} share five topics while user_{1} and user_{3} share three topics, we believe that the similarity between user_{1} and user_{2} is higher than the similarity between user_{1} and user_{3}: where means the similarity of user and user ; represents the interest topics of user ; and means the total number of interest topics of and . The is computed by the following formula:where is a threshold used to determine whether a specific topic is an interested topic. When is high, it will be more difficult for a topic to be selected as the interested topic. means the sum of positive possibilities.

Here, the positive probability indicates the probability that the user will rate a high score for some topics, referring to the degree that the user may interest in the topics. For example if the value is high, we may infer the user interest in the topic . On the other hand, a passive probability indicates the probability that the user will rate a low score for some topics, which refers to the probability that the user dislikes a specific topic. For example, if the value is high, we believe the user dislikes topic . In this research work, only the positive probability is considered to calculate the similarity of two user models. In general, the high score ranges from 3 to 5 when the score ranges from 1 to 5.

##### 3.5. Recommendation

The last step is to figure out recommendations, which is a list of top- items that would most interest the user. The primary task of recommendation is to predict which scores the active user might give to unrated items according to the neighbor set, which gives us the top- highest score items as the recommendation. In the classical CF, we can predict the score using the average of neighbor ratings. However, the average value ignores the different rating characters of different users. The widely used method is the weighted sum [29, 30], which is also Resnick’s prediction formula [31]. Therefore, in this work, Resnick’s prediction formula is adopted for all different methods to validate the presented method.

#### 4. Experiments

##### 4.1. Evaluation Database and Performance Metrics

In this research, the research question was arisen from the product design domain, and the solution has been discussed in this paper which is trying to deal with the information overload problem in the product design domain. Because of that, the best way to validate the method is to use a database in the product design domain. However, there is still no unanimous database for validating the recommender systems in this domain. Therefore, we turn to finding widely used database that has similar characteristics of the problem in the product design domain. Another reason for using widely used database is that we believe the proposed method has the potential to deal with many problems in the different domains.

After considering many different databases, the MovieLesn is selected because the ratings and topics of items are recorded properly. MovieLens (http://www.movielens.umn.edu) is a widely used evaluation database. The database consists of 100,000 ratings, 943 users, and 1682 items. All ratings follow the following numerical scale: (1) bad, (2) average, (3) good, (4) very good, and (5) excellent. Each user in this database has rated at least 20 items. The dataset also contains genre (topic) features for the items that our method needs. A single item can belong to one or more movie genres, which include action, adventure, animation, children’s, comedy, crime, documentary, drama, fantasy, film-noir, horror, musical, mystery, romance, sci-fi, thriller, war, or western. To avoid the influence of cold start items and cold start users, we have extracted a denser database from the original database. We only considered items that had been rated more than 20 times. Table 1 details the statistics of the extracted database.