Abstract

In the social network, similar users are assumed to prefer similar items, so searching the similar users of a target user plays an important role for most collaborative filtering methods. Existing collaborative filtering methods use user ratings of items to search for similar users. Nowadays, abundant social information is produced by the Internet, such as user profiles, social relationships, behaviors, interests, and so on. Only using user ratings of items is not sufficient to recommend wanted items and search for similar users. In this paper, we propose a new collaborative filtering method using social information fusion. Our method first uses social information fusion to search for similar users and then updates the user rating of items for recommendation using similar users. Experiments show that our method outperforms the existing methods based on user ratings of items and using social information fusion to search similar users is an available way for collaborative filtering methods of recommender systems.

1. Introduction

Importing social information to the collaborative filtering method is proved to be an available way to improve the performance of the recommender system and alleviate the data sparseness problem caused by the cold start [13]. Take, for examples, in microblog, the Chinese twitter, most existing collaborative filtering methods use user ratings of items (microblogs) to search for similar users. When the number of microblogs is much more than the number of microblogs which have been rated, the data sparseness problem will be caused. This problem makes searching similar users only relied on microblogs (items) inconvenient.

In order to address the above issues, we propose a new collaborative filtering method which uses social information fusion to search the similar users and then update user ratings of items using the similar users, which will alleviate the problem of data sparsity. Finally, we recommend the items based on the updated user ratings of items.

The remainder of this paper is organized as follows. In Section 2, we discuss the related work. Then we formulate the social information fusion in Section 3. The process of recommendation is given in Section 4. Finally, we show our experimental results in Section 5 and draw a conclusion in Section 6.

The user-based collaborative filtering recommendation algorithm assumes the interest between similar users (neighbors) is similar [4]. First, find the neighbor user of the target user, and then predict the target user’s rating of the corresponding item according to the rating of the similar user, thereby completing the recommendation. To find neighbor users, the traditional method is to treat the user’s rating of the item as a high-dimensional vector and use cosine similarity to find the user with high similarity as the target user’s neighbor. This method can effectively use the feedback information of other users. However, in the actual environment, the number of items is far greater than the number of users, and the number of items that users can browse is limited, which results in the sparseness of the user-item rating matrix, which makes it difficult to find similar users. In the case of sparse data, a small number of user ratings is not enough to support larger items recommendation.

To alleviate this problem, literature [5] proposes a simple recommendation algorithm that fully exploits the similarity information among users and items and intrinsic structural information of the user-item matrix. This method constructs a new representation which preserves affinity and structure information in the user-item rating matrix and then performs recommendation task. Experimental results show the effectiveness of this method. Recently, a linear sparse and low-rank representation of the user-item matrix has been applied to produce Top-N recommendations. Literature [6] addresses the limitations of the nuclear norm by mimicking the behavior of the true rank function and improves the quality of recommendation substantially. Literature [7] proposes a method to reduce the item’s space dimension based on singular value decomposition (SVD), but the reduction of dimension led to the loss of information at the same time, which makes the effect difficult to guarantee. Many scholars have begun to study the similarity between users from user attribute and social attribute. Literature [8] adopts the thought of deep search to find the trust relationship among users and then to regard people with the strong relationship of trust as similar users. Literature [9] presents Matchmaker, a collaborative filtering friend recommendation system based on personality matching, and gets a good result. Literature [10] studies the problem of link recommendation in weblogs and similar social networks and proposes an approach based on collaborative recommendation using the link structure of a social network and content-based recommendation using mutual declared interests. Literature [11] utilizes the user’s interaction relationship to measure the user relationship strength. Literature [12] fuses the user’s similarity of attributes and the user’s interaction relationship for user similarity calculation. In addition, the analysis of user relationship is also often used for friends' recommendation [13].

Our study finds that every piece of social information reflects an aspect of the user. This paper proposes a novel method based on social information fusion instead of the method based on user ratings of items to search neighbors.

3. Fusion of Social Information

Based on the experimental data provided by KDD CUP 2012 track1, this paper divides a user’s social information into personal description information (Profile), relationship information (Follow), behavior information (Action), and interest information (Interest). Therefore, the social information fusion is expressed as

The similarity between the users is expressed as

The description of each piece of information and its similarity calculation are as follows.

3.1. Similarity of Personal Description Information

Personal description information includes gender (sex), age (age), and label (tag). The similarity of personal description information is defined as

(1) . Gender is represented by numbers. Number 1 means male, Number 2 means female, and Number 0 means unknown. The similarity of gender is defined as follows:

(2) . Age is expressed in year of birth. First, we transfer the user’s birth year into an actual age. Second, consider that people’s cognitive ability is changing along with age. For the older, the cognitive ability difference is bigger. For two people of younger and two people of older, even if these two groups have the same age difference, the cognitive ability difference between younger group will be smaller than that in older group in most situations. The similarity of age is defined as follows:

(3) . Tag is expressed in words. Due to the fact that different users have various tags, we use the proportion of the number of the two users’ common tags and total tags as evaluation. The formula is as follows:

3.2. Similarity of Relationship Information

Relationship information contains the focusing information (followee) and fans information (followers). Relationship information similarity is expressed as formula (7), and .

Focusing information and fans information are mutual; for example, user focuses on the user , the user is the user ’s focusing information, and user is ’s fans information at the same time. So, the calculation of focusing information and fans information is the same. Due to a large number of users in the data, high-dimensional cosine similarity calculation should be avoided, so we use the proportion of the two users’ focusing/fans information and of their total as evaluation. The formula is as follows:

3.3. Similarity of Behavior Information

Behavior information includes comment number (Comment), pointing frequency (At), and forwarding frequency (Retweet). Similarity of behavior information is expressed as follows, and :

The above three types of data have the same forms, which led to the same evaluation methods. Taking comment number information, for example, creates a comment matrix based on the number of users. In comment matrix, each row stands for a user’s number of comment given to other users, and each column stands for the number of comments of a user received. User ’s comments for other users are recorded vector . The similarity of two users is calculated using cosine similarity; the formula is as follows:

3.4. Similarity of Interest Information

Experimental data provide the keywords and the corresponding weights of keywords, both extracted from user’s microblog comments, and classify all the keywords. According to this information, we could calculate the similarity of the user’s interest information.

As is shown in Figure 1, keywords are included in one or more categories. After counting, the weight of every keyword could be reflected in the classification. Establish the interest classification matrix based on the number of classifications, and then every user could get a corresponding interest classification vector through calculating. The similarity of interest would be calculated using cosine similarity.

The calculation process is as follows.

Step 1. Set keywords set as ; set user-keywords weight as ; set classified collection as . All data in above can be provided directly by the experimental data.

Step 2. Set the user-classification weight which we are going to calculate as ; the calculation formula of weights is = . In this formula, means the keywords’ number included in the classification of .

Step 3. Regard as .
Then, the similarity calculation of interest information between two users is as follows:

4. The Process of Recommendation

User-based collaborative filtering recommendation creates a rating matrix based on user ratings of items and predicts the possible rating of the target user in the unrated item according to the rating matrix. The recommendation sequence is then generated based on the predicted rating. The sparseness of the rating matrix in the actual data is one of the key factors affecting the accuracy of the entire recommendation. The matrix filling technique [14] provides an idea in solving the matrix sparsity problem. For user’s unrated content, early scholars set it to the average of other users’ ratings of items [15], which did not fundamentally alleviate the sparseness of the rating matrix.

4.1. The Introduction

Based on the experimental data, the data provides a representation of whether the user likes the target item, each of which contains several keyword attributes. In the data, “1” means like, and “-1” means dislike. We use “1” or “-1” as the user’s rating for the item and for the unrated item as “0”.

As is shown in Figure 2, user likes Items 1 and 2; user likes Item 3. The initial rating is shown in Table 1.

The data is sparse in the matrix, at the same time there are no common rating items between the two users. But there is a connection between the item they like. Taking Item 1 as an example, we divide the user ’s ratings for Item 1 into three equal parts according to the number of keywords included in Item 1 and think that word A, word B, and word D occupy 1/3 of the rating respectively. Excluding the relationship between Item 1 and word D, word D has a relationship with Item 3 and Item 4. At this time, the 1/3 rating of Item 1 occupied by the word D is equally divided into two and is reflected to the keyword D of Item 3 and Item 4; that is, the keyword D of the Item 3 and Item 4, respectively, occupies the 1/6 rating. Then process other items in this way. The processed ratings of the items are shown in Table 2.

This matrix filling process can pass the rating through the implicit relationship between the items to other related items, showing an implicit rating. At the same time, since some ratings of items are the same (1 or -1), personal preferences cannot be highlighted. Through this process, they can be distinguished.

4.2. The Process

Regard the sets of items as ; regard the sets of words as .

Create an array, , to save the reflection rating.

Step 1. Select from ; the rating of is 1; extract from the set ; and in the formula, .

Step 2. Select from ; find out; in the formula, and .

Step 3. If is not empty, add the implicit rating to corresponding to each element in .

Step 4. If there is any comment in , continue from Step 2.

Step 5. If there is any comment in , continue from Step 1.

Step 6. Add original ratings to .

Finally, represent the values in as user ratings of items. Put the new together; then we get the new rating matrix.

4.3. The Recommendation

According to the methods in Section 3, we get the neighboring user sets . Next, process every user’s rating by the calculation method above, then we get the new ratings. Therefore, the formula for user ’s prediction rating of is as follows: presents the user ’s average rating of items, presents user ’s rating for , and presents the similarity between user and . Finally, the first few items with high ratings in the predicted ratings are used as the recommended set.

5. Experiments

This section compares the ideas presented in this paper through two sets of experiments. The experimental content includes(1)the user’s relationship information’s effect to similar users;(2)the collaborative filtering method based on matrix filling algorithm.

5.1. The Experimental Data

The experimental data in this paper is derived from the data provided in the KDD CUP 2012 track1. The data is based on Tencent Weibo for a period of time, extracting all kinds of social information of users. These include user information (gender, age, number of microblogs, and tags), relationship information, behavior information (forwarding information, pointing information-@, and rating information), keyword information (keywords, weights, and classifications), and items accepted and rejected by users and their ratings. Since the data-related content is encrypted, it is inconvenient for data screening. We screened the data as follows.

Step 1. Randomly select 20 users whose age information is between 15 and 40 years from all users and put them in the queue.

Step 2. Dequeue a user from the queue, use the user as an experimental user, and put the friends he follows into the queue.

Step 3. Repeat the second step, when the number of experimental user groups reaches the target number.

After the progress in above is completed, we get 10001 experimental users. The related information includes 846077 pieces of item rating information, 821666 pieces of relationship information, 381208 pieces of behavior information, 16267430 pieces of keywords information, and 363 keywords’ classification information.

5.2. The Influence of the User’s Social Information to Similar Users

First, we divide the 10001 users into groups, each group consisting of two users, and then we get 50005000 similarity computing groups. For each group of users, according to the calculation process described in this paper, age similarity, gender similarity, label similarity, following similarity, fan similarity, forwarding similarity, pointing similarity, comment similarity, rating similarity, and interest similarity are calculated, respectively, as shown in Figures 36.

As can be seen from the data in Figure 3, the ratio of male to female in the experimental data is close to 1:1, and the age difference is moderate, but the label has a large difference. From the data in Figure 4, the fan information is more difficult to reflect the similarity between users than the information of interest. As is shown in Figure 5, the similarity of forwarding behavior in user behavior accounts for a large proportion. As is shown in Figure 6, the rating information can also be regarded as a kind of user interest, but compared with the keywords extracted in the experimental data and the interest extracted from the classification, it does not reflect the similarity of interest between users. This is due to the large number of microblogs, and the related microblog content cannot be browsed by the user, resulting in the inability to obtain the rating. Therefore, analyzing user interests according to their keywords and their related classifications in a user’s limited page views can alleviate such problems.

According to the similarity information calculated above, we use the personal description information, relationship information, behavior information, and interest information to perform the selection calculation of similar users and then find the fusion parameters in each piece of information. The evaluation criteria of similar users are based on the user’s own followers. The description is as follows: select one user from the experimental users and calculate the similarity for the other users. The descending sequence of similarity values is taken as the user ’s similar user sequence. Use the user ’s own followers as the correct collection. The similar user sequence is valued from 5 to 40 according to Top-N, and combined with the user set, the map (Mean Average Precision) calculation is performed [16], and the formula is as follows:

In this formula, represents the number of experimental user sets; represents the number of elements included in the correct recommendation set of user when the recommendation number is N. represents the first N similar user sets. represents the ranking of the i-th correct content in the similar user sets.

Taking behavior information as an example, according to the fusion calculation methods, described in this paper that the coefficients of the three kinds of data included in the behavior information are valued in the range of 0.1 to 0.8, ensuring that each coefficient is greater than 0 and the sum of the three coefficients is 1. When Top-N takes 40, the average map value image of all experimental users is shown in Figure 7.

The horizontal axis represents the coefficient. From top to bottom, they are the coefficient of point information, forwarding information, and comment information. As shown in the graph, finding similar users based on behavior information is mainly affected by the similarity of the forwarding information. Each peak is generated when the forwarding information coefficient has a larger value, and each trough is generated when the forwarding information coefficient is the smallest.

We will get the maximum value of the map when the value on X-axis is “253”. So the formula of convergence computation is expressed as

In the same way, we give the weight of personal description information and relationship information, as shown in Table 3.

Next, we conduct experiments to find similar users according to the combined personal description information, relationship information, behavior information, and interest information and rating information. In the experiment, the Top-N values were taken from 5 to 40, and the average map value of the experimental users is shown in Figure 8.

As can be seen from the figure, with the growth of Top-N, it is better to calculate similar users according to social information than other methods. However, the map value calculated based on the rating information is low, and it is not suitable to find similar users based only on the content. The behavior information is better when the value of Top-N is lower.

The experimental data can show that, due to the sparse user ratings, implicit ratings are not enough to give reasonable similar users. Similar users based on relationship information and behavior information are better than rating information. According to the idea of fusion and through a large amount of data calculation, a similar user calculation formula is obtained, in which the coefficient of behavior information, relationship information, interest information coefficient, and personal description information are 0.36, 0.49, 0.11, and 0.44, respectively. Then compare with the traditional method of finding neighbor users according to the rating. The comparison result is shown in Figure 9, where Top-N ranges from 5 to 40.

5.3. The Collaborative Filtering Recommendation Experiment Based on the User Similarity Degree

The experimental data provided a record of 845,727 use ratings of items involving 3,775 related items. According to the method of finding neighbor users in Section 5.2, we selected 40 neighboring users for every user, and each user with the corresponding neighboring users comprised the evaluation matrix of the item. According to the method of processing matrix sparsity in Section 4, we process the rating matrix. The result is shown in Table 4.

According to the processed rating matrix, we repredict the current user’s ratings of the item according to the formula in Section 4.3. We use the recall rate (R) and precision (P) and F1 as the basis for evaluation. The recall rate is the ratio of the correct recommendation quantity to the theoretical maximum recommendation item quantity; the precision rate is the ratio of the correct recommendation quantity to the recommendation total number; the F1 calculation can be balanced between the recall rate and the precision rate. The formulas are as follows:

indicates the number of items whose original rating of the user is “1” in the recommendation set. indicates the number of all users whose original rating is “1”. indicates the number of items in the recommendation set. Then we sort the user’s predicted rating results and select the recommendation set according to the Top-k. Three cases were selected for comparison in the experiment.

(1) Recommending directly without matrix sparsity processing.

(2) Recommending after matrix sparsity processing.

(3) Only considering the positive evaluation of the user, getting the rating matrix, and then performing the recommendation after matrix sparsity processing. Figure 10 shows the variation of the F1 value between 5 and 45 for the Top-k value.

According to the experimental results, we can get the following:

(1) The recommendation effect after sparse processing of the rating matrix is better than the recommendation effect of not processing the rating matrix. When the value of Top-k is 5, the two are basically the same. As the Top-k grows, the difference is obvious. When the value of Top-k is between 30 and 45, the difference between the two is relatively obvious. According to the result, when the matrix sparsity processing is not performed, the item with more original rating of “1” cannot be selected as the recommendation result.

(2) Negative ratings have a certain influence on the recommendation effect. The rating includes a positive rating with a value greater than zero and a negative rating with a value less than zero, with a value of zero indicating that the user has not rated. In the data, we found that the ratio of the negative rating is more important than the positive rating. After the rating vector processing, the values of more ratings are expressed as negative numbers. In order to reflect the impact of negative ratings, we removed the negative ratings, only considering the impact of positive ratings and conducting recommendation experiments. As can be seen from Figure 9, the experimental results considering both positive and negative ratings are better than those considering only negative ratings.

6. Conclusion

In this paper, based on the data set of KDD CUP 2012 track1, we study the users’ social information and propose a new collaborative filtering method based on social information fusion. Experiments show that our new method outperforms the method based on user ratings of items and the social information fusion is an available feature for recommender system. In the future, we will study how to extract more effective social information fusion features using the deep learning method.

Data Availability

The data is a public dataset and everyone can download from https://www.kaggle.com/c/kddcup2012-track1. In Section 3, we provide the data name.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61672040) and the North China University of Technology Startup Fund.