Security and Privacy Protection of Social Networks in Big Data EraView this Special Issue
SHMF: Interest Prediction Model with Social Hub Matrix Factorization
With the development of social networks, microblog has become the major social communication tool. There is a lot of valuable information such as personal preference, public opinion, and marketing in microblog. Consequently, research on user interest prediction in microblog has a positive practical significance. In fact, how to extract information associated with user interest orientation from the constantly updated blog posts is not so easy. Existing prediction approaches based on probabilistic factor analysis use blog posts published by user to predict user interest. However, these methods are not very effective for the users who post less but browse more. In this paper, we propose a new prediction model, which is called SHMF, using social hub matrix factorization. SHMF constructs the interest prediction model by combining the information of blogs posts published by both user and direct neighbors in user’s social hub. Our proposed model predicts user interest by integrating user’s historical behavior and temporal factor as well as user’s friendships, thus achieving accurate forecasts of user’s future interests. The experimental results on Sina Weibo show the efficiency and effectiveness of our proposed model.
Online microblog systems such as Sina Weibo, Twitter, and Facebook provide a convenient platform for users to share their information. The number of such social media users showed exponential growth in last decade. A recent snapshot of the friendship network Facebook indicated that there are over 1 billion users in it. These social networks are becoming not only effective means to connect their friends but also powerful information dissemination and marketing platforms to spread ideas, fads, and political opinions.
Microblog contains a vast amount of information, and topics of users and user groups always change with hotspot at home and abroad or over time. In this context, research on user interest prediction is useful in network marketing, public opinion analysis, or even public security . Generally, interest prediction is to generate potential and possible topics in the next time point according to one’s historical blog posts. Unfortunately, blog posts are almost short text; both user-keyword matrix and user-topic matrix of microblogs are relatively very sparse. Moreover, in the prediction model, contents of the related matrices transfer with lots of factors, such as time information and friendship in social hub. Therefore, interest prediction is still a challenging problem.
It should be noted that user interest prediction is different from user interest detection, as the latter mainly focuses on mining users’ current interests. Interest prediction remains a relatively understudied problem that poses two main challenges. First, user interest in microblog changes over time or time interval. In the time-aware prediction model, user’s temporal preference is an important aspect. Furthermore, long-term preference and short-term preference will result in different prediction result. Second, user interest is a dynamic phenomenon; it maybe migrates due to the topic migration of one’s social hub. In the real world, capturing user’s friendship and their topics is difficult.
Recently, a lot of models for prediction have been investigated [2–4]. A typical method exploits the probabilistic matrix factorization (PMF) technique to learn latent features for users and topics. These kinds of algorithms are mostly based on the blog posts published by user to predict his interest.
In fact, we observed several interesting phenomena. There exist some users who publish less but browse more blog posts and we call them silent type users. Such users may have very explicit interest and just may be prudent to express their ideas. And they do publish their opinion at an appropriate moment. However, existing prediction models always fail to predict their interests. Another kind of users expands their social hubs by focusing on new friends’ topics they are interested in. We call them interactive type users. In other words, the interest of such users can be represented by the interest of direct neighbors in their social hubs to some extent. Obviously, prediction models ignoring the impact of this interactive property always result in incomplete forecast.
In order to overcome the shortcomings of existing works, combining our observations about microblog, this paper proposes a social hub matrix factorization-based model for user interest prediction model in microblog, which is called SHMF. SHMF incorporates the impact of user’s social hub on user’s interests in our model to improve the quality of prediction. The experimental results on Sina Weibo dataset show that our approach improves the prediction accuracy and the performance efficiency.
The rest of this paper is organized as follows. The related work is discussed in Section 2. Some preliminary knowledge and research are introduced in Section 3. We present our proposed model in Section 4 and give the implementation details in Section 5. In Section 6, we describe the real datasets we used in our experiments. Our experiments are reported in Section 7. Finally, we conclude the paper and present some directions for future work in Section 8.
2. Related Work
With regard to user interest prediction in microblog, there are a series of mature methods that are based on probability matrix factorization of probabilistic graph model. Probabilistic graph model is a kind of model which can concisely express complex probability distribution, effectively calculate the edge and condition distribution, and conveniently learn the parameters and hyperparameters in probability model , while probability matrix factorization based on this model is often used to predict the user’s interests and recommendations.
In 2008, Salakhutdinov and Mnih  proposed a probability matrix factorization (PMF) method for the traditional collaborative filtering algorithm which cannot solve the problem of the recommendation of large sparse dataset and cold start. Experiments on datasets of Netflix demonstrate the effectiveness of PMFs on large number of sparse unbalanced datasets. In the same year, Ma et al.  applied PMF to social network and socialization recommendation and analyzed the complexity and prediction accuracy of this method in detail. In 2010, combining the characteristics of social networks, Jamali and Ester  proposed a social probability matrix factorization (SocialMF) model based on the consideration of the social trust relationship between users. This model promotes the application prospect of PMF in socialization recommendation. In 2003, Sun et al.  proposed a method to model the user’s timing behavior and combined this method with the SocialMF to predict the Weibo user’s interest, the experimental results of which prove that this way of modeling is more effective than the traditional recommendation algorithm based on label information. Taking into account the fact that user interest is changing over time, Bao et al.  introduced a new temporal and social PMF-based (TS-PMF) method to predict users’ interests in microblog. Compared with previous methods of interest prediction, this method has higher accuracy.
The above studies neglect the impact of the information of the blogs posted by others in their social hub on the user’s future interest and behavior, when they establish the Weibo user interest prediction model. Aiming at this problem, in this paper, we propose a new user interest prediction model (SHMF) based on PMF, which combines user’s history behavior, user’s social trust relationship, and the impact of the information of the users’ social hub on the user’s interests in the future. And it designs experiments on the Sina microblog real dataset to prove that this prediction model and the algorithm of the model are superior to the previous prediction model in top- accuracy .
In this section, we give the notations that will be used in the following discussions. In prediction model, we have a set of users and a set of topics in a microblog dataset.
The users’ interests expressed by user-topic matrix are given in , where if user has published posts on topic . We divide users’ historical data into time points and construct a set of user-topic matrix to represent user’s interests over time. Furthermore, considering the impact of user’s social hub on his/her interest, we can construct a set of user’s social hub-topic matrix according to the blogs posted by friends of his/her social hub.
In microblog, each user can follow others whom he is interested in; then users’ friendships can be described as a user-user matrix , where which denotes that has followed . Each user can mainly read the blogs posted by his friends of his social hub. Obviously, there are interactions among different users’ social hubs. Users’ social hubs can be described as a hub-hub matrix . We set if the number of users in the intersection of hub and hub is and the number of users in hub is . Hub is a set of users who are followed by , and we have a set of user social hubs .
Generally, user interest prediction model is to generate a user-interest matrix in the next time segment. The basic matrix factorization (MF) approach finds the approximate matrix of the original matrix in the low-rank space as a predictive approximation matrix. It has been proven to be effective to learn the latent characteristics of users and topics and predict the scores using these latent characteristics. The conditional probability of the known scores is defined as
As is shown in (1), and are the latent characteristics of users and topic feature matrices, with column vectors and representing -dimensional user-latent and topic-latent feature vectors, respectively; , where is the transpose of . is the Gaussian distribution with mean and variance , and is the indicator function that is equal to 1 if and is equal to 0 otherwise. The function is a logistic function with the formula , which makes it possible to bound within the range .
In fact, the relations among users in social network architecture play an important role in users’ behaviors [9, 10]. Specifically, a user is more and more similar to his/her friends. SocialMF model incorporates social influence into the MF approach for prediction, adding the user-user relationship matrix :
Figure 1 shows the graphical model corresponding to (2). In Figure 1, the edges among the latent feature vectors of users are representatives of the trust relationship among users and the degree of trust of user on user is .
The user-topic matrices in PMF and SocialMF model are all constructed from the user’s historical behavior information and do not take time influence into account. Meanwhile TS-PMF model incorporates characteristics of the user interest over time and adds the exponential decay function to analyze the user-topic matrices . TS-PMF is designed to utilize users’ sequential interest matrices and the users’ friendships matrix to predict users’ interest in the near future. In time , the conditional distribution probability of the observed items in is similar to that in (1):
Adding the exponential decay function to analyze the change of user interest, the computing formulation is listed as follows:
The user’s latent feature vector is affected by his historical interests and his friends’ interests. Therefore, the conditional distribution probability of users’ latent features can be expressed like this:
Now, through a Bayesian inference, we have the following equation for the posterior probability over latent features of users and topics:
Maximizing the log of the posterior distribution with regard to and is equivalent to minimizing the following sum-of-squared-errors objective function (we can find a local optimal value of the objective function by performing gradient descent):
4. Social Hub User Interest Prediction Model
In this section, we present our model, SHMF, to incorporate impact of user’s social hub into MF approach for prediction. SHMF combines user’s historical behavior, social trust relationship, and blog articles posted by friends in user’s social hub.
Independence Hypothesis. Information of blogs posted in users’ social hub influences users’ interests independently.
Based on the above hypothesis, we have
Therefore, the conditional distribution probability of users’ latent features can be expressed as follows:
Through a Bayesian inference, we have the following equation for the posterior probability over latent features of users and topics:
The log of the posterior distribution for SHMF at time point is given by
Maximizing the log of the posterior distribution with regard to and is equivalent to minimizing the following sum-of-squared-errors objective function:
In (12), and can be computed by (7). It is obvious that SHMF interest prediction is actually equivalent to performing the symmetrical calculation on the loss function. Here we introduce a parameter to indicate the importance of user’s social hub information in user’s interest. We set if only user’s personal posting behavior is considered and set if only user’s social hub information is considered. Thus, the loss function can be computed as follows:
In order to reduce the computational complexity, stochastic gradient descent is used to optimize the local optimum of the loss function, as shown in (14):where is the first-order derivative of logistic function ; , and are the Frobenius norm.
SHMF model provides an effective way to predict users’ interests. The procedure of prediction will be described with two algorithms in Section 5. All the notations used throughout the paper are summarized in Notations.
To evaluate the effectiveness and efficiency of our approach, we implemented a prototype system of user interest prediction. According to SHMF model and its variant, we provide two algorithms with different parameters and procedures.
5.1. Architecture Overview
The architecture of our implementation is illustrated in Figure 2. We first use topic model LDA to mark out topics of the microblog dataset automatically. Meanwhile, we use the sequential behaviors of users to get a set of user-topic matrices and a set of users’ social hub-topic matrices . Next, we capture the social relationship between users and get a user-user matrix and we can get a hub-hub matrix in the same way. Finally, , and are input to the SHMF model to generate the prediction result.
SHMF integrates user’s history behavior, user’s social trust relationship, and the impact of the information of user’s social hub. The process of predicting users’ interests with SHMF is described in Algorithm 1.
6.1. Experimental Data
We used the dataset from 1 May 2016 to 31 May 2016, which we downloaded from Sina Weibo. This dataset includes more than 20 million microblog messages, time-stamps, and user-to-user relationships.
6.2. User Selection
The basic idea of traditional collaborative filtering is that similar users make similar choices, or similar options are chosen by similar groups of users . In recent years, the basic idea of the social recommendations is gradually concerned by the researchers. The researchers of the social recommendations think that, for a social impact of consideration [12, 13], the associated users will affect each other, so the user’s interest is largely influenced by the users associated with him.
Taking into account the complexity of the calculation, the selection of users is very important in the microblog user interest prediction. In a month, different users will post different numbers of microblogs. Someone only posts one, but someone posts tens of thousands. For such users who post little of microblog in a month, personal microblog information and social hub microblog information are unable to describe their interests. However, for the users who post lots of microblogs in a month, they mostly are enterprises and institutions of the official microblog or commercial procurement service, and it is meaningless to predict user’s interest based on those users. To do this, we perform a statistical analysis on the dataset from Sina Weibo and find that the number of microblogs posted by most users is 100 or less as shown in Figures 3(a) and 3(b) showing histograms of the number of users with the different numbers of blog posts. In this paper, we select users who post 20 to 100 microblogs as subjects, and the number of this kind of users is about one million. After using neighbor computing  and stratified sampling, the 1402 users’ information is selected as the experimental object.
6.3. Automatically Classify Blogs’ Topics Posted by Users
After getting the user’s blog information, we train the LDA model and use it to automatically classify the blogs posted by users and the blogs posted by others in user’s social hub, and the number of topics is calculated by the perplexity. According to perplexity-numbers of topics curve shown in Figure 4, the best number of topics is 23 when the perplexity reached its lowest point.
7. Experiments and Analysis
In this section, effectiveness and efficiency of our SHMF model are evaluated. We conduct experiments on Intel Core i7 processor with 4 cores running at frequency of 3.60 GHz, 24 GB memory, and 1TB hard disk. The programs are run on Windows 7 Professional and Anaconda 4.1.1 (64-bit).
We first present evaluation metrics used throughout our experiments. Next, we employ the variable-controlling approach to adjust the parameters of SHMF model and the other three models. Then the prediction accuracy and the performance overhead of our model are compared with results of the other models. Finally, we will analyze the experimental results.
Because of the great uncertainty of the behavior of user posting blogs, the recall rate has little practical significance in this issue, and in the real life users pay more attention to the top- topic which they are most interested in. Therefore, in this paper, the precision of top- is used as the model evaluation criteria:
represents the number of users in the test set; and represents the total number of interest topics predicted correctly in the top- prediction results for all users in the corresponding test set.
7.2. Model Selection and Parameter Setting
We set up three experiments, PMF , SocialMF , and TS-PMF , as the contrastive experiments because these three methods are very often used to predict users’ interests, and the three methods are in the same theoretical system as the model SHMF proposed in this paper. And then we set up an experiment for the model SHMF proposed in this paper.
First, the variable-controlling approach was used to adjust the parameters to better values, and then we compare their top- accuracy and average accuracy.
(1) PMF Model. The PMF model has three parameters, , in this paper; are the regularization term coefficients in the loss function. The default value of is 0.01 before setting parameters; is the dimension of the latent features which is generally less than the rank of the original matrix. The control variable method is used to set the parameters by fixing other values and changing one. Then we can draw a graph to get the impact of each parameter on top- accuracy. In order to reduce the computational complexity, we set . The top- accuracy varies with the parameters as shown in Figure 5.
According to Figure 5, we can get a set of parameters and , which can make the models perform better on the top-1, top-3, top-5, and top-10 accuracy rate.
(2) SocialMF Model. The SocialMF model has four parameters, , in this paper. are the regularization term coefficients in the loss function. In order to reduce the computational complexity, we set which we set in the first experiment, and we set before setting parameters. is the dimension of the latent features which is generally less than the rank of the original matrix. The control variable method is used to set the parameters by fixing other values and changing one. Then we can draw a graph to get the impact of each parameter on top- accuracy. The top- accuracy varies with the parameter as shown in Figure 6(a) and with the parameter as shown in Figure 6(b).
Based on Figure 6, we can get a set of parameters , , and which can make the model have better performance on the top-1, top-3, top-5, and top-10 accuracy rate.
(3) TS-PMF Model. The TS-PMF model has six parameters, , , , , , and . , , and are the regularization term coefficients in the loss function. In order to reduce the computational complexity, we set which we set in the first experiment, and we set and before setting parameters. is the dimension of the latent features which is generally less than the rank of the original matrix. are the parameters in the forgotten function. The control variable method is used to set the parameters by fixing other values and changing one. Then we can draw a graph to get the impact of each parameter on top- accuracy. The top- accuracy varies with the parameters , , , , and as shown in Figure 7.
From Figure 7, we can get a set of parameters , , , and which can make the model have better performance on the top-1, top-3, top-5, and top-10 accuracy rate.
(4) SHMF Model. The SHMF model has ten parameters, , , , , , , , , , and . In order to reduce the computational complexity, according to independence hypothesis, “the information of blogs posted by users and the information of blogs posted by others in user’s social hub influence the user’s interest in the future independently”; we can set and in accordance with the third experiment. Then we set and , so we should actually consider the five parameters , , , , and , in which , , and are the regularization term coefficients in the loss function. is the dimension of the latent features which is generally less than the rank of the original matrix. are the parameters in the forgotten function. indicates how important the user’s social hub information is to the user’s interest. We set if only user’s personal posting behavior is considered and the SHMF model degrades to TS-PMF model at this time, and we set if only user’s social hub information is considered. The control variable method is used to set the parameters by fixing other values and changing one. Then we can draw a graph to get the impact of each parameter on top- accuracy. The top- accuracy varies with the parameters , , , , , and as shown in Figure 8.
According to Figure 8, we can get a set of parameters , , , , and which can make the model have better performance on the top-1, top-3, top-5, and top-10 accuracy rate.
7.3. Experimental Results and Analysis
(1) Comparison of Accuracy. After adjusting the parameters of the five models, it is necessary to compare the strengths and weaknesses of the different models. As a result of the fact that the selection of different top- accuracy will lead to different results, in order to consider comprehensively, this paper takes top-1, top-3, top-5, and top-10 accuracy of the arithmetic mean as the average accuracy, as shown in the following equation:
By adjusting the model parameters of five experiments, the average accuracy of the five models under most parameters is shown in Table 1.
It can be seen from Table 1 that the algorithm SHMF proposed in this paper improves the average accuracy by over 1.3% compared to algorithm PMF and algorithm SocialMF and the average accuracy of the algorithm SHMF is 0.76% higher than the algorithm TS-PMF.
(2) Executive Efficiency Analysis. On the efficiency of implementation, based on the best parameters, set the number of iterations to 100 times and record the run-time, as shown in Table 2.
It is found from Table 2 that the running time of the algorithm SHMF is the longest, which is nearly three times the running time of the algorithm PMF. This is because, with the calculation of the complexity of the increase, the run-time of the algorithm SHMF has increased.
(3) Result Analysis. Through the comparison of four groups of experiments, we can see the difference and relation of PMF-based algorithm in microblog users’ interest prediction. In the first comparative experiment, we use the most basic probability matrix factorization algorithm and got the average accuracy of 17.35%. In the second comparative experiment, the social trust relationship is added based on the probability matrix factorization algorithm. However, the average accuracy is almost the same as that obtained by the basic probability matrix factorization algorithm. This is mainly due to the fact that, in constructing dataset, we take the users whose posts are in a certain range and then determine their social trust relationships according to the statistical characteristics instead of using all or as many social trust relationships as possible for a user in order to consider both the similarity of behavior and the mutual influence among users. Therefore, this kind of method leads to sparsity of social trust matrix, so the impact is relatively small. Since we do not only focus on the correlation between users, we use this approach to implement the experiment. Compared with the previous two experiments, the average accuracy of the third comparative experiment is higher than that of the previous two experiments, and it is proven that the fact that this method based on the short-term interest of users is changing along time is rational. In the last experiment, the algorithm SHMF proposed in this paper will improve the average accuracy rate of nearly one percentage point, indicating that the user’s social hub information does affect the user’s interest in microblog and verifying the effectiveness of the algorithm at the same time.
8. Conclusions and Future Work
Based on the work of the prediction of microblog users’ interest, this paper analyzes the information of microblog users’ social hub and puts forward the SHMF model, which greatly improves the top- accuracy and average accuracy. This will lay the foundation for the follow-up research work. At the same time, we can solve the cold-start problem of predicting interests of the users who do not often post blogs by analyzing the information of their social hub. This method could have a broad application space in social platform recommendation. However, there are still some defects in the implementation efficiency. When the amount of data is particularly large, the running time is too long, which needs to be improved in the future work.
For the future work of microblog users’ interest prediction, further research on the expression of interest should be carried out to achieve more accurate representation, which determines the upper limit of interest prediction. In the prediction algorithm, we should add more techniques, such as Bayesian analysis, to solve the multiparameter problem by analyzing the relationship between the parameters and the actual meaning.
|:||The user-topic matrix in time|
|:||The user’s social hub-topic matrix in time|
|:||The user-user matrix|
|:||The hub-hub matrix|
|:||The users’ latent feature space in time|
|:||The topics’ latent feature space in time|
|:||The users’ latent feature space in social hub in time|
|:||The topics’ latent feature space in social hub in time|
|:||The final users’ latent feature space in time|
|:||The final topics’ latent feature space in time|
|:||The mean matrix of with spherical Gaussian priors in time|
|:||The mean matrix of with spherical Gaussian priors in time|
|:||The mean matrix of with spherical Gaussian priors in time|
|:||The mean matrix of with spherical Gaussian priors in time|
|:||A weight that indicates how important the whole previous time points are to the current one|
|:||The kernel parameter|
|:||The dimension of latent feature space|
|:||A weight that indicates how important the user’s social hub information is to the user’s interest|
|:||The impact of the users’ latent feature vectors on users’ interests|
|:||The impact of the social hubs’ latent feature vectors on users’ interests|
|:||The impact of the topics of the blogs posted by users on users’ interests|
|:||The impact of the topics of the blogs posted by others in users’ social hub on users’ interests|
|:||The impact of the users’ relationships on users’ interests|
|:||The impact of the social hubs’ relationships on users’ interests.|
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work is supported by the National Natural Science Foundation of China (31371340) and the National Key Technologies Research and Development Program of China (no. 2016YFB0502604).
R. Salakhutdinov and A. Mnih, “Probabilistic matrix factorization. In NIPS 2008, volume 20”.View at: Google Scholar
R. R. Sinha and K. Swearingen, “Comparing recommendations made by online systems and friends,” in DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries, 2001.View at: Google Scholar
J. L. Herlocker, J. A. Konstan, and J. Riedl, “Explaining collaborative filtering recommendations,” ACM Transactions on Information Systems, vol. 22, no. 1, pp. 5–53, 2001.View at: Google Scholar