Security and Privacy Protection of Social Networks in Big Data EraView this Special Issue
A Stable-Matching-Based User Linking Method with User Preference Order
Over the past decade, followed by the exponentially growing net-services, the number of anonymous users is also springing up. As of the third quarter of 2016, active users of Facebook reached 1.79 billion , which means more than half of 30 million Internet users use Facebook per month at least once. About 65% or about 1.18 billion users log at least once in daily. However, some traditional social network sites now are facing significant development. According to the Twitter 2016 Q3 results , average growth rate of monthly active users, only about 3%, reached 317 million, compared with the image-based social network Instagram, whose monthly active users have already exceeded 600 million . This change shows that, with the development of times, user’s interest of the net-services has been divided. Therefore, net-services providers also aim at developing different social services for various user’s interests.
Nowadays, each net-service often has its unique mode of information sharing to maintain its social relationships. These unique models attract different user groups; for example, a user selects Twitter to share some information publicly and chooses Facebook for own circles, and for sharing traveling scenery and foods, of course Instagram is the best choice. On these net-services platforms, users typically pass a uniquely identified nickname along with some other attribute tags, such as profile information, hobbies, friendships, and events. If these accounts can be effectively linked with a particular user, when we try to understand a user comprehensively, this not only can significantly improve his (or her) experience of a recommender system but also can provide a better anonymity protection policy . In network security, when detecting malicious attackers with multiple accounts in different platforms, it is possible to integrate the information of cross-media together and makes a vast improvement of the detecting ability. Practice has proved user linking has important practical significance.
However, due to the anonymity protection policy of net-services providers and users in different net-service platforms always choose to share different information, resulting in the fact that particular user’s multiple accounts often do not have adamant relevance. This large number of nondirectly linked account makes difficulties for comprehensively understanding the user. There are existing studies done by analyzing the user’s naming style convention [5, 6], profile [7, 8], writing style , behavior , social relations [11, 12], and so on and then by linking users multiple accounts by statistical and also machine learning methods. These methods are used to model the characteristics of a vast number of accounts and made some certain achievements in the experimental dataset. However, in reality, not enough account features could be gotten from sparse network data and the behavior behind these accounts is always changing. It is hard to use a stable mathematical model to describe it. Moreover, the real human behavior is neither random nor entirely rational . Therefore, considering the mutual influence among multiple accounts from different net-services users, the user linking problem can be regarded as a cooperative game problem in the bilateral (multilateral) market—how to formulate a cooperation (linking) strategy in the markets (net-services) to enhance the interest (linking results) of the whole candidate accounts set.
In recent years, researches on Game-Theoretic Machine Learning are progressing; some researchers have constructed a Game-Theoretic Machine Learning framework, through the Markov model to study and predict the user’s behavior [13–15]; some scientists use cooperative game approach to evaluate and select the features for machine learning . These methods have proved the game theory plays an improving role on the traditional machine learning. Therefore, this paper proposes a stable-matching-based game theory method for user linking with user preference order and prior knowledge. The main contributions of this paper are as follows.
Process a novel method based on stable matching game theory to carry on the analysis of user linking.
Input the linked user accounts as prior knowledge to enhance the result of user linking.
Through many experiments carried out in the LifeSpec  project dataset provided by Microsoft Research Asia, our method is about 21.6% higher in accuracy compared with the traditional user linking methods. Moreover, there was a further improvement of about 7.8% after inputting the prior knowledge.
2. Problem Formulation
In this section, the related concepts and formal descriptions of user linking are given. For the convenience of description, this paper focuses on two heterogeneous networks. Symbols used in this paper are shown in Table 1.
User. A user can be represented as , where represents the user ’s account on the network and represents feature of accounts. For convenience, we focus on two heterogeneous networks, so , where represents the account from the source network and represents the account from the target network .
Account Set. An account set represents extractable accounts from a particular network. So represents source network account set and represents target network account set, where are the number of users of both networks.
Accounts Pair. Accounts pair represents a tuple consisting of any account of user from the source network and any account of user from the target network .
Identification of Accounts Pair. , which means when the accounts pair consisted of accounts from the same user, the value of identification ; otherwise .
Problem Description. Given the source network and target network , extracting the candidate account sets , and grouping any two accounts from these networks one by one to an accounts pair, then get pairs. Finally, use a linking algorithm to find all the pairs whose identification , namely, linking accounts from two heterogeneous networks .
The challenges of this paper are as follows:(1)Traditional user linking technology is often trying to maximize some objective function so that the whole candidate accounts set can get the best result. However, since the user’s different account behavior is often not rational and stable  and the sparse features of accounts could influence the linking result significantly, the traditional methods do not always have an ideal result on large-scale sparse data sets. Within the cooperative game theory, user linking is actually trying to find matched players in the bilateral market. In this paper, we combined the game theory and the user’s preference using stable matching theory  and Pairwise, finally linking users through the cooperation between accounts.(2)The traditional method often linked the user’s account by calculating the “similarity” between different accounts using certain types characteristics. However, in the real world, multiple accounts of a user on various platforms tend to reflect different needs of the user, resulting in the fact that the “similarity” is minuscule that many accounts can not be linked. Taking into account the fact that the “user linking problem” and “linking similar user problem” are different, so we input some linked user as prior knowledge, thereby enhancing the result of user linking.
The following section will detail how to solve these two problems.
3. Stable User Linking with User Preference Order
User linking essentially is a multiclassification problem, and different user accounts are categorized according to the user category. However, because the multiclassification problem is usually difficult to obtain an ideal solution, therefore, in this paper, we make use of the idea of Pairwise , combined accounts to pairs, and classified them according to whether linked. Then user linking problem will be converted into a binary classification problem and could calculate the probability of each account pair under a different category. Then, according to this probability, construct the user preference order set and finally convert the question into “how to select the best target account in one’s preference order set” and try to improve it by inputting the prior knowledge. Therefore, we present a three-phase approach to solve the user linking problem:(1)Constructing user preference order set: calculating posterior probability for each pair according to the SVM model trained by the training set and sorting of each to construct user preference order set.(2)User linking based on stable matching: using stable matching algorithm based on the user preference order set between and finally getting all the stable links among accounts.(3)User linking based on prior knowledge: inputting the prior knowledge to improve user linking in the stable matching algorithm and finally get the reinforced user linking algorithm.
3.1. Constructing User Preference Order Set
According to Pairwise, first user linking can be converted into a binary classification problem, and by calculating the classification probability the account preference order set could be constructed, which is defined as follows.
User Preference Order Set. For an account , the ordered sequence of the target account set is called the preference order set of the account . The ordered sequence reflects the order of which target account is more likely to link.
In recent years, many kinds of research have shown that the Support Vector Machine has a high ability in resolving the problem of binary classification [20, 21]. Since SVM is very sensitive to features, selecting the proper feature is vital. Traditional methods make many features by artificial information, such as naming habits, personal profiles, writing style, user behavior trajectory, and social relations. However, due to the incompleteness and heterogeneity of network data, the features of user data acquired not only can be very limited but also need to be completed. Therefore, by using account labels, we avoid the difficulty of filtering and completing of features.
From the reality net-services, some of them provide labels to simply and clearly reflect the characteristics of user accounts. But others do not have. So, we can directly construct labels by account history text using topic model, such as LDA. The method of label extraction by the topic model has been matured in recent years and will not be repeated here.
In this paper we took these accounts with their label tag as a bag-of-words model and then calculate the value of features between ’s feature vector and ’s feature vector as follows:(1)Cosine similarity: .(2)Number of common labels: .
According to the feature above, the training data can be trained by SVM and then accurately classify the test data. However, in large-scale data, there are many accounts because of the sparseness of labels and the different user’s accounts may have some similarity, resulting in the fact that many cases can not make an accurate classification. These noise accounts will have a great impact on the classifying effect when using the standard SVM. In fact, user linking is a nondeterministic classification problem: some samples can not belong to a category accurately, only through the probability to reflect its belonging to a certain category. To address this issue, according to the sigmoid-fitting method proposed by Platt , we calculate each pair’s posterior probability under the conditions :where is the Support Vector Machine with no threshold output and two parameters can be set by maximum likelihood estimation of the training set. This posterior probability actually reflects the likelihood that one account will be linked to another target account. According to the posterior probability we construct user preference order set as follows.
Based on Pairwise, the training set and the test set of pairs are constructed between account sets , , and the feature vectors of any pairs are constructed by using the above two features, and then use Support Vector Machine to train a model on the training set. For a particular test set account , calculate the posterior probability of , where comes from target network , under the conditions . Finally, we get the user preference order set of by sorting of each pair.
The following section describes how to link user accounts by user preference order set.
3.2. User Linking Based on Stable Matching
Through the convention in Section 3.1, user linking actually turns into “how to select the best target account in one’s preference order set” so that the whole candidate account set can get the best performance. In this paper, we try to use stable matching theory to solve this problem. The stable matching theory  is proposed by Shapley using cooperative game theory to solve the linking problem in bilateral market entities. Because of this theory, Shapley won the 2012 Nobel Prize in Economics. This theory has been widely used in many practical scenarios, such as students selecting (students and schools matching ), housing allocation (matching between people and house ), and job searching (employee and employer matching ). The core of this theory lies in the realization of the stable state, which means there does NOT exist ANY pair of entities in the bilateral market at the end of linking, which have a more preferred target than the currently linking target. In fact, if the source network and target network are regarded as a bilateral market, user accounts can be seen as entities from the bilateral market. Then the problem of “how to select the best target account in one’s preference order set” is converted to “how to find a cooperation (linking) strategy in the markets (networks) to make the interest (linking results) to the maximum.” Therefore, based on the idea of stable matching, we linked accounts based on the preference order set.
Broken Account Pair. If an account is linking to , is linking to . Assume there is a on which the account has in its preference order set and the account has in its preference order set; then the is called a broken account pair because actually it breaks the current linked pairs.
Stable Matching. If there does NOT exist ANY broken account pair at the end of linking, then we said the entire linking is a stable matching.
Using  proposed GS delay algorithm can achieve a stable matching in the bilateral market. However, the standard GS algorithm requires that the number of entities in the bilateral market must be , and the size of preference order set of each entity must also be the size . That is to say, “the number of bilateral market entities is same” and “each preference order set is completed.” However, these two restrictions are difficult to meet, and because of the lack of attributes, some of the feature vectors can not be calculated and can not get the completed order set, so we make two adaptations.(1)Fake account: an account which does NOT actually exist is called fake account . In a linking process, a balanced number between two account sets of fake accounts will be added to the littler set, and when linking is completed all the pairs which contain fake account will be excluded.(2)Uncompleted user preference order set: a user preference order set which does NOT include ALL the accounts in the target network is called an uncompleted user preference order set. In a linking process, if is not in ’s user preference order set we directly denied this link.
According to this, we propose a stable-matching-based user linking method with user preference order (Stable User Linking with Preference order, SULP) as shown in Algorithm 1.
Through Algorithm 1, this paper combines the user preference order and stable matching of cooperative game theory to achieve the purpose of user linking. The next section will be on how to strengthen the result of this method.
3.3. User Linking Based on Prior Knowledge
Consistent with the traditional linking method, the method we proposed is still based on the similarity of account features. However, in fact, as the network platform tends to specify functionally, users on different platforms usually choose to explicitly express their interest by their multiple accounts, and these various interests among the accounts are likely to have little similarity. Therefore, user linking not only is “how to link accounts by similarity,” but also includes “how to identify and link the accounts which are dissimilar but belong to the same user.” The latter one is extremely challenging, and the researches show that there has been no effective solution. In this paper, we try to input some users’ linked accounts as prior knowledge, to strengthen the user linking method proposed in Section 3.2.
Considering that the preference order set of the entity in the bilateral market is a set based on the feature similarity, the above method can not adequately reflect the correlation information among different accounts. To add some correlative information by prior knowledge, we defined prior candidate account set as follows.
Prior Candidate Account Set. For an account , given its linked account , then is called a prior candidate account of . In the matching process, is assumed to match account , if is NOT a prior candidate account of ; then regardless of the preference order set, let link to . If IS a prior candidate account of , then follow the order of preference set.
Based on the definition above, we further propose a reinforced algorithm (EXtended Stable User Linking with Preference order, EXSULP) based on prior knowledge. Only the improved part is shown in Algorithm 2.
According to the algorithm, we input the already linked account as the prior knowledge, further strengthening the possible correlation between the accounts. Finally, all the eligible are taken as the final result of user linking between the network and network .
In this section, based on the dataset provided by Microsoft Research Asia LifeSpec , we used the standard SVM, SVM based on the cooperative game theory, and reinforced SVM based on prior knowledge, respectively, to analyze user linking. Experiment code has been made public on GitHub: https://github.com/Observerspy/UserStableMatching.
4.1. Dataset Description
LifeSpec is a computational framework developed by the Microsoft Research Asia for discovering and hierarchically categorizing urban lifestyles. The LifeSpec dataset is composed of tens of millions of user’s data about sign-in, movie comments, book comments, music comments, and behavior. In this paper, we attempt to link users from the books set as the source network and to movies set as the target network .
As in Table 2, we selected a total of 62,558 different users.(1)Books Dataset: contains 34,942 different accounts on 523,064 books with 2,118,400 comments; each data contains title, author, publisher, date of issue, number of pages, price, packaging, labels, user ratings, and other information.(2)Movies Dataset: contains 41,823 different accounts on 82,868 movies with 8,397,846 comments; each data contains name, director, screenwriter, starring, category, country, duration, release date, labels, user ratings, and other information.
The total number of pairs in this dataset is 1,461,379,266. Because, in such a large-scale dataset, the proportion of positive instances and negative instances is often more than 1 : 10000, we controlled the proportion to about 1 : 1 by random undersampling.
4.2. Performance of User Linking Methods
We took labels from the books and movies as the accounts features and the frequency of each label as the feature value. Because the dimension of inputting feature vector is small, we use ten times 10-fold cross-validation Gaussian kernel SVM with setting the cost value to 1 and remaining the default parameters. Support Vector Machines and posterior probability calculations are provided by LibSVM  tools. The compared methods are summarized as follows.(1)SVM_Label: baseline method, using SVM to do a link∖nonlink classification only in label feature space.(2)SULP: the stable-matching-based user linking method with user preference order which is proposed in Section 3.2.(3)EXSULP: the extended user linking method which is proposed in Section 3.3.
As the user linking problem only concerned with the correct links (positive instances), therefore, we select the precision , recall , and value as the evaluation metrics, and the average result of 10 times 10-fold cross-validation is shown in Table 3.
It can be seen from the results that the two methods proposed in this paper have surpassed the baseline method on the metrics of precision , recall , and , where the SULP has an improvement of about 21.6% in accuracy and a further increase of about 7.8% after adding the prior knowledge. Compared with other researches which used a large number of user’s personal information, texts, behaviors, and so on, we achieved the ideal precision when only using the labels as a feature. Moreover, different from other stable matching methods , we canceled the two restriction conditions of the following: “the number of bilateral market entities must be same” and “the preference order set is completed.” Therefore, in the complex sparse real dataset, the method proposed in this paper can be considered to have better practical significance.
4.3. Analysis of Prior Knowledge
From the experiment above, we can know that the prior knowledge can improve the performance of user linking. It is clear that the proportion of prior knowledge to the whole data will influence the final linking results. Therefore, we analyze EXSULP algorithm by taking a part of incorrect classification results (total 2158) obtained from SULP algorithm as a prior knowledge and changing the proportion of the prior knowledge to analyze the effect of prior knowledge.
Expansion Raterepresenting the extended ability of the EXSULP algorithm for linking results.
The result is shown in Figure 1.
(a) Effect on the , , values
(b) Effect on the expansion rate
From the results, it can be seen with the increasing proportion that , , and values increase steadily. It can be considered the proportion of prior knowledge is in proportion to the result of the algorithm, enhancing the precision of up to about 7.8%. The expansion rate reflects the fact that the results of this algorithm gradually stabilize as the scale of prior knowledge increases. The above experiment sufficiently proved the prior knowledge can enhance the correlation among accounts, illustrating the effectiveness of our method.
4.4. Case Study
We choose four linked results to display and analyze in Table 4. The coexisting top 10 labels are given (translated to English, the works name is in italic), among which 1–3 are the correct links and 4 is a wrong link.
As can be seen from Table 4, because of the semantics of the label, when the coexisting labels are specific enough, then the accounts can be correctly linked. In fact, it further illustrates that the problem of user linking can be solved according to each user’s specific and unique interest labels. However, as shown in item 4, when these labels represent more abstract and general terms, these accounts can not make the right link. When the label of the linked account which is inputted as the prior knowledge contains such abstract and general terms, it can effectively reduce the misclassification caused by the classifier based on calculating features.
In this paper, we have studied the user linking problem and propose a stable-matching-based method with user preference order. Different from the restrictions of the traditional stable matching algorithm, we made some relaxation and enhance the result of user linking by inputting prior knowledge. Experiments show that, in the real dataset, our method has achieved an ideal effect when only using the characteristics of the website label, which adequately demonstrates the effectiveness of this approach. In the future research, we will further study how to extract accurate and efficient characteristics in the sparse data and how to enhance the correlation between different accounts.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
The work is supported by the National Natural Science Foundation of China (Grant nos. 61309007, U1636219) and the National Key Research and Development Program of China (Grant no. 2016YFB0801303). The dataset is provided by Microsoft Research Asia.
Facebook Inc, “Facebook quarterly report [sections 13 or 15(d)],” https://www.sec.gov/Archives/edgar/data/1326801/000132680116000087/0001326801-16-000087-index.htm.View at: Google Scholar
Twitter, Twitter quarterly report [sections 13 or 15(d)], https://www.sec.gov/Archives/edgar/data/1418091/000156459016026749/0001564590-16-026749-index.htm.
I. Inc, “600 Million and counting”.View at: Google Scholar
R. Zafarani and H. Liu, “Connecting corresponding identities across communities,” in Proceedings of the 3rd International AAAI Conference on Weblogs and Social Media (ICWSM '09), pp. 354–357, San Jose, Calif, USA, May 2009.View at: Google Scholar
T. Iofciu, P. Fankhauser, F. Abel, and K. Bischofi, “Identifying users across social tagging systems,” in Proceedings of the 5th Annual Conference on Weblogs and Social Media (ICWSM '11), Barcelona, Spain, July 2011.View at: Google Scholar
Y. Zhong, N. J. Yuan, W. Zhong, F. Zhang, and X. Xie, “You are where you go: inferring demographic attributes from location check-ins,” in Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM '15), pp. 295–304, ACM, Shanghai, China, February 2015.View at: Publisher Site | Google Scholar
N. J. Yuan, F. Zhang, D. Lian, K. Zheng, S. Yu, and X. Xie, “We know how you live: exploring the spectrum of urban lifestyles,” in Proceedings of the 1st ACM Conference on Online Social Networks (COSN '13), pp. 3–14, ACM, Boston, Mass, USA, 2013.View at: Google Scholar
J. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” Advances in Large Margin Classifiers, vol. 10, no. 3, pp. 61–74, 1999.View at: Google Scholar
Y. Chen and O. Kesten, “From boston to shanghai to deferred acceptance: theory and experiments on a family of school choice mechanisms in,” in Proceedings of the International Conference on Auctions, Market Mechanisms and Their Applications, pp. 58–59, Springer, New York, NY, USA, April 2011.View at: Google Scholar
P. Guillen and O. Kesten, On-campus housing: theory vs. experiment.
X. Kong, J. Zhang, and P. S. Yu, “Inferring anchor links across multiple heterogeneous social networks,” in Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 179–188, ACM, San Francisco, Calif, USA, 2013.View at: Google Scholar