Security and Communication Networks

Volume 2018, Article ID 5923156, 8 pages

https://doi.org/10.1155/2018/5923156

## Identifying Fake Accounts on Social Networks Based on Graph Analysis and Classification Algorithms

^{1}Department of Computer, Borujerd Branch, Islamic Azad University, Borujerd, Iran^{2}Department of Mathematics and Computer Science, Amirkabir University of Technology, Tehran, Iran^{3}Computer Engineering Department, Science and Research Branch, Islamic Azad University, Tehran, Iran^{4}Computer Science, University of Human Development, Sulaymaniyah, Iraq

Correspondence should be addressed to Mohammad Ebrahim Shiri; ri.ca.tua@irihs

Received 7 April 2018; Accepted 28 June 2018; Published 5 August 2018

Academic Editor: Tom Chen

Copyright © 2018 Mohammadreza Mohammadrezaei et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Social networks have become popular due to the ability to connect people around the world and share videos, photos, and communications. One of the security challenges in these networks, which have become a major concern for users, is creating fake accounts. In this paper, a new model which is based on similarity between the users’ friends’ networks was proposed in order to discover fake accounts in social networks. Similarity measures such as common friends, cosine, Jaccard, L1-measure, and weight similarity were calculated from the adjacency matrix of the corresponding graph of the social network. To evaluate the proposed model, all steps were implemented on the Twitter dataset. It was found that the Medium Gaussian SVM algorithm predicts fake accounts with high area under the curve=1 and low false positive rate=0.02.

#### 1. Introduction

The use of social networks such as Facebook, Twitter, Google+, Instagram, and LinkedIn is on the rise [1, #3]. Individuals and organizations use social networks to express their views, advertise their products, and express future policies of their companies and organizations. By expanding the use of social networks, malicious users seek to violate the privacy of other users and abuse their names and credentials by creating fake accounts, which has become a concern for users. Hence, social networks providers are trying to detect malicious users and fake accounts in order to eliminate them from social networking environments. Creating fake accounts in social networks causes more damage than any other cybercrime .

Removing fake accounts has attracted the attention of many researches; thus, extensive researches have been carried out on the identification of fake accounts in social networks. Different approaches are proposed in [2, #19], [3, #21], [4, #22], [5, #23], and to find fake accounts based on attribute similarity, similarity of friend networks, profile analysis for a time interval, and similarity of attribute together with IP address. Kontaxis et al. [6, #25] proposed a scalable approach which can be used to discover a bunch of fake accounts made by a user. Their main technique was a supervised machine learning to classify clusters from malicious or legal accounts. Conti et al. [4, #22] provided a framework for discovering fake accounts based on the growth rate of the social network graph and the interaction of regular users on the network with their friends. Gurajala et al. [7, #26] used map-reduction techniques and pattern recognition approaches to discover fake profiles. To identify fake and actual accounts, the rate of the number of followers as well as collected friends per day was used for each account. They used [8, #27] a combination of pattern-matching (screen-names) and update times analysis in their methodology to discover fake accounts. Kagan et al. [9, #28] offered an unsupervised two-layer meta-classifier method that can detect unruly nodes in a complex network by using the extracted properties of the graph topology. He also proved that the proposed algorithm is used to detect fake users and can recognize effective users in the network. Boshmaf et al. [10, #29] provided a robust and scalable defense system called “Íntegro” which puts fake accounts at the lowest rank with the use of users ranking. Sakariyah et al. examined the four main categories of malicious accounts on social networks . Cao et al. [11, #31] introduced a forwarding message tree with six effective features which is used to investigate the relationship between accounts and detect suspicious accounts. The problems in discovering fake accounts in previous researches are stated below:

The use of similarity measures that do not consider the strength of the network of friendships shared among users [3, #21], while we believe that the more the shared friendship network of the two users is connected, the greater the similarity of the users is.

Due to the high volume of information, the use of machine learning techniques leads to overfitting problem [6, #25].

In some previous works, in order to implement the proposed methods, some normal users were assumed to be fake and this is because the number of fake users is lower than that of the fake users in datasets. The above assumption is completely wrong and, thus, will dispute the logic of learning [3, #21], . The aim of this article is to provide a model for solving the proposed problems and improve the efficiency of solving them. This paper improves the efficiency of detecting fake accounts on social networks using the proposed method that preprocessed data by using the definition of similarity measures in order to use the strength of relationship among account's friends, using feature extraction methods to prevent the overfitting problem, and generating artificial forged accounts to create a balance in the dataset by using resampling methods.

In the proposed method, according to the graph adjacency matrix, the similarity matrices between accounts were calculated, and then PCA algorithm was used for feature extraction and SMOTE was used for data balancing successively. Then the linear SVM, Medium Gaussian SVM and regression, and logistic algorithms were used to classify the nodes. Finally, the performance of this method was evaluated using various classifier algorithms.

The remaining part of this paper is organized as follows: graph analysis and similarity types are reviewed in Section 2. Section 3 reviews resampling, principal component analysis, and machine learning concepts. The methodology is described in Section 4; in Section 5, the experiments on Twitter dataset are stated, which shows the performance results. Conclusions and future work are presented in Section 6.

#### 2. Graph Analysis and Similarity Types

Graph Analysis is used in many applications, such as displaying circuit diagrams to detect SHAPE, image matching, and social network analysis . The networks' graph is analyzed in order to solve most of the social network problems. Therefore, graph similarity measures reduce the complexity of graph analysis problems by using different techniques. Some of these graphs are defined below.

A social network G = (E, N) maps into a graph, so that a set of N nodes represents social network users, while the set of edges E N × N represents the relationships. In addition, the dot sign was used to refer to a particular component in a graph.(1)* A *represents the sparse adjacency matrix for graph G. If (v, u) is an edge in G, then* A* (v, u) =1. Otherwise,* A* (v, u) =0.(2)Friendship graph (FG): Considering the social network graph G and a node v, the friendship graph is a vertex containing all vertices that are directly connected to that node and are defined in (1) [12, #33]. where FG (v). N and FG (v). E denote a vertex containing all vertices that are directly connected to the node v and the relationship between these nodes.(3)Common friends (CF): One of the measures for similarity in social networks is the number of friends shared. Given a social network G and two nodes v, u, all vertices that are on a path with the length of two between these two nodes are common friends of that nodes, as shown in (2) [13, #34], [14, #35].(4)Total friends (TF): It shows the number of different friends between the two v and u nodes as shown in (3) [12, #33].(5)Jaccard similarity (JS): Jaccard coefficient represents the similarity between the sample sets, and in fact it is used to calculate the ratio of the common friends of the two nodes to their entire friends, as shown in (4) [13, #34].(6)Cosine similarity: Another similarity measure between nodes is the cosine similarity graph. The cosine similarity actually counts the similarity between the two product vectors as shown in (5) [12, #33].(7)L1 norm similarity: This measure is obtained by dividing the overlapping part of two nodes according to their sizes as shown in (6) [12, #33]. Edge weight measure: First, the edge weights are calculated as two separate attributes for each of the two edges as shown in (7) and (8) [15, #36]. Then, the weight of the edge between the two vertices of u and v must be calculated in two ways: total weights: the sum of weights is equal to sum of the two weights which is defined for u and v as shown in weight coefficient: this parameter is defined in (10); it is multiplication of the two weights defined above, as illustrated in (7) and (8).

#### 3. Introduction of Resampling, Principal Component Analysis, and Machine Learning

##### 3.1. Resampling

One of the problems in data classification is the unbalanced distribution of data, in which items in some classes are more than those of other classes. This problem arises in two-class applications more than the others; it means that one class has more items than the other class. The resampling approach means changing the distribution of training sample sets by processing data. There are several approaches towards improving the class efficiency by balancing the datasets [16, #37]. Resampling data may balance the distribution of the data class by removing the samples of majority class by the use of undersampling approach or increasing the samples of minority class using oversampling to balance. There is another approach known as the minority class artificial sampling which creates the Synthetic Minority Oversampling Technique (SMOTE) of artificial data based on similarity of the characteristics between minority class items. In the proposed model, due to the use of similarity feature of the nodes and the unwillingness to remove information, SMOTE method is used. Due to the replication of minority class samples from the main data in all oversampling approaches, it may increase noise data and processing time and result in overfitting and decrease in efficiency.

Chawla [17, #38] proposed the SMOTE algorithm. This algorithm can randomly create items of a minority class based on certain rule and combine these new sample items with the original dataset to produce new training steps. This approach can be used to produce new minority class items. In minority classes, different samples have different roles in the process of oversampling, and these marginal samples take more roles than the items at the center of minority class. Examples obtained on the margin of a minority class may improve the theme recognition decision and classification rate for minority class prototypes.

##### 3.2. Principal Component Analysis

The key idea of the* principal component analysis (PCA) *is one of the multivariate classical methods and perhaps the most ancient and most popular one [18, #39]. Multiple data analysis has a fundamental role in data analysis. There are many modes or variables in multiple datasets to be observed. If there are n variables in each set of data, each variable can have multiple dimensions. Due to the fact that it is often difficult to perceive multidimensional space, the principal component analysis method reduces the dimensions of all observations based on the combination index and the classification of similar observations [18, #39][19, #40]. The PCA method is one of the most valuable results of linear algebraic application that is used abundantly in all analytical forms, because it is an easy and nonparametric method for extracting relevant information from a complex dataset. In this method, the variables in a multistate space are summed up to a set of unconnected components, each of which is a linear combination of the main variables. The uncorrelated components obtained are the main components which are derived from special covariance matrices or correlation matrices of the main variables. This method is mainly used to analyze the main components of reducing the number of variables and finding a communication structure among the variables. The main components have the largest variance in the entire dataset, and there is no dependence on them. One of the most important issues in the PCA method is selection of the number of core components. Several criteria have been proposed for selecting the number of main components that can be categorized as formal and informal categories. In an unofficial approach, first, an appropriate precision which is suitable for data and the desired results is determined and then the total number of variations is selected based on the cumulative percentage, with the highest precision being considered to be between 80 and 90% of the total variations. Another method used to choose the number of PCs is part of the formal group methods which uses Eigen values higher than one for PC selection called* Rule Kaiser's*.

##### 3.3. Machine Learning

Most machine learning methods train the classifiers by the use of machine learning algorithms. The classifiers are based on various social networks attributes such as attribute similarity, network friend similarity, and IP address analysis. Machine learning classifiers, a number of algorithms which are used in the proposed model, are introduced below.

###### 3.3.1. Support Vector Machine

Support vector machine (SVM) proposed by [20, #41] is a learning algorithm based on statistical learning theory. SVM implements the principal of structure risk minimization which minimizes the empirical error and the complexity of the learner at the same time and achieves good generalization performance in classification and regression tasks. The goal of SVM for classification is to construct the optimal hyperplane with the largest margin. In general, the larger the margin, the lower the generalization error of the classifier .

In this article, SVM was used with a linear and Gaussian kernel in training. Gaussian uses normal curves around the data points and sums these data points so that the decision boundary can be defined by a type of topology condition such as curves where the sum is above 0.5.

###### 3.3.2. Logistic Regression

Given a set S= of m training samples with as feature inputs and *∈* as labels, logistic regression can be shown aswhere *α∈* are the model parameters.

Without regularization, logistic regression tries to find parameters by using the maximum likelihood criterion, while with regularization, there is a tradeoff between connections, and the variables in the model [21, #43] are fewer.

#### 4. The Proposed Method

Based on the characteristics of a fake account detection problem, our proposed method was introduced in this section. First, the adjacency matrix of the social networks graph was computed. Then, the measures of network friend's similarities between nodes (social network users) were calculated. After that, the similarity matrix was calculated for each of the defined measures such as common friends’ similarity, Jaccard similarity, cosine similarity, and other measures. At the end of this step, several matrices that represent similarity between the nodes were shown.

Given that, in such cases, data are not balanced and also about 98-99% of the data belong to the same majority class (normal users), and because the work on such data causes the ignorance of the clarification of minority class (fake users) and the increase of the overall accuracy of classifications, the tagging of all data was labeled normal. To solve this problem, the SMOTE was used to balance the data. The method of creation of an artificial fake user is shown in Table 1.