Abstract

Web service is one of the key communications software services for the Internet. Web phishing is one of many security threats to web services on the Internet. Web phishing aims to steal private information, such as usernames, passwords, and credit card details, by way of impersonating a legitimate entity. It will lead to information disclosure and property damage. This paper mainly focuses on applying a deep learning framework to detect phishing websites. This paper first designs two types of features for web phishing: original features and interaction features. A detection model based on Deep Belief Networks (DBN) is then presented. The test using real IP flows from ISP (Internet Service Provider) shows that the detecting model based on DBN can achieve an approximately 90% true positive rate and 0.6% false positive rate.

1. Introduction

Web service is a communication protocol and software between two electronic devices over the Internet [1]. Web services extends the World Wide web infrastructure to provide the methods for an electronic device to connect to other electronic devices [2]. Web services are built on top of open communication protocols such as TCP/IP, HTTP, Java, HTML, and XML. Web service is one of the greatest inventions of mankind so far, and it is also the most profound manifestation of computer influence on human beings [3].

With the rapid development of the Internet and the increasing popularity of electronic payment in web service, Internet fraud and web security have gradually been the main concern of the public [4]. Web Phishing is a way of such fraud, which uses social engineering technique through short messages, emails, and WeChat [5] to induce users to visit fake websites to get sensitive information like their private account, token for payment, credit card information, and so on.

The first phishing attack on AOL (America Online) can be traced back to early 1995 [6]. A phisher successfully obtained AOL users personal information. It may lead to not only the abuse of credit card information, but also an attack on the online payment system entirely feasible.

The phishing activity in early 2016 was the highest ever recorded since it began monitoring in 2004. The total number of phishing attacks in 2016 was 1,220,523. This was a 65 percent increase over 2015. In the fourth quarter of 2004, there were 1,609 phishing attacks per month. In the fourth quarter of 2016, there was an average of 92,564 phishing attacks per month, an increase of 5,753% over 12 years [7]. According to the 3rd Microsoft Computing Safer Index Report released in February 2014, the annual worldwide impact of phishing could be as high as $5 billion [8]. With the prevalence of network, phishing has become one of the most serious security threats in modern society, thus making detecting and defending against web phishing an urgent and essential research task. Web phishing detection is crucial for both private users and enterprises [9].

Some possible solutions to combat phishing were created, including specific legislation and technologies. From a technical point of view, the detection of phishing generally includes the following categories: detection based on a black list [10] and white list, detection based on Uniform Resource Locator (URL) features [11], detection based on web content, and detection based on machine learning. The antiphishing way using blacklist may be an easy way, but it cannot find new phishing websites. The detection on URL is to analyze the features of URL. The URL of phishing websites may be very similar to real websites to the human eye, but they are different in IP. The content-based detection usually refers to the detection of phishing sites through the pages of elements, such as form information, field names, and resource reference.

In this paper, we will focus on the detection model using a deep learning framework. The main contributions are as follows:(i)We present two feature types for web phishing detection: an original feature and an interaction feature. The original feature is the direct feature of URL, including special characters in URL and age of the domain. The interacting feature is the interaction between websites, including in-degree and out-degree of URL.(ii)We introduce DBN to detect web phishing. We discuss the training process of DBN and get the appropriate parameters to detect web phishing.(iii)We use real IP flows data from ISP to evaluate the effectiveness of the detection model on DBN. True Positive Rate (TPR) with different parameters is analyzed; our TPR is approximately 90%.

The paper is organized as follows. Related works are discussed in Section 2. The detection model and algorithm are discussed in Section 3. DBN is tested and evaluated in Section 4. The conclusion is drawn in Section 5.

Researchers have conducted lot of work in security [1218], including secure routing [1921], intrusion detection [2227], intrusion prevention [28], and smart grids security [29]. Different from research problems in wireless networks [3060] and energy networks [6164], web phishing is the attempt to acquire sensitive information such as usernames, passwords, and credit card details, often for malicious reasons, by masquerading as a trustworthy website on the Internet. Researchers present some solutions to detect web phishing as follows.

When we judge whether a specific website is web phishing, the direct way is to use a white list or black list. We may search the URL in some database and decide. Pawan Prakash et al. [10] presented two ways to detect phishing websites by the blacklist. The first way includes five heuristics to enumerate simple combinations of known phishing sites to discover new phishing URLs. The second way consists of an approximate matching algorithm that dissects a URL into multiple components that are matched individually against entries in the blacklist. Many well-known browser vendors such as Firefox [65] and Chrome [66] also used a self-built or third-party black-white list, to identify whether the URL is a phishing site. This method is very accurate, but its blacklist or whitelist usually relies on manual maintaining and reviewing. Obviously, these methods are not real time and may cost a lot of time and effort.

Another phishing detection way is to analyze the features of URL. For example, sometimes a URL looks similar to the famous site URL or contains some special characters in the URL. Samuel Marchal et al. [11] used one concept of intra-URL relatedness and evaluate it using features extracted from words that compose a URL based on query data from Google and Yahoo search engines. These features are then used in machine-learning-based classification to detect phishing URLs from a real data set. This method is efficient and economical because it utilizes the preexisting knowledge of the URL, which has a fast detection speed and a lower cost. However, we cannot fully exploit the characteristics of phishing in terms of an URL only because the essence of the scheme is to fraud by means of web content. Phishing attackers are very likely familiar with URLs and easily tailor their URLs to avoid detection; therefore this method will result in a lower detection rate if only the information of the URL is checked.

The content-based detection usually refers to the detection of phishing sites through the pages of elements, such as form information, field names, and resource reference. Anthony Fu et al. [67] proposed an approach to detect phishing web page using Earth mover’s distance (EMD) to measure web page visual similarity. The accuracy rate of this method is high. But at the same time the downside is a need to collect large amounts of data as a priori knowledge.

With the popularity of machine learning, phishing detection has focused on the use of machine learning algorithms. This method integrates URL text features, domain name features, and web content features into a unified detection basis. W. Chu et al. [68] presented a machine learning algorithm based on phishing detection using only lexical and domain features. J. Ma et al. [69] described an approach to classifying URLs automatically as either malicious or benign based on supervised learning across both lexical and host-based features. In general, the essence of these methods of machine learning detection is to map all the features of the phishing website into the same space and then to use the machine learning and data mining algorithms to detect phishing.

3. The Phishing Detection Model Based on DBN

3.1. Phishing Feature Extraction and Definition

First, we get real traffic flow from ISP. The data set includes traffic flow for 40 minutes and 24 hours. We construct the graph structure of traffic flow and analyze the characteristics of web phishing from the view of the graph.

Each piece of data contains the following fields.(i): user node number.(ii): user IP address.(iii): access time.(iv): Uniform Resource Locator, access web address.(v): request page source.(vi): user browser type.(vii): server address to access.(viii): User Cookie.

A graph is mathematical structures used to model pairwise relations between objects. It is also a very direct way to describe the relationship between nodes in a network. The relationship between the nodes on the Internet can also be expressed through the graph structure. Therefore, we construct a graph to store the real traffic flow data and describe the relationship between the nodes in traffic flow.

Give an undirected graph , where includes two kinds of node:(i)user node ;(ii)access and . denotes an access relationship between , and .

The vertices of the graph are as follows:(i)User node has one attribute: total access times (vertex out-degree).(ii)User node has two attributes: total accessed times (vertex in-degree) and website registration time.

The edges of the graph are as follows:(i)The number of visits: which corresponds to the number of occurrences of the edge, the number of times an AD may have access to a URL, or the number of direct links between two URLs, depending on the corresponding vertex type.(ii)Cookie: the cookie field in the access record.(iii)UA: User Agent in the access record.

3.2. Feature Definition

We define two kinds of features to detect web phishing, and they are an original feature and interactive feature.

3.2.1. Original Feature

There are some features in the phishing URL, such as special characters. We definite these features in URL as an original feature as follows:(i): there are special characters in URL, such as @, Unicode, and so on. Those special characters are not allowed in a normal URL.(ii): there are too many dots or less than four dots in normal URL.(iii): the age of the domain is too short. For example, the age of the normal domain is more than 3 months.

In order to quantify the above characteristics, all the characteristic values are binary, that is, one of 0 or 1. Intuitively, the more of the 1 appear in the feature, the higher the likelihood that the site will be a phishing site.

3.2.2. Interaction Feature

There are some features in graph , such as access frequency. We define these features through a node relationship as interaction feature as follows:(i): in-degree of node from is very small. In general, the normal websites do not link to phishing sites. The phishing sites are directly accessed.(ii): out-degree of node is very small. In order to get personal private information, the phishing sites are usually terminal websites and do not link to the other sites.(iii): the frequency of from is one. In general, one user accesses the phishing site only one time and the user cannot access the phishing site more than one time.(iv): when accesses , user browser type is not the main browser. Well-known browser vendors often have a built-in filtering phishing site plug-in. A user who uses unknown browsers is more likely to access the phishing sites.(v): there is no cookie in user. The phishing site does not leave its cookie in user.

3.3. Detection Based on DBN

DBN is one of the deep learning models, each of which is a restricted type of Boltzmann machine that contains a layer of visible units that representing the data [70].

DBN can extract phishing features from a data set. The key to training a DBN is how to determine some parameters. According to Hinton and Salakhutdinov [71], we select Contrastive Divergence (CD) as training algorithm, which calculates the gradient through times of Gibbs Sampling [72]. The pseudocode of -step CD- is in Algorithm 1.

Require: Visible Layer , Hidden Layer
   
Ensure: Gradient Approximation for
   i in , j in
 1:  for in , in do
 2:  Initialize
 3:  end for
 4:  for Each in V do
 5:  
 6:  for in do
 7:   for in do
 8:      Sample
 9:   end for
 10:    for in do
 11:     Sample
 12:    end for
 13:  end for
 14: end for
 15: for in , in do
 16:  
 17:  
 18:  
 19: end for

is the weight matrix of all edges, and are, respectively, the offset vector of the visible and hidden layers, and Sample is Gibbs Sampling [72]. We can get a set of parameters by this algorithm. The gradient of formula is as

First, we set initialization parameters. The weight matrix obeys the normal distribution (0,0.01). We set visible layer offset as where is the probability of the in the active state. For the original feature, we can determine the characteristics of nonphishing sites and then calculate the ratio of nonphishing sites to take the back, that is, . We set the offset vector of hidden layers as 0. After initialization, we start the training process, and pseudocode is in Algorithm 2.

Require: Period , Learning Rate , Momentum , Visible
  Layer , Hidden Layer , Number of visible and
  hidden layer units , Offset Vector , Weight
  Matrix
Ensure:
 1: Initialize
 2: for do
 3:  Calling CD- to generate
 4:  
 5:  
 6:  
 7: end for

The iteration period and of CD- do not have to select a large number. Hinton [71] discussed that the algorithm can get to good result even if . The parameter is related to the concept of gradient ascent in Maximum likelihood Approximation in Restricted Boltzmann Machine (RBM).

In order to maximize , we use the iterative (4).

The learning rate is related to the convergence speed of the algorithm. The larger the learning rate , the faster the convergence. But there is no guarantee that the algorithm always has a good result. That is to say, the algorithm stability is not high. If the learning rate selects a smaller value, the algorithm can guarantee the stability, but at the same time it leads to slower convergence speed. The algorithm will run for a long time. To solve this problem, the algorithm introduces a momentum associated with the direction of the last parameter change in the algorithm to avoid premature convergence of the algorithm. The iterative formula is as follows:

The number of nodes on the hidden layers is entirely determined by the training effect and experience. The classic training process of DBN is in Hinton’s paper [71]. We present a training process as follows:(i)Step 1: to initialize set of original features and set of interaction features, we use set as input of the bottom layer. Then, the DBN trains the first layer and gets the result of the hidden layer.(ii)Step 2: the output from the previous layer is used as the input feature of the next layer , and DBN gets the output .(iii)Step 3: do Step 2 until getting to the top layer.(iv)Step 4: fine-tune weight matrix .

The fine-tuning step is key to the training process of DBN, in order to get better features from the data set. There are an unsupervised way and a supervised way in the process of fine-tuning. The Backpropagation is a supervised way [73]. The wake-sleep algorithm is an unsupervised way [74]. We use the supervised way to fine-tune, for we can calibrate the data by some blacklists in advance.

Since the entire DBN can be seen as a feature extraction process, the output of the top RBM can be seen as a feature in a space. At this point these features can be used as a common machine learning algorithm input. Although we can do the processing of the top RBM directly as an input to a classifier without any processing, it is clear that the error return can be obtained with fine-grained features under supervised conditions. Y. Tang [75] describes a case in the top classifier using Support Vector Machine (SVM). It is not difficult to speculate that other binary classifiers are also feasible. In addition, it should be noted that the practice of the top classifier found that the characteristics of the original input and DBN extracted after the characteristics of the classification will play a better classification effect. This paper chooses SVM as a binary classifier and classifies the DBN features together with the original features as SVM input.

According to H. Wang and B. Raj [76], the time complexity of deep learning model including DBN is . S. Bahrampour et al. [77] do a comparative study of five deep learning frameworks, namely, Caffe, Neon, TensorFlow, Theano, and Torch. The experimental results show the gradient computation time of TensorFlow increases from 14ms to 23ms while batch size increases from 32 to 1024.

4. Test and Analysis

4.1. Test Data and Evaluation Criterion

The test data come from ISP and are composed of two data sets. The small data set includes real traffic flow for 40 minutes. The big data set includes real traffic flow for 24 hours. After pretreatment, we get record sum, unique IP, unique AD, and unique URL as in Table 1.

This paper belongs to a classical binary classification model application. In the binary classification model, the results are usually marked as Positive (P) or Negative (N). In this paper, the corresponding node is either a phishing site or not a phishing site. Then with the classification results with a priori facts, there will be the following four categories:(i)True Positive (TP): is actually P and the classification is also P(ii)False Positive (FP): is actually N and the classification is also P(iii)True Negative (TN): is actually N and the classification is also N(iv)False Negative (FN): is actually P and the classification is also N

The above classification data can generate four categories of evaluation criterions with details as follows:(i)Accuracy (ACC): (ii)True Positive Rate (TPR, Recall): (iii)False Positive Rate (FPR, Fall-Out): (iv)Positive Predictive Value (PPV, Precision):

In this paper, we use TPR as evaluation criterion.

4.2. Experimental Environment and Parameter Setup

In this paper, DBN experiments are conducted in stand-alone mode. The hardware environment includes CPU processor Intel i5-4570 quad-core, 16G memory, and the Nvidia GeForce series GTX760 graphics card. Deep learning algorithms often require high computational performance. Many popular deep learning libraries use the GPU to increase computation speed.

GPUMLib [78] is a GPU machine learning library. It may use C++ and Compute Unified Device Architecture (CUDA) and has support for Backpropagation (BP), Multiple Backpropagation (MBP), Autonomous Training System (ATS) for creating BP and MBP networks, Neural Selective Input Model (NSIM) for BP and MPB, RBM, SVM, and other computationally machine learning algorithms.

SVM model can be seen as a shallow feature extraction (with a hidden layer). DBN selects at least two layers in order to relatively enhance the feature selection effect, and too many layers will lead to overfitting. DBN main module declaration is as in Listing 1.

DBN(
HostArray<int> & layers,
HostMatrix<cudafloat> & inputs,
cudafloat initialLearningRate,
cudafloat momentum = DEFAULT_MOMENTUM,
bool useBinaryValuesVisibleReconstruction = false,
cudafloat stdWeights = STD_WEIGHTS
);

Some parameters are explained as follows:(i)layers: the number of nodes per layer. Here, as the visible layer has a total of 10 different variables as a set of features, select 10 as the number of visible layer nodes.(ii)inputs: the matrix to be trained.(iii)initialLearningRate: learning rate.(iv)momentum: learning rate correction momentum. Select the default value.(v)useBinaryValuesVisibleReconstruction: whether to use the binary value to reconstruct the visible layer. Select the initial value false.(vi)stdWeights: the upper and lower bounds of the weight matrix are initialized.

The number of DBN layer is one of the key parameters of the DBN algorithm. In this paper, we do not specify a fixed value for , because is regarded as change parameter to test the DBN. We set the number of each layer to 10. The learning rate is in and sets as 0.1 for faster learning rate. The momentum sets as the default value.

4.3. Experiment and Analysis

There are three parameters to affect the accuracy. They are the number of DBN layer, the number of iterations per layer, and the number of nodes in hidden layers. L. McAfee [79] shows that when the number of iterations and the number of hidden layer nodes exceed a certain threshold, the precision of the algorithm will reach a higher level. With the number of iterations or hidden layer nodes increase, the detection rate will be a small drop. The reason may be overfitting. Therefore, we first set the larger number of iterations and hidden layer nodes, such as .

Figure 1 shows that TPR is related to the number of layers. When the number of layers is 2, TPR gets the top level at about 89%. With the number of layers increase, TPR decreases a little. The reason is that too many layers will lead to overfitting. Therefore, the best number of layers is two layers.

Figure 2 shows that TPR is related to the number of iterations. The results show that when the number of iterations is at 200, the detection rate is above 80%. The highest detection rate achieves at about 250 iterations. After that, the accuracy of the algorithm decreases with the increase of the number of iterations. Moreover, the more iterations of each layer are, the longer the algorithm overall run time. Therefore, the best number of iterations is 250.

Figure 3 shows that TPR is related to the number of hidden units. The results show that TPR increases significantly to above 85%, when the number of hidden units gets 20. The detection rate does not change much under 30 hidden units. And when it gets to 40 hidden nodes, the detection rate again significantly increases and reaches nearly 90%. Since then, as the number of nodes increases, the detection rate under 80 hidden units is slightly higher than 90%. But the overall detection rate does not significantly change, after more than 40 hidden nodes. As the number of hidden layer nodes increase, the running time also significantly increases. Therefore, the number of hidden units should be 40.

Table 2 shows TPR between BP and no BP. We find that fine-tuning in BP does not improve the TPR but reduces detection rate and increases running time. The possible reason is that BP results in a degree of overfitting in the case of small input latitudes. It is also possible that the parameters of the BP algorithm are not appropriate. Therefore, we do not use BP in detection.

After training and getting the parameters in the small data set, we use DBN to detect the phishing websites in the big data set. The results show that there were 17672 nodes in phishing websites, and the detection rate was 89.2%. The FPR was 0.6%. Because the big data set cannot be fully calibrated, the results are only reference significance.

5. Conclusions

In this paper, we analyze the features of phishing websites and present two types of feature for web phishing detection: original feature and interaction feature. Then we introduce DBN to detect phishing websites and discuss the detection model and algorithm for DBN. We train DBN and get the appropriate parameters for detection in the small data set. In the end, we use the big data set to test DBN and TPR is approximately 90%.

Data Availability

The test data used to support the findings of this study have not been made available because these data belong to the ISP (Internet Service Provider).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by National Natural Science Foundation of China (61571290, 61831007, and 61431008), the NSFC-Zhejiang Joint Fund for the Integration of Industrialization and Informationization under Grant U1509219, and Shanghai Municipal Science and Technology Project under Grants 16511102605 and 16DZ1200702 and NSF Grants 1652669 and 1539047.