Abstract

With the development of Internet technology, network security is under diverse threats. In particular, attackers can spread malicious uniform resource locators (URL) to carry out attacks such as phishing and spam. The research on malicious URL detection is significant for defending against these attacks. However, there are still some problems in the current research. For instance, malicious features cannot be extracted efficiently. Some existing detection methods are easy to evade by attackers. We design a malicious URL detection model based on a dynamic convolutional neural network (DCNN) to solve these problems. A new folding layer is added to the original multilayer convolution network. It replaces the pooling layer with the k-max-pooling layer. In the dynamic convolution algorithm, the width of feature mapping in the middle layer depends on the vector input dimension. Moreover, the pooling layer parameters are dynamically adjusted according to the length of the URL input and the depth of the current convolution layer, which is beneficial to extracting more in-depth features in a wider range. In this paper, we propose a new embedding method in which word embedding based on character embedding is leveraged to learn the vector representation of a URL. Meanwhile, we conduct two groups of comparative experiments. First, we conduct three contrast experiments, which adopt the same network structure and different embedding methods. The results prove that word embedding based on character embedding can achieve higher accuracy. We then conduct the other three experiences, which use the same embedding method proposed in this paper and use different network structures to determine which network is most suitable for our model. We verify that the model designed in this paper has the highest accuracy (98%) in detecting malicious URL through these experiences.

1. Introduction

Hackers often use spam and phishing [1, 2] to trick users into clicking malicious URL, the Trojans will be implanted into the victims’ computers, or the victims’ sensitive information will be leaked. The technology of malicious URL detection can help users identify malicious URL and prevent users from being attacked by malicious URL. Traditionally, research on malicious URL detection adopts blacklist-based methods to detect malicious URL. This method has some unique advantages. It has high speed, has low false-positive rate, and is easy to realize. However, nowadays, the domain generation algorithm (DGA) can generate tens of thousands of different malicious domain names every day, which cannot be detected effectively by the traditional blacklist-based methods.

Researchers have been using a machine learning technique to identify malicious URL. However, these methods often need to extract the features manually, and attackers can design these features to avoid being identified. Faced with today’s complex network environment, designing a more effective malicious URL detection model becomes a research focus.

This paper proposes a malicious URL detection model based on a DCNN. It adopts word embedding based on the character embedding method to extract features automatically and learn the URL’s expression. Meanwhile, we verify the validity of the model through a series of contrast experiments.

In this study, our innovations and contributions are as follows:(1)This paper proposes a malicious URL detection model based on a DCNN. The dynamic convolution algorithm adds a new folding layer to the original multilayer convolution structure. It replaces the pooling layer with the k-max-pooling layer. In the dynamic convolution algorithm, the width of feature mapping in the middle layer depends on the vector input dimension. Moreover, the pooling layer parameters are dynamically adjusted according to the length of the URL input and the depth of the current convolution layer, which helps extract more in-depth features in a wider range.(2)In the stage of feature extraction and representation, the features are extracted from the URL sequence. The extracted features are integrated into a vector, and the vector is processed directly by the convolutional neural network to learn the classification model. This method not only simplifies the process of feature extraction, it does not depend on extracting features manually, but also combines the advantages of character embedding and word embedding. Word embedding can obtain word sequence information, which cannot be obtained by character embedding. Character embedding can process special characters and unfamiliar words in the URL. The dictionary and vector dimension are also not too big. The combination can save memory space and express the URL more effectively, which will help extract information from the URL.(3)To prove the feasibility of the model proposed in this paper, we did a lot of comparative experiments. As for the embedding method, we conduct three contrast experiments to verify that word embedding based on character embedding achieves higher accuracy than word embedding and character embedding. We also perform three contrast experiments and prove that leveraging the network structure consisting of a DCNN and different fields extracted from the URL can achieve a better effect.

The rest of this paper is organized as follows. In Section 2, we introduce the research status of malicious URL detection methods. In Section 3, we present the malicious URL detection model and its main modules. In Section 4, we conduct experiments on the malicious URL detection model and test the embedding methods. Finally, we offer a brief discussion in Section 5.

1.1. Related Work

At present, the methods [35] of detecting malicious URL can be roughly divided into traditional detection methods based on blacklist and detection methods based on machine learning. Literatures [6, 7] introduce the detection method based on a blacklist. Although this method is simple and efficient, it cannot detect the newly generated malicious URL, which has severe limitations. Literature [8] points out that attackers can generate various malicious domain names through a random seed to effectively evade the traditional detection method based on a blacklist.

In literatures [911], researchers have applied machine learning technology to detect malicious URL. Machine learning learns the prediction model based on statistical properties and classifies a URL as a malicious URL or a benign URL. This method attempts to analyze URL and their relevant websites or web page information to extract the features. The features extracted by this method are often divided into two types, static features and dynamic features. Literature [12] obtains lexical information in URL strings, information about hosts, and sometimes HTML and JavaScript content. Literature [13] extracts a series of network traffic-related features from URL, and the support vector machine (SVM) is adopted for detection. Literature [14] proposes three methods of feature processing to optimize the classification effect. While the above methods have shown good performance, there are still some limitations. Traditional detection methods based on machine learning often require extract features manually [15, 16]. Attackers can avoid being detected by these detection methods by designing these features, making it very difficult to maintain the detection system based on traditional machine learning. Additionally, in large-scale malicious URL detection, a trained model may lose some useful information from URL.

Referring to the idea of text classification [17, 18], many researchers have proposed a variety of methods based on deep learning [19] models to detect malicious URL and judge whether a URL is malicious only according to the strings contained in the URL. These methods can automatically extract valid information in the URL. For example, literature [20] uses the cyclic neural network model at the character level to classify URL generated by DGA. Literature [21] proposes the method of extreme machine learning to detect malicious URL. Combining n-gram model with deep learning, literature [22] takes the advantages of character-level semantic features to detect whether DGA generates the URL. A variety of deep learning architectures for malicious URL detection are listed in the literature [23], including the structure of single-layer long short term memory(LSTM) [24], the structure of bidirectional LSTM [25], the combined structure of CNN and LSTM [26, 27], and the deep convolution structure [28]. On this basis, literature [29] designs a keyword-based malicious URL detection model, which combines word embedding and GRU model. Literature [30] analyzes the structures and features of different URL, extracts more features, and proposes a semisupervised training model for URL’s multiclassification. Literature [31] extracts URL domain name features and instantaneous features of redirection attacks and optimizes the neural network structure to improve detection accuracy. In recent years, it has become a new research direction to detect malicious URL directly. Literature [32] takes the original URL as the input of malicious URL detection system, transforms URL into feature vectors by character embedding technology, and then uses the convolutional neural network for training, which significantly reduces the dimension of data and the amount of computation. Additionally, it can help achieve a good classification effect. In literature [30], word embedding is used to transform URL in each message into vector-matrix, which is then inputted into the convolutional neural network for analysis. Literature [33] improves the detection algorithm and adds a convolution branch on the CNN structure to extract in-depth character-level features. The disadvantage of this method is that it adopts character embedding or word embedding alone, and it is difficult to extract features in both characters and phrases. Literatures [34, 35] adopt a parallel convolutional neural network to detect malicious URL. They combine character embedding and word embedding, improve the word embedding method in the vector embedding stage, and extract the URL’s character and phrase features, which improve the detection effect. However, one of their disadvantages is that the fixed CNN structure is used to detect URL. The model parameters cannot be adjusted according to the input vector’s dimension, so it is difficult to extract the in-depth features in a wide range.

Based on DCNN, this paper designs a malicious URL detection model to solve the abovementioned problem. We will detail this in the following chapters. The materials and methods section should contain sufficient detail so that all procedures can be repeated. It may be divided into headed subsections if several methods are described.

2. Malicious URL Detection Model

Our paper proposes a malicious URL detection model based on convolutional neural networks. The construction of the model is shown in Figure 1. The model mainly includes three modules: vector embedding module, dynamic convolution module, and block extraction module. In the following, we will introduce each module and the detection process in detail.

3. Vector Embedding Module

In our model, a URL is inputted into the embedding layer, and we use word embedding based on character embedding to transform the URL into a vector expression. Moreover, the URL will be input into the DCNN for feature extraction.

The vector embedding module represents the input URL sequence as a suitable vector to facilitate the subsequent process. In the beginning, URL vector representation is initialized randomly. It is then inputted into the embedding layer used in the subsequent training and the most appropriate URL vector expression is obtained during the training process. This module uses an advanced word embedding method based on character embedding. The module extracts the phase information from the URL and the character-level information from the word. The information extracted will be used in subsequent training to obtain the most appropriate vector expression of the URL, and then the vector expression is inputted into the subsequent convolutional layer.

An example of a word embedding method based on character embedding is shown in Figure 2, where represents the embedding dimension of characters and words, represents the number of words contained in a URL string, and represents the number of characters contained in each word in the URL string. In the example, we convert each URL to an id sequence of word and character, respectively. Then, the embedding layer obtains word embedding matrix and character embedding matrix during the subsequent training. The word ID sequence uses the word embedding matrix to obtain the matrix expression at the word level. The character ID sequence uses the character embedding matrix to obtain multiple word matrix expressions based on character embedding. The multiple word matrices will be merged and compressed into a matrix representation of the URL at the word level. The matrix representation and the matrix representation are added. We will get the final representation of the URL.

3.1. Dynamic Convolution Module

The dynamic convolution module is to extract features automatically from the input data. The processing procedure of data includes 1D convolution, folding, and dynamic pooling. The DCNN can adjust parameters and extract features in a wider range based on the input length and the current convolutional layer.

When the DCNN is training, the network’s upper layer’s output is inputted into the network’s next layer. The URL is inputted into the input layer, and it is converted to a suitable vector expression in the embedding layer. Then, the first convolution layer starts to extract features. After the data are outputted from the convolutional layer, the data tensor dimension is compressed by the folding layer and then inputted to the pooling layer for dynamic pooling. After several rounds of convolution-folding-pooling, the data are finally inputted into the fully connected layer for training, and the result is finally outputted from the output layer.

3.2. Block Extraction Module

The block extraction module extracts different fields such as subdomain name, domain name, and domain name suffix from URL and encodes them as the second data branch of the detection model. In the embedding layer, the URL is converted to an appropriate vector. After passing through the embedding layer, the second data branch is merged with the first data branch, and the merged result is inputted to the fully connected layer for training. When the block extraction module extracts different fields, it can separate the top-level domain name or national domain name from the URL string.

The block extraction module can distinguish between generic top-level domains and national top-level domains. Therefore, the model can make full use of essential information. It takes the different fields in the URL as different features, which enriches the fully connected layer’s input.

3.3. Detection Process

The detection process is as follows. First, domain name, subdomain name, and domain name suffix are sequentially extracted from URL. In the first branch of the detection model, we pad each URL to a fixed length, of which every word is marked with a specific number. The entire URL is represented as a sequence of numbers. Then, the sequences are inputted to the embedding layer and trained together with other layers. These sequences will learn the appropriate vector expression during the training process. The data stream outputted from the embedding layer is subsequently inputted into a DCNN. The output passes through a convolution layer, a folding layer, and a pooling layer in two successive rounds. In the flatten layer, the data stream is flattened. It then waits for connecting with data from the other branch, where domain name, subdomain name, and domain name suffix are marked. The different main domain name, subdomain name, or domain name suffix in each field are encoded as an independent expression. Then, the marked data are inputted into the three newly added embedding layers and obtain the appropriate vector expression. After that, the information is transformed into a suitable shape in the reshape layer and merged with the first branch’s data. The two branches’ outputs are combined and jointly inputted to the fully connected layer for training. We use the DCNN to extract features automatically and use different fields in URL as different features to detect malicious URL jointly. After the dropout layer, the result is outputted into the output layer.

This model can fully obtain the information carried by the different fields in the URL string and enrich the input of the fully connected layer, which improves the detection effect.

In summary, the branch of processing data is added for expanding the input of the detection model. When training in the fully connected layer, the features are extracted automatically by the convolutional neural network and extracted artificially from the URL field. The detection model can effectively utilize critical information in the URL, such as top-level domain names and national domain names, to achieve higher accuracy and recall. Accuracy is vital, especially for detecting models, because if the accuracy is low, normal web pages may be classified as malicious websites and will be blocked.

4. Experiment and Evaluation

4.1. Experimental Environment

The experimental environment is based on the Windows operating system. The processor is i5-7500, the memory is 8 GB, the programming language is Python 3.6, and the deep learning framework is TensorFlow.

4.2. Comparative Experiments between Different Embedding Methods

Our paper adopts the word embedding method based on character embedding, which considered the advantages and disadvantages of word embedding and character embedding. The advantages of word embedding based on character embedding are as follows:(1)This method can effectively express rare words. It takes full use of character-level and word-level information. Therefore, this method can accurately represent rare words in URL.(2)This method can reduce the scale of the embedded matrix and reduce memory space. Meanwhile, it helps to convert a URL to a more accurate expression.(3)This method can convert new words that do not exist in training set into more accurate vectors, thereby extracting character information.

Based on different embedding methods, we conducted three comparative experiments. Experiment 1 adopted character embedding. Experiment 2 utilized word embedding as an embedding method. Experiment 3 used word embedding based on character embedding as an embedding method. These experiments used malicious DGA URL as the training set, and the network structure was stacked CNN. We measured the accuracy, F1-score, precision, and recall ratio to evaluate the results.

Attackers can communicate with the control center through a malicious DGA domain name. A malicious DGA domain name can be treated as a string during detection. However, compared with the real malicious URL, the malicious DGA domain name contains fewer characters.

The experimental result is shown in Table 1. We find that word embedding based on character embedding achieves the highest accuracy among the above two embedding methods through these experiments. The accuracy of word embedding based on character embedding is 0.958; the accuracy of character embedding and word embedding are only 0.923 and 0.954, respectively. The recall of word embedding based on character embedding achieves 0.976, which is higher than character embedding and word embedding. We also draw ROC curves and AUC curves among the three experiments. It is shown in Figure 3.

4.2.1. Network Structure Experiment

(i) Experimental Dataset. We collect a large amount of URL data from github.com, uci.edu, and kaggle.com to verify the model’s validity and feasibility. Besides, we select the top 1 million domain names, published on the Alexa website on May 17, 2019, as standard URL. Alexa ranks URL in terms of website traffic, including the site’s top-level domains and subdomains. The dataset division is shown in Table 2; the distributions of the dataset are shown in Figures 46.

(ii) Experiment Setting. We designed three comparative experiments. We tried to use different network structures to determine the best solution for our model. Experiment 1 utilized the stacked CNN. Experiment 2 only leverages DCNN. Experiment 3 adopted DCNN, and it extracted different fields from the URL to participate in training. We set the same experimental parameters for these three experiments. Besides, each URL’s length was set to 200 words, and the vector embedding dimension was 32. The DCNN included two convolutional layers. The number of convolution kernels was set to 128. The DCNN was finally trained by one fully connected layer, and it adopted the Adam optimization algorithm. The learning rate was 0.001, and the drop rate of the dropout layer was 0.5. In the process of the experiment, we adopted batch training, and each batch contains 100 URL.

(iii) Experimental Results and Analysis. We measured the accuracy, F1-score, precision, and recall ratio to evaluate the results. The final results of the detection model in this paper are shown in Table 3. The accuracy reaches 0.987, the precision reaches 0.993, the F1-score reaches 0.987, and the recall ratio is 0.981. The accuracy and loss are shown in Figures 7 and 8. As the number of iterations increases, the training accuracy increases continuously, and the fitting degree of the model is ideal. At the same time, the loss continues to decrease.

We also list the experimental results of other comparative experiments, as shown in Table 4; the network structure DCNN + extracting fields obtained a better effect than the other two network structures. The accuracy reaches 0.987, and the precision is 0.993. The ROC curves are drawn, and the AUC value is calculated to facilitate comparison and analysis. They are shown in Figure 9. We can see the TPR of DCNN + extraction is the first to reach 0.9993.

From the results, it can be seen that the best detection effect is obtained in Experiment 3. The high accuracy indicates that benign samples are less likely to be misjudged as malicious samples. Experiments show that the URL can be adequately expressed, critical features can be extracted to obtain better detection results using DCNN and the word embedding based on character embedding. Extracting different fields from the URL can make full use of keywords in the URL to improve detection accuracy and precision. The abovementioned experiments verify the validity of the detection model in this paper.

5. Conclusions

This paper aims to design a new malicious URL detection model based on deep learning. We designed a word embedding method based on character embedding, and the vector expression of a URL is learned automatically by combining character embedding with word embedding. DCNN for malicious URL detection is designed. According to the length of the input vector and the depth of the current convolution layer, the pooling layer parameters are dynamically adjusted to extract features in a wider range. We coordinated the relationship between different modules and adjusted network parameters. Besides, the real URL and malicious DGA URL in the real network are collected, and a series of experiments are designed and conducted. The results and various indicators are compared and analyzed to demonstrate the validity of the detection model.

The detection model achieves the expected effect in experiments. However, considering that the network traffic in the test environment and the real network are different, and with the development of the Internet, types of malicious URL are more diverse. It is necessary to timely update the model in the actual scenario. Therefore, to better adapt to the requirements of various complex application scenarios, we plan to study how to simplify the detection model’s architecture and shorten the training time while keeping the detection performance unchanged in the future.

Data Availability

The URL data used to support the findings of this study have been deposited in the malicious and benign URL dataset: https://gitee.com/blackwall/UrlDetect/blob/master/urldetect/%E5%AE%9E%E9%AA%8C%E4%BB%A3%E7%A0%81.zip.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors gratefully acknowledge the Key Laboratory of Network Assessment Technology of the Institute of Information Engineering. This research was financially supported by the Key Lab of Information Network Security, Ministry of Public Security (C19614), China Postdoctoral Science Foundation (2019M650606), and First-Class Discipline Construction Project of Beijing Electronic Science and Technology Institute (3201012).