Abstract

The number of malicious websites is increasing yearly, and many companies and individuals worldwide have suffered losses. Therefore, the detection of malicious websites is a task that needs continuous development. In this study, a joint neural network algorithm model combining the attention mechanism, bidirectional independent recurrent neural network (Bi-IndRNN), and capsule network (CapsNet) is proposed. The word vector tool word2vec trains the character- and word-level uniform resource locator (URL) static embedding vector features. At the same time, the algorithm will also extract texture fingerprint features that can compare the content differences of different malicious web URL binary files. Then, the extracted features are fused and input into the joint neural network algorithm model. First, the multihead attention mechanism is used to extract contextual semantic features by adjusting weights and Bi-IndRNN. Second, CapsNet with dynamic routing is used to extract deep semantic information. Finally, the sigmoid classifier is used for classification. This study uses different methods from different angles to extract more comprehensive features. From the experimental results, the method proposed in this study improves the classification accuracy of malicious web page detection compared with other researchers.

1. Introduction

With the continuous improvement of the network environment, Internet applications have penetrated deeply into all aspects of life. Simultaneously, the vast Internet applications group also attracted many network attacks to make a profit through malware, spam, and phishing websites. According to the Check Point’s report in 2020 [1], more than 100,000 malicious websites are used to steal users’ personal information or cause damage to users’ systems every day around the world. Kaspersky’s report [2] stated that the number of malicious URLs identified by web antivirus components in 2020 was 173 million. Besides, the report also mentioned that malicious URLs accounted for 66.07% of the 20 most active malicious programs. With the emergence of more and more malicious websites, more and more individuals and companies will suffer immeasurable losses worldwide.

The web page represented by the malicious URL contains malicious interactive code, such as HTML tags [3], JavaScript (JS) [4], and Cascading Style Sheets (CSS) [5]. The attacker writes the source code containing malicious JS tags into the website, and thereby, the malicious code is executed while the user is visiting the website. For example, a remote download program is executed in the background when a user clicks on an advertisement implanted with malicious code by a hacker. The user terminal is finally controlled to collect user personal information. In addition, phishing websites are also the main battlefield for malicious URLs. The attacker establishes an illegal site and leads users into malicious web pages through inducements and other means to complete malicious acts such as network fraud. To dispel the user’s precautionary psychology, the attacker will construct these websites very similar to the legitimate website, indistinguishable by the human eye. Accelerating the development of malicious URL detection has become an essential task of network security in such a network environment.

So far, predecessors have proposed a lot of malicious URL detection methods. In previous research on malicious website detection, researchers usually manually extract one or more of the following features: web content feature HTML, JavaScript code, host information feature WHOIS, lightweight feature web URL, and visualization features, and then input them into the machine learning or heuristic learning system to detect malicious websites. For example, Kumar et al. [6] used an HTML parser and JavaScript simulator to extract web content features and input them into a heuristic system. Chu et al. [7] used domain-related information as the main feature and used machine learning for detection research. However, the feature engineering of machine learning technology is more cumbersome and relies on the subjective judgment of researchers. The emergence of deep learning has solved this well. Ren et al. [8] extracted the word embedding of the URL character to identify malicious URLs effectively. Peng et al. [9] added texture fingerprint features based on extracting URL and host information and then used a deep learning model for detection research. This study only focuses on URL features and uses deep learning techniques to detect and research malicious websites.

Designers generally design URLs as meaningful words to facilitate memory, and some meaningless words usually convey information in their character sequence. Therefore, we use word embedding and character embedding technology to extract the semantic features of URLs. Since the URLs generated by the same tool or organization have similar structures, we also extracted the URL texture fingerprint features (Section 3). Therefore, a joint neural network algorithm model was proposed to capture URL features. First, the attention mechanism is used to give higher weight to key features. Second, we used an improved independently recurrent neural network (IndRNN) [10] called the bidirectional IndRNN (Bi-IndRNN) model to encode the fusion feature information. Finally, the CapsNet is to extract high-level semantic features. Through experiments, it is found that the stacked CapsNet has made significant progress, and the joint model is a precious exploration. The innovations of this study are summarized as follows:(1)We constructed a joint neural network algorithm model that combines the attention mechanism, Bi-IndRNN, and CapsNet for malicious URL detection(2)To obtain more specific and natural features, we have integrated different malicious URL feature information to extract combined semantic and image information(3)A series of comparative experiments show that the joint model proposed in this study achieves better performance than some state-of-the-art methods

The subsequent chapters of this study are organized as follows. Section 2 introduces the contributions of previous researchers to malicious URLs, Section 3 introduces the details of the proposed method, Section 4 explains the experimental results and analysis, and Section 5 summarizes this study.

The main aim of malicious URL detection is to distinguish malicious URLs from benign URLs. Previous researchers proposed the methods for the problem of malicious URL detection which are mainly divided into the following categories: blacklist-, rules-, machine learning-, and deep learning-based detection.

2.1. Blacklist

The method based on the blacklist is to detect malicious websites, mark them, and store them in the database, which contains relevant information of the malicious URL. A global distributed URL blacklist service system based on P2P technology is proposed [11]. Contributors share the blacklist information on storage nodes, and the client uses a plug-in form to ensure the user’s normal browsing experience. Fukushima et al. [12] proposed a blacklist system based on the reputations of the IP address block and registrars used by attackers. To discover more malicious websites actively, some researchers have proposed methods to expand the blacklist by analyzing the features of malicious websites. Akiyama et al. [13] used existing malicious URL search structure neighborhoods to find unknown malicious websites and verify them to expand the URL blacklist. Prakash et al. [14] proposed a prediction system composed of multiple heuristic components to generate new URLs. Then, regular expressions and hash maps are used to approximately match the URL to verify whether it is malicious. Compared to passively submitting URLs to the blacklist, this method can discover and verify malicious URLs from the same malicious source, but the limitations are also apparent. It is impossible to find newly emerging malicious domains, that is, there is no better generalization ability. Although the blacklist-based approach is easy to operate, data storage and update will face challenges when malicious websites are added with a considerable amount every day.

2.2. Rule Matching

Researchers have proposed a rule matching method to solve the above problem, which uses some features to formulate rules to filter malicious URLs. Cao et al. [15] proposed a rule matching method called Automated Individual Whitelist (AIWL), which automatically uses the Naive Bayes classifier to operate and maintain a list of the login user interface (LUI) that users are familiar with. This detection method will warn users when they visit untrusted websites or submit confidential information to these websites. Nguyen et al. [16] proposed a system that calculated six heuristic values similar to the Levenshtein distance between the domain name and the Google search engine spelling suggestion and weighted and added these values to determine whether it is a phishing website based on the threshold. Liu and Zhang [17] proposed a two-wheeled phishing page check method. The first round checks the domain name, URL, and e-mail of the current page, and if it exceeds the threshold, it is directly identified as a phishing page. If it does not exceed the second round, the password, link, and picture are checked. If all the checks do not exceed the threshold, it is a regular page. However, this method is only used in the financial field. Shekokar et al. [18] proposed a two-stage phishing page detection scheme. The first stage uses the LinkGuard algorithm to analyze the difference between visual links (links rendered by the browser) and actual links (hidden in HTML). The second stage compares the similarity between suspicious web page snapshots and legitimate web pages by calculating the discrete cosine transform. Although this method does not need to maintain a vast database of malicious websites, it cannot detect unknown malicious URLs because the establishment of rules relies on existing malicious URLs. Moreover, it requires much subjective experience to analyze malicious web pages. The rule-based approach can find some more obvious malicious websites. Nowadays, the features of malicious web pages are diversified, and many rule-based methods are helpless.

2.3. Machine Learning

As big data become more and more popular, machine learning with generalization and resistance to actual attacks has become the mainstream detection method for malicious URLs. To implement a self-learning model, researchers must have enough malicious website data. Furthermore, the known sites are used to train the algorithm model, and the unknown sites are classified through the already trained algorithms model. After these steps, the model will have specific dynamic detection capabilities. Shahrivari et al. [19] proposed a method that used feature engineering to construct a dataset that extracts 30 features from URL, web page content, and host information; then, 12 machine learning methods such as random forest and decision tree are used to detect phishing websites. Crisan et al. [20] used word embedding to represent URL information and increase the performance of naive Bayes, logistic regression, and SVM models by adding general domain-specific features. This method abandons the selection of features from complex page content and simplifies the data processing process. However, machine learning methods require much functional design. Once these functions are known to malicious website designers, it is easy to bypass these security settings. Singhal et al. [21] used machine learning to classify malicious websites and proposed concept drift detection to find the difference in data distribution between the feature vectors of the old training dataset and the newly collected dataset. The purpose is to prevent attackers bypass the detection rules by changing the URL after realizing that the feature is extracted from the URL. The methods proposed by Eshete et al. [22] use machine learning algorithms for training and customize the corresponding algorithms to further improve the generalization ability of the method. First, seven machine learning algorithms are trained by extracting 39 features in three categories: URL, page-source, and social reputation. Then, the web page category is determined by the confidence-weighted majority vote classification algorithm.

Although machine learning can improve detection accuracy and has a certain generalization ability, manually extracting features is still a time-consuming and labor-intensive task and can only extract shallow features.

2.4. Deep Learning

Different from machine learning, deep learning can automatically extract high-dimensional features based on preprocessed data. After time verification, deep learning has also become the mainstream malicious URL detection method. The emergence of deep learning broke the deadlock of traditional machine learning algorithms. Deep learning can automatically extract features compared to machine learning’s feature extraction method, which frees up the time of manual feature engineering. Wei et al. [23] proposed a method for malicious URL detection using the CNN. This method first extracts character-level features from the URL. Then, the CNN is used to extract features and classification. Bahnsen et al. [24] proposed feature extraction and classification method of malicious URLs based on long and short-term memory networks. This method analyzes 14 URL vocabulary features, such as subdomain length and URL entropy, to build feature engineering and LSTM for classification. The experimental results show that malicious webpage detection based on URL vocabulary features is more feasible than complete content analysis. Jiang et al. [25] proposed an online detection scheme based on a deep neural network to detect malicious URLs. This method maps URL and DNS into vectors and then uses the CNN to extract malicious features and train a classification model automatically. Nevertheless, this model can also be fine-tuned to make the model predictions more accurate. Das et al. [26] compared the application of a simple RNN, simple LSTM, and CNN-LSTM architecture to malicious URL classification in their research. After comparing accuracy, precision, and recall rate, the performance of CNN-LSTM architecture is better than the other two. The enlightenment of this research is as follows. Different models have different ideas for feature extraction. It is advisable to optimize the process of feature extraction by fusing models.

In general, deep learning technology has significantly improved the performance of malicious URL detection. Our method can process data faster than previous research results, which is essential in applying malicious URL detection tasks. In addition, the fusion of texture fingerprint features enables the model to have the ability to process URLs with complex structures, and the fusion of features enables the model to have better recognition accuracy. Experiments results show that our method improves the performance of malicious URL detection and classification. Although machine learning can improve detection accuracy and has a certain generalization ability, manually extracting features is still time-consuming and labor-intensive and can only extract shallow features.

3. Our Approach

3.1. Feature Analysis

The malicious behavior of malicious websites is generally manifested in the URL and website content. However, the method proposed by [27] bypassed the website content, directly used URL to extract features and classify them, and achieved good experimental results. This study is inspired by this and only focuses on the website URL. To grasp the global features of malicious URLs, we extracted the texture fingerprint features of website URLs. However, this type of feature is only a superficial feature and does not fully reflect the essential attributes of URLs. So, we make a static analysis of the website URL to extract the semantic features of the website URL. In general, this study extracts two types of features: texture fingerprint features and semantic features.

3.1.1. Semantic Feature

By analyzing the malicious URLs published by PhishTank [28], Openphish [29], we found that the creators of some phishing websites usually imitate the content of regular pages and bind a similar domain name, such as “http://www.amazzonn.online.” Besides, some special characters may be used to confuse users, such as “@” and “-”. What is more, it confuses users by lengthening the string of meaningless characters or increasing the depth of the domain name (that is, the number of “.”), such as “mlwdkaflzkpqccqdaxjuqlltyexdfcfuzufo-dot-cryptic-now-290917.ey.r.appspot.com.” So, we extracted URL word features. First, symbolize the input URL and decompose the string into their constituent words. The symbolic description is shown in Figure 1.

To enable the symbolized data to be processed by the computer, it is necessary to embed the words obtained in the above steps and convert them into digital vectors containing the grammatical and semantic information of the words. The specific method is to embed the symbolized data into a V × D matrix and update it through backpropagation, where V represents the size of the vocabulary, and D is the dimension of word embedding. When we use word2vec to get the vectors of most words, meaningless words and symbols will confuse our model, so we also extracted URL character features. The process is similar to the process of word embedding. At this stage, we extracted two granular levels of embedding from the website URL: word level and character level.

3.1.2. Texture Fingerprint Feature

We also extracted visual features from the URL. In the experiment of Wang et al. [30], it was concluded that the same malicious web page family has similarities in texture fingerprints. In previous studies, Su et al. [31] and Yang and Wen [32] have proved the validity of grayscale images for deep learning models. Inspired by these conclusions, the URLs were also converted into grayscale images. The two-dimensional texture fingerprint features in the range of 8-bit unsigned integers are converted into effective texture fingerprint features corresponding to the grayscale image’s gray value range.

Specifically, as shown in Figure 2, read the original data in binary form, and use each 8-bit read as a basic unit (fill it with 0 if the last read is less than 8-bit). Then, convert each basic unit to an unsigned integer, so that each integer value is guaranteed to be in the range of [0, 255]. Each integer is mapped to a grayscale image and represents the grayscale value of each pixel. “0” means pure white, and “255” means pure black. Finally, the gray value is stored in a fixed-width matrix.

3.1.3. Feature Fusion

To further improve the accuracy of detection, the three features of character-, word-level embedding vector, and texture fingerprint features have been fused. Given a sequence represents the th word, represents the th character of , and represents the th pixel of the grayscale image. The following formula can be used to express the joint vector:Where “[]” represents the vector cascade, and denotes the embedding of this . After the features are fused, they are sent to the model for training and prediction.

3.2. Framework of the Model

A deep learning framework for detecting malicious URLs based on Bi-IndRNN and CapsNet is proposed in this study. The main structure of the framework is shown in Figure 3. First, the displayable characters and words are embedded in the multidimensional feature space using character embedding and word embedding components. The texture fingerprint feature of the URL is simultaneously extracted. Subsequently, merge the selected features and input to the attention mechanism, which assign probability weights to the mixed features to obtain features with higher weights. Next, an improved IndRNN called Bi-IndRNN is used to extract features from long sequences. We input the features extracted by the attention mechanism into Bi-IndRNN. The features extracted by Bi-IndRNN are input into CapsNet to establish high-level feature information. Finally, the sigmoid classifier is used to calculate the probability.

3.2.1. Attention Mechanism

The contribution of each joint vector to the feature expression of malicious URLs is different. As the attention mechanism can give higher weight to key features to highlight the impact of key features on downstream models, we stacked the attention mechanism on the top layer of the joint vector. Bahdanau et al. [33] first used the attention model in machine translation. The main task of the attention mechanism is to extract the most critical information for the model from a large number of given inputs by simulating the attention behavior of people to improve the efficiency of model training as much as possible while minimizing feature loss. At a macro level, the attention model can be understood as a mapping from a query to a series of key-value pairs. In essence, the attention mechanism is to perform a weighted summation of value. Then, query and key are used to calculate the weight coefficient of the corresponding value.

In this study, the multihead attention is introduced to structure a subset of URL high-dimensional features. The multihead attention is also based on query, key, and value, represented by ( represents the number of URL features, and represents the dimension of URL features), respectively, which will obtained by applying linear projections. Different from the general attention, multihead attention uses scaled dot-product attention to calculate the attention score. Given represents the URL fusion feature vector, represents the th feature vector, and input the into the attention model:

The key of multihead attention is to use the above attention multiple times, and the number of “” represents the number of times to perform the above attention. Then, it should be calculated as follows to obtain the attention of all URL feature vectors:

However, the linear projection of and calculated by each head is different. Take the multihead attention model of the th head with heads as an example:Where . After calculations, concatenate the calculation results:

Finally, calculate the weighted sum of the input joint vector and the obtained attention to obtain the input of the next layer, the feature vector . After the above calculations, we can determine which information is more critical when Bi-IndRNN processes the current task. Give this important information a higher weight, so as to obtain as much information as possible for the current task from the URL joint vector.

3.2.2. Bi-IndRNN

Recurrent neural network (RNN) can effectively process data with sequence characteristics. However, the RNN training will face the problem of gradient disappearance and explosion due to long-distance dependence. As a variant of the RNN, long and short-term memory network (LSTM) can make it easier for the RNN to save information many steps ago, but it does not guarantee that gradients will not disappear or explode. In order to break through the situation at the time, Li et al. [10] proposed an independent recurrent neural network. This method effectively solves the problem of gradient disappearance and explosion because it can well apply ReLU and other nonlinear activation functions and can adjust the time-based gradient backpropagation. The IndRNN unit structure is shown in Figure 4. The hidden layer of IndRNN can be described aswhere , , and represent the input weight, recurrent weight, and bias, respectively, denotes the Hadamard product, and denotes the input vector.

However, IndRNN can only obtain features through forwarding information when processing sequences to enable the model to integrate feature information better and have better modeling capabilities. The improved IndRNN called Bi-IndRNN has been used in this study. Bi-IndRNN is based on IndRNN and adds the idea of the bidirectional recurrent neural network (BRNN). That is, for each time t, the input will be given to two independent IndRNN units in the front and rear directions at the same time, and the output will be jointly determined by the two unidirectional IndRNN units.

The joint vector is a description of semantic and visual information, including important text structure and the spatial position distribution between characters. In order to enable the content represented by the joint vector to have more robust information representation capabilities, we use the Bi-IndRNN model to extract features from the joint vector. Given an feature vector extracted from fusion feature by multihead attention, in the Bi-IndRNN we implemented, forward IndRNN reads the feature sequence from to , and backward IndRNN reads the feature sequence from to . The hidden state expression can be expressed as follows:

Next, we combine these two vectors as the output of Bi-IndRNN. In this way, each hidden state has information of the entire sequence, which is concentrated around the th sequence of the input vector. Then, the features vector extracted by Bi-IndRNN will be input into the capsule network to further extract deep features.

3.2.3. Capsule Network

This study introduces the capsule network to establish advanced feature information. The extracted feature data can play a huge advantage when the capsule network is built on the top of the Bi-IndRNN layer. In order to solve some of the defects of the convolutional neural network to adapt to new deep learning tasks, Sabour et al. [34] proposed the capsule network in 2017. The capsule network is also a kind of neural network. The difference from the ordinary neural network is that the neurons of the capsule network are vectors instead of scalars. Each dimension of these vectors represents an attribute of the object. Therefore, the capsule network retains the posture information and spatial relationships between objects to the greatest extent. As part of the overall model, the structure of the capsule network is shown in Figure 3. First, input the features extracted from Bi-IndRNN to a standard convolution layer. The convolution operation is as follows:where is the element-wise multiplication, denotes the bias, and denotes the convolutional filter, where the size of the convolutional filter is denoted by . That means the convolution operation is to slide the filter on a given input to extract features and collect them in a feature map.

Next is the capsule layer, which converts the feature map into a capsule through a group-convolution operation:where denotes the dimension of a capsule vector, represents the ith dimension capsule vector, and function means the nonlinear squash function, which expressed by the following formula:

Each capsule in the th layer in the network needs to predict the output of the layer capsule separately:

Then, calculate the weighted sum of all prediction vectors to get the high-level capsule :where is the coupling coefficient obtained by the dynamic routing algorithm. The capsule network can retain the most valuable information to the greatest extent and then save it completely and submit it to the upper capsule.

Finally, input the result obtained into the sigmoid classifier to get the final probability. So far, our model can complete the detection of malicious URL.

4. Results and Discussion

4.1. Experimental Set-Up

In Table 1, the attention layers, Bi-IndRNN layers, and CapsNet layers represent the number of attention, Bi-IndRNN, and CapsNet layers, respectively. The attention units and Bi-IndRNN units represent the number of multihead attention and Bi-IndRNN hidden layer units. The head denotes the number of heads of multihead attention. The capsule numbers and capsule dimensions denote the number and dimension of capsules, respectively. Our model uses the Adam optimizer with a default learning rate of 0.001.

4.2. Dataset

The dataset in this study consists of benign and malicious instances. We obtained a collection of benign URLs from the top rankings of Alexa verified by Google Safe Browsing, and the collection of malicious URLs was obtained from public websites, such as host-file.net and phishtank.com. In the end, 32,378 benign URLs and 33,549 malicious URLs were obtained.

4.3. Evaluation Indicators

We use five-fold cross-validation. The dataset is equally divided into five parts, four parts of which are used as training data and 1 part used as test data, and experiments are carried out in turn. Accuracy (ACC), precision (P), recall (R), and F score (F) are used to evaluate the classification results. Before the evaluation, it is necessary to count the number of experimental results correctly classified as malicious (TP) and benign (TN) samples and the number of incorrectly classified as malicious (FP) and benign (FN) samples. The evaluation is calculated as follows:

4.4. The Influence of Model Parameters on Experimental Results

In the model training process, we found that the influence of model parameters on experimental results is quite apparent. Appropriate parameter settings will have a positive effect on model training and classification results. To determine these parameters and obtain the optimal classification results, we test these variable parameters, such as feature types, feature dimensions, under the same dataset, and determine the optimal parameters based on the evaluation indicators.

To determine which feature type to use to have the best classification performance, we first use three types of features: character embedding, word embedding, and texture fingerprint features to test separately and then combine these three features for testing. The results are given in Table 2. It can be concluded that using character and word embedding alone for classification can have good performance, reaching 99.82% and 99.89% recall rates, respectively. In contrast, the performance of the texture fingerprint classifier is slightly weaker, reaching a recall rate of 97.48%. It can also be concluded from the table that although the use of character embedding features can get good results with an accuracy of 99.74%, the method of combining the three features has a stable improvement in various evaluation indicators.

The dimension of the feature vector also has a specific impact on the experimental results. We used different dimensions as variable parameters to determine the feature dimensions and divided them into six groups of varying feature dimensions for comparison. As shown in Figure 5, in these six sets of experiments, ten feature dimensions are added each time the next set of experiments is performed. When the feature dimension increased from 90 to 130, all evaluation indicators increased, but the results obtained by continuing to increase the feature dimension are not ideal. When the dimensionality increases from 130 to 140, all other evaluation indicators decrease except for the slight recall rate increase, and the precision decrease is more pronounced. From this, we have determined that the dimension of the feature is 130.

4.5. The Necessity of Model Components

In this part, several sets of experiments are designed to verify the effectiveness of each part of the model. After comparing the three attention mechanisms, we found that these attentions have good performance. As shown in Figure 6, the accuracy of self-attention, hierarchical attention, and multihead attention can reach 99.75%, 99.67%, and 99.78%, respectively. However, the multihead attention mechanism has improved significantly in various evaluation indicators, and the recall rate can reach 99.90%, which is helpful for the detection and classification of malicious URLs. So, the multihead attention had been used as a component of the model in this study.

Our model combined the attention mechanism, Bi-IndRNN, and CapsNet components. In order to verify the effectiveness of each component, three other models were designed. The three groups of models in the table are(i)Attention-based IndRNN (AIR): extract feature information and classify through the attention-based IndRNN sequential model without CapsNet.(ii)Attention-based CapsNet (ACaps): extract feature information and classify through the attention-based CapsNet sequential model without IndRNN.(iii)IndRNN + CapsNet (IRCaps): use IndRNN and CapsNet sequential models for detection and classification without the attention mechanism.

Such a comparative experiment allows us to see the contribution of each component in the model. As given in Table 3, the AIR model without CapsNet is lower in accuracy, precision, and recall rate than the model used in this study, which shows the effectiveness of the capsule network. For ACaps without IndRNN, the result is similar to the AIR model, and all evaluation indicators are also lower than our model. In addition, the IRCaps model removes the attention mechanism, and the result is as we expected. The model performance is not as good as our model due to the inability to select more helpful information, which also illustrates the necessity of the attention mechanism.

In order to verify that our proposed model is more suitable for malicious URL detection and classification, a set of experiments had been designed to compare with methods by using other deep learning models. The experimental results are shown in Table 4. In this experiment, we fixed the hyperparameters and input the same data set into different models under the same experimental environment to verify the improvement of our model.

These methods have good performance for malicious URL detection and classification. Wanda and Jiang [35] also use character embedding technology and a single CNN architecture to extract features and classify, with a precision of 99.7%. Nevertheless, it is slightly inferior in accuracy and F value. Bahnsen et al. [24] and Liang et al. [36] used LSTM and Bi-LSTM models, respectively, and it can be concluded that the Bi-LSTM model with an accuracy of 99.74% performs slightly better than LSTM. In Wang’s [37] method, after fusing the host features and URL information features, Bi-IndRNN is used for detection and classification, and finally, a recall rate of 99.93% is obtained. Besides, a separate attention, IndRNN model experiment, comes to similar results. Furthermore, CapsNet can save more features that can play its advantages. A single CapsNet [38] surpasses other single models in some evaluation indicators. However, the model proposed in this study combines these advantages and can outperform the previous model in various evaluation indicators. By comparing LSTM and Bi-LSTM, IndRNN, and Bi-IndRNN, it can be concluded that using a bidirectional network has better model performance than using a unidirectional network. Compared with other models, our model has improved in all indicators, and the accuracy and recall rates have reached 99.78% and 99.98%.

4.6. The Cost of the Model

To study the time cost of the proposed model, we conducted comparative experiments on the methods proposed by previous researchers. The experiment is divided into five groups. Each group of experiments uses early stop to prevent the model from overfitting, so the epoch of each model training is not the same. The experiment tested the average time required for a single epoch of each model, the average total time required to train a complete model, the trainable parameters of a model, and the test accuracy. The parameters of the hardware used in the experiment are given in Table 5, and the time cost experimental results are given in Table 6.

It can be known from the experimental results that the classic model can also achieve good results in a short time. For example, an attention-based Bi-LSTM model called AB-LSTM proposed in [8] can get 99.69% of the test accuracy. In pursuit of higher accuracy, researchers have proposed more complex models to detect malicious URLs. The TException method proposed in [39] uses multiple TException Blocks composed of 1d convolutional, batch normalization, Maxpooling, ReLU layer, and deep neural network (DNN) layers to perform feature processing on character-level and word-level URLs. This method uses multiple batch normalization layers to speed up the training, but this will also reduce the expression ability of the subsequent activation function, resulting in a limited improvement in accuracy. Both the attention-based CNN-LSTM (ATT-CNN-LSTM) method proposed in [40] and the CNN and attention-based hierarchical RNN (ATT-CNN-HRNN) method proposed in [41] combine CNN and RNN related methods, which can effectively extract relevant features and achieve malicious URL detection. It can be seen from Table 2 that compared with other new methods, our method does require a longer time in the training of a single epoch, which is caused by the routing protocol algorithm in the internal loop of the capsule network. However, the smaller training parameters make our method converge faster, have the same order of magnitude total training time as other advanced algorithms, and have higher test accuracy.

5. Conclusions

This study proposed a joint neural network algorithm model combining the attention-based bidirectional independent recurrent network (Bi-IndRNN) and capsule network (CapsNet) to identify and detect malicious URLs. It can be concluded from the experiment that the performance of this method to detect malicious URLs is significantly better than these of a single deep neural network and a shallow neural network. The key to this study is to use the generated word vector model word2vec to train to obtain URL words and character vector features, extract the texture fingerprint features of the URL, and fuse the three features. Then, extract key features are based on the weight of the multihead attention mechanism and Bi-IndRNN, and finally, use the capsule network to build high-dimensional features and classify them. Besides, in the same experimental environment, we compared different feature types and dimensions, different model components, and algorithm models. In summary, the method proposed in this article can effectively improve the detection efficiency and accuracy of malicious URLs.

It can be improved, although the method in this study had performed well. In the follow-up process, we will consider integrating dynamic and static features to verify its effectiveness. At the same time, we will continue to update the model, integrate new components into the system, and optimize the time cost of the model to achieve a more excellent method.

Data Availability

The data used to support the findings of this study have been deposited in the GitHub repository (https://github.com/yipeng-liu-rep/malicious-url-data).

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work was supported by the Xinjiang Autonomous Region Key R&D Project (2021B01002), National Natural Science Foundation of China (U2003208), and CERNET Innovation Project (NGII20190412).