Abstract

The increasing number of cyberattacks has made the cybersecurity situation more serious. Thus, it is urgent to use cyber threat intelligence to deal with the complex and changing cyber environment. However, cyber threat intelligence usually exists in an unstructured form, and a huge amount of data poses a great challenge to security analysts. To this end, this paper proposes a novel threat intelligence information extraction system combining multiple models, which contains four key steps: entity extraction, coreference resolution, relation extraction, and knowledge graph construction. In the entity extraction task, a multihead self-attention mechanism is adopted to extract the dependency relationships between words. In the coreference resolution task, contextual information and mention embedding are fused to improve the mention representation. Meanwhile, features of different dimensions are extracted using a convolutional neural network. In the relation extraction task, additional features such as part of speech, mention width, entity type, and distance of entity pairs are incorporated to improve the embedding representation. Finally, a knowledge graph is constructed to explicitly present entities and their relationships. Experimental results indicate that compared with the baseline model, the F1 score of our model is improved by at least 8.87, 9.82, and 10.56 on entity extraction, coreference resolution, and relation extraction, respectively. The knowledge graph in Neo4j demonstrates the effectiveness of our system.

1. Introduction

Modern IT infrastructures are under different degrees of cyberattacks. To address this problem, it is necessary to continuously monitor devices, collect and process information, and generate security alerts. To better understand the threat situation and coordinate the response to unknown threats, security experts have proposed to use Cyber Threat Intelligence (CTI) for cyber defense. In 2013, Gartner first pointed out that threat intelligence is knowledge about existing or upcoming threats against an asset, including scenarios, mechanisms, indicators, revelations, and actionable recommendations that can provide subjects with threat response strategies. Threat intelligence and intrusion detection systems [13] are different in two ways. On the one hand, cyber threat intelligence is created after the attack or defense. It can be used to provide informational support for the response to threats. While intrusion detection systems will alert when an attack is performed. On the other hand, threat intelligence can be obtained from security vendor bulletins, hacker forums, social media, and information security vulnerability databases. It helps to understand the behavior of cyberattacks and how they happen. However, due to the complex composition of the Internet, variable attacker behaviors, and an increasing number of security devices, threat intelligence has grown geometrically. Meanwhile, cyber threat intelligence is usually in the form of natural language, with related entities scattered throughout the article and intricate relationships between entities. This poses challenges to intelligence analysis, utilization, and sharing. The huge amount of alert data puts a lot of pressure on security analysts. As a result, many alarms are left unprocessed and become junk data. Therefore, analyzing and dealing with threat information has become a key problem to be solved.

Manual analysis of threat intelligence requires cybersecurity expertise and is time-consuming and laborious, so it is difficult to cope with the increasing number of cyberattacks. Given its importance, much research efforts has devoted to extracting structured knowledge from unstructured threat intelligence, and the extraction process mainly involves four key techniques: entity extraction, coreference resolution, relation extraction, and knowledge graph construction.

Automated analysis of threat intelligence is faced with the following challenges: (1) Unlike the entities in the general domain, entities in the threat intelligence domain have strong domain characteristics. For instance, threat entities, including hacker organizations, attack techniques, malware, etc., are difficult to identify directly by an entity extraction model in the general domain; (2) in a threat intelligence document, an entity may appear multiple times, i.e., there are multiple mentions. Determining whether the mentions refer to the same entity requires fully utilizing contextual information and extracting semantic knowledge. (3) Threat intelligence documents have complex structures and relatively long sentences. The relationships between entities are usually inferred from multiple sentences. (4) There is a lack of publicly available threat intelligence datasets.

To overcome the above-mentioned challenges, this paper proposes a novel threat intelligence information extraction system that combines multiple models and involves four steps: entity extraction, coreference resolution, relation extraction, and knowledge graph construction. Zhou et al. [4] also proposed an extraction system for APT threat intelligence, but they could only extract related entities. Vulcan [5] extracted descriptive or static CTI data from unstructured text and determined their semantic relationships. However, their definitions of entities and relationships in threat intelligence are not comprehensive. The main contributions of this paper are summarized as follows: (1)A hybrid model is proposed to implement information extraction in the threat intelligence domain. The model can output the input unstructured threat intelligence text in a structured manner and generate a knowledge graph. The knowledge graph is stored using the Neo4j graph database to explicitly present the entities in threat intelligence and the relationships between them, thus providing knowledge support and decision support for security analysts to understand attack events and make defense deployments.(2)A novel entity extraction model called entity extraction with multihead attention and POS (EEMAP) is proposed. The model obtains vector representations that are important to entities through a multiheads self-attention mechanism. Then, it fuses the vector representations with the feature vectors generated by a recurrent neural network model. Next, the fused results are input to a linear layer to obtain sequence labels, and the entities in the text are extracted.(3)A novel coreference resolution model called Coreference Resolution with CNN and POS (CRCP) is proposed. In this model, the mention representation is augmented by fusing contextual information and mention embedding. Also, a convolutional neural network is introduced to extract different dimensions of the mentioned features, which can effectively compensate for the low recall rate of traditional coreference resolution methods.(4)To solve the problem of lacking publicly available datasets for threat intelligence, this paper collects and annotates 227 threat intelligence documents from security vendor bulletins, blogs, and forums. The experimental results in this paper validate the effectiveness of our model.

This section mainly reviews the existing work that is closely related to our study. Specifically, advanced models related to entity extraction, coreference resolution, and relation extraction are presented.

2.1. Entity Extraction

Entity extraction is a fundamental task of natural language processing, which aims to extract and categorize people, places, organizations, and entities with specific meanings from unstructured text and organize them into semi-structured or structured information. After this, other techniques can be used to analyze and understand the text. In essence, entity extraction is a sequence labeling task.

The early research [6] on entity extraction mainly used rule-based or dictionary-based methods, and experts built many rule templates and adopted string matching as the main approach for entity extraction. This approach can achieve satisfactory results on fixed pattern datasets, but it has poor generalization ability and cannot adapt to changing data.

Statistical machine learning can obtain better results for the cybersecurity entity extraction task. This method does not require the manual definition of rule templates and has good portability. Statistical machine learning models include the hidden Markov model (HMM), maximum entropy model (MEM), and conditional random field (CRF) model. Joshi et al. [7] proposed to extract entities, concepts, and relationships from cybersecurity blogs and announcements using the CRF model to construct a custom cybersecurity ontology. Mulwad et al. [8] employed a support vector machine (SVM) as a classifier to identify attack means and results. Bridges et al. [9] evaluated the MEM on different security corpora to avoid the overfitting problem when training the model. The above methods are more robust than rule-based methods, but they require manual mining of text features, so it is difficult for these methods to achieve satisfactory results on datasets of small sizes.

With the rise of deep learning in various fields, neural networks are continuously improved and have achieved excellent performance in entity extraction tasks. For example, Chiu and Nichols [10] used a hybrid model of BiLSTM and CNN to detect character-level and token-level features. Then, they developed a new dictionary encoding model and a matching algorithm. Dionísio et al. [11] first used CNN to determine whether the collected tweets contained security information related to the property of IT infrastructure, followed by extracting named entities with BiLSTM to obtain security alerts. Wu et al. [12] investigated new attack patterns using a dependency analysis model to extract tactics, technology, and entities in e-commerce threat intelligence. Gasmi et al. [13] employed a BiLSTM-CRF model to extract cybersecurity concepts and entities, and they compared the model with three LSTM-based models. Zhao et al. [14] proposed an IOC identification method based on the multigranular attention mechanism and constructed a heterogeneous information network to model the dependencies among IOCs to improve accuracy.

2.2. Coreference Resolution

The coreference resolution identifies whether the mention pairs belong to the same entity. In document-level relation extraction tasks, an entity may appear many times, i.e., there are multiple mentions. If two references refer to the same entity, they are said to be “coreferential.” For example, in the sentence, “The malware also silently downloads and installs a known malicious app named ister59.apk (detected as Android.Reputation.3) from the following URL.” ister59.apk and Android.Reputation.3 refer to the same malware. Coreference resolution is widely used in various tasks such as relation extraction [15], event extraction [16, 17], multiparty conversation [18], and abstract meaning generation [19].

2.3. Relation Extraction

Relation extraction aims to extract relation facts from the text. Early studies [2023] focused on predicting the relationship between two entities in a single sentence. However, an increasing number of relationship facts need to be extracted through multiple sentences, that is, to perform document-level relation extraction.

Document-level relation extraction approaches are mainly divided into two categories: graph-based methods and transformer-based methods. Specifically, graph-based methods construct document graphs based on documents to intuitively model entity structures. Zeng et al. [24] constructed mention-level and entity-level document graphs, respectively, and they proposed a new path inference mechanism to infer relationships between entities. Sun et al. [25] presented a dual-channel hierarchical graph convolutional neural network called DHGCN to model token-level, mention-level, and entity-level complex interactions between different semantics in a document. Transformer-based methods employ pretrained models (Bert, Roberta, ERNIE, etc.) to obtain the representation of each word in a document. Yuan et al. [26] exploited an intersentence attention mechanism to dynamically obtain multiple key sentence features, and they designed a gating function to combine sentence-level features with document-level features. Zhang et al. [27] introduced a U-shape segmentation module to capture local and global information by predicting entity-level relationship matrices. To solve the multientity and multilabel problems in document-level relation extraction, Zhou et al. [28] proposed to enhance entity embedding by introducing local context pooling and reducing the optimization overhead by replacing global thresholding with adaptive thresholding.

3. Methodology

3.1. Model Framework

This paper proposes a threat intelligence information extraction system combining multiple models. The proposed system consists of named entity recognition, coreference resolution, document-level relation extraction, and knowledge graph construction. The structure of the system is shown in Figure 1. Firstly, the input is converted into POS-enhanced embeddings using a pretrained model BERT and a Python library NLTK. Then the embeddings are fed into the named entity recognition model, coreference resolution model and document-level relation extraction model, successively. After that, the outputs are arranged into triples and inserted into the knowledge graph.

3.2. Encoding Layer

Different from the traditional encoding layer using random word embedding, in this paper, BERT is introduced to provide rich semantic knowledge, and part-of-speech embedding is integrated to further improve the representation ability of mention embedding.

The pretrained model BERT is used as the encoder, and the special tags of “[CLS]” and “[SEP]” are added at the beginning of the document. For each mention in the document, the special tag “” is inserted at the beginning and end.

As shown in Figure 1, the given document is first input into the tokenizer to obtain tokenized document , where represents the word at position . In this section, the BERT-base is used to encode the document and obtain the contextual representation :where , and is the dimension of hidden size.

The POS sequence of the document is obtained by NLTK. Based on this, the POS embedding matrix is constructed as follows:where , and is the dimension of POS embedding.

For each token, the POS-enhanced word representation is generated by fusing the contextual embedding and the POS embedding:where , and indicates the linking operation.

3.3. Entity Extraction

To obtain vector representations that are important for entities, our entity extraction model integrates a multihead self-attention mechanism. This mechanism can learn dependencies between any two words and assign different weights to each token representation to obtain key information. Multiple attention heads can be used to learn features in different representation subspaces, thus significantly improving the model performance. Specifically, in this paper, a sequence of POS-enhanced token representations is used as input to the attention layer to obtain contextually significant embeddings of the current word:where , , and are the query, key, and value sequence, respectively; is the dimension of the key sequence; and is the number of attention heads.

This paper introduces BiLSTM to obtain historical and future information about the current word. The previous work [29] demonstrated the effectiveness of BiLSTM in capturing contextual semantic information. BiLSTM consists of a forward LSTM layer, a backward LSTM layer, and a connectivity layer. Each LSTM contains a set of cyclically connected subnetworks, called storage modules. Each time step is an LSTM storage module that is obtained based on the previous moment hidden vector, the previous moment storage cell vector, and the current input word embedding operation.

Feature vectors are obtained by inputting the POS-enhanced token representation sequence into BiLSTM.

Then, the important contextual embeddings are fused with feature vectors, and the fusion results are input into the linear classifier to obtain sequence labels:

3.4. Coreference Resolution

Coreference resolution identifies whether two mentions refer to the same entity. This paper proposes a model that treats the coreference resolution task as a binary classification problem. The model obtains POS-enhanced token representations of mentions, followed by calculating the average vector of each token:

CNN extracts depth features through a sliding window to alleviate the problem of long-distance dependence. Usually, a convolution layer contains one filter that performs convolution operations with word vectors through a convolutional kernel. The mentioned representations are fed into the convolution layer to obtain features in different dimensions, followed by the downscaling and compression operations using a pooling layer to remove redundant information and prevent overfitting. This paper introduces max-pooling, i.e., the maximum feature value is selected from the feature values obtained by each filter in the convolution layer, and the rest of the features are discarded.

After obtaining the pooled feature vector of mention pairs, the tanh activation function is further used to calculate the label probability, i.e., whether two mentions refer to the same entity.

Then, the sequence labels obtained from the entity extraction model introduced in Section is used to extract the corresponding mentions. Then, they are input into the coreference resolution model to predict whether the mentions point to the same entity.

3.5. Document-Level Relation Extraction

Document-level relation extraction determines whether there are corresponding relationships between entities. In our model, it is treated as a multilabel classification problem. Additional features are fused in the entity representation to fully utilize the document information.

Specifically, the POS-enhanced token representation of “” in front of the mention is used as the representation. Experiments have shown that the width of a mention is an important piece of information about an entity. Thus, a width embedding matrix is trained and fused with the mention representation to generate a width-enhanced representation:where , is the dimension of width embedding, and is the mention of the entity.

For entity with mentions , it is necessary to integrate the mention-level representation to obtain the entity-level representation. Traditional methods usually adopt the maximum pooling method that performs well when the mention pair can express the relation clearly. However, in practical scenarios, the relation between the mention pair of different entities is fuzzy. This paper adopts a smoothed version of maximum pooling, i.e., LogSumExp pooling, to obtain an entity-level representation:

The multihead attention matrix of the encoder BERT is used. denotes the attention score from token to token in the head. The attention on “” before a mention is regarded as the attention score of that mention. Meanwhile, the entity-level attention score is obtained by averaging all mention attentions of the same entity. Besides, the important contexts for a specific entity pair are located by the attention matrix, and the local context embeddings are calculated as follows:

Experiments have shown that the distance between entities and entity type also impacts the effect of relation extraction. Thus, in this paper, a distance embedding matrix and a type embedding matrix are introduced and merged into entity representation. Based on this, the representation encoding of a specific entity pair can be represented as follows:where , , , and are the dimensions of the distance embedding and type embedding, respectively; indicates the distance between the first mentions of entities and ; and are the type of and , respectively.

To reduce the computational overhead, in this paper, entity representations are divided into k semantic groups of the same size. Then, the entity representations are fused to obtain the following entity pair representation:

Finally, the nonlinear activation function is employed to calculate the specific relationship probability.

3.6. Knowledge Graph Construction

Ontology is fundamental for knowledge graph construction. In this paper, the threat intelligence ontology previously studied by our team is introduced, and it is shown in Figure 2.

The scattered and heterogeneous security data are organized by the abovementioned information extraction model. Then, a threat intelligence knowledge graph is constructed based on the ontology, displaying entities, and their relationships visually to provide data analysis results for threat modeling and attack reasoning in cybersecurity.

4. Experiment

4.1. Datasets

Due to the lack of publicly available information extraction datasets in the field of threat intelligence, this paper crawled 227 threat intelligence documents for preprocessing and manual annotation. Then, the documents were divided into a training set of 151 documents and a test set of 76 documents. The training set contains 1610 entities and 949 relationships. The distributions of entities and relationships are as shown in Figure 3.

4.2. Experiment Settings

In the experiment, the pretrained model BERT was used as the encoder to encode threat intelligence documents, and its contextual representation was obtained with a hidden layer dimension . To achieve better performance, the training epoch was set as (use the early stop strategy when the performance does not improve within 20 epochs), . Our model use the AdamW optimizer with a linear warmup strategy to improve the model convergence speed. The learning rate was set to 5e − 5 for BERT and 1e − 4 for other layers. All models were trained on Nvidia Geforce RTX 3090 GPU based on Pytorch 1.7.1.

In the entity extraction task, 46 parts of speech and 13 entity tags were introduced. The dimensions of the POS embedding and feature vector of BiLSTM were set as and , respectively. Besides, the number of heads was set as .

In the coreference resolution task, the convolution kernel size was set to , and the output channel was set to 200.

In the relation extraction task, the dimensions of width embedding, distance embedding, and type embedding were set to . The number of semantic groups was , and the number of attention heads was .

According to the previous work, precision, recall, and F1 score were used as evaluation metrics for model performance in the above-mentioned three tasks. Meanwhile, in the entity extraction task, exact-match accuracy (i.e., all tokens in the entity are predicted correctly) was introduced as an additional evaluation metric.

4.3. Result Analysis
4.3.1. Performance on Entity Extraction

Table 1 presents the comparison between our model EEMAP-BERT and the baselines on the entity extraction task, where EEMAP-WE indicates that the pretrained model BERT is replaced by random word embedding as the encoder. It can be seen from Table 1 that compared to baselines BiLSTM [11] and BiLSTM-CRF [13], our model obtained significantly better precision, recall, F1 score, and exact-match accuracy. Specifically, the F1 score was improved by 9.94 and 8.87, respectively. Then, ablation experiments were performed to analyze the effect of each module on the model performance.

First, the POS embedding was removed (labeled as NoPOS), and it was observed that the F1 score was decreased by 0.67, and the exact-match accuracy was decreased by 0.74. Particularly, according to the knowledge of threat intelligence, entities are mainly nouns and verbs, so these two types of embeddings have a significant impact on the model performance.

Then, the multihead self-attention layer was removed (labeled as attention). It was observed that the F1 score was decreased by 0.42, and the exact-match accuracy was decreased by 0.56. The experiment results indicate that the multihead self-attention mechanism can help to locate vital context and capture long-distance interdependent features.

Finally, both the POS embedding and multihead self-attention layer were removed (labeled as No POS + No Attention). It can be seen that the performance was significantly degraded, where the F1 score was decreased by 2.76, and the exact-match accuracy was decreased by 3.45.

Additionally, to explore the influence of the model encoder on performance, the performance of random word embedding and BERT were compared. It was indicated that the model performance was significantly reduced when random word embedding was used. The F1 score was reduced by 9.65, and the exact-match accuracy was reduced by 13.01. The experimental results showed that the pretrained model trained on a large corpus could learn universal language representations, which contributes to better generalization performance and faster convergence speed on the target task.

It can be seen from Figure 4 that when the number of attention heads is 5, both the F1 score and exact-match accuracy reached the highest point, so was set.

To further investigate the ability of the POS embedding and attention mechanism to fit different data types, Figure 5 shows the single-category fine-grained performance in detail.

4.3.2. Performance on Coreference Resolution

Table 2 presents the performance of our model on the coreference resolution task, where CRCP-WE indicates that the pretrained model BERT is replaced by random word embedding as the encoder. As shown in Table 2, compared to E2E-CR, our model CRCP-BERT improved the F1 score by 9.82 [30]. Similar to the experiment on the entity extraction task, ablation experiments were performed on each module.

Firstly, POS embedding was removed (labeled as NoPOS) to investigate its effect on the performance of coreference resolution. It can be seen that the F1 score decreasing by 1.04 after the POS embedding was removed. If two entities are mentioned at different parts of the speech, they must not refer to the same entity. Therefore, POS embedding can be used as an auxiliary feature to obtain potential semantic information.

Then, the CNN module was removed (labeled as NoCNN), and it can be seen that the model performance was significantly degraded, with the F1 score decreased by 2.78. This is mainly because CNN can extract information from different dimensions and effectively identify the corresponding features.

Finally, both the POS embedding and CNN were removed (labeled as “No POS + No CNN”), the model performance was further degraded, and the F1 score was reduced by 4.15.

Additionally, it can be observed that when random word embedding was used as the model encoder, the model achieved the highest precision, but the recall and the F1 score decreased by 9.51 and 1.40, respectively. The analysis of the results indicated that after using random word embedding, the introduced part-of-speech embedding and CNN module might cause the model to overfit, resulting in a decrease in the overall performance.

4.3.3. Performance on Relation Extraction

To tackle the imbalance of the dataset, we adopt random over sampling to copy minority classes before training our model. Specifically, tokens are replaced by their synonyms to create new samples.

Table 3 presents the performance of our model (labeled as DRE) on the relation extraction task. Compared to GAIN [24] and ATLOP [28], our model improved the F1 score by 11.74 and 10.56, respectively. Then, ablation experiments were conducted to verify the effects of four features, namely parts-of-speech, mention width, entity type, and distance of entity pairs, on the model’s performance. The experimental results indicated that the performance of the model decreased significantly after the four modules were removed, which validates the good effect of introducing features on the model performance.

The influence of each feature module on the F1 score was investigated further, as shown in Figure 6.

4.3.4. Knowledge Graph Construction

The threat intelligence text was input into the above-trained model, and the corresponding relation triplets were generated through the processes of entity extraction, coreference resolution, and relation extraction. Then, the neo4j-admin command was executed to insert triples into the Neo4j graph database to obtain the threat intelligence knowledge graph. Some partial results are illustrated in Figure 7.

5. Conclusion

This paper designs a hybrid model to implement information extraction in the field of threat intelligence, including entity extraction, coreference resolution, relation extraction, and knowledge graph construction. First, the multihead self-attention mechanism is introduced into the entity extraction model to obtain important contextual vectors. Meanwhile, in the coreference resolution model, the contextual information is fused with mention embedding. Furthermore, the features of multiple dimensions are acquired by a convolutional neural network. In the relation extraction model, parts of speech, mention width, entity type, and the distance between entity pairs are introduced as additional features to improve the representation. Experimental results indicate that, compared to baselines, our model improves the F1 score by at least 8.87, 9.82, and 10.56 on the tasks of entity extraction, coreference resolution, and relation extraction, respectively. Finally, a threat intelligence knowledge graph is constructed to illustrate the potential semantic connections between entities. In summary, our model can automatically extract knowledge from multiple documents and display the internal relationships between crucial elements. It lays a firm foundation for situational awareness and attack detection. In future work, we will further refine entity and relationship classification and collect more examples to extend our dataset. Also, we will perform knowledge reasoning through a knowledge graph to obtain more knowledge.

Data Availability

The experiment dataset used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (grant nos. 61501515 and 61601515).