Abstract

E-mail systems and online social media platforms are ideal places for news dissemination, but a serious problem is the spread of fraudulent news headlines. The previous method of detecting fraudulent news headlines was mainly laborious manual review. While the total number of news headlines goes as high as 1.48 million, manual review becomes practically infeasible. For news headline text data, attention mechanism has powerful processing capability. In this paper, we propose the models based on LSTM and attention layer, which fit the context of news headlines efficiently and can detect fraudulent news headlines quickly and accurately. Based on multi-head attention mechanism eschewing recurrent unit and reducing sequential computation, we build Mini-Transformer Deep Learning model to further improve the classification performance.

1. Introduction

With the rapid development of Internet, Internet security is suffering from various potential threats. The rise of Advanced Persistent Threat (APT) has caused traditional network defense systems to face increasingly severe challenges. According to statistics, social engineering is the main technique that attackers use to launch APT attacks, so it is of practical significance to research on defending against social engineering attacks. Cutting off the chains of attacks, detecting attacks and isolating attackers is the fastest and most effective method of defending against social engineering attacks.

Currently, in major cases of social engineering attacks, the essential operation of attackers to launch attacks is to distribute fraudulent news headlines on e-mail systems and online social media platforms, such as Instant Messaging services (e.g., QQ, WeChat, WhatsApp, Facebook Messenger, and Line) or microblogs (e.g., Twitter and Weibo). Some fraudulent news headlines often carry malicious links preset by attackers. Many curious users who see those news headlines would want to learn more about the detailed contents of those news by clicking directly on the malicious links, which leads to serious consequences, including personal privacy theft, account and password stealing, and even huge asset loss.

According to Symantec Internet Security Threat [1] (ISTR Volume 23), for the social engineering attacks on companies, 71.4% of targeted attacks in 2017 involved the use of spear-phishing e-mails. Therefore, the main vector of social engineering attacks to reach companies through their employees remains e-mail system.

Above all, it is of great importance to analyze and detect fraudulent news headlines, which has a profound impact on Internet security and the defense system against social engineering attacks.

In recent years, Deep Learning models, such as Long Short-Term Memory (LSTM) [2], attention layer [3] and Transformer [4], have demonstrated outstanding advantages in solving the problems of Natural Language Processing (NLP). In this paper, for the classification of news headline text data, we add one extra attention layer to the LSTM model and achieve a slight increase in accuracy. In addition, based on multi-head attention, we build Mini-Transformer without complex recurrent or convolutional neural networks to improve the classification performance (i.e., accuracy, precision, recall, and F1 score) dramatically.

Although a considerable amount of literature has been published on Internet social engineering, the emerging security issues with e-mail systems and online social media platforms are still not addressed adequately. Moreover, since the operational principle of social engineering attacks has not been clearly revealed, it is difficult to construct an effective defense system.

Castillo et al. [5] raised the issue of fake information detection on Twitter. To examine newsworthy topics on Twitter, they evaluated various classification algorithms and analyzed four features (message, user, topic, and propagation). Automatic method was used to classify the credibility of Twitter messages and achieved high precision and recall.

Ma et al. [6] utilised Recurrent Neural Networks (RNN), including LSTM and Gated Recurrent Unit (GRU), to process massive text data. They proposed a novel method that learns continuous representations of microblog events for identifying rumors on Twitter and Weibo more quickly and accurately.

Guo et al. [7] investigated the relevant characteristics of social media and utilised attention mechanism to analyze the massive news and messages on the microblog. They designed an efficient classification scheme, which can detect rumors more accurately.

Song et al. [8] combined LSTM with attention mechanism and proposed a novel method of sentiment lexicon embedding for aspect-level sentiment analysis, which is better at representing sentiment word’s semantic relationships to improve the sentiment classification performance.

Vaswani et al. [4] proposed a new network architecture (Transformer) based solely on attention mechanism, which is not only superior in machine translation quality but also more parallelizable so as to require significantly less time to train.

Our work focuses on the news headlines spread on e-mail systems and online social media platforms. We develop a set of models to detect massive fraudulent news headlines using LSTM and attention mechanism. To further improve the classification performance, we build Mini-Transformer, which consists of multi-head attention layers and fully connected dense layers rather than recurrent unit layers (i.e., LSTM layer and GRU layer).

3. Methodology

In this section, firstly, we briefly revisit LSTM [2]. Then, we present the formulations of attention layer proposed by Bahdanau et al. [3]. Finally, we show how we use multi-head attention mechanism to build Mini-Transformer.

3.1. Long Short-Term Memory (LSTM) Networks

LSTM is able to process variable-length input sequences by recursive operation [2]. With the ability to maintain the hidden states and fit the variations of contextual information in relevant time steps, LSTM is well-suited for classifying news headline text data.

Unlike the traditional Vanilla RNN unit whose hidden state is overwritten in each time step, LSTM unit maintains long memory cell state in time step . Given an input sequence with length , are real number vectors with dimension , hidden state sequence with length , are real number vectors with dimension , and long memory cell state sequence with length , are also real number vectors with dimension . From to , the algorithm iterates as follows:where , , , and are weight matrices for forget gate, input gate, long memory cell, and output gate, respectively. The operator “” denotes the dot-product between the matrix and vector. The operator “” denotes the element-wise multiplication (Hadamard product) between two vectors. is the logistic sigmoid function, and is the hyperbolic tangent function.

In LSTM unit, forget gate controls the range of existing memory removed from , input gate controls the range of new memory added to , and output gate determines the amount of output memory. By removing part of the existing memory and adding part of the new memory , long-term memory cell is updated. LSTM unit is illustrated in Figure 1.

From to , after all iterative steps of the algorithm, last hidden state vector is computed to generate real number output via a fully connected dense layer whose activation is logistic sigmoid function.

3.2. Attention Layer

In 2014, Bahdanau et al. [3] introduced the attention mechanism to the NLP field for the first time and completed modeling, transduction, and alignment procedure on the machine translation task at the same time.

LSTM layer needs to return all hidden states as the input of attention layer. In attention layer, attention weight scores are computed with , , and input sequence . are real numbers reflecting the importance of each state . As trainable parameters, is a real number vector with dimension , and is a real number matrix with shape (, ). From to , the algorithm iterates as follows:where is the transpose of .

For normalization that , softmax function is called to generate , i.e., . From to , the algorithm iterates as follows:where is the exponential function.

We take a weighted sum of all states as computing an expected state with dimension , which is similar to . The formula reads as follows:

Weighted sum state is computed to generate real number output via fully connected dense layer whose activation is logistic sigmoid function.

3.3. Multi-Head Attention

In 2017, Vaswani et al. [4] introduced the multi-head attention mechanism, which consists of several attention heads running in parallel. Then, they built the Transformer without any recurrence or convolution to improve the machine translation quality. In addition, the Transformer is more parallelizable so as to require significantly less time to train.

In this paper, we propose a simplified Transformer, called Mini-Transformer, for the classification of news headline text data. Mini-Transformer is composed of multi-head attention layers and eschews recurrence or convolution.

For single-head dot-product attention, given an input sequence with length , where are real number vectors with dimension , we generate Q (Query), K (Key), and V (Value) with trainable parameter matrices, , , and . They are all real number matrices with shape (, ), where denotes dimension of attention head. The formulas are as follows:

After generating Q, K, and V with shape (, ), we compute single dot-product attention head as follows:

The above dot-product single-head attention outputs a real number matrix with shape (, ). For multi-head attention, we employ parallel attention heads. Due to the reduced dimension of each head , the total computational cost is about the same as that of single-head attention with full dimensionality, but multi-head attention is more parallelizable for GPU to train. The formulas are as follows:

Finally, multi-head attention outputs a real number matrix with shape (, ), as depicted in Figure 2.

4. Fraudulent News Headline Detection

Our work focuses on classifying massive news headlines data into fraudulent class (label 1) and true class (label 0). There are three Deep Learning models based on LSTM, LSTM with attention layer, and Mini-Transformer, respectively. The proposed scheme consists of labeled data source, text data preprocessing and training, and test and evaluation of Deep Learning models.

4.1. Scheme Flow Chart

The flow chart of our proposed scheme for fraudulent news headline detection is shown in Figure 3.

4.2. Labeled Dataset

Three data sources are used in this paper. All of them are publicly available at Kaggle, the world’s largest data science community [9].

If the length of a news headline is greater than or equal to 7, the news headline would be considered as valid news headline data. For balanced sampling, there are a total of 1,481,814 news headlines, including 736,009 items with label 1 and 745,805 items with label 0.

Fraudulent news headline dataset is The Examiner - Spam Clickbait Catalog [10]. Original source is the pseudo news site, The Examiner. At a certain point, the site was the 10th largest site on mobile and was attracting twenty million unique visitors per month. However, The Examiner no longer exists at present, Kaggle keeps the last record. Our work focuses on the fraudulent news headlines from January 1, 2013, to December 31, 2015, a total of 736,009 fraudulent news headlines (with a class label of 1).

True news headline datasets are A Million News Headlines [11] and News Category Dataset [12], a total of 745,805 true news headlines (with a class label of 0).

For A Million News Headlines, the original source is Australian Broadcasting Corporation. It includes the entire corpus of articles published by the ABC news website. With a volume of two hundred articles per day and a good focus on international news, every event of significance has been captured. It contains a total of 577,264 true news headlines from February 19, 2003, to December 31, 2019.

For News Category Dataset, the original source is HuffPost. Each news headline has a corresponding category (e.g., parenting, style and beauty, entertainment, wellness, and politics). It contains a total of 168,541 true news headlines from January 28, 2012, to May 26, 2018.

4.3. Text Data Preprocessing

We preprocess the original labeled news headline text data, including deleting repeated news headlines, removing unnecessary English symbols (i.e., ( ) ’ ” , . ?: - ! #), removing redundant space characters, NLTK Lemmatization [13], truncating the news headlines that are too long, padding the news headlines that are too short, and converting uppercase letters to lowercase, etc.

The data with time order is generally called sequence. In this paper, news headline text data are typical sequences. The representation of news headlines is a two-dimensional string array with shape (, ), where is the total number of news headlines, and is the maximum length of news headlines. For example, the two-dimensional string array can be as follows:(i)[‘tom’, ‘act’, ‘a’, ‘cat’, ‘o’, …] 0(ii)[‘jerry’, ‘act’, ‘a’, ‘mouse’, ‘e’, …] 0(iii)[‘goofy’, ‘act’, ‘the’ , ‘sanguine’, ‘dog’, …] 0(iv)[‘jerry’, ‘act’, ‘the’ , ‘hypothetical’, ‘cat’, …] 1,where label 0 denotes true class and label 1 denotes fraudulent class.

We calculate the frequency of each English word in the two-dimensional string array, so as to identify high-frequency words and generate a high-frequency word dictionary. Significantly, in the procedure of generating the high-frequency word dictionary, our work mainly focuses on ignoring the extremely short words, marking stopwords [14] uniformly with tag 1 and marking low-frequency words uniformly with tag 2. For example, the word dictionary can be as follows:(i)Stopword: ‘a’ ->1, ‘the’ ->1, …(ii)Low-Frequency word: ‘sanguine’ ->2, ‘hypothetical’ ->2, …(iii)High-Frequency word: ‘act’ ->3, ‘cat’ ->4, ‘jerry’ ->5, ‘dog’ ->6, ‘goofy’ ->7, ‘mouse’->8, ‘tom’ ->9, …,where ‘a’ and ‘the’ are stopwords marked uniformly with tag 1, ‘sanguine’ and ‘hypothetical’ are low-frequency words marked uniformly with tag 2.

The original news headline is composed of several words; to facilitate the training and test of Deep Learning model, we map each word string to the corresponding integer based on the generated word dictionary, thus the news headline two-dimensional string array can be converted to a two-dimensional integer array with shape (, ), e.g., the two-dimensional integer array can be as follows:(i)[9, 3, 1, 4, …] 0(ii)[5, 3, 1, 8, …] 0(iii)[7, 3, 1, 2, 6, …] 0(iv)[5, 3, 1, 2, 4, …] 1,where ‘o’ and ‘e’ are extremely short and unnecessary words (only one letter) that have been ignored, tag 1 denotes all stopwords, tag 2 denotes all low-frequency words, and tags which are greater than or equal to 3 denote corresponding high-frequency words.

4.4. Deep Learning Model Structure

In this section, we propose three Deep Learning models, all of them contain word embedding layers and output layers; their structures are shown in Figure 4.

Word2Vec [15] provides a simple and effective method for vectorized representation of words, which can be employed in word embedding task. In word embedding layer, each integer in news headline two-dimensional integer array will be converted to real number vector with dimension . Eventually, the two-dimensional integer array will be converted to a real number array with shape (, , ), i.e., each piece of news headline text data will be converted to word vector input sequence .

In output layer, vector , vector , or the vector with dimension returned from final dense layer in Mini-Transformer will be converted to real number via dense layer whose activation is logistic sigmoid function. If is greater than threshold 0.5, it will be set to 1, else it will be set to 0.

In Mini-Transformer, we employ two layers of multi-head attention sublayer and fully connected dense sublayer without bias. It is worth noting that the activation function of dense sublayer is Rectified Linear Unit [16] (ReLU), but final dense layer has no activation function.

5. Experimental Settings and Results

If the frequency of a word is greater than or equal to 140 times, the word will be considered as high-frequency word; from word frequency statistics, the total number of high-frequency words is 7,996, so the length of word dictionary is 7,998, including low-frequency words and stopwords.

To configure the Deep Learning model for training, in tf.keras.Model.compile, we set that optimizer = Adam(learning_rate = 0.0002), loss = BinaryCrossentropy().

For word embedding layer, dimension of word vectors is 25 and the maximum length of news headlines is 15. For LSTM and attention layer, dimension of hidden states is 16 and is 32.

In Mini-Transformer, for multi-head attention sublayer, number of dot-product attention heads is 8 and dimension of that is 64, and for final dense layer, is 16, which is the same as .

After shuffled, 80% of original labeled dataset is split into the training set and 20% is split into the test set for cross-validation; for batch training [17], we combine consecutive news headlines of this text dataset into batches, batch size is set to 768.

For LSTM, LSTM with attention layer and Mini-Transformer, test set accuracy and loss curves in 10 epochs are shown in Figure 5; accuracy, precision, recall, and F1 score are shown in Table 1.

For further comparison with the models from [18], we conducted several contrast experiments by employing three regular Machine Learning models: logistic regression [19], linear support vector machine (Linear SVM) [20], and random forest [21].

For logistic regression, primal formulation is implemented with liblinear solver (dual = false). For Linear SVM, the algorithm is selected to solve primal optimization problem (dual = false). For random forest, the minimum number of samples required to split an internal node is 50 (min_samples_split = 50). Other hyper-parameters are default from scikit-learn.

From Table 1, three regular Machine Learning models do not achieve good classification results, this may be because they are too simplistic to process massive news headline data.

Compared with LSTM, Mini-Transformer achieves an obvious accuracy improvement in classification performance (0.9%–1.0%). However, LSTM with attention layer achieves a slight accuracy improvement in classification performance (<0.1%); this may be because the maximum length of news headlines is so short that general attention layer cannot play a sufficient role in reflecting the importance of all hidden states returned from LSTM layer.

6. Conclusion

Existing work has not focused on fraudulent news headline detection. In this paper, we have compared the classification performance of mainstream LSTM network and general attention mechanism for fraudulent news headline detection using massive news headline data, which is helpful for the research on the defense system against social engineering attacks.

In addition, according to relevant experience, we have built a more advanced Deep Learning model, Mini-Transformer, which further improves the classification performance.

There is still room to optimize the proposed Deep Learning model. For future work, we can employ Bidirectional Encoder Representations from Transformers (BERT) as NLP pre-training method. Additionally, adversarial training and virtual adversarial training may be beneficial to improving the classification performance.

Data Availability

The data used to support the findings in this study are available from The Examiner-Spam Clickbait Catalog|Kaggle: https://www.kaggle.com/therohk/examine-the-examiner, A Million News Headlines|Kaggle: https://www.kaggle.com/therohk/million-headlines, and News Category Dataset|Kaggle: https://www.kaggle.com/rmisra/news-category-dataset.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the National Key R&D Program of China (2017YFB0802805 and 2017YFB0801701), the National Natural Science Foundation of China (Grant No. U1936120), and the Basic Research Program of State Grid Shanghai Municipal Electric Power Company (52094019007F).