Abstract

Search engine is critical in people’s daily life because it determines the information quality people obtain through searching. Fierce competition for the ranking in search engines is not conducive to both users and search engines. Existing research mainly studies the content and links of websites. However, none of these techniques focused on semantic analysis of link and anchor text for detection. In this paper, we propose a web spam detection method by extracting novel feature sets from the homepage source code and choosing the random forest (RF) as the classifier. The novel feature sets are extracted from the homepage’s links, hypertext markup language (HTML) structure, and semantic similarity of content. We conduct experiments on the WEBSPAM-UK2007 and UK-2011 dataset using a five-fold cross-validation method. Besides, we design three sets of experiments to evaluate the performance of the proposed method. The proposed method with novel feature sets is compared with different indicators and has better performance than other methods with a precision of 0.929 and a recall of 0.930. Experiment results show that the proposed model could effectively detect web spam.

1. Introduction

With the rapid development of the network, web applications are becoming more and more popular in the recent years, among which search engines are one of the most common web tools for people to gain information every day [1]. As the most popular search engine worldwide, Google processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide in 2012 [2]. There are data [3, 4] indicating that 85% of Internet users find websites through search engines and 90% of Internet users do not go past the first three pages on search results. Spammers design pages delicately to improve rankings as most users only access the first page of search results. There has been a brief definition of web spamming in the literature [5]; shortly speaking, web spamming is a black-hat search engine optimization (SEO) that deceive search engines to increase the ranking of a page in search engine results. These web pages are called web spam. As evident, spammers try to deceive search engines and attract end users to click on web spam sites. They not only reduce the effectiveness and efficiency of search engine results since web spam pages take much time to process but may also be full of malicious content and links. Lina et al. [6] present threats and related attacks about web spam. Although search engine companies have utilized various methods to counter spam [7], it is still a challenge to prevent the increase of black-hat SEO technology and the growth of spam pages nowadays. Therefore, it is of great significance to detect web spam with efficiency and accuracy.

Many researchers and experts have conducted much research on spam in this field. Several researchers have relied on feature extraction from text and links on the web page [8, 9]. Some other researchers detect web spam by crawlers observing different versions of the web page returned to search engines and ordinary users [10, 11], as well as there are methods based on spam purposes and user access logs [12]. Detection methods have also evolved continuously from the original statistical characteristics to determine whether a web page is spam to automatic monitoring using machine learning and deep learning, and the efficiency and accuracy of detection are also continuously improved. We are motivated by previous work in the field of web spam detection and cybersecurity, which has proved the viability of web spam detection using a combination of machine learning and effective features. We use similar insights to support the discovery of web spam based on novel features. Our method is different from that in previous studies in that we extract not only link-based statistical features but also semantic features based on text content analysis and structural features based on the structure of web pages from the source code. Additionally, in terms of real-world aerial applications, the proposed method could be deployed in the browser. For example, the judgment of each web page in the users’ search results by the proposed method can provide constructive conclusions to users and browser manufacturers. As the proposed method has features based on semantics, it is helpful to detect spam in web pages where links to spam content are easily injected into.

In this paper, we proposed a method using machine learning algorithm RF that combines feature extraction and feature selection to classify whether a web page is spam or not. Note that, for a binary classification problem, the classifier aims to distinguish the web page as spam or nonspam. The main contributions of this paper are as follows:(1)This paper considers some previously undescribed features for web spam detection. We extract three novel feature subsets by studying homepage’s links, texts, and structure based on statistical and semantic similarity analysis. The experimental results prove that the importance of novel features ranks high and effective.(2)This paper applies a feature selection method for precomputed features related to the homepage to reduce computational consumption and improve accuracy. We introduce random forest algorithm for building the web spam detection model. The method could automatically distinguish web spam and a normal page from the website homepage.(3)We evaluate the proposed method with comprehensive evaluation metrics for binary classification problems. Our method achieves the F1 score of 92.9%, which is higher than that of the existing methods. The experimental results show that our method can effectively detect web spam.

The rest of this paper is organized as follows. Section 2 presents related work regarding web spam detection. Section 3 describes the proposed approach in detail, and Section 4 evaluates the proposed method and the results of our experiments. Finally, we discuss our conclusions and future work.

Web spam is often categorized into four classes: content spam, link spam, cloaking, and redirection. Several researchers and experts present kinds of methods to combat web spam correspondingly.

There were many research methods from different perspectives in the early stages. For example, Jakub Piskorski et al. [13] explored linguistic features focused on the utility of content-based linguistic features with computing 208 linguistic attributes, Benczúr et al. [14] conducted commercial intent analysis because other than the ordinary methods depending on the website itself; the authors thought much web spam was for commercial purpose, Bíró et al. [15] applied an extension of latent Dirichlet allocation (LDA) which is a linked LDA technique for web spam classification since topics are propagated along with links in such a way that the linked document directly influences the words in the linking document, and Liu et al. [16] analyzed web spam with user behavior where user visiting patterns of spam pages and three user behavior features are proposed to separate web spam from ordinary ones. Luca et al. [17] studied the spectrum of black-hat cloaking techniques that target browser, network, or contextual cues to detect organic visitors. Their anticloaking system can detect whether a web page would split view content returned to two or more distinct browsing profiles.

With machine learning developing by, a plurality of popular machine learning algorithms combined with sorts of feature engineering methods are applied to detect web spam. Machine learning techniques are more flexible than other methods; some difficult problems can be solved and more accuracy becomes a reality. Liu et al. [18] used a sentiment analysis model based on topic enhanced word embedding to obtain more complete text context information. The document topic distribution matrix is used to extract the document features. Reza Mohammadi et al. [19] proposed a method to improve support vector machine (SVM) algorithm by using two nonlinear kernels in twin support vector machine (MKTWSVM), which was experimented on both UK-2006 and UK-2007 datasets. The authors used a language-model approach and qualified-link analysis on detection. Fdez-Glez et al. [20] proposed a new framework according to combine different techniques, particularly suitable for filtering spam content on web pages. Mei et al. [21] proposed an improved PageRank algorithm based on web page differentiation (DPR), which evaluates pages authority according to its links’ numbers and assigns corresponding weights according to its authoritativeness when assigning PageRank values. They combined DPR with K-means, designed a differentiation page-based K-means algorithm. Jelodar et al. [22] presented a systematic framework based on the chi-squared automatic interaction detector algorithm and a modified string matching algorithm. The author used the modified knuth–morris–pratt algorithm to extract features from Alexa Top 500 Global Sites and Bing search engine results in 500 queries; then, they generated a tree model with useful attributes that can detect web spam. Asdaghi and Soleimani [23] proposed a new backward elimination feature selection approach with the Naive Bayes (NB) classifier.

Many experts also used different neural networks and deep learning algorithms to detect web spam. Renato Moraes et al. [24] presented a performance evaluation of different models of artificial neural networks used to automatically classify and filter real samples of web spam based on their contents. Li et al. [25] introduced the deep belief networks and combined with the synthetic minority oversampling technique (SMOTE) and denoising autoencoder (DAE) algorithm to improve the classification performance of web spam. In [26], the authors presented a framework called FS2RNN, a feature selection scheme using recurrent neural networks (RNNs), for the classification of spam nodes. In this framework, the dataset is preprocessed before applying RNNs in which principal component analysis (PCA) is used for dimension reduction on the dataset and recursive feature elimination (RFE) is used for feature selection. Belahcen et al. [27] addressed the web spam detection problem by using the graph neural network (GNN) architecture, which can act as a mixed transductive-inductive model that is able to classify pages by using both the explicit memory of the classes assigned to the training examples and the information stored in the network parameters.

In addition to detecting traditional e-mail spam and web spam, there are many scholars studying spam in social media called social spam, such as spam based on blogs, tweets, and YouTube videos. Fu et al. [28] presented a framework detecting spammers by measuring how careful a user is when she is about to follow a potential spammer. Samsudin et al. [29] proposed a framework that extracted features by using data collected from the YouTube spam dataset to detect YouTube comments spam. To deal with users who are affected by social spam, Ezpeleta et al. [30] focused on mood analysis and all content-based analysis techniques. Based on these heuristic research studies, we can apply to the problem that needs to be solved in this paper.

3. Proposed Method

In this section, we discuss the proposed method framework, give a comprehensive process of mining novel features, and determine classification algorithm training for the detection of the web spam model presented in this paper. The framework of the proposed method is depicted in Figure 1. The input is the web pages of sorts of websites. The output is a list of web pages with predicted classification scores, where a higher score indicates that the web page is more likely to be web spam. The proposed method is composed of 3 components: the prepossessing, features, and detection model. The number in brackets is the number of features. Next, we will describe the functionality and specific implementation methods of each component in detail.

3.1. Data Augmentation

We design our method based on the WEBSPAM-UK2007 dataset [31]. In the labeled samples given by the original dataset, the proportion of spam is only 6%. One of the main challenges we face is that the data are very imbalanced. There is no doubt that machine learning algorithms are data-driven approaches. It means that the performance of the model is highly related to data. Therefore, to augment the data, we extract more original data from the results of previous studies on this dataset. The summary of the augmented data method is to select the labeled data from the labeling results with high accuracy of the previous studies on the dataset, which are used as the labels of our data. Detailed information about data augmentation is given in Section 4.1.

3.2. WARC Parser

First of all, the dataset is structured, but the complex structured data cannot be directly applied because it contains unnecessary information. We need to process raw data. The original web pages’ HTML documents in each host are arranged in sequence and stored in separate Web Archive (WARC) format proposed by the Internet Archive. The WARC format is an extension of the ARC File Format that has traditionally been used to store “web crawls” as sequences of content blocks harvested from the World Wide Web. Figure 2 shows a code snippet of a WARC file. We can see that each capture in a WARC file is preceded by a one-line header (line 1) that very briefly describes the harvested content and its length. Next to the one-line header, HTTP protocol response headers (from line 5 to line 16) are recorded, and then, multiple lines of HTML documents (from line 18 to line 43) are followed. We develop a WARC parser to separate the blocks into multiple individual HTML documents one by one and store them in different file folders according to what the domain extracted from the one-line header uniform resource locator (https://chato.cl/webspam/datasets/uk2007/contents/excerpt.txt) field the HTML document belongs to. Since the domain of each host is different, the folder name is the domain name.

3.3. Homepage Extraction and Check

A website contains at least one web page, and some websites are up to several hundred pages. The results obtained by users searching for keywords in search engines are just one web page for users, and it is challenging to get all the web pages of the website to which the current web page belongs. In other words, getting all the web pages is not easy, but getting the homepage is still relatively simple. Moreover, the homepage is the core of a website, covering the main content that a website wants to express to those who are visiting the website. For example, some companies, governments, and schools’ websites will display related information about companies, governments, and schools such as history, main business, and contact information on the homepage. Statistics show that web spam pages are more inclined to improve their rankings in search engine results pages, especially homepages. Figure 3 shows that the percentage of a homepage with the largest PageRank value among all pages on the website of spam websites is higher than that of nonspam websites. It indicates that when spammers create a website, they intentionally make the homepage with the highest ranking. Some well-known algorithms for calculating page rankings include PageRank Page Score and TrustRank [32]. Therefore, it is very representative to check whether the homepage is spam. To some extent, the homepage can represent whether the entire website is spam.

The next step is to determine which HTML document is the homepage of a website. In the process of parsing HTML documents from WARC files, we have judged whether a web page is a homepage from the URL path roughly. We set every HTML document name as its pathname and store it under the website to which it belongs. Here are some simple rules, for example, the URL path is only the root path “/” could be as homepage, and the first-level path with the distinct keywords such as “index,” “home,” and “homepage,” is also the homepage. Of course, all web pages under some hosts do not match these rules, so a manual check is required.

3.4. Features

We extract some novel features from the source code of the web page and divide them into four categories mixed with existing features: homepage links features, semantic similarity features, homepage structure complexity features, and existing features. Although some features based on links, content, and structure have been used in previous papers, in this paper, we have studied these features from a different perspective.

3.4.1. Link Features

It is necessary to consider link characteristics because the spammers deliberately set up a large number of hyperlinks between the spam websites to point to each other which can increase the clickthrough rate and the PageRank value of the website’s homepage. There are some specific measures, such as inserting hyperlinks in the homepage to point to essential or well-known websites and attract other pages to point to their own pages or using information hiding technology to publish valuable information on the Internet, but hidden text or hyperlinks that are invisible to users, pointing to spam homepages. These practices all increase the entry link, also called indegree of the homepage of the spam websites. This makes the PageRank value of the homepage increase and leads to the advance of the spam page in the ranking of search engines, whereas the production of normal hosts is reasonable and standardized, and the host owner will not deliberately increase the homepage’s incoming link. Many previous studies only focused on the number of all links, without considering external links and cross links separately. But, the impact of these two features on the construction of web spam is different. Based on this perspective, this paper extracts these two features separately. As discussed above, we propose two link features, number of external links and number of cross links, based on all the links in the homepage.(1)Number of external links: external links defined as hyperlinks that point at an external domain which means any domain other than the domain the link exists on. We compared the domain of each link in the homepage with the domain to which the homepage belongs and counted the number of external links.(2)Number of cross links: in contrast to external links, cross links also called internal links are links that, from within a website, point to another page which belongs to the same website. Similarly, the number of cross links is obtained by subtracting the number of external links from the total number of links.

3.4.2. Semantic Similarity Features

Generally, for content spam, the primary method is to repeat the same or similar keywords in large numbers. Some web pages directly copy the content of some standard high-quality websites. When users search for a specific keyword, these plagiarized websites will also have a relatively high ranking. Users will not be aware of this is spam page through only the restricted content displayed in the search engine results. These pages add partial anchor texts link to some marketing and authority websites, even malicious websites such as gambling and pornographic websites, to entice users to click and earn profits. This malicious behavior can be challenging to detect. Many web pages with interactive functions are easily used by spammers. For example, in the comments section of a blog, it is easy to evade censorship and spread malicious websites to entice users to click. In previous studies of content-based web spam, researchers mainly focused on the topics and keywords of the entire website. The method of detecting web spam from the entire content of the website is not accurate enough, and it will make web spam using this technology evade detection. Also, the partial web spam technique has not been widely studied yet. After analysis and manual check, we observe that the semantic analysis between the anchor text and the current web page is helpful for web spam detection. Therefore, we extract two semantic similarity features, namely, similarity of texts and links.(1)Similarity of texts: the feature represents the semantic similarity between the textual description, also known as anchor text of the external links in the page and some textual description of the web page. It can reflect the similarity between the link’s anchor text inserted in the web page and the main content of the web page.(2)Similarity of links: the feature represents the semantic similarity of each domain of all links in the page and page’s domain. It can reflect the similarity between the link inserted on this web page and the page’s domain.

In this paper, we choose word mover’s distance (WMD) as a metric to measure semantic similarity. The WMD is a novel distance function between text documents presented in [33]. In a supervised learning task, semantic similarities are useful for classification. WMD measures the difference between two texts and calculates the minimum distance that a word vector of one text “moves” to a word vector of another text. As shown in Figure 4, after removing stop words (not bold), the remaining words are embedded into a vector in the vector space. The WMD distance between the two short texts is the minimum cumulative distance calculated by word vectors in short text 1 travel to short text 2. Since the WMD distance can use the word-level semantic information represented by word2vec [34], it can achieve better results in the short text semantic distance calculation. Thus, WMD is suitable for computing the semantic similarity features in this paper. The smaller the WMD value, the more similar the two short texts. To automatically extract the semantic similarity features from the homepage’s HTML document, we propose an algorithm and complete it through the three steps: short text cleaning, represent words as vectors, and computing WMD distance. The pseudocode of this algorithm is shown in Algorithm 1.

Require:
(1)hp: homepage of each domain
(2): homepage’s domain
(3)Initialize the list , and set to null
(4)if (hp is not null) then
(5): = ExtractText (hp)
(6) /∗ extract hp’s title, keywords, description /
(7):= Collectlinks (hp)
(8) /∗ collect all external links in hp /
(9)if ( and is not null) then
(10)  for each link do
(11)   : = ExtractLinkText (link)
(12)   : = ExtractDomain (link)
(13)   /∗ extract link’s anchor text /
(14)  end for
(15)    = WMD ()
(16)  /∗ computing the WMD distance between hp’s text and external link’s anchor text∗/
(17)    = WMD (, )
(18)  /∗ computing the WMD distance between hp’s domain and external link’s domain∗/
(19)else if ( is not null and is null) then
(20)   = 0
(21)else
(22)   =  = 0
(23)end if
(24)end if
(25)return,

As described in Algorithm 1, line 1 and line 2 show that the homepage and domain name of each website is required. Line 3 creates three lists: contains the homepage text, includes all external links, and is for storing the anchor text corresponding to each external link.

Line 4 to line 24 is the process of calculating semantic similarity features. There are five functions in the proposed method.(i)ExtractText(): this function extracts the homepage’s title, keywords, description in meta tags(ii)Collectlinks(): this function extracts all external links of the homepage(iii)ExtractLinkText(): it extracts the anchor text of each link in turn(iv)ExtractDomain(): this function extracts the domain of each link in turn(v)WMD(): it calculates the WMD distance between the homepage text and each anchor text of the external link and calculates the WMD distance between the homepage domain and each external link domain

Next, we introduce each step in detail.Step 1Short Text Cleaning. The HTML document of the homepage contains much information, but only a few parts are used to extract WMD features. To extract the title, the keywords and descriptions in the metafield, external links, and text description of every external link from the homepage of each website, we first need to parse the HTML tags with the Beautiful Soup Python library [35] and convert HTML entities to characters with the HTML Python library. Moreover, we remove some punctuations and stop words in raw texts. We utilize stop words from the NLTK library, which contains 127 English words. Besides, to ensure a hyperlink is an external link, it is necessary to extract each website’s domain. Note that the external link includes neither relative paths nor links under the same domain which we mentioned in Section 3.1.1. Then, we splice the content of the homepage’s title and keywords and description of metatag together as the . Besides, we should process two parts for each external link, the link itself and the anchor text of the link. Some websites contain more than one external link while some contain none. We push all the external links to a list as . For each link’s anchor text, we also push all the anchor texts corresponding to each link into the in turn, which is separated by spaces.Step 2Represent Words as Vectors. Since models accept numerical input only and the words in short texts are natural languages such as English, these words need to be converted into numerical forms or embedded in mathematical space. The vector mapped to real numbers is called word vectors. The embedding method is called word embedding, and word2vec is a kind of word embedding method. Word2Vec is a tool to transform the text processing into a vector in the multidimensional vector space, representing the text’s semantic similarity based on the similarity in the vector space. We use a pretrained model, Google News [36], to train word vectors. It contains 3 million pretrained English word embeddings.Step 3Computing WMD Distance. After obtaining the original short texts and links’ word vectors, we then calculate the WMD distance to represent the similarity, which is illustrated as follows:

is a sparse matrix where is the weight of word in document move to word in document , is the Euclidean distance between word and word , as equation (3) shows, and , is the word ’s word vector after embedding, respectively.

The sum of weight of word to another document is equal to the weight of word in the first document , and the sum of weight of word to another document is equal to the weight of word in the first document .

can be calculated by equation (5), where is word appear times in the document .

In the case of this article, document corresponds to the and document corresponds to the links and anchor text of external links. So, there are multiple short texts. We need to compare the semantic similarity of each external link with the homepage. Then, we calculate the average by equations (6) and (7), respectively.

3.4.3. Structure Complexity Features

It is necessary to consider the characteristics of the homepage’s document object model (DOM) structure because we discover that many web pages use the same templates, such as domain parking services, personal blog websites, and some government websites. It can be considered that there are certain regularities. According to MDN [37], a web page is a document of which structure and content are represented as nodes and objects by DOM. Structural features can reflect the complexity of web pages. Similar web pages have a similar DOM structure because web spam is often a well-designed template, unlike normal web pages that have different characteristics. The previous method of identifying web spam mainly focused on the difference in content and links, but we also consider the web page’s structural characteristics in this paper. Previous methods also studied the structure of HTML, such as extracting the number of <a> tags and <img> tags. However, this method may not be very effective for some specific websites, such as online shopping websites and stock picture websites, which have an obvious tendency of certain types of HTML tags. We not only analyze a certain type of tags but also analyze the complexity of the web page’s structure from a more general perspective. We have analyzed the three features of the number and diversity of HTML tags and the depth of element nodes. For example, the domain parking service, where the parking platform often has a fixed template. We aim to identify such a web page. Thus, it is necessary to research the web page’s structure, and we have extracted the following structural features from the source code of the homepage.(1)Number of HTML tags: the DOM represents an HTML document as a tree structure of tags, which is a DOM tree. We only consider the element-type nodes, HTML tags. By traversing the DOM, the total number of tags contained in the homepage is calculated.(2)Diversity of HTML tags: different types of websites have different distributions of their tags. For example, the web pages in some link factories contain a large number of hyperlinks, and the ¡a¿ tag implements these hyperlinks. There are many image tags in online stores, while personal blogs have many paragraph tags. We also counted each type of label on each homepage.(3)Depth of element nodes: by traversing each branch of the DOM tree, we calculated the maximum depth of the DOM tree of the homepage. It reflects the complexity of a DOM tree structure and the complexity of the homepage’s HTML structure.

3.4.4. Precomputed Features

There are 277 precomputed features in total, which categorize into 4 sets, direct features set with 2 features, content-based features set with 96 features, link-based features set with 41 features and transformed link-based features set with 138 features, which are obtained by mathematically transforming the link-based features. If all features are considered for experiments, it is evident that those with high dimensions will undoubtedly consume a lot of resources and cause a long execution time. In fact, many features are redundant. By removing several redundant features, both efficiency and accuracy can be improved. Therefore, we only take the features related to the homepage into consideration.

First of all, by checking the meaning of each feature, we preliminarily filter out 106 precomputed features from these four feature sets that are all about the homepage (hp), without considering the page with the maximum PageRank (mp) value. Then we select features from the 106 features. We use a new backward elimination approach, Smart-BT, proposed by Asdaghi and Soleimani [23] to accomplish feature selection. This method differs from sequence backward elimination in that it measures the impact on the classification result after eliminating a set of features, rather than eliminating a single feature. In summary, we extracted 7 new features and selected 14 features from the existing features. There are 21 features in total. These features will be input into the detection model.

3.5. Classification Algorithm

Judging whether a web page is spam or not is a problem with less clear boundaries. It is a subjective issue to some extent, so classifying web pages as spam or nonspam is challenging. Because of the apparent differences between spam and nonspam web pages in some features, we can use these features to build machine learning models that allow experts or researchers to identify web spam and reduce losses on the ground quickly. Classic algorithms, such as NB, logistic regression (LR), SVM, RF, convolutional neural networks (CNNs), RNN, and long short-term memory (LSTM), have different advantages and disadvantages. In [38, 39], a large-scale empirical comparison between these machine learning methods is presented. A CNN has an excellent performance spatial mapping, such as image data. An RNN is more suitable for sequence content, such as text analysis. But, an RNN has the problem of gradient disappearance; it is challenging to process long sequence data. LSTM can avoid the vanishing of gradient of conventional as a special case of the RNN.

Considering the characteristics of the dataset is imbalanced and features in the feature set are independent and based on the cost of different methods, the RF [40], which combines a multitude of decision trees, is more suitable for the problem solved in this paper. As an ensemble learning method for classification, RF solves the shortcomings of the weak generalization ability of decision trees since it predicts a sample by lots of decision trees voting for the final result. Furthermore, there are several advantages to selecting RF as the classifier. It is inherently easy to interpret and understand. Furthermore, RF algorithm is easy to implement and costs less than deep learning. Therefore, we chose to use RF as the automatic classifier in this paper.

4. Experiments and Evaluation

In this section, we have first illustrated the environment of experiments and detailed the source and composition of the dataset used in this paper. Then, the metrics utilized for the measurement of the performance of the proposed model are discussed, and later, experiment results are analyzed.

The experiments studies are conducted on the Ubuntu operating system. The homepages extraction and data preprocessing are developed in some libraries written in Python. Also, the process of model building and classification is implemented by scikit-learn [41] and keras [42] with TensorFlow backend [43]. The experimental environment configuration is shown in Table 1.

Since this paper focuses on the extraction and effects of novel features, the parameters in the detection model should be set or adjusted as little as possible. The simpler the machine learning model, the more likely it is that good experimental results are not based solely on specific samples. For example, in the detection model RF, we only set the number of trees parameter “n_estimators” to be 100 empirically. In fact, the default value of “n_estimators” changed from 10 to 100 in scikit-learn v0.22. There is no specific setting for the other hyperparameters, which means other hyperparameters are set by default. The advantage is that the classifier of this paper is not aimed at a specific dataset, but has generalization capabilities. There is a reason to believe that the proposed method is not too data dependent and easy to apply for new users. When encountering a new dataset, we only need to extract the features proposed in this paper according to the method in Section 3 and input these features into the classifier for classification. However, machine learning is dependent on data, and different data types have different targeting models, which we mentioned in Section 3.2. For data with similar regularities, the proposed method has generalization ability. Firstly, different web spam pages have similar characteristics, such as too many links for link-based web spam or a large number of repetitions of text content for content-based web spam. Secondly, cross-validation is used to evaluate the prediction performance of the model, especially the performance of the trained model on new data. The cross-validation can reduce overfitting to a certain extent and better evaluate the generalization quality of the model by repeatedly dividing the dataset.

4.1. Dataset

We run our experiments on the WEBSPAM-UK2007 dataset [31] which is a large collection of 105,896,555 pages in 114,529 hosts based on a crawl of the “uk” domain that was conducted in May 2007. It is also used as the Web Spam Challenge 2008 dataset. Although the amount of the dataset is large, only few were labeled by a group of volunteers. As shown in Table 2, among the all 6479 labeled data, we discard the data labeled as “undecided” because it means that the volunteers were still uncertain as to whether they were spam or nonspam and discard the data without content features. Meanwhile, we delete these data that their features are not complete. As a result, 5797 data remain.

As Table 2 depicts, the number of spam is 321, and the proportion of spam is only 6%; the ratio of samples of nonspam class to spam class is nearly 16 : 1, which means that the data are very imbalanced. In such scenarios, machine learning models cannot learn the characteristic behavior of the minority spam class. Classifying samples as spam or nonspam accurately presents considerable challenges. To address this issue, we re-extracted 1215 pieces of data after removing duplicate data in the original dataset from the results given by the top three [44] in the Web Spam Challenge 2008. These data were consistently labeled as “spam” by the top three teams. To a certain extent, these data can be considered reliable. Although we extracted more labeled data, these data were already 13 years old, so we also considered the newer dataset UK-2011 [45], which was derived from the WEBSPAM-UK2007 dataset. After deduplication, we find that all pages on many websites are invalid websites. “Invalid” means the source code of these pages has no content or is meaningless. There are probably the situations “301 Moved Permanently,” “Object Moved,” “This IP has been banned,” and “302 Found.” Since these pages have no research value, they need to be deleted. In addition, we have also removed some pages that are not in English. In the end, as Table 2 depicts, we have 6189 pages consisting of 4745 nonspam pages and 1444 spam pages.

We have a reason to believe that conducting the study with the 13-year-old dataset has certain limitations, such as whether it is suitable for today’s rapid development of web applications. The meaningfulness of using the dataset is as follows.(1)This dataset is a standard dataset in the field of web spam research, and its labels are judged by multiple scholars. It can be considered that its labels are authoritative.(2)Although it has passed 13 years, it is still in the web 2.0 era now. Developers build web pages, especially web spam, with some commonality technologies before and nowadays, which indicates the dataset is universal. Moreover, during our manual check process, we find that some websites are still active.

4.2. Experimental Design

We conduct three sets of experimental studies to evaluate the performance of our model fully: (1) we first evaluate the performance of our proposed method and verify the effectiveness of the novel feature; (2) we also compare the performance of our method with some popular web spam detection systems; and (3) we use hypothesis testing to verify the validity of our method and analyze the importance of features.

4.2.1. Experiment for Performance of Proposed Model

We first examine the performance of the RF model on the dataset and compare the results with benchmark models. Considering the dataset is inadequate, especially the data for the minority class and different samples or different partitions of the dataset may cause the result to be optimistically biased, we take cross validation to train our dataset. As a potent tool in machine learning and deep learning, cross validation can ensure that every page in the dataset can be used in the experiment process. In this way, we ensure the full use of the data and manage to make the experimental results less biased. Thus, we apply 5-fold cross validation to train our dataset in all detection models. We input all the features that comprise existing features and novel features into classification approaches to determine whether a page is spam or not. We have also investigated some benchmark traditional machine learning algorithms such as NB, LR, and SVM and some benchmark deep learning algorithms including CNN, RNN, and LSTM as basic comparative experiments. Secondly, we compare the classification effects of each model on only existing features and all features.

4.2.2. Experiment for Comparison of Detection Rate

We also have compared some state-of-art methods; for example, Mittal and Juneja [46] presented a mutual information-based feature selection method which selects content-based features and with a SVM classifier to distinguish web spam, Makkar et al. [26] used PCA and RFE to deal with link-based features and incorporated the features into an RNN classifier to detect spam, and Asdaghi and Soleimani [23] proposed an effective feature selection method to select fewer features and put the features into NB model to achieve high performance. We used the same method as these papers to divide the dataset into training, validation, and test sets. In order to make a comparison on the dataset used in this paper, we reproduced these experiments.

4.2.3. Experiment for Validity of the Proposed Features

As one of the models considered is the LR model and the best model is RF, followed by comes LR, we believe that reporting the logistic regression model results with statistical inference would be very useful for more than one reason. We use statsmodels [47] for the estimation of many different statistical models. Firstly, it would verify that some of the features identified in Section 3.1 can actually be used to detect spam, and it would demonstrate which variables are the most important in this regard. Secondly, it could then be used as a benchmark against which the other models could be compared. We use the open-source package “pROC” provided in Robin et al. [48] to compare two different models’ area under the curve (AUC). It helps us compare the superiority of different models more rigorously, especially when the p value is less than 0.05, and the results are more convincing.

We also use two methods including mean decrease impurity (MDI) and mean decrease accuracy (MDA) also known as permutation importance (PI) in RF for feature importance analysis. There are two main problems of impurity-based feature importance methods are that it biased towards high cardinality features, and the impurity-based importances are computed on the training set. Hence, it is not certain that features are also useful on the test set, whereas MDA is an alternative that can mitigate those limitations. We add up the ranked results of each feature and calculate the average importance score because of cross validation.

4.3. Evaluation Metrics

In binary classification problems, the most popular performance evaluation indicators are accuracy, precision, recall, and F1 score, which are described in detail as follows.

Accuracy is the number of correct predictions over the number of total predictions of the model.

We judge a sample with a prediction score greater than the threshold as positive (spam class), and the threshold is 0.5. Where true positive (TP) is the number of spam samples that are correctly classified, true negative (TN) is the number of nonspam samples that are correctly identified, false negative (FN) is the number of spam samples that are mistakenly classified as a nonspam sample, and false positive (FP) is the number of nonspam samples that are mistakenly classified as spam.

Recall, also called true positive rate (TPR), is the proportion of the number of spam samples was identified correctly as spam in all spam samples and defined in equation (9). Precision is the proportion of the true predictions of the spam samples over total samples predicted as spam which is defined in equation (10).

F1 score is the harmonic average of precision and recall and defined in the following equation:

False positive rate (FPR) is defined as follows:

Receiver operating characteristics (ROC) curve must be regarded as a widely used indicator when it comes to metrics for performance evaluation in classification problems. Because the ROC curve has an outstanding characteristic, the ROC curve can remain unchanged when the distribution of positive and negative samples in the test set changes. Also, AUC, which utilizes TPR and FPR (equations (9) and (12), respectively), represents classifiers’ performance. The larger the AUC value, the better the classifier is in detecting web spam.

4.4. Results and Analysis
4.4.1. Classifier Performance Result

We show the performance of the binary classifiers in detecting web spam in Table 3, which demonstrates the results of different baseline models using the evaluation indicators described in Section 4.3. Figure 5 illustrates the graphical view of the performance of the different classifiers in web spam detection. It can be seen from the table and figure that the RF model yields the best performance in all aspects of our experiments, with a precision rate of 0.929 and a recall rate of 0.930 and has the largest curve area. Trees of RF algorithm are independent during the training process. The final result is obtained by voting of all trees. For imbalanced data sets, RF can balance errors [49]. LR is closely followed, and SVM, CNN, and RNN perform well but relatively worse. However, NB and LSTM performance are slightly worse. NB is relatively simple, more sensitive to minority class data. LSTM is suitable for longer sequence data, so the dataset in this paper does not highlight its advantages. The result of experiment for performance of the proposed model demonstrates that the RF model can use the features effectively. We can conclude that the selected novel features combined with the chosen classifier RF yields better results. In Table 4, we can see the existing features and novel features, as well as the results of all features under different evaluation indicators. Without novel features, the result is inferior to that of with novel features, which means that the novel features we extracted are practical.

4.4.2. Comparison Experiment Result

In this paper, we reproduced three representative methods as comparative experiments that we explained in Section 4.2. The three studies were chosen for the comparisons based on the consideration that the researches were relatively new and their results performed well. Also, they all used the same dataset as this paper. As in Table 5, which demonstrates the results of three state-of-art methods, paper [46] achieved an F1 score of 0.883 on our dataset, which was less than our method by nearly 5%. Our method uses fewer features and achieves better results. It is concluded that these methods are worse than our method and our proposed method performs well.

4.4.3. Validity Verification Result

Table 6 demonstrates some regression results of the features used in this paper. It includes each feature’s coefficients, standard error, and value by logit regression analysis. It can be seen that the novel features contribute significantly to the model.

“Two ROC curves are ‘paired’ if they derive from multiple measurements on the same sample” described by Robin et al. [48]. Thus, we compare the ROC curve of RF with other models’ ROC curves, respectively. We use “roc.test()” command from “pROC” package in R to calculate the value. All paired ROC curves’ value is less than 2.2e-16, which is far less than 0.05. We could say that the RF model (AUC = 0.957) has an AUC that is significantly greater than the second-best model (AUC = 0.902). Also, the results are not accidental which proves that our method is correct and effective.

As can be observed in Figure 6, the Figure 6(a) is the result of using MDI, and the Figure 6(b) is the result of using MDA. The results of the two RF feature importance ranking methods are not exactly the same. The ones with on the Y-axis are novel features. It is clear that the novel features extracted in this paper rank top overall. The advantage of the features proposed in this paper is that it is convenient to extract, whereas the precomputed existing features extraction requires more stringent conditions such as construction of web graph. The novel features are general and easily accessible.

5. Conclusions and Discussion

Based on current research, this paper proposes a new method to distinguish web spam. We introduce a set of novel features about the homepage which we manually checked. In the meantime, we use the feature selection algorithm Smart-BT [23] to reduce the precomputed existing features’ dimension so that the method’s computational cost will decrease. Then, we use the RF model to discriminate against web spam with efficient identification. The experiment results showed that this method could reach a state-of-art level compared with other methods. Besides, the model with novel features which are are impressive to web spam detection is more superior and valid than the model with only existing features. Since this paper takes homepage only into account, the method is general and extensible because obtaining all pages of a website is not easy in most times. We acknowledge that some of the biases of our dataset might affect the result. Our method may not work well as the web spam evolves because the boundary between spam and nonspam is likely to blur. Also, we only analyzed statically from the source code without considering the dynamic parts such as JavaScript code, so our method has limitations for web spam that uses dynamic technology. For example, cloaking and redirection web spam. The proposed method only focuses on the homepage of a certain website without confirming whether the website returns different content for users and search engines so that there is a certain error in detecting this type of web spam. Moreover, many malicious websites redirect to other pages to improve rankings. There are many ways to achieve redirection, such as the redirection field in the meta tag and dynamic scripts in JavaScript. The proposed method does not pay attention to the JavaScript code and redirection web spam detection is not accurate enough.

In the future, mining more efficient features based on static and dynamic analysis and using a classifier with the ability of high accuracy would be an interesting direction. This will be the direction we will consider later.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant no.61902265, Sichuan Science and Technology Program under Grant nos.2020YFG0047 and 2020YFG0076, and the Fundamental Research Funds for the Central Universities.