Review Article

An Improved Framework for Content- and Link-Based Web-Spam Detection: A Combined Approach

Table 1

Percentage of different spam signals.

Spam signalsSpam vs. nonspam ratio (odds ratio)Percentage of sites with spam signals and marked as spam (%)Percentage of sites with spam signals and marked as nonspam (%)

Single page website5.218.875.39
Thin content4.0826.675.09
No contact information6.7819.282.07
Presence of spammy keywords11.3223.592.32
No SSL certificates13.0127.303.83
No links to social media accounts9.4336.1810.98
External outgoing links4.6021.476.75
Content to links ratio3.7836.485.89
The ratio of incoming links9.1720.061.96
External links in navigation12.5216.085.89
A few internal links3.0632.766.60
URL length6.9021.693.63
Numerals in domain name7.3635.444.69
Top-level domains11.1721.2911.18
Huge proportion of anchor text2.1526.787.66
Site markup proportion4.4110.368.87
Broken links1.1917.507.96
Favicon2.4511.375.09
Page not found 404 error2.3915.279.47
Meta description length1.7213.785.59
Length of the title2.7010.703.88