Review Article
An Improved Framework for Content- and Link-Based Web-Spam Detection: A Combined Approach
Table 1
Percentage of different spam signals.
| Spam signals | Spam vs. nonspam ratio (odds ratio) | Percentage of sites with spam signals and marked as spam (%) | Percentage of sites with spam signals and marked as nonspam (%) |
| Single page website | 5.2 | 18.87 | 5.39 | Thin content | 4.08 | 26.67 | 5.09 | No contact information | 6.78 | 19.28 | 2.07 | Presence of spammy keywords | 11.32 | 23.59 | 2.32 | No SSL certificates | 13.01 | 27.30 | 3.83 | No links to social media accounts | 9.43 | 36.18 | 10.98 | External outgoing links | 4.60 | 21.47 | 6.75 | Content to links ratio | 3.78 | 36.48 | 5.89 | The ratio of incoming links | 9.17 | 20.06 | 1.96 | External links in navigation | 12.52 | 16.08 | 5.89 | A few internal links | 3.06 | 32.76 | 6.60 | URL length | 6.90 | 21.69 | 3.63 | Numerals in domain name | 7.36 | 35.44 | 4.69 | Top-level domains | 11.17 | 21.29 | 11.18 | Huge proportion of anchor text | 2.15 | 26.78 | 7.66 | Site markup proportion | 4.41 | 10.36 | 8.87 | Broken links | 1.19 | 17.50 | 7.96 | Favicon | 2.45 | 11.37 | 5.09 | Page not found 404 error | 2.39 | 15.27 | 9.47 | Meta description length | 1.72 | 13.78 | 5.59 | Length of the title | 2.70 | 10.70 | 3.88 |
|
|