Research Article
Detecting Web Spam Based on Novel Features from Web Page Source Code
Table 6
Logit regression results.
| Type | Feature | Coefficients | Std. error | Pr(>—z—) |
| Selected existing features | HST_19 | 0.7254 | 0.338 | 0.032 | HST_20 | 2.3106 | 0.420 | 0.000 | outdegree_hp | 0.0066 | 0.003 | 0.015 | pagerank_hp | −9.712e + 04 | 4.2e + 05 | 0.817 | truncatedpagerank_1_hp | −1.043e + 05 | 1.35e + 06 | 0.939 | truncatedpagerank_2_hp | −1.079e + 05 | 2.35e + 06 | 0.963 | truncatedpagerank_3_hp | −1.101e + 05 | 4.28e + 06 | 0.980 | truncatedpagerank_4_hp | −1.084e + 05 | 2.74e + 06 | 0.968 | L_outdegree_hp | −0.0425 | 0.002 | 0.000 | L_pagerank_hp | 0.0472 | 0.012 | 0.000 | L_truncatedpagerank_1_hp | 6.4183 | 0.665 | 0.000 | L_truncatedpagerank_2_hp | −8.9288 | 1.766 | 0.000 | L_truncatedpagerank_3_hp | 3.5543 | 2.380 | 0.135 | L_truncatedpagerank_4_hp | −0.9848 | 1.369 | 0.472 |
| Novel features | Diversity of HTML tags | −0.0523 | 0.011 | 0.000 | Depth of element nodes | −0.0123 | 0.002 | 0.000 | Number of HTML tags | −0.043 | 0.000 | 0.000 | Number of external links | 0.0460 | 0.003 | 0.000 | Number of cross links | 0.0173 | 0.002 | 0.000 | Similarity of texts | −0.0816 | 0.029 | 0.004 | Similarity of links | 1.7910 | 0.249 | 0.000 |
|
|