Research Article

Detecting Web Spam Based on Novel Features from Web Page Source Code

Table 6

Logit regression results.

TypeFeatureCoefficientsStd. errorPr(>—z—)

Selected existing featuresHST_190.72540.3380.032
HST_202.31060.4200.000
outdegree_hp0.00660.0030.015
pagerank_hp−9.712e + 044.2e + 050.817
truncatedpagerank_1_hp−1.043e + 051.35e + 060.939
truncatedpagerank_2_hp−1.079e + 052.35e + 060.963
truncatedpagerank_3_hp−1.101e + 054.28e + 060.980
truncatedpagerank_4_hp−1.084e + 052.74e + 060.968
L_outdegree_hp−0.04250.0020.000
L_pagerank_hp0.04720.0120.000
L_truncatedpagerank_1_hp6.41830.6650.000
L_truncatedpagerank_2_hp−8.92881.7660.000
L_truncatedpagerank_3_hp3.55432.3800.135
L_truncatedpagerank_4_hp−0.98481.3690.472

Novel featuresDiversity of HTML tags−0.05230.0110.000
Depth of element nodes−0.01230.0020.000
Number of HTML tags−0.0430.0000.000
Number of external links0.04600.0030.000
Number of cross links0.01730.0020.000
Similarity of texts−0.08160.0290.004
Similarity of links1.79100.2490.000