Research Article

WSF2: A Novel Framework for Filtering Web Spam

Algorithm 5

WSF2 filter definition combining C5.0 and SVM algorithms together with regular expressions.
() web_body HAS_GRATIS_ON_BODY eval( "[gG][rR][aA][tT][iI][sS]")
() describe HAS_GRATIS_ON_BODY Finds if web page contains references to “Gratis” on content.
() score HAS_GRATIS_ON_BODY +
()
() web_body HAS_GORGEOUS_ON_BODY eval( "[gG][oO][rR][gG][eE][oO][uU][sS]")
() describe HAS_GORGEOUS_ON_BODY Finds if web page contains references to “Gorgeous” on content.
() score HAS_GORGEOUS_ON_BODY +
()
() web_body HAS_FOXHOLE_ON_BODY eval( "[fF][oO][xX][hH][oO][lL][eE]")
() describe HAS_FOXHOLE_ON_BODY Finds if web page contains references to “Foxhole” on content.
() score HAS_FOXHOLE_ON_BODY +
()
() web_body HAS_TRANSEXUAL_ON_BODY eval( "[tT][rR][aA][nN][sS][eE][xX][uU][aA][lL]")
() describe HAS_TRANSEXUAL_ON_BODY Finds if web page contains references to “Transexual” on content.
() score HAS_TRANSEXUAL_ON_BODY +
()
() web_body HAS_GODDAM_ON_BODY eval( "[gG][oO][dD][dD][aA][mM]")
() describe HAS_GODDAM_ON_BODY Finds if web page contains references to “Goddam” on content.
() score HAS_GODDAM_ON_BODY +
()
() web_body HAS_SLUTTY_ON_BODY eval( "[sS][lL][uU][tT] 1,2 [yY]")
() describe HAS_SLUTTY_ON_BODY Finds if web page contains references to “Slutty” on content.
() score HAS_SLUTTY_ON_BODY +
()
() web_body HAS_UNSECUR_ON_BODY eval( "[uU][nN][sS][eE][cC][uU][rR]")
() score HAS_UNSECUR_ON_BODY +
()
() web_body HAS_BUSINESSOPPORTUNITY_ON_BODY eval( "[bB][uU][sS][iI][nN][eE][sS] 1,2 [  
() ][oO][pP] 1,2 [oO][rR][tT][uU][nN][iI][tT][yY]")
() describe HAS_BUSINESSOPPORTUNITY_ON_BODY Finds if web page contains references to “Business Opportunity”
  on content.
() score HAS_BUSINESSOPPORTUNITY_ON_BODY 5
()
() web_body HAS_GAY_ON_BODY eval( "[gG][aA][yY]")
() describe HAS_GAY_ON_BODY Finds if web page contains references to “Gay” on content.
() score HAS_GAY_ON_BODY 5
()
() web_body HAS_CHEAP_ON_BODY eval( "[cC][hH][eE][aA][pP]")
() describe HAS_CHEAP_ON_BODY Finds if web page contains references to “Cheap” on content.
() score HAS_CHEAP_ON_BODY 5
()
() web_body HAS_BLONDE_ON_BODY eval( "[bB][lL][oO][nN][dD][eE]")
() describe HAS_BLONDE_ON_BODY Finds if web page contains references to “Blonde” on content.
() score HAS_BLONDE_ON_BODY 5
()
() web_body HAS_BARGAIN_ON_BODY eval( "[bB][aA][rR][gG][aA][iI][nN]")
() describe HAS_BARGAIN_ON_BODY Finds if web page contains references to “Bargain” on content.
() score HAS_BARGAIN_ON_BODY 5
()
() web_body HAS_RESORT_ON_BODY eval( "[rR][eE][sS][oO][rR][tT]")
() describe HAS_RESORT_ON_BODY Finds if web page contains references to “Resort” on content.
() score HAS_RESORT_ON_BODY 5
()
() web_body HAS_VENDOR_ON_BODY eval( "[vV][eE][nN][dD][oO][rR]")
() describe HAS_VENDOR_ON_BODY Finds if web page contains references to “Vendor” on content.
() score HAS_VENDOR_ON_BODY 5
()
() web_features SVM check_svm()
() describe SVM Classifies a web page as spam using Support Vector Machine classifier
() score SVM 5
()
() web_features TREE_00 check_tree(0.0, 0.25)
() describe TREE_00 Classifies a web page as spam if C5.0 probability is between 0.0 and 0.25            
() score TREE_00 −1
()
() web_features TREE_25 check_tree(0.25, 0.50)
() describe TREE_25 Classifies a web page as spam if C5.0 probability between 0.25 and 0.50
() score TREE_25 3
()
() web_features TREE_50 check_tree(0.50, 0.75)
() describe TREE_50 Classifies a web page as spam if C5.0 probability between 0.50 and 0.75
() score TREE_50 4
()
() web_features TREE_75 check_tree(0.75, 1.00)
() describe TREE_75 Classifies a web page as spam if C5.0 probability between 0.75 and 1
() score TREE_75 5
()
() required_score 5