Research Article

WSF2: A Novel Framework for Filtering Web Spam

Table 1

Description of inheritance relationships and classes in the WSF2 framework.

Parent classInherited classMethod signatureMethod description

parser_tweb_headerget_params(): header_tGathers relevant information from the response header of the retrieved web site (e.g., status of the HTTP response or web-content encoding).
web_bodyget_content(): const charExtracts the content of the body of each web page (ignoring the HTML tags).
web_body_stemmedget_stem_words(): stem_tReduces the words to their root form, returning a list of stem terms together with their occurrence count inside a web page.
web_ext_domainsget_domains(): domain_tReturns information related to those domains linked from a given web site.
web_featuresget_features(): features_tBuilds a vector that contains a set of features extracted from the header and the content of the retrieved web site.
corpus_featuresget_features(): features_tRetrieves all the preprocessed features of a corpus to an internal format use by our framework.

function_tc5.0check_tree(cont: const char): int Performs C4.5 tree over the content of the web domain.
regexeval(cont: const char): intVerifies if a specific regular expression matches the web content.
svmcheck_svm(vector: features_t): intExecutes SVM algorithm over the features extracted from the web domain.

eventhandler_tc5.0_autolearnc5.0_learn(cont: features_t)Executes the learning tasks for C5.0 tree.
svm_autolearnsvm_learn(cont: features_t)Performs the learning method for SVM algorithm.