Research Article
WSF2: A Novel Framework for Filtering Web Spam
Table 1
Description of inheritance relationships and classes in the WSF2 framework.
| Parent class | Inherited class | Method signature | Method description |
| parser_t | web_header | get_params(): header_t | Gathers relevant information from the response header of the retrieved web site (e.g., status of the HTTP response or web-content encoding). | web_body | get_content(): const char | Extracts the content of the body of each web page (ignoring the HTML tags). | web_body_stemmed | get_stem_words(): stem_t | Reduces the words to their root form, returning a list of stem terms together with their occurrence count inside a web page. | web_ext_domains | get_domains(): domain_t | Returns information related to those domains linked from a given web site. | web_features | get_features(): features_t | Builds a vector that contains a set of features extracted from the header and the content of the retrieved web site. | corpus_features | get_features(): features_t | Retrieves all the preprocessed features of a corpus to an internal format use by our framework. |
| function_t | c5.0 | check_tree(cont: const char): int | Performs C4.5 tree over the content of the web domain. | regex | eval(cont: const char): int | Verifies if a specific regular expression matches the web content. | svm | check_svm(vector: features_t): int | Executes SVM algorithm over the features extracted from the web domain. |
| eventhandler_t | c5.0_autolearn | c5.0_learn(cont: features_t) | Executes the learning tasks for C5.0 tree. | svm_autolearn | svm_learn(cont: features_t) | Performs the learning method for SVM algorithm. |
|
|