|
Features | No. | Feature identifier | Description |
|
URL features | F1 | Scheme | Scheme of URL (HTTP or HTTPS) |
F2 | Domain | Domain of URL |
F3 | top_domain | Top-level domain of URL |
F4 | second_domain | Second-level domain of URL |
F5 | domain_level | Depth of domain level |
F6 | domain_len | Length of domain |
F7 | behind_domain_len | Length of path |
F8 | dash_count | Number of “-” in URL |
F9 | num_count | Number of nums in URL |
F10 | slash_count | Number of “” in URL |
F11 | special_symbol_count | Number of “@ _%#” in URL |
F12 | top_char | The character appears most frequently in URL |
F13 | top_symbol | The symbol appears most frequently in URL |
F14 | sens_words_url | Sensitive words (i.e., “secure,” “account,” “login,” “signing,” and “confirm”) in URL |
F15 | url_word_top3 | Top three words with the highest word frequency in URL |
|
Host features | F16 | valid_days | Valid days of domain |
F17 | registrant_country | Registrant country of domain |
F18 | A | IP in A record for domain (A.B.C.D) |
F19 | A_1 | P In A record for domain (A.B.C) |
F20 | A_2 | IP in A record for domain (A.B) |
F21 | A_IP_num | Number of IP in A record for domain |
F22 | CNAME | CNAME in CNAME record for domain |
|
Web resources’ features | F23 | tag_count | Number of specific tags in Html source code (i.e., “ link ,” “ script ,” “ img ,” and “ form ”) |
F24 | sens_words_html | Sensitive words (i.e., “secure,” “account,” “login,” “signing,” and “confirm”) in HTML text |
F25 | brand_words_html | Brand names in HTML text |
F26 | tfidf_top3 | Top three words with the highest tf-idf in HTML text |
F27 | html_text_symbol | Number of symbols in HTML text (unicode FF00-FFEF) |
F28 | icon_str | Hex string converted by.ico file |
|
OCR features | F29 | sens_words_ocr | Brand names in web page OCR results |
F30 | brand_words_ocr | Sensitive words in web page OCR results |
|