Research Article

Edge-Based Detection and Classification of Malicious Contents in Tor Darknet Using Machine Learning

Algorithm 3

Data cleaning algorithm.
Input: Web Set
Output: Corpus set ([0], [])
(1)for TO do
(2)content = obtain the HTML content of
(3)use“lxml()”funtion to parse the content, then remove HTML tags, script andso on
(4)text = preserve the text content displayed on the page
(5)for in text do
(6)if = ‘or = ‘then
(7) = ‘’
(8)end if
(9)end for
(10)Lowercase all English words
(11)for in text do
(12)if is a punctuation or a number then
(13)  = ‘’
(14)end if
(15)end for
(16)PorterStemmer(text)//Unify all word formats
(17)word_list(,) = text.split(‘’)
(18)for TO do
(19)if word_list[] stopwords(,) or 2 len(word_list[ ]) 12 then
(20)delete_wordlist[i]
(21)end if
(22)end for
(23)SET  word_list(,).join(‘’)//Words are concatenated to strings
(24)end for
(25)return