Mobile Information Systems

Research Article

Edge-Based Detection and Classification of Malicious Contents in Tor Darknet Using Machine Learning

Data cleaning algorithm.

Input: Web Set
Output: Corpus set ([0], [])
(1)	for TO do
(2)	content = obtain the HTML content of
(3)	use“lxml()”funtion to parse the content, then remove HTML tags, script andso on
(4)	text = preserve the text content displayed on the page
(5)	for in text do
(6)	if = ‘’ or = ‘’ then
(7)	= ‘’
(8)	end if
(9)	end for
(10)	Lowercase all English words
(11)	for in text do
(12)	if is a punctuation or a number then
(13)	= ‘’
(14)	end if
(15)	end for
(16)	PorterStemmer(text)//Unify all word formats
(17)	word_list(, …) = text.split(‘’)
(18)	for TO do
(19)	if word_list[] stopwords(, … ) or 2 len(word_list[ ]) 12 then
(20)	delete_wordlist[i]
(21)	end if
(22)	end for
(23)	SET word_list(, … ).join(‘’)//Words are concatenated to strings
(24)	end for
(25)	return