Review Article

Threats from the Dark: A Review over Dark Web Investigation Research for Cyber Threat Intelligence

Table 1

A comparison of reviewed literature according to their goal, approaches, used methods and tools, case studies, results, and possible limitations.

ReferenceResearch goalApproachUsed methods and toolsCase studyResults

(Sapienza et al. 2017) [24]Detecting cyber threats from Surface, Deep, and Dark Webs(i) Focused Crawling
(ii) Text mining
(iii) Semantic context
(i) Amazon EC2
(ii) Elastic Search
(iii) Twitter API
(iv) REST-based API
(i) Twitter posts of 69 cybersecurity experts
(ii) 200 Dark Web and Deep Web hacking forums and markets
Generating warnings about specific threats and malware mentioned in Dark Web hacker forums
(i) Short list of security experts
(ii) Lack of dictionaries causing false alarms
(iii) Unhandled terms that can be abbreviated or altered intentionally by hackers
(iv) The generated warnings are not early enough

(Almukaynizi et al. 2017) [25]Predicting vulnerability exploits(i) Crawling
(ii) Binary classification
(i) SVM
(ii) RF
(iii) NB
(iv) LOG-REG
(i) NVD
(ii) CVE
(iii) ExploitDB
(iv) Zero Day Initiative
(v) Sark Web and Deep Web marketplaces and forums (in different languages)
(vi) Symantec attack signatures (ground truth)
Predicting exploits with a high true-positive rate and low false-positive rate
(i) The system does not consider the particularity of each organization, its specific needs, and utilized platforms, and thus it depends on fixed thresholds for all exploitation types
(ii) limited ground truth

(Almukaynizi, et al., 2018) [26]Predicting Cyber Attacks(i) Focused Crawling
(ii) Association Rules
(iii) Causal Reasoning concept
(iv) Logic Programming
(i) Annotated Probabilistic Temporal Logic (APT-logic)
(ii) Point Frequent Function (pfr)
(i) CYR3CON Dark Web marketplaces and forums
(ii) NVD (CVE, CPE)
Generating timely warnings about cyber threats
(i) Low average of warning trigger time before the attack (3 days)
(ii) Lack of performance evaluation according to the consumed time (using n ∗ n matrix of -1 ∗ Lead-time - without a clear evaluation for larger numbers of warnings)

(Williams et al. 2018) [27]Proactive detection of cyber threats from forum attachments(i) Crawling
(ii) Classification
(iii) Visualization
(i) Python libraries (requests_html, BeautifulSoup, Keras)
(ii) RNN
(iii) Standard RNN
(iv) Gated recurrent unit (GRU) RNN
(v) LSTM RNN
(vi) Tableau
Ten hacking forums on the Dark Web in different languages(i) Detecting trending and emerging hacking exploits
(ii) Detecting top active authors and top active forums
(iii) Classifying exploits and attachments
(iv) Analyzing author activities by year and exploit
Excluding attachments uploaded on third party platforms, thus the system cannot be generalized on forums that prevent direct attachments within posts or can miss some valuable insight within the same forum

(Narayanan et al. 2018) [28]Detecting cyber event patterns and predicting future cyberattacks(i) Ontology enrichment
(ii) Knowledge graph representation and reasoning
(iii) Association Rules
(iv) Clustering
(i) Unified cybersecurity ontology (UCO)
(ii) Named-entity recognizer (NER)
(iii) RDF, OWL
(iv) Semantic web rule language (SWRL)
(v) Hidden Markov model (HMM)
(vi) JENA reasoner
(i) Structured information (threat intelligence sources like US-CERT and Talos)
(ii) Plain text (blogs, Twitter, Reddit, Dark Web forums)
(iii) CVE
(i) Constructing an enriched cybersecurity knowledge graph to detect cybersecurity events patterns and predict future cyberattacks
(ii) Reducing the cognitive load on the analyst
(iii) Proving a solution for information incompleteness
(i) Fewer indicators cause less confidence for attacks that do not follow the seven steps of the intrusion kill chain
(ii) Multiple ontologies can produce more accurate results; the approach does not consider the need for a special ontology of hackers’ special technical terms and abbreviations, or the use of foreign languages

(Tavabi et al. 2018) [29]Predicting vulnerability exploits(i) Focused Crawling
(ii) Language Embeddings
(iii) Classification
(i) Paragraph Vector
(ii) SVM
(iii) Radial basis function (RBF)
(iv) RF
(i) Deep Web and Dark Web sites in 17 different languages
(ii) ExploitDB, NVD (CVE), attack signatures from Symantec antivirus and Intrusion Detection Systems, and Exploits database by Metasploit (ground truth)
(i) Achieving low-dimensional space
(ii) Better classification performance with embeddings
(i) Lack of ground truth
(ii) Sparse data with the higher dimensionalities of feature space
(iii) Needs enriched representations for features in other languages

(Arnold et al. 2019) [30]Detecting and predicting vulnerability exploits (breached data)(i) Crawling
(ii) Classification
(i) SNA
(iii) Graph building, annotation, and visualization
(i) Python libraries
(ii) Gephi
5 largest markets and 3 major forums on the Dark WebBetter performance by integrating SNA on text from both forums and markets on the Dark Web to detect and predict exploits
(i) Unhandled inconsistency of listing names led to few classification results; thus, a manual search using SQL queries was needed
(ii) Lack of real-world evaluation

(Ampel et al. 2020) [31]Classifying hacker exploit source codes(i) Crawling
(ii) Deep transfer learning (DTL) techniques
(i) CBiLSTM models
(ii) Transferred Embedding
(iii) Convolutional and BiLSTM layers
(i) Hacker forums:
8 English, 3 Russian
(ii) 1 marketplace: English
(iii) Public repositories: English: Seebug, ExploitDB, Packet Storm, Metasploit, Vulnerlab, Zeroscience
Better labeling of exploit source code with DTL than non-DTL techniques
(i) Low rates of accuracy in some experiments, can be enhanced by considering more features from metadata

(Koloveas et al. 2021) [32]Identifying, analyzing, and sharing information about cyber threats(i) Focused and topical crawling
(ii) Social media Monitors
(iii) Classification
(iv) NoSQL storage
(v) Predictive and suggested search
(vi) Visualization
(i) NYU’s ACHE crawler
(ii) SMILE classifier
(iii) MongoDB
(iv) SVM, RF, NB, K-NN, DT, LOG-REG, CNN
(v) Gensim (Word2Vec)
(vi) spaCy
(vii) MySQL
(viii) MISP’s UI
(ix) PyMISP library
(x) Stack Exchange data dump
(i) Crawled platforms on Surface, Deep, and Dark Web
(ii) Integrated datasets: KB-Cert Notes by Carnegie Mellon University, ExploitDB, VulDB, 0 day Today, NVD (CPE, CVE ID), JVN (JVN iPedia, CPE, CVE ID)
A hybrid CTI tool to detect, identify, analyze, search, and share information about cyber threats
(i) Downloading of the entire webpages can cause a heavy load on storage
(ii) Some domain-specific terms used in the nontechnical text are missed from the named-entity recognizer
(iii) Low-quality seed pages for the topical crawler classification model caused false negatives and missing potential important relevant out-links
(v) Low rates of precision and recall in the social media monitoring system

(Samtani et al. 2017) [33]Classifying hacker assets and detecting key hackers(i) Crawling
(ii) Classification
(iii) Web, data, and text mining
(iv) Source code topic extraction
(v) SNA
(i) SVM
(ii) LDA
(iii) Bipartite networks
(iv) RapidMiner LIBSVM package
(i) 7 Dark Web hacking forums (English and Russian)(i) Identifying hacker disseminated tools in Dark Web hacking forums, their types, and their functionality features
(ii) Detecting key hackers
(i) Downloading full webpages can cause a heavy load on storage
(ii) Some preprocessing procedures may cause losing semantics or names of specific threats
(iii) Limited data needed to understand relationships among hackers leading to a small density and low average path lengths of the constructed graph and thus can cause missing some key hackers

(Grisham et al. 2017) [34]Proactive detection of mobile malware attachments and key hackers(i) Crawling
(ii) Text classification
(iii) Neural network
(iv) SNA
(i) Keras
(ii) LSTM RNN
(iii) Adam optimizer
4 Dark Web hacker forums in different languagesIdentifying mobile malware attachments and key authors from Dark Web hackers' forums
(i) Concentrating on key hackers that only post attachments can miss important key hackers that interact with the other hackers' attachments or perform the attached malware
(ii) Lower rates of precision and recall for the model on mobile malware attachments than on nonmobile malware ones

(Pastrana et al. 2018) [35]Detecting cyber threats, key actors, predicting potential future key actors, analyzing actors' evolution of interests and knowledge(i) SNA
(ii) Clustering
(iii) Topic Analysis
(iv) Prediction
(v) Classification
(i) SNA network metrics
(ii) NLP
(iii) Linear SVM
(iv) K-means
(v) LOG-REG
(vi) LDA
Hackforums from CrimeBB dataset(i) Detecting key actors and their relationships
(ii) Identifying actors’ behavior pathways and interest transition
(iii) Detecting potential cybercrime actors
(i) Manual search of key actors, thus the approach may not be generalized or scalable
(ii) Manually analyzing the activity of neighboring key actors
(iii) Low key actors prediction rate
(iv) Low-resource language corpora may not be adequate for applying NLP tools
(v) Lack of validation of the prediction results

(Biswas, mukhopadhyay, and gupta, 2018) [36]Analyzing hacker behavior, clarifying hacker roles(i) Text Mining
(ii) Sentiment Analysis
(iii) Classification
(i) TF-IDF with overlap score Measure
(ii) LOG-REG
(iii) SentiStrength
HackHound forum, retrieved from the University of Arizona Hacker database(i) Discovering predictors in hacker behavior to detect leaders in the community
(ii) Building a hacker dialect lexicon
(iii) Generating a role-based hacker classification model
(iv) Better accuracy
(i) The results may be affected by the language styles used in the specific platform at study; thus, the approach needs the proved hypotheses to be validated on other platforms
(ii) Low rate of precision and recall for some predicted hack roles

(Marin, shakarian, and shakarian, 2018) [37]Detecting key hacker in Dark Web hacking forums(i) CA
(ii) SNA
(iii) Seniority Analysis
(iv) Classification
(v) Prediction
(i) Genetic Algorithms
(ii) LR
(iii) RF
(iv) SVM
3 hacker forums on Dark Web (English)(i) Identifying key hackers
(ii) Generalizing the model on other forums that do not have reputation systems
(iii) Achieving better performance with a hybrid approach and combined features
(i) The compared forums have wide distinctions of reputation values (134, 102, and 37); thus, this may affect the results of training and testing processes conducted alternately between them (as shown in the study results when using Forum 3 in the testing).
(ii) Low rates of identified key hackers (0.52 as the highest value)
(iii) Lack of validation on the same forum: the approach was trained and tested on different platforms but not on the same platform

(Marin et al. 2018) [38]Predicting hackers’ future post topicsSequential rule miningTRuleGrowth algorithmA popular hacking forum on Dark WebDetecting members’ adoption behavior of topics posted after getting influenced by their peers
The approach needs a justification for using hours’ granularity for sequential rules generating and prediction, while the hours’ granularity has double numbers of rules of those of the days' granularity, with low precision rates. This insight can be misleading as the rules generated for the hours within the same day can be the reason for this increase in rule number (and this cannot be applicable, readable, or adequately visualized)

(Deb, lerman, and ferrara, 2018) [39]Predicting future cyber events(i) Sentiment Analysis
(ii) Time-series prediction
(i) VADER
(ii) LIWC15
(iii) SentiStrength
(iv) ARIMA
(v) Apache Lucene’s elastic search engine
113 hacking forums in English on the Surface and Dark webs, provided by CYR3CON(i) Predicting cyberattacks weeks before the event
(ii) Exploring the relationship between community behavior and cyber activity
(iii) Determining the forums with more predictive power than other forums
(i) Low performance of the sentiment signal system for low frequencies (small numbers of events)
(ii) The performances resulted differs according to the attack type; thus, the system's performance needs to be validated on other types of attacks
(iii) Low precision and recall rates for some dominated months

(Zenebe et al. 2019) [40]Proactive detection of cyber threats and identifying key hackers(i) Classification
(ii) Prediction
(iii) Visualization
(i) IBM Watson Analytics
(ii) WEKA
(iii) RF
(iv) RT
(v) NB
University of Arizona’s Artificial Intelligence Lab datasetDetecting trending topics and key actors
(i) Low-quality comparison of top authors in all three forums at study combined, not in the same forum, which can miss the influencing power of each forum on its own
(ii) Overfitting of exploits with most of the samples in the dataset
(iii) Low accuracy for exploits with a little number of entities
(iv) Does not classify irrelevant posts (nonthreat), which can affect the accuracy of classification results

(Marin, almukaynizi, and shakarian, 2019) [41]Predicting cyber threats, learning hackers’ strategies(i) Association rules
(ii) Causal reasoning
(iii) Logic programming
(i) Annotated probabilistic temporal logic (APT-logic)
(ii) Existential Frequent Function (EFR)
(i) 53 Dark Web hacking forums retrieved from CYR3CON, in different languages
(ii) NVD (CVE, CPE)
(iii) ExploitDB
(iv) 230 records from an enterprise's logs (ground truth)
(i) Detecting hackers' attack strategies
(ii) Predicting near-future cyberattacks
(i) Ground-truth incident data gained from one enterprise, thus providing a little number of incidents for proper testing. Therefore, it needs validation on other enterprises as each one may face different types of attacks
(ii) Most CVE entries are not frequently mentioned in hacking posts
(iii) For the mapping with CPE solution followed for the previous issue, most CPEs have a little number of CVE associated with them, which cannot make a complete ground-truth testing data
(iv) Low performance of the designed algorithm for higher numbers of predicates
(v) Low interval of warning of predicted attack (3 days)

(Sarkar et al. 2019) [42]Predicting real-world cyberattacks through analyzing forums discussion posts and replies(i) Classification
(ii) prediction
(i) TS
(ii) LOG-REG
(i) 53 Dark Web forums
(ii) CVE
(iii) CPE
(i) Predicting cyberattacks by analyzing the activities of expert hackers through reply networks
(ii) Better results by analyzing the network paths than with PageRank or the number of posts per user
(i) The model was trained and tested on data from a single enterprise, which can limit the incidents samples and attack types; thus, the system needs to be validated on other organizations' data

(Huang et al. 2021) [43]Detecting key hackers from hacking forums on the Dark Web(i) Crawling
(ii) CA
(iii) SNA
(iv) TM
(i) LDA
(ii) Topic-specific PageRank
(iii) SNA graph construction
5 hacking forums(i) Increasing the coverage rate of the forum higher than applying CA or SNA alone
(ii) Identifying key hackers based on their topic preferences and activeness
(i) Training the LDA model on all of the analyzed forums together can affect the results of influencing key hackers in each forum separately, as interests and influencing power differ from one forum to another
(ii) Manual validating the resulting key hackers (top 5) can be inapplicable for a larger number of key hackers
(iii) Lack of identifying key hackers in real-time, as they are identified using historical data

(Deliu et al. 2017) [44]Detecting cyber threats with more accuracy (comparing performances)(i) Classification
(ii) Word embeddings
(i) CNN
(ii) SVM
(iii) DT
(iv) K-NN
(v) word2vec
(vi) GloVe
(vii) scikit-learn python library
Nulled.IOSVM and CNN lead to better performances
(i) Analyzing post content without titles; titles comprise useful abstractions of the posts and help for better classification
(ii) The resulting comparison of the algorithms’ performances was applied to one particular case study, which cannot be generalized, as algorithms perform differently on different datasets. Thus, the system needs validation on other platforms.

(Deliu et al. 2018) [45]Detecting cyber threats from hacker forum posts(i) Classification
(ii) TM
(i) SVM
(ii) LDA
Nulled.IOReducing the time consumed by TM by employing classification first
(i) Some topics with minority numbers needed manual searching as LDA cannot extract; that is, it cannot be generalized for datasets with partial sparse data

(Koloveas et al. 2019) [46]Crawling only the content relevant to a specific hacking topic (IoT)(i) Crawling
(ii) Classification
(iii) Semantic language modeling
(i) ACHE Crawler
(ii) SVM
(iii) MongoDB
(iv) Gensim
(i) Websites and forums on the Surface Web
(ii) Hacking forums and marketplaces on the Dark Web
(iii) Stack exchange data dump
Directing the crawler to fetch only relevant content
(i) Downloading whole HTML pages can cause heavy load on storage with useless data
(ii) The approach seems time-consuming with harvesting a massive volume of websites (about 22K per hour) but with a low percentage (1%) of them considered containing actionable CTI after manual checking by experts, and percentages higher than that (not specified) for Social and Dark Webs.
(iii) The crawler depends on the link relevance (words or alt-text of the URL) to decide whether to visit the corresponding website or not, which can miss some valuable sources that are relevant but do not specifically describe the content in the URL

(Queiroz et al. 2019) [47]Enhancing classification methods of hacker discussions(i) Word Embeddings
(ii) Sentence Embeddings
(iii) Classification
(i) Word2vec
(ii) Glove
(iii) Sent2vec
(iv) InferSent
(v) SentEncoder
(vi) SVM
(vii) CNN
(viii) Sci-kit
(ix) Keras API for TensorFlow
5 datasets including forums, microblogs, and hacker marketplaces from Surface, Deep, and Dark WebsExperimental results: SEMB improves SVM, WEMB improves CNN
(i) High rate of false positive causing low rates of recall; thus, the approach resorted to oversampling with an increased number of positive instances to improve Recall
(ii) Classifying the datasets into three classes (Yes, No, and Undecided) does not seem to be justified, as instances classified under the Undecided class were included afterward in the Yes class, which may lead to noise or unclear messages classified as threats

(Johnsen and franke, 2019) [48]Detecting cyber threats, identifying members’ roles and interests(i) Text preprocessing techniques
(ii) TM
(i) Several preprocessing techniques
(ii) Python Panda package
(i) LDA
(iii) Scikit-learn package
Nulled.IO(i) Understanding what the forum is about
(ii) Understanding members' interests and roles
(iii) Improving results quality by reducing vocabulary size
(i) Very low hyperparameters values can lead to a very low convergence rate, which cannot be suitable for real-time CTI
(ii) Lack of validation of how interpretable the generated topics are for human analysts
(iii) The subject-user-centric construction does not yield significant results
(iv) The results focus on the majority of users, which are members with little experience or newbies, while overlooking the highly professional ones

(Ebrahim et al. 2020) [49]Semisupervised labeling for cyber threat detection from Dark Web marketplaces(i) Transductive learning
(ii) Semisupervised labeling
(iii) Heuristics (lexical and structural marketplace characteristics)
(iv) Crawling
(i) LSTM
(ii) Transductive SVM (TSVM)
(iii) El-Gato
(iv) Sindhwani’s implementation (for TSVM)
(v) Context3.0 library (for LSTM)
7 Dark Web marketplaces(i) Reducing manual labeling
(ii) Reducing false positives and negatives in identifying cyber threats
(i) For the lexical characteristics, the approach depends on the market naming rules that prevent vendors from purposely including irrelevant words in their listing titles. Therefore, the system does not handle the misleading naming for markets that do not force such a rule.
(ii) The excessive tests conducted to achieve the optimal values of the hyperparameters for best performance do not justify how the systems will dynamically keep pace with the evolution of the market, changes in labeling, or the newly added labels
(iii) The approach needs to be validated on markets in other languages

(Nunes, shakarian, and simari, 2018) [50]Early detection of potentially targeted systems: platforms, vendors, products(i) Logical argumentation
(ii) Classification
(i) DeLP
(ii) SVM
(iii) RF
(iv) NB
(v) DT
(vi) LOG-REG
(i) 302 forums and marketplaces on the Dark Web in different languages
(ii) NVD
(iii) CVE
(iv) CPE
(i) Improving classification performance by reducing the possible labels with argumentation
(ii) Identifying potential at-risk systems (platforms, vendors, and products)
(i) Low rate of precision and recall for vendor and product components
(ii) Lack of sufficient data for training
(iii) Misclassification of newly discovered vulnerabilities for new products not known as at-risk systems before

(Ebrahimi et al. 2018) [51]Detecting cyber threats from non-English hacker marketplaces on the Dark Web without translating the language(i) Cross-Lingual Representation Modeling
(ii) Deep learning
(i) Deep Cross-Lingual Knowledge Transfer (CLKT)
(ii) Bidirectional Long-Short-Term Memory (BiLSTM)
Dark Web marketplaces, 7 English and 1 Russian(i) Achieving better performance than monolingual or translated models
(ii) Reducing false positives and false negatives
(i) Lack of handling short texts (short products titles)
(ii) Lack of validation on other languages and platforms (forums)

(Schäfer et al. 2019) [52]Detecting trending topics in hacker forums, defining relationships between actors and forums(i) Crawling
(ii) Unsupervised TM
(iii) Time series
(iv) Knowledge graph constructing
(i) Chrome browser Puppeteer
(ii) Scala
(iii) Apache Spark analytics framework
(iv) Elasticsearch
(v) Walktrap community-finding algorithm
(vi) LDA
Seven forums, 3 on Dark Web and 4 on Deep WebA CTI platform that performs real-time tasks:
(i) Inferring relationships between authors and forums
(ii) Extracting trending topics
(iii) Inferring relationships among threads, actors, messages, and topics
(iv) Detecting overlapping actors across forums
(i) Translating languages can cause a loss in the semantics and sentiments of the language
(ii) Downloading whole webpages can cause a heavy load on storage with unnecessary data
(iii) The analyzed forums are easy to access and the assets are free to acquire without excessive measures of authentication or specific user privileges; thus, the approach does not handle the forums with such difficult measures

(Ebrahimi et al. 2020) [53]Increasing the capabilities of multilingual cyber threat detection, cross-language cyber threat knowledge representation(i) Crawling
(ii) language invariant representation
(ii) Classification
(i) LSTM
(ii) CLKT
(iii) Generative adversarial networks (GAN)
(iv) BiLSTM
(v) NB
(vi) SVM
(vii) RF
(viii) K-NN
(ix) Gated Recurrent Unit (GRU)
(x) Bidirectional Gated Recurrent Unit (BiGRU)
(xi) CNN
4 hacking forums on Dark Web (1 English, 1 Russian, and 2 French)Improving the performance of classical ML and deep learning methods
(i) Lack of labeled ground-truth data
(ii) Low rates of accuracy and precision for some languages
(iii) The small volume of data used for training and testing can affect the performance
(iv) It needs validation on other languages and multilingual platforms

(Dong et al. 2018) [54]Generating warnings about newly emerged threats and new releases of existing threats from Dark Web marketplaces(i) Crawling
(ii) Classification
(iii) Text mining
(i) Scrapy
(ii) Elastic search
(iii) Multilayer Perceptron (MLP) Classifier
(i) 8 Dark Web marketplaces
(ii) AlienVault OTX (for existing threat lists)
Detecting new threats emerging from Dark Web markets, and new releases of exiting threats
(i) High false positives due to foreign languages words and words specified for use in the Dark Web (original words), misspelling, compound words, and proper names

(Marin et al. 2018) [55]Detecting communities of malware and exploits’ vendors from hacking-related offerings in Dark Web marketplaces(i) Clustering
(ii) SNA
(iii) Community Detection
(iv) Community validation
(i) K-means with cosine similarity
(ii) Louvain heuristic method
(iii) Adjusted rand index (ARI)
20 Dark Web marketplaces (English)Detecting communities of vendors according to their products and shared categories
(i) Lack of ground-truth data to validate detected communities with real-world communities
(ii) Only cross-validation used to justify the suggested hypothesis

Convolutional neural networks (CNN), defeasible logic programming (DeLP), decision trees (DT), K-nearest neighbors (K-NN), latent Dirichlet allocation (LDA), logistic regression (LOG-REG), linear regression (LR), long-short-term memory (LSTM), naive bayes (NB), natural language processing (NLP), random forest (RF), recurrent neural network (RNN), random tree (RT), social network analysis (SNA), support vector machine (SVM), time series (TS), National Vulnerability Database (NVD), Common Vulnerability Enumeration (CVE), and Common Platform Enumeration (CPE).