Security and Communication Networks

Research Article

Similarity Digest Search: A Survey and Comparative Analysis of Strategies to Perform Known File Filtering Using Approximate Matching

Similarity digest search strategies: characteristics.


Strategy	Tools	Main technology	Input	Output ( - threshold)	Match decision	Insert/ remove elements	Owning database

Brute force (sdhash)	sdhash	Bloom filters	sdhash digest	Digest	Bloom filter comparison	✓/✓	×

Brute force (ssdeep)	ssdeep	Rolling Hash	ssdeep digest	Digest	Edit distance	✓/✓	×

Brute force (TLSH)	TLSH	LSH	TLSH digest	Digest	Header/body distance	✓/✓	×

DHTnil	Nilsimsa	DHT (chord) + Voronoi diagram	Bit vector	Number of matches	Adapted Euclidean distance	✓/✓	×

iCTPH	ssdeep	DHT (chord) + iDistance	ssdeep digest	Number of matches	Edit distance	✓/✓	×

F2S2	ssdeep	Indexing (-grams) + hash table	ssdeep digest	Candidates sharing the same -gram queried	Edit distance	✓()/✓	✓

MRSH-NET	sdhash, mrsh-v2	Single, huge Bloom filter	Object features	Yes/No (consecutive features found in the filter )	Bloom filter matches	×/×	✓

BF-based tree	sdhash, mrsh-v2	Bloom filter tree structure	Object features	Candidate with highest number of features found in the filter	Bloom filter matches	×/×	✓

MRSH-CF	sdhash, mrsh-v2	Cuckoo filter	Object features	Yes/No (consecutive features found in the filter )	Cuckoo filter matches	×/✓	✓

Observation: (): the data set increase (beyond its real capacity) is allowed at the cost of performance.