Research Article

Similarity Digest Search: A Survey and Comparative Analysis of Strategies to Perform Known File Filtering Using Approximate Matching

Table 1

Similarity digest search strategies: characteristics.

Strategy ToolsMain technologyInputOutput ( - threshold)Match decisionInsert/ remove elements Owning database

Brute force (sdhash)sdhashBloom filterssdhash digestDigest Bloom filter comparison✓/✓×

Brute force (ssdeep)ssdeepRolling Hashssdeep digestDigest Edit distance✓/✓×

Brute force (TLSH)TLSHLSHTLSH digestDigest Header/body distance✓/✓×

DHTnilNilsimsaDHT (chord) + Voronoi diagramBit vectorNumber of matches Adapted Euclidean distance✓/✓×

iCTPHssdeepDHT (chord) + iDistancessdeep digestNumber of matches Edit distance✓/✓×

F2S2ssdeepIndexing (-grams) + hash tablessdeep digestCandidates sharing the same -gram queriedEdit distance✓()/✓

MRSH-NETsdhash, mrsh-v2Single, huge Bloom filterObject featuresYes/No (consecutive features found in the filter )Bloom filter matches×/×

BF-based treesdhash, mrsh-v2Bloom filter tree structureObject featuresCandidate with highest number of features found in the filter Bloom filter matches×/×

MRSH-CFsdhash, mrsh-v2Cuckoo filterObject featuresYes/No (consecutive features found in the filter )Cuckoo filter matches×/✓

Observation: (): the data set increase (beyond its real capacity) is allowed at the cost of performance.