Research Article

Effective and Fast Near Duplicate Detection via Signature-Based Compression Metrics

Algorithm 1

SigNCD duplicate detection.
Require: document list ; similarity threshold ; number of threads ; compressor .
Ensure: duplicate set
()    ,
()    function  DUPDETECT()
()     for  all documents in using threads in parallel  do
()          preprocessing to filter out noisy information
()           signature of
()           the length of compressed
()     end for
()     sort all in by in ascending order
()     for  all in using threads in parallel  do
()       if   in   then
()        continue
()       end if
()       
()  end for
()  return  
() end function
()
() function  ((, , ))
()  
()   the index of boundary object of  matching partition  of on
()  for all in   do
()       if   in   then
()        continue
()       end if
()       
()       if    then
()        
()        
()       end if
()  end for
()  return  
() end function