Research Article

Flow Chart Generation-Based Source Code Similarity Detection Using Process Mining

Algorithm 2

CDFC mining algorithm based on the heuristic process mining.
Input: instrumentation output sequences by running the source code, Seq
Output: CDFC = (V, E)
(1)Denote the following directly threshold as Tf, the threshold of dependence as Td, the number of following directly as num[][] = 0, the dependence as d[][], the instrumentation output set as IO, the node set of CDFC as V, and the edge set of CDFC as E.
(2)for each trace in Seq do//traverse the instrumentation output sequences of each running of the code
(3)for i = 0; i < trace.size-1; i++ do//traverse the adjacent instrumentation outputs in the instrumentation output sequences
(4)  num[trace[i]trace[i+1]]++//record the following directly number of every two instrumentation outputs
(5)  if trace[i] not exist in V then
(6)   add trace[i] to V;
(7)  end if
(8)end for
(9)end for
(10)for each io1 in IO do
(11)for each io2 in IO do
(12)  if io1 = = io2then//the output of two instrumentation outputs is the same
(13)   d[io1][io2] = num[io1][io2]/(num[io1][io2] + 1)//calculate the dependence of every two instrumentation outputs
(14)   if num[io1][io2] ≥ Tf and d[io1][io2] ≥ Tdthen
(15)    add (io1, io2) to E
(16)   end if
(17)  end if
(18)  if io1! = io2then
(19)   d[io1][io2] = (num[io1][io2]-num[io2][io1])/(num[io1][io2] + num[io2][io1] + 1)//calculate the dependence of every two instrumentation outputs
(20)   if num[io1][io2] ≥ Tf and d[io1][io2] ≥ Tdthen
(21)    add (io1, io2) to E
(22)   end if
(23)  end if
(24)end for
(25)end for
(26)return CDFC = (V, E)