Flow Chart Generation-Based Source Code Similarity Detection Using Process Mining
Algorithm 2
CDFC mining algorithm based on the heuristic process mining.
Input: instrumentation output sequences by running the source code, Seq
Output: CDFC = (V, E)
(1)
Denote the following directly threshold as Tf, the threshold of dependence as Td, the number of following directly as num[][] = 0, the dependence as d[][], the instrumentation output set as IO, the node set of CDFC as V, and the edge set of CDFC as E.
(2)
for each trace in Seq do//traverse the instrumentation output sequences of each running of the code
(3)
for i = 0; i < trace.size-1; i++ do//traverse the adjacent instrumentation outputs in the instrumentation output sequences
(4)
num[trace[i]trace[i+1]]++//record the following directly number of every two instrumentation outputs
(5)
if trace[i] not exist in V then
(6)
add trace[i] to V;
(7)
end if
(8)
end for
(9)
end for
(10)
for each io1 in IO do
(11)
for each io2 in IO do
(12)
if io1 = = io2then//the output of two instrumentation outputs is the same
(13)
d[io1][io2] = num[io1][io2]/(num[io1][io2] + 1)//calculate the dependence of every two instrumentation outputs
(14)
if num[io1][io2] ≥ Tf and d[io1][io2] ≥ Tdthen
(15)
add (io1, io2) to E
(16)
end if
(17)
end if
(18)
if io1! = io2then
(19)
d[io1][io2] = (num[io1][io2]-num[io2][io1])/(num[io1][io2] + num[io2][io1] + 1)//calculate the dependence of every two instrumentation outputs