Abstract

This paper addresses the preprocessing of event sequences issued from cyclic discrete event processes, which perform activities continuously whose delimitation of jobs or cases is not explicit. The sequences include several occurrences of the same events due to the iterative behaviour, such that discovery methods conceived for workflow nets (WFN) cannot process such sequences. In order to handle this issue, a novel technique for splitting a set of long event traces S = {Sk} (|S| ≥ 1) exhibiting the behaviour of cyclic processes is presented. The aim of this technique is to obtain from S a log λ = {σi} of event traces representing the same behaviour, which can be processed by methods that discover WFN. The procedures derived from this technique have polynomial-time complexity.

1. Introduction

In discrete event processes, modelling is essential for designing management or control systems or analysing processes in operation. In the latter case, automated modelling of discrete-event processes from the recorded system behaviour is a valuable resource for process reengineering. In the areas of business process and manufacturing systems, automated modelling is an active research matter; in the first area, it is called process discovery [1], while in the second one it is named process identification [2].

1.1. Automated Modelling

In both areas, the aim is to build discrete-event models from records of event data generated by the processes; such event data are captured in the form of event sequences or traces, which reveal the actual process behaviour. The models must represent clearly sequential and concurrent behaviours; finite automata and Petri nets (PNs) are the formalisms mostly used.

The source of event traces, called the event log, is the management information systems [36] or the process controllers [2, 7]. In each type of process, the logs are represented in different formats. In business processes, the event logs are composed of large multisets of traces; each trace describes a process execution called a case. In manufacturing processes, the activities are continuously performed iteratively; the delimitation of jobs or cases is not explicit. Thus, the event logs are composed of a few (usually one) very long sequences.

1.2. Event Log Preprocessing

Extracting the iterative subsequences from long task sequences is a way to isolate the executions of t-components of the workflow net (WFN) to discover, allowing splitting the long sequences into multiple traces. This approach allows the application of diverse techniques that discover WFN to event logs drawn from the manufacturing processes.

Existing discovery methods for WFN cannot always process long sequences from cyclic process, in particular, when initial events occur again in the sequence due to the iterative behaviour of the process; the obtained models are less readable or, in some cases, wrong. Consider the log S = {abcdabcecd}; the discovered model obtained using a standing method [8] is shown in Figure 1(a). Conversely, when the single sequence of the log is split into λ = {abcd, abcecd}, the same discovery method yields the WFN in Figure 1(b); the extended WFN replays S.

Splitting or partitioning an event log is a strategy held for several purposes: trace clustering [9, 10], reduction of the surplus language for fault diagnosis [11], model simplification [12], discovering unobservable behaviour [13], and model refinement [14]. Methods dealing with the problem of sequence segmentation for improving the translation from Japanese to English have been proposed [15, 16].

1.3. Contribution

In this paper, a novel technique for splitting long task sequences issued from highly repetitive cyclic processes into subsequences is proposed. To the best of our knowledge, there are no methods addressing the stated problem. The method processes a reduced set of long event traces S = {Sk} (|S| ≥ 1) and obtains a log λ = {σi} of event traces representing the same behaviour. The purpose of this processing is to apply WFN discovery algorithms, in particular, those dealing with the silent transitions.

The paper is organised as follows: Section 2 presents the notation on PNs, WFNs, and the splitting problem; Section 3 describes the splitting trace method; Section 4 presents the implementation and tests; finally, Section 5 presents the conclusions.

2. Background and Problem Statement

This section presents the basic concepts and notation of ordinary PNs and WFNs used in this paper. For further details the reader can consult to the study by van der Aalst et al. [1]. Additionally, the sequence splitting problem is formulated.

2.1. Petri Nets

Definition 1. An ordinary PN structure G is a bipartite digraph represented by the three-tuple G = (P, T, F); where: P = {p1, p2, …, p|P|} and T = {t1, t2, …, t|T|} are finite sets of nodes named places and transitions, respectively; F ⊆ P × TT × P is a relation representing the arcs between the nodes.
For any node x ∈ P ∪ T,  = {y|(y, x) F} and  = {y|(x, y) F}. The incidence matrix of G is C = [cij]; where cij= −1 if (pi, tj) F and (tj, pi) F; cij= 1 if (tj, pi) ∈ F and (pi, tj) F; cij= 0 otherwise.
The places in P can be empty or marked by one or more tokens. A marking M: P → determines the number of tokens within the places; where is the set of nonnegative integers. A marking M, usually denoted by a vector ()|P|, describes the current state of the modelled system.

Definition 2. A Petri net system or Petri net (PN) is the pair N = (G, M0), where G is a PN structure and M0 is an initial marking. R(G, M0) denotes the set of all reachable markings from M0.

Definition 3. A PN system is 1-bounded or safe iff, for any MiR(G, M0) and any pP, Mi(p) ≤ 1. A PN system is live iff, for every reachable marking MiR(G, M0) and tT there is a MkR(G, Mi) such that t is enabled in Mk.

Definition 4. A t-invariant Yiof a PN is a nonnegative integer solution to the equation CYi= 0. The support of Yi (t-support) denoted as <Yi> is the set of transitions whose corresponding elements in Yi are positive. Y is minimal if its support is not included in the support of other t-invariant. A t-component G(Yi) is a subnet of PN induced by a <Yi>: G(Yi) = (Pi, Ti, Fi), where Pi = <Yi> ∪ <Yi>, Ti = <Yi>, Fi = (Pi × TiPi × Ti) ∩ F.
In a t-invariant Yi, if we have initial marking (M0) that enables a ti ∈ <Yi>, when ti is fired, then M0 can be reached again by firing only transitions in <Yi>.

2.2. Workflow Nets

Definition 5. A WorkFlow net (WFN) N is a subclass of PN owning the following properties [1]: (i) it has two special places: i and o. Place i is a source place: i = ∅, and place o is a sink place: o = ∅. (ii) If a transition te is added to PN connecting place o to the place i, then the resulting PN (called extended WFN) is strongly connected.

Definition 6. A WFN (N, M0) is said to be sound iff any marking MiR(N, M0), o ∈ Mi → Mi = [o] and [o] ∈ R(N, Mi) and (N, M0) contains no dead transitions. An extended WFN sound is live and bounded. A WFN can represent a process behaviour by associating task labels to some transitions.

Definition 7. A labelled WFN is a four-tuple (N, M0, Σ, L) where Σ is a finite set of tasks labels, and L: T → Σ ∪ {ε} is the labelling function. Transitions labelled with ε are called silent or unobservable, otherwise they are called observable. Additionally, ∀ ti, tj∈ T, ti ≠ tj, if L(ti), L(tj) ∈ Σ then L(ti) ≠ L(tj); i.e., two transitions cannot have the same label from Σ.

Definition 8. Let Σ be a finite set of tasks labels; an event log λ is a set of traces σi = A1A2AkΣ, |σi| = , refers to the task at position j.

2.3. The Problem of Sequence Splitting

Definition 9. Given a set of long event traces S = {Sk}, where Sk∈ T and |S| ≥ 1, representing the behaviour of a cyclic discrete even process, the aim is to obtain a set λ = {σi} of task traces by splitting the Sk, such that the concatenation of traces in λ represents the same behaviour expressed in S, i.e. an extended WFN discovered from λ must replay S.

Assumptions. A1. The sequences Sk are arbitrarily long; they capture all the possible actual behaviour of the process. Such sequences are generated by an unknown, live, and 1-bounded cyclic PN. It means that the process is well behaved; there are no deadlocks nor buffer overflows during the recording of traces.
A2. In every Sk all the tasks occur at least twice.
A3. Sk are recorded from the initial state. Thus, the first tasks are known.

Example 1. Consider the log S = {S1} on Σ = {A, B, C, D, E, F, G, H}, where S1 = HDEGADBEFDECABDECH DEFDEGADEBFDECHDEGABDECABDEFDECHDEFDEGHDEFDEGADBECADEFBDECHDEGADEB CHDEFDEGADEFDBECADEFDEBCHDEGHDEG. A suitable splitting technique should determine λ = {σ1, σ2, σ3, σ4, σ5, σ6, σ7, σ8, σ9, σ10, σ11}, where σ1 = ABDEC, σ2 = ADBEC, σ3 = ADEBC, σ4 = ABDEFDEC, σ5 = ADBEFDEC, σ6 = ADEBFDEC, σ7 = ADEFBDEC, σ8 = ADEFDBEC, σ9 = ADEFDEBC, σ10 = HDEG, and σ11 = HDEFDEG that represents the execution of the WFN is depicted in Figure 2. The extended WFN replays S1.

3. The Splitting Technique

3.1. Strategy

Every SkS is parsed by searching subsequences in Sk that have the same alphabet; such subsequences are represented by a macrotask θj, which is replaced in all the Sk that contains this subsequence; this operation is repeated until all the sequences in S are formed only by macrotasks.

Example 2. Consider the event log S of Example 1. Then, using the strategy described above, the output of the method is λ = θ1∪ θ2∪ θ3∪ θ4 where θ1 = {HDEG}, θ2 = {ABDEC, ADBEC, ADEBC}, θ3 = {HDEFDEG}, θ4 = {ABDEFDEC, ADBEFDEC, ADEBFDEC, ADEFBDEC, ADEFDBEC, ADEFDEBC}. Figure 2 shows the WFN obtained from λ.
The main steps of the technique are the following. First, an initial splitting of Sk, induced by the first task, is performed. Then, the subsequences of Sk are analysed for obtaining the macrotasks θ1, which are replaced in Sk.

3.2. Basic Operators

Several operators for handling task traces are introduced below.

Definition 10. Let λ be an event log over Σ and let a be a task in Σ; for every trace σk = x1x2xnλ and a ∈σk:(i)τ(xi, σk) provides the name of the task of position xi in σk;(ii)First(S′): gets the first subsequence of the list S′;(iii)(X) gets the set of tasks (alphabet) used in the object X; (σk) and (λ) gets the set of tasks in a trace σk and in λ, respectively.

Definition 11. Let λ be an event log and σk = x1x2xnλ a trace. A macrotask θ = {σ1, σ2, …, σn} is a set of traces such that (σ1) = (σ2) = … = (σn).
Notice that (σ1) is the support of a t-invariant of the extended WFN to build.

Definition 12. Let S′ = {σ1, σ2,…, σn} be a list of subtraces, σ = t1t2tm∈ S′ be a subtrace, i {1,…, m-1} and j {2,…,m} be indexes. Then, the operator delSet(S′, σ, θ, i, j) deletes the tasks in σ from i to j and replace them with the symbols of the macrotask θ in S′.

3.3. Splitting Procedures
3.3.1. First Splitting

In the processing of Sk, the subsequences to consider are those delimited by a given task symbol T along Sk. This search is started using the first symbol of Sk; then, a list of sequences S′ is formed by all the subsequences of Sk starting with T.

The algorithm to split the sequence S in shorter subsequences delimited by the apparition of the first task is presented below.

Input: S, T //The log S and the first task T.
Output: S′ // A list of the sub-sequences whose first task is T
1. σ ← ∅; S′ ← ∅;
2. ∀ tiS:
3.  If ti ≠ T then:
4.   σ ← σti
5.  else If i ≠ 1 then:
6.    S′ ← S′  {σ}; // σ is appended to S
7.    σ ← T;
8. S′ ← S′  {σ}
9. Return S

Remark. The computational complexity of Algorithm 1 is O(|S|).

Example 3. Consider, the log S = {S1} on Σ = {A, B, C, D, E, F, G, H, I, J}: S1 = ABCHIJDEFG DJABBCABCDEGHIJDEGDJDEFGABCDEFGABBCHIJHIJ obtained from the execution of the model is depicted in Figure 3(a). Then, splitSeq(S1, A) gets S′ = {ABCHIJDEFGDJ, ABBC, ABCDEGHIJDEGDJDEFG, ABCDEFG, ABBCHIJHIJ}.

3.3.2. Determining Macrotasks

Afterward, the subtrace σ1 of S′ with the smallest alphabet is chosen and added to the macrotask θ1; such a subtrace is replaced by θ1 in S′.

Based on (σ1) in θ1, the remainder subtraces σr that have the same alphabet can be found and then added to θ1. The replacing of θ1 in S′ may split the remaining subtraces and then create new subsequences.

This operation is performed again on S′ without considering θ1, then obtaining θ2, which is included in S′ as explained before. In every iteration, new macrotasks θs are created and replaced in S′. This process is performed until S′ is formed only by macrotasks. The traces in all the macrotasks form the event log.

Now, the procedures (Algorithms 2 and 3) to replace a macrotask θ in S′ and delete the corresponding subsequences are presented below.

Input: S′, σ, θ
Output: S′, θ
1. σ′ ← ∅; start ← 0; end ← 0; first ← 0;
2. ∀ σi∈ S′:
3.  start ← 0
4.  ∀ tjσi: // tracking the symbols of σi
5.   If tj∈ (σ) then
6.    If first = 0 then
7.     start ← j; first ← 1;
8.   σ′ ← σ′ ∙ tj
9.  else
10.   If (σ’) = (σ): //All the tasks in (σ) are in (σ’).
11.    end ← j – 1
12.    θ ← θ ∪ {σ′} //Def. 12. A new sub-trace is appended to the macro-task θ
13.    S′ ← delSet(S′, σi, θ, start, end) //Def. 12
         Deletes the tasks in σi from start to end and replace them with θ in S′.
14.  else: σ′ ← ∅
15. Return S′, θ

Remark. The computational complexity of Algorithm 2 is O(|S′|.|σ|).

Example 4. Consider S′ = {σ1, σ3, σ3, σ4, σ5}, where σ1 = ABC HIJDEFGDJ, σ2 = ABBC; σ3 = ABCDEGHIJDEGDJDEFG; σ4 = ABCDEFG; σ5 = ABBCHIJHIJ and the shortest subtrace σ = σ2 from Example 3. We replace σ with θ1 in every apparition in S′ and split the subsequence where σ was replaced. So, we obtain S′ = { θ1, HIJDEFGDJ, θ1, θ1, DEGHIJDEGDJDEFG, θ1, DEFG, θ1, HIJHIJ} and θ1 = {ABBC, ABC}.

The procedure below (Algorithm 3) summarises the splitting process.

Input: S
Output: S
1. T ← ∅; σ ← ∅; S′ ← ∅; i ← 1;
2. T ← τ(x1, S); //Def.10 Gets the first task in S.
3. S′ ← FirstSplit(S, T);//Alg.1 Splits S in every apparition of T.
4. While ∃ tj in (S’)| tj(S) then
5.  σmin ← First (S′); // Def.10 Gets the first sub-sequence in S′.
6.  ∀ σ ∈S′:
7.   If |(σ)| < |(σmin)| then
8.    σmin ← σ;
9.  θi ← σmin; // The macro-task is the sub-sequence with the smallest alphabet.
10.  (S′, θi) ← replaceSeq(S′, σmin, θi); //Alg. 2 Replaces all σ in S′.
11.  i ← i + 1;
12. Return S

Remark. The computational complexity of Algorithm 3 is O(|S′|.|σ|).

Example 5. Consider the log S = {ABCHIJDEFGDJABBCABCDEGHIJDEGDJDEFGABCDEF GABBCHIJHIJ} from Example 3. We will briefly describe how the splitting technique works.(1)The first splitting is:S′ = {ABCHIJDEFGDJ; ABBC; ABCDEGHIJDEGDJDEFG; ABCDEFG; ABBCHIJHIJ}(2)Then, we get the shortest alphabet subtrace σ = σ2 = ABBC; the macrotask θ1 = ABBC is created and all the apparitions of the tasks in the alphabet of θ1 are replaced by the macrotask in S′, creating new subtraces and adding the apparitions to θ1; this is:S′ = {θ1; HIJDEFGDJ; θ1; θ1; DEGHIJDEGDJDEFG; θ1; DEFG; θ1; HIJHIJ}, where θ1 = {ABBC, ABC}.(3)Next, the shortest subtrace is σ = DEFG; then the macrotask θ2 = DEFG is created, we replace it in S′, yielding:S′ = {θ1; HIJ; θ2; DJ; θ1; θ1; DEGHIJDEGDJ; θ2; θ1; θ2; θ1; HIJHIJ}.(4)Then, the shortest subtrace is σ = DJ; then the macrotask θ3 = DJ is created, we replace it in S′, producing:S′ = {θ1; HIJ; θ2; θ3; θ1; θ1; DEGHIJDEG; θ3; θ2; θ1; θ2; θ1; HIJHIJ}(5)Next, the shortest subtrace is σ = HIJ then the macrotask θ4 = HIJ is created; we replace it in S′, obtaining:S′ = {θ1; θ4; θ2; θ3; θ1; θ1; DEG; θ4; DEG; θ3; θ2; θ1; θ2; θ1; θ4; θ4}.(6)Then, the shortest subtrace is σ = DEG; then, the macrotask θ5 = DEG is created; we replace it in S′, creating:S′ = {θ1; θ4; θ2; θ3; θ1; θ1; θ5; θ4; θ5; θ3; θ2; θ1; θ2; θ1; θ4; θ4}.(7)Finally, the set of macrotask is {θ1, θ2, θ3, θ4, θ5}, whose subtraces form the event log λ = {ABBC, ABC, DEFG, DJ, HIJ, DEG}, which is replayed by the WFN (without the transition x) shown in Figure 3(a). This WFN is easily transformed into the cyclic PN shown in Figure 3(b).

Property. Algorithm 3 processes efficiently an event sequence S yielding a set S′ which contains subsequences corresponding to the segmentation of S.

Proof. The procedure builds iteratively S′ and converges toward a set including only macrotasks. The concatenation of the subsequences represented by the macrotasks in the order they are obtained yields the sequence S. Since all the involved algorithms are polynomial-time, the processing is efficient.

4. Implementation and Tests

The algorithms to split a long trace into several traces have been implemented as a software tool. Besides to test the software over sequences and verify the correct splitting, an extended test scheme, described below, is defined.

4.1. Testing Scheme

The correctness of the splitting procedure is verified in a controlled manner through a rediscovery scheme, using artificial event logs, which are generated as follows. First, a known extended WFN that may contain silent transitions is created and executed in the PN editor PIPE [17]; this WFN contains a transition te that allows the cyclic behaviour in the net to get long sequences. Then, the obtained string is processed to delete the apparition of the task te in the log and silent transitions labelled with ε. Finally, the long string is saved in a text file, which is the input of the implemented method.

The developed tool processes the text file that contains the long sequence and splits it into several traces, which are saved in a text file; such traces represent the behaviour of the initial WFN. This text file can be used as input to a discovery process technique [18] to obtain a WFN, which is compared to that used to generate the log. The discovered WFN is an XML file, which can be drawn by PIPE. The followed test scheme is shown in Figure 4.

4.2. Experiments

Several case studies using WFN with different structure and size were conducted using the software tool. The following examples are more significant due to their structure rather than the size.

4.2.1. Test 1

An execution of the software tool is presented in Figure 5. In Figure 5(a), the extended WFN edited in PIPE is shown; the artificial log is drawn from such a net. The artificial log composed by one sequence of length 1,045 is shown in Figure 5(b). In Figure 5(c), the split log with 11 traces obtained by the execution of the implemented tool is displayed. Then, the WFN discovered by applying the classification method to the split log is displayed in Figure 5(d).

4.2.2. Test 2

A second test is presented in Figure 6. In Figure 6(a), the extended WFN is shown. The artificial log with length of 3,937 is shown in Figure 6(b). In Figure 6(c), the obtained log with six traces as result of the execution of the implemented tool is displayed. The WFN obtained using the split log and the classification method is displayed in Figure 6(d).

4.2.3. Test 3

In Figure 7, a third test is presented. In Figure 7(a), the extended WFN is shown. The artificial log with length of 10,093 is shown in Figure 7(b). In Figure 7(c), the obtained log with eight traces as result of the execution of the implemented tool is displayed. The WFN obtained using the split log and the classification method is displayed in Figure 7(d).

5. Conclusions

A technique for splitting long event sequences exhibiting the behaviour of cyclic processes has been presented. The result of the processing is an event log from which a WFN can be discovered. Long event sequences are drawn from highly repetitive processes, such as automated manufacturing systems where the initial state is known, but the delimitation of jobs or cases is not specified.

Although, there are discovery methods that deal with the sequences of cyclic processes, this preprocessing technique allows applying many discovery algorithms that build WFN, particularly those that deal with silent transitions [1820]. In this paper, the method in [18] has been used in the tests to rediscover the models that generate the long sequences.

The event logs obtained from the splitting technique contain traces capturing silent behaviour represented in the discovered WFN by silent transitions of types skip, redo, switch, and finalise. However, these traces cannot always lead to discover initialise silent transitions; it is a pending research.

Data Availability

No underlying data were collected or produced in this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

Author, Yolanda Alvarez-Pérez is supported by the CONACYT, Mexico. Ph.D. Grant No. 778009.