Abstract
Infrequent behaviors of business process refer to behaviors that occur in very exceptional cases, and their occurrence frequency is low as their required conditions are rarely fulfilled. Hence, a strong coupling relationship between infrequent behavior and data flow exists. Furthermore, some infrequent behaviors may reveal very important information about the process. Thus, not all infrequent behaviors should be disregarded as noise, and identifying infrequent but correct behaviors in the event log is vital to process mining from the perspective of data flow. Existing process mining approaches construct a process model from frequent behaviors in the event log, mostly concentrating on control flow only, without considering infrequent behavior and data flow information. In this paper, we focus on data flow to extract infrequent but correct behaviors from logs. For an infrequent trace, frequent patterns and interactive behavior profiles are combined to find out which part of the behavior in the trace occurs in low frequency. And, conditional dependency probability is used to analyze the influence strength of the data flow information on infrequent behavior. An approach for identifying effective infrequent behaviors based on the frequent pattern under data awareness is proposed correspondingly. Subsequently, an optimization approach for mining of process models with infrequent behaviors integrating data flow and control flow is also presented. The experiments on synthetic and reallife event logs show that the proposed approach can distinguish effective infrequent behaviors from noise compared with others. The proposed approaches greatly improve the fitness of the mined process model without significantly decreasing its precision.
1. Introduction
The purpose of process mining is to extract useful knowledge from event logs recorded by IT systems of enterprises to discover, monitor, and enhance the actual business process [1, 2]. One of the important research areas is process discovery, which automatically infers process models from event logs. The goal of process discovery is to find the “best” process model given a record of the real executions as much as possible. The four important metrics for measuring the “best” model are fitness, precision, generalization, and simplicity [3].
Unfortunately, reallife event logs often contain both noise and infrequent behavior [2, 4, 5]. In general, noise refers to behavior that does not conform to a process specification and/or correct execution, such as traces of recorded incomplete process behaviors, recording errors, and error execution of process [5]. However, effective infrequent behavior is considered to be a possible execution behavior in very exceptional cases, such as fraudulent behavior in insurance [6], risk problems in system operations [7], and escape problems of spacecraft systems. Some infrequent behaviors may be important behaviors that cannot be discarded in system operation. Early process discovery algorithms [8–13] have assumed that event logs accurately record system behavior and apparently have significant limitations in real life. Most of the recent discovery algorithms [14–23] support noise filtering but ignore infrequent behaviors of business processes. Very few discovery algorithms [24–27] consider infrequent behaviors, whereas they still regarded infrequent behaviors as noises. This may lead to some important information to be discarded. As a result, the derived models have difficulty in accurately describing the real behavior of systems. Therefore, one important challenge in process discovery is to distinguish infrequent behavior from noise in event logs.
There are few approaches related to the research of infrequent behavior, most of which focus on the controlflow perspective. Existing approaches determine whether the behavior is infrequent or frequent only by considering the frequency of activities or directlyfollows relations. However, for the infrequent behavior, they rarely analyze whether it has a relationship with data flow and directly remove it as noise. However, in real system operation, some execution paths may be taken by contextual data information, such as available resources, execution time, and execution status. As the required conditions (that is, specific data information) are rarely fulfilled, some paths are executed infrequently. Therefore, these infrequent behaviors are caused by their special required conditions. Once these conditions are fulfilled, the corresponding infrequent behavior will inevitably occur. We can say these infrequent behaviors as effective infrequent behaviors or correct infrequent behaviors. For instance, an airbag deploying in a car requires a suitable speed and angle of impact. Theoretically, the airbag can only be opened when the impact on a fixed object is within 60° in front of the vehicle, and the car speed is higher than 30 km/h. Compared with normal driving activities, the frequency of airbag deployment behavior is lower. Obviously, there is a coupling relationship between these infrequent behaviors and the data information of the event. Moreover, it is an important behavior for system operation. Therefore, existing approaches that filter lowfrequency behavior based on control flow and treat it as noise are not appropriate. Identifying these effective infrequent behaviors from the perspective of data flow and integrating control flow and data flow information in process discovery play an important role in process model optimization, business process improvement, resource allocation adjustment, and so on.
This paper analyzes the coupling relationship between infrequent behavior and data flow. It quantifies the influence strength of data information on behavioral dependencies between events, which provides a reliable basis for the identification of effective infrequent behavior. We conduct a series of experiments to compare our approach to existing approaches on synthetic and reallife event logs and discuss the result. The experimental result indicates that the proposed approach can identify more infrequent but useful behaviors than the stateoftheart mining technique and greatly increase the fitness of the process model without significantly decreasing the precision and, indeed, optimizing the process model. The main contributions of the paper are as follows:(1)An analysis approach based on frequent patterns and interactive behavior profiles is proposed to identify which parts of the trace are infrequent(2)To quantify the strong influence of data information on the behavioral dependence between activities, a conditional dependence probability measurement approach is introduced(3)An effective infrequent behavior recognition approach based on frequent patterns under data awareness is presented along with an optimization approach for mining of process models with infrequent behaviors integrating data flow and control flow
The remainder of this paper is structured as follows. Section 2 discusses the related work. Section 3 introduces the problem with an example. Section 4 presents the notations and the required concepts. Section 5 proposes an effective infrequent behavior recognition approach based on frequent patterns under data awareness. After that, an optimization approach for mining of the process model with infrequent behaviors that integrates data flow and control flow is also given. Section 6 evaluates how well the proposed approach works on synthetic and reallife event data. Finally, Section 7 concludes the paper and discusses future work.
2. Related Work
Many researchers have proposed a range of process mining algorithms. However, there exist many problems in the process mining algorithm, such as short loops [28], indirect dependency relationships [29], duplicated transitions, invisible transitions, noises, and infrequent behaviors [30]. Some early mining algorithms, such as the αalgorithm [8] and its derived improved algorithm [9, 10], the ILP mining algorithm [11], the inductive miner algorithm [12], and the domainbased mining algorithm [13], disregarded noises in the event log. Clearly, they have great limitations in real life. Most of the recent mining algorithms support noise filtering [14–23]. The first discovery algorithm handling noise was heuristics miner [14]. Heuristics miner considers the frequencies of the basic ordering relations during the computation of the strength of causal relations. The true dependency between two events (such as concurrency, exclusion, and causality) is determined by the strength of the causal relations. Its derived algorithms have also been proposed [15, 16]. Existing noisefiltering approaches are based on frequencies [14–18], machinelearning techniques [19, 20], genetic algorithms [21], or probabilistic models [22, 23]. All of those approaches focus on the controlflow perspective when filtering noise without considering the data flow information and exclude infrequent but useful behaviors. The literature [31–33] specifically studied the noise processing approach in event logs but still did not address infrequent behaviors.
Recently, the literature on infrequent behavior has been very scarce, and it has mainly focused on the controlflow perspective [24–27]. In terms of control flow, the literature [24] proposed the WoMinei algorithm, which retrieves infrequent behavior patterns from a process model, including structures with sequences, selections, parallels, and loops. However, in general, we do not have a reference model in real life, only have logs recorded reality. Hence, it is difficult for this approach to improve the quality of the discovered model. In [25], the inductive miner infrequent algorithm was proposed, which adds infrequent behavior filters to all steps of the IM algorithm, such that infrequent behavior is filtered by adopting an eventuallyfollows graph. In [26], a minimum anomalyfree automaton (AFA) based on the whole event log and a given threshold was constructed. Subsequently, all events that did not fit the AFA were removed from the filtered event log, which led to the removal of individual events rather than entire traces from the log. However, this technique cannot detect some typical anomalies, such as incomplete traces. The approaches in [25, 26] filter infrequent behaviors based on the frequency of directlyfollows relation between the activity pairs only from the perspective of the control flow and neglect the dependence between some infrequent behaviors and data flow. In [27], the authors proposed a generic noisefiltering approach suitable for any arbitrary process discovery algorithms. The approach uses the conditional occurrence probability to calculate the likelihood of the occurrence of an activity following a subsequence. The disadvantage of this approach is that the log is interpreted as a sequence, whereas the structural information is not considered, such as concurrent and loop structures. For concurrent and loop structures, the same structure will correspond to multiple or even infinitely different subsequences. Additionally, distinguishing between noise and infrequent behavior is ignored, and they are directly deleted as noise. In terms of data flow, in [34], the dataaware heuristic miner (DHM) was proposed, which combines data flow and control flow. A classification technique is used to find the data attributes between activities, which can reveal conditional infrequent behavior from the event log to distinguish infrequent behavior from noise effectively. There are two limitations to this approach. First, according to the condition of data dependence, only the dependency strength of two different directlyfollows relations between activity a and activity b is computed, but the probability of activity b following activity a directly compared to other activities is not considered. Second, only condition directlyfollows dependencies between activities are discovered, and the conditional dependencies of the more complex patterns cannot be discovered. Recent work on declarative process discovery [35–37] considered the data perspective. Declarative process models representing the discovery results from execution logs are given. In [37], the authors present an automated discovery of a declarative process model with data conditions. Clustering techniques in conjunction with a rule mining technique and redescription mining techniques are used to discover constraints between two activities, respectively. However, similar to association rule mining, only sets of rules or constraints rather than full process models are returned.
As a consequence, there exist three problems with the process discovery algorithm in recognition of infrequent behavior. First, most of them focus on the controlflow perspective to recognize infrequent behavior, whereas they ignore the coupling relationship between infrequent behavior and data conditions. Second, they directly remove all infrequent behaviors as noise when discovering the process model, which leads to infrequent but useful behaviors which are also excluded. Third, these approaches only consider the directlyfollows dependency of activity pairs, while the dependencies of more complex patterns are ignored. For the reasons stated above, this paper proposes an effective infrequent behavior recognition approach based on frequent patterns under data awareness. Subsequently, an optimization approach for mining of process models with infrequent behaviors integrating data flow and control flow is provided.
3. Motivation
A business process of booking tickets in a train ticket reservation system is illustrated as an example. Here, only the business process of ordering the ticket is considered, and a series of business processes generated by refunds and changes are not considered. Assuming that only 1,000 records are extracted from the system log, the trace sequences and the frequency of their occurrence are shown in Table 1. The event names corresponding to the activities are shown in Table 2.
Assuming a frequent threshold is 100, the inductive minerinfrequent (IMi) algorithm [25] infers the initial process model after removing the infrequent traces , as shown in Figure 1.
In a real system, , , and are three effective infrequent traces corresponding to situations in which the total number of contact names related to the loggedin user exceeds 15. The time interval between confirming the order and the payment success is more than 30 minutes, the user has an unpaid order, and the waiting time for payment does not timeout. In particular, they often occur infrequently because the corresponding conditions are rarely fulfilled. Obviously, these behaviors are infrequent but correct. However, the IMi algorithm disregards these as noise when constructing the process model so that the resulting process model cannot truly describe the actual operation of the system.
In trace , it is not difficult to find an infrequent activity K′, which has an indirect datadependent relationship with the frequent activity F. In trace , there is a lowfrequency activity pair TQ, which is caused by an indirect data dependency between R and T. The reason that trace occurs infrequently is the same as that for the trace . It is obvious that there is a coupling relation between these infrequent behaviors and the particular data dependency. To capture the infrequent and useful behavior, an effective infrequent behavior recognition approach based on the frequent pattern under data awareness is proposed in this paper according to indirect data dependency between events. Furthermore, an optimization process model integrating data flow and control flow is obtained by incorporating infrequent behavior in the resulting process model, which increases the fitness of the process model and more accurately captures the important behaviors of the system.
4. Preliminaries
This section gives basic definitions of several terms used in this paper. The events in the log represent activities, the event log is a collection of traces, and the same trace may appear multiple times in the event log, with each trace corresponding to the execution of a process. The event log typically stores considerable additional information about the event, such as the active execution resource (such as people or devices) and the timestamp of the event execution.
Definition 1. (process model [38]). A process model is a quadruple with(1) and as a nonempty set of place and transition, respectively(2), , and (3) as the flow relation of (4), with (5) as the structure type of the process model for sequence, selection, parallel, and loop
Definition 2. (weak order (log) [39]). Let be the event log. Let be an activity set of . The weak order relation contains all pairs such that there exists a trace in with and for which and hold.
Definition 3. (behavioral profile (log) [40]). Let be the event log. Let be an activity set of . A pair is in at most one of the following relations:(1)The strict order relation , iff and (2)The exclusiveness relation , iff and (3)The interleaving order relation , iff and The set is the behavioral profile of .
Note that we say that a pair is in reverse strict order, denoted by if and only if .
Definition 3 indicates that if any trace in the log does not contain both the activity and the activity , then . If there are two different traces such that and hold, or there is a trace such that and both hold, then . If there is a trace for which holds and there is no other trace such that , then .
Definition 4. (causal behavioral profile (log) [40]). Let be the event log. Let be an activity set of . is the behavioral profile of .(1)A pair is in the cooccurrence relation , iff implies that (2)The set is the causal behavioral profile of Clearly, the cooccurrence relation compensates for the option of the strict order relation. A causality holds between two activities and if they are in strict order and for any trace in the log must contain the activity as long as it contains the activity .
5. An Optimization Approach for Mining of Process Models with Infrequent Behaviors Integrating Data Flow and Control Flow
This section describes an approach for identifying effective infrequent behaviors from the perspective of data dependency and gives an algorithm to reconstruct optimized process models integrating data flow and control flow by incorporating effective infrequent behavior. Section 5.1 presents some relevant definitions and an algorithm for an effective infrequent behavior recognition approach based on frequent patterns under data awareness. Section 5.2 gives an optimization approach for mining of the process model with infrequent behaviors integrating data flow and control flow. The research framework of the proposed approach is shown in Figure 2.
5.1. An Effective Infrequent Behavior Recognition Approach Based on Frequent Patterns under Data Awareness
In this section, first (Section 5.1.1), some definitions related to the proposed approach are introduced, such as pattern, subsequence matching a pattern, interaction behavioral profile, and conditional dependency probability. Then, Section 5.1.2 elaborates on how to identify effective infrequent behaviors by using frequent patterns, interactive behavior profiles, and conditional dependency probabilities.
5.1.1. The Relation between Infrequent Behavior and Data Dependency
Prior to presenting the filtering approach, we present some basic notations used throughout the paper. Let denote the set of all possible activities and let denote a set of finite sequences over . A finite sequence of length over is a function: , alternatively written as , where for .The empty sequence is written as . The concatenation of sequences and is written as . A sequence is a subsequence of sequence if and only if we can write as , where both and are allowed to be , i.e., is a subsequence of itself. The beginning activity of a finite trace is written as , and the end activity of a finite trace is written as with . The set of all beginning activities and all ending activities in the event log is written as and , respectively, where and .
Considering the event log, including five traces, where , , , , and (the superscript of the trace indicates the number of times the trace appears in the log). Figure 3 shows the directlyfollows graph of log . In the directlyfollows graph, each node represents an activity, and an edge represents the directlyfollows relationship between two activities in the trace. indicates the start node of the log, double circles indicate the end node of the log, a line with a doublesided arrow indicates two activities are in a concurrent relationship, and a line with a singlesided arrow indicates a directlyfollows relationship.
According to Definition 3, we obtain the behavioral profile between activities in the log shown in Figure 4. Clearly, and are distinguishing traces, but in fact, they are the behavioral equivalent, as activity and activity are in an interleaving order relation. The same is true for traces and . Therefore, traditionally treating the trace as a sequence is too imprecise. We consider that the sequences of equivalent behaviors are the same, even if their corresponding sequences are different. To analyze the frequency and correctness of subsequences included in a trace from the behavior perspective, we define a pattern that considers all types of structures—sequence, selection, concurrent, and loop.
Definition 5. (pattern). Given the event log, let be an activity set of . Let be a directlyfollows graph of log L. All vertices of are written as , and all edges of are written as . When a connected subgraph satisfies the following two conditions, we call it a pattern of log :(1), (2), , Definition 5 indicates that the pattern represents part of the behavior of the trace in the event log, and any activity in the pattern has the same behavioral profile relation with other activities that are not in this pattern. For convenience, the vertices of the pattern are written as , the edges of the pattern are written as , and the pattern to which the activity belongs is written as .
In the example provided in Figure 5, according to Definition 5, for , , and , then . For and , then .
In expressing the interaction behavior between patterns better, the concept of an interactive behavioral profile is introduced as follows.
(a)
(b)
(c)
Definition 6. (interactive successor relationship and interactive input (or output) arc). Given the event log , let be an activity set of and be one of the patterns of the log . For, we denote ≼_{I} as an interactive successor relationship between the activity and the activity , if and such that . And we say that the activity has an interactive input arc and the activity has an interactive output arc, respectively.
Definition 6 indicates that when the activity and the activity belong to different patterns and a trace exists in forms of , there exists an interactive successor relationship between and . For instance, in Figure 5, there are two interactive successor relationships between patterns and : ≼_{I} and ≼_{I} .
An activity is referred to as the entry node of pattern if or has an interactive input arc. Similarly, an activity is referred to as the exit node of the pattern if or has an interactive output arc. As a pattern may have multiple entry nodes, or multiple exit nodes, the set of entry nodes of pattern is written as , and the set of exit nodes of pattern is written as .
In a pattern, indicates the interactive input arc of node a and indicates the interactive output arc of node a.
According to Definition 5, the directlyfollows graph of the aforementioned log can be divided into three highly cohesive lowcoupling subpatterns, as shown in Figures 5(a)–5(c).
Definition 7. (subsequence matching a pattern). Let be an event log over a set of activities . Let pattern be a subpattern of log . A subsequence in the trace is said to match a pattern , denoted as , when with and holds, where denotes a path from node to node , and denotes the sequence of activities that consists of all nodes on the path from node to node .
Definition 7 illustrates that a subsequence is considered to match the pattern when it corresponds to a substring consisting of all nodes on the path from the entry node to the exit node in the pattern . The pattern that matches a subsequence is denoted as .
For example, for the subsequences belonging to the traces in log, they match patterns and , respectively. For different subsequences matching the same pattern, we consider them to be behavioral equivalent, i.e., although and are two different sequences, they are considered to be behavioral equivalent. and do the same.
Definition 8. (interactive behavioral profile (pattern)). Given the event log, let be an activity set of and ≼_{I} be an interactive successor relationship. The interactive behavioral profile is the 3tuple (⟶_{I}, ≼_{I,}+_{I})_{L} defined by(i) iff ≼_{I} and ⋠_{I} (ii) iff ≼_{I} and ≼_{I} (iii) iff ⋠_{I} and ⋠_{I} Also, we say that a transition pair is in reverse strict order of the interaction, denoted by if and only if the transition pairs satisfy the strict order of the interaction, i.e., .
According to Definition 8, the interactive behavioral profile of patterns , , and is shown in Figure 6.
An infrequent trace occurs at low frequencies either because it contains lowfrequency events or because it contains lowfrequency subsequences, along with a large number of highfrequency subsequences. How can the frequency of occurrence of subsequences or activities be determined? To solve this problem, the activity frequency and pattern frequency are both given below.
Definition 9. (activity frequency [24]). Given the event log , let bean activity set of . The frequency of an activity is defined asGiven a frequency threshold , an activity is frequent iff .
A pattern can reflect the structural behavior relationship between activities. Since there exist multiple different subsequences with the same behavior corresponding to the same pattern, so it is more accurate to measure their frequency by using the frequency of patterns.
Definition 10. (pattern frequency). Let be an event log over a set of activities . Let pattern be a subpattern of log . The frequency of a pattern is defined asGiven frequency thresholds , a pattern is frequent, iff .
Theorem 1. Let be an event log over a set of activities . Let pattern be a subpattern of log .Given a subsequence , is frequent iff is frequent.
Proof. According to Definition 12, we can easily obtain this conclusion.
Frequencybased filtering techniques only consider direct dependencies between activity pairs, while the frequency of directlyfollows relation between all activity pairs in some infrequent trace is frequent. For instance, <a, b> and <b, c> are frequent activity pairs in the log, but the subsequences<a, b, c> consisting of <a, b> and <b, c> may be lowfrequency subsequences. Only using the direct dependency between activity pairs will not identify that <a, b, c> is an infrequency subsequence. Therefore, filtering infrequent behavior only from the frequency of occurrence of a single activity pair is too imprecise. In this case, computing the probability that a certain activity directly occurs after the occurrence of the subsequence at larger distances is necessary. Definitions 11 and 12 compute the number of conditional occurrence times and conditional dependency probability, respectively, of the activity directly following the subsequence, when the data dependency condition exists between a subsequence and an activity.
Definition 11. (conditional occurrence times). Given a subsequence , an activity , and dependency conditions , we write to represent the conditional occurrence times of a subsequence with the latest attribute values directly followed by an activity a under dependency condition ; we denote aswhere represents a pattern to which a subsequence matches it.
For example, traces and in the previously mentioned log , let (1) the activity follow directly after the subsequence with the latest attribute values in the trace , and (2) the activity also follow directly after the subsequence with the latest attribute values in the trace . According to Definition 11, . Definition 11 considers the number of conditional occurrences of which more behavioral equivalence subsequences are directly followed by the same activity (such as concurrency or loops) under the same conditions.
Behavioral dependencies between activities in realworld systems may be affected by direct or indirect data dependency between activities. To capture the strength of behavioral dependence they cause, Definition 12 further gives the concept of conditional dependency probability between the subsequence and the activity based on the literature [20].
Definition 12. (conditional dependency probability). Let be an event log over a set of activities . Given a subsequence , an activity , and dependency conditions , we write to represent a conditional dependency probability of the subsequence with the latest attribute value directly followed by the activity under dependency conditions ; we denote aswhere represents other activities except activity and .
Obviously, the value of is a real number in (−1, 1). When the dependency condition has the latest attribute value , the higher the value of , the more likely the activity directly follows the subsequence . If an infrequent subsequence has a higher value of , it can be judged to be a correct infrequent behavior.
For a given conditional dependency probability threshold , is considered to be a reasonable subsequence under the current data dependency iff .
Definition 13. (special data dependency). Given an event log L, a subsequence , an activity , and dependency conditions or . or is regarded as special data dependency if the following two conditions are met:where is a frequent threshold, is the total number of traces in the log, and represents the dependency condition has the latest attribute value .
5.1.2. An Effective Infrequent Behavior Recognition Approach Based on Frequent Patterns under Data Awareness
Definition 12 in Section 5.1.1 quantifies the strength of data dependency on the behavioral dependency between the activity and the subsequence, which provides a basis for the identification of effective infrequent behaviors. This section provides an effective infrequent behavior recognition approach based on frequent patterns under data awareness.
For the frequent traces in the log, a number of patterns with highcohesion lowcoupling on behavior can be constructed by their directlyfollows graph and behavioral profile. It is easy to determine that these subpatterns are frequent patterns. Since an infrequent trace often contains some frequent behavior in addition to infrequent behavior, there may be direct or indirect data dependency between them. To make full use of this dependency and accurately capture its impact on behavioral relationships, Algorithm 1, first, finds out which part of the behavior in the trace occurs in low frequency. Then, check whether there exists special data context information in the context of the infrequent behavior. If it exists, conditional dependency probability is used to analyze the influence strength of the data flow information on infrequent behavior. If its value is greater than a certain threshold, the infrequent behavior is considered to be an infrequent but correct behavior; otherwise, it is considered to be noise.

Step 1–Step 11 in Algorithm 1 analyze the validity of the infrequent trace in terms of the infrequent activity, where Step 1–Step 3 determine whether there is an infrequent activity in the trace and if it exists, Step 4–Step 11 determine the correctness of the infrequent activity occurrence from a data dependence perspective. Step 12–Step 13 analyze the validity of the infrequent trace in terms of the infrequent subsequence. Step 12 divides the trace into several subsequences according to the activity set in the frequent subpatterns. The divided subsequence is either a frequent subsequence or an infrequent subsequence. If the divided subsequence includes a smaller infrequent subsequence, Step 16–Step 21 analyze the correctness of the infrequent subsequence from the perspective of data dependence. If the divided subsequences are all legal subsequences, Step 21–Step 33 determine whether the interaction behavior between the subsequences is reasonable according to the interaction behavior profile of the frequent subpatterns. If it is unreasonable, the correctness of infrequent interaction between them is judged from the data dependency perspective.
5.2. An Optimization Approach for Mining of Process Models with Infrequent Behaviors Integrating Data Flow and Control Flow
Algorithm 1 analyzes the effectiveness of infrequent behavior based on direct or indirect data dependencies between events. On this basis, Algorithm 2 further gives an optimization approach for mining of the process model with infrequent behaviors integrating data flow and control flow. First, the initial process model based on the control flow is constructed from the frequent traces by using the IMi mining algorithm. Then, Algorithm 1 is used to identify all the effective infrequent behaviors in the event log. Finally, an optimization process model integrating data flow and control flow is further reconstructed by incorporating all the effective infrequent behaviors into the initial process model.

Step 1–Step 4 in Algorithm 2 preprocess the traces in the event log according to the occurrence frequency and divide the log into two sets, and, where represents all frequent traces and represents the infrequent traces that need to be analyzed. The initial process model is built by applying the IMi mining algorithm on the event log in Step 5. Incomplete traces that do not start or end normally are deleted and simultaneously added to the set in Step 6–Step 11. Step 12–Step 14 use Algorithm 1to determine whether each trace in the set is an effective infrequent trace and further divide the trace into two subsets: effective infrequent trace set and noise set . An optimized process model of the fusion control flow and data flow is obtained by adding these infrequent behaviors in the set to the initial process model in Step 15.
6. Evaluations and Results
In this section, we conducted controlled experiments on synthetic and reallife event logs to compare our approach to existing approaches and discuss the result in this section. First, we (in Section 6.1) illustrate the solution steps of the infrequent behavior identification approach proposed in this paper using the synthetic event log shown in Section 3 and then report on the number of infrequent behaviors correctly identified using our approach and other approaches. Then, in Section 6.2, we compare the proposed approach with other approaches to measure the quality of the process model discovered when different levels of infrequent behavior are injected into the reallife logs. These experiments are performed on an Intel i76500 processor and an 8 GB RAM (2.50).
6.1. Synthetic Dataset
In verifying the correctness of Algorithm 1, the event log given in Section 3 is taken as an example to elaborate on how to use Algorithm 1 to identify effective infrequent behavior. First, the causal behavioral profile is obtained according to frequency traces in the event log, as shown in Figure 7 (note that the subscript L of the behavioral profile is omitted here). Six maximal frequent patterns obtained according to the behavioral profile in Figure 7 are presented in Figure 8. The corresponding interactive behavior profiles between them are shown in Figure 9.
For the infrequent traces , since the end activity of trace is an abnormal end activity, it is easy to determine that it is noise. In the event log, the attributes of some activities and their attribute values of infrequent traces , and are shown in Table 3 to Table 7, respectively. The conditional dependence probabilities between certain activities and subsequences are computed according to Algorithm 1, as displayed in Table 8. The results show that when the condition dependence threshold , the traces , and are considered to be an effective infrequent behavior using the proposed approach.
Subsequently, we evaluate the ability to identify effective infrequent behaviors in the proposed approach compared to the IMi algorithm [17], the FM algorithm [19], and the DHM algorithm [20]. Table 9 indicates that the proposed approach can correctly identify more effective infrequent behaviors than other approaches, whereas the DHM algorithm may mistake the incorrect infrequent trace as the correct one.
Finally, the optimization process model of the fusion control flow and data flow is constructed by incorporating these infrequent behaviors into the process model, as shown in Figure 10. The transitions in process model are unobservable activities representing data flow that have been added for routing purposes only and do not appear in the event log. In adopting the approach proposed in [41], the fitness of the model is improved to 0.993, while the fitness of the initial model is 0.939.
Algorithm 1 uses two thresholds, a frequency threshold of activity and conditional dependency probability threshold. The former is used to differentiate the frequent activities and infrequent activities, while the latter is used to differentiate effective infrequent behaviors and noneffective infrequent behaviors. Actually, the performance of identifying effective infrequent behaviors is mainly affected by the conditional dependency probability threshold. To illustrate how varying this threshold affects the identification of effective infrequent behaviors, we designed the experiment to measure the amount of effective infrequent behavior correctly identified by our technique. Here, we use the previous synthetic log to evaluate the effect of different levels of threshold on Algorithm 1 by incrementally injecting infrequent behaviors. As shown in Figure 11, the results show that generally the rate of effective infrequent behavior correctly recognized decreases as the threshold parameter increases. When the threshold is set at a high value, these infrequent behaviors which rarely happen (i.e., only one or two times), and some infrequent behaviors which include other traces of recorded errors with the same data dependency conditions cannot be identified correctly.
6.2. RealLife Dataset
We designed a simulation experiment for analysis using the claims data package provided by an insurance company platform. The data are from the company's Insurance Service PlatformLog Data1, including 980 cases, 13,280 events, 27 activities, and 12 attributes. For Log Data1, we compare the proposed approach, the IMi algorithm, and the DHM algorithm on precision [42] and fitness [41] value to evaluate the quality of the discovered model by injecting 1% to 9% infrequent behavior into the event log. In many cases, there is a tradeoff between these two metrics. To balance them, the Fscore is often used to combine fitness and precision through their harmonic means . The abscissa corresponds to the ratio of the injected infrequent behavior, and the ordinate corresponds to the fitness, precision, and Fscore value, in Figures 12–14, respectively.
Figure 12 shows that the proposed approach can find more infrequent behaviors than the IMi and DHM approaches, and it significantly improves the fitness of the discovered model. Since the IMi algorithm only filters infrequent behaviors based on frequency from the controlflow perspective, many infrequent behaviors are disregarded as noise. Therefore, the fitness of the resulting model is relatively low. With the increase in infrequent behavior, the overall fitness of the three approaches declines.
Figure 13 indicates that the precision of the model obtained by the proposed approach is higher than that of the others when injecting less infrequent behaviors. This may be due to a reduction in additional behaviors by adding data flow to the resulting model. The precision of the DHM algorithm is relatively low. Although it can find more infrequent behavior, more control flow is added in the resulting causal net without data flow information, which makes the discovered model more complicated. With the increase in infrequent behavior, the overall precision of the three approaches shows a downward trend.
Figure 14 shows that the Fscore for the proposed approach is generally superior to the IMI and DHM approaches. As the increase in infrequent behavior may lead to a small decrease in precision, in some cases, it will be slightly lower than others.
The experimental results of synthetic and real logs show that our approach has a noticeable improvement over the fitness of the discovered process model without significantly reducing the precision. Thus, our approach is promising that can preserve the effective infrequent behavior representing important information of the system when discovering the process model. Hence, the proposed approach provides better support for enterprise business improvement.
7. Conclusions and Future Work
In this paper, an effective infrequent behavior recognition approach based on frequent patterns under data awareness is presented. It analyzes the coupling between infrequent behavior and the data dependency information and uses the conditional dependence probability to quantify the influence strength between them. This approach provides a qualitative and quantitative analysis for the identification of effective infrequent behavior and realizes longterm dependencies between the activity and the frequent pattern, not only directlyfollows data dependencies. Moreover, an optimization approach for mining of process models with infrequent behaviors integrating data flow and control flow is provided in this paper. We compared the proposed approach with other techniques, showing that our approach discovers infrequent behavior that other techniques cannot detect. Furthermore, the evaluation on synthetic and reallife event logs indicates that incorporating infrequent but correct behavior will greatly improve the fitness of the discovered process model without significantly reducing its precision by adding appropriate data flow and control flow to the resulting process model.
In the future, the proposed approach will be applied to more application fields, and various factors that lead to infrequent behavior occurrence from a dataflow perspective will be further studied. Association rules will be used to reveal data dependency between activities to provide a better basis for the recognition of infrequent behaviors.
Data Availability
The data used to support the findings of this study were supplied by an insurance company under license and so cannot be made freely available. Requests for access to these data should be made to [email protected].
Conflicts of Interest
The authors declare that they have no potential conflicts of interest.
Acknowledgments
This work was partially supported by the National Natural Science Foundation, China (nos. 61572035 and 61402011), the Leading Backbone Talent Project in Anhui Province, China (2020112), the Natural Science Foundation of Anhui Province, China (no. 2008085QD178), Anhui Province Academic and Technical Leader Foundation (no. 2019H239), Anhui Province College Excellent Young Talents Fund Project of China (gxyqZD2020020), and the Open Project of the Key Laboratory of Embedded System and Service Computing Ministry of Education (no. ESSCKF201804).