Abstract

Business process models are required to be in line with frequently changing regulations, policies, and environments. In the field of intelligent modeling, organisations concern automated business process compliance checking as the manual verification is a time-consuming and inefficient work. There exist two key issues for business process compliance checking. One is the definition of a business process retrieval language that can be employed to capture the compliance rules, the other concerns efficient evaluation of these rules. Traditional syntax-based retrieval approaches cannot deal with various important requirements of compliance checking in practice. Although a retrieval language that is based on semantics can overcome the drawback of syntax-based ones, it suffers from the well-known state space explosion. In this paper, we define a semantics-based process model query language through simplifying a property specification pattern system without affecting its expressiveness. We use this language to capture semantics-based compliance rules and constraints. We also propose a feasible approach in such a way that the compliance checking will not suffer from the state space explosion as much as possible. A tool is implemented to evaluate the efficiency. An experiment conducted on three model collections illustrates that our technology is very efficient.

1. Introduction

Business process models are valuable intellectual assets capturing the ways organisations conduct their business. Current business process management evolves increasingly fast due to changing environments and emerging technologies. As a result, organisations accumulate huge numbers of business process models, and among these may be models with high complexity. For example, Haier is one of the largest Chinese consumer electronics manufacturers. Over the years, Haier has gathered more than 4,000 process models from various domains, including purchase, financing, distribution, and service. In this context, support for business process management, for example, for the purposes of knowledge discovery and process reuse, faces real challenges. In order to stand a competitive advantage, one of these challenges concerns business process compliance checking to make sure that business processes are in line with frequently changing business environments and legal regulations. This problem has also gradually emerged as an important branch of intelligent modeling. There are two key issues must be addressed for automated business process compliance checking. One is a retrieval language that can be employed to capture compliance rules, the other is the efficient evaluation of compliance checking.

In recent years, there are some query languages have been proposed to retrieve process models in repositories, such as BP-QL [1] and BPMN-Q [2]. In [3], BPMN-Q was also used to capture compliance rules. But these languages are based on syntax (structure) of process models, rather than on semantics of them. While in the syntax of a process model, a directed path connecting a task A and a task B does not mean that during execution task A will always occur before task B. Let us consider, for example, the three process models in Figure 2. Among of them, rectangles represent tasks, arcs represent sequential dependencies between tasks, while diamonds represent choices (if each of the diamonds has one input arc and multiple output arcs) and merges (if each of the diamonds has multiple input arcs and one output arc). These models represent three variants of a business process for opening an account in the BPMN notation [4]. These three variants could specify the way an account is opened in three different states in which the company conducts its business and could be part of a repository of hundreds, even thousands of process models for all states in which the bank operates. Next, let us take BPMN-Q as an example to illustrate the drawback of syntax-based languages. A rule written in BPMN-Q uses a directed edge connecting two activities to represent that these two activities are executed in order (in just some executions of a process). For example, the BPMN-Q query, as shown in Figure 2, can specify the compliance rule that task “receive customer request” must always be followed by task “analyse customer credit history” in some process executions. But if an analyst requires to retrieve processes where in every process execution task “receive customer request” always occurs before task “analyse customer credit history,” BPMN-Q cannot capture this kind of requirements. Thus, after executing the query in Figure 2, we would retrieve the first and the third processes, since in both process (a) and process (c), there exists at least one process execution in which if task “receive customer request” occurs, then task “analyst customer credit history” would eventually occur. However, according to the requirement, process (c) does not belong to the result as process (c) has an execution where that task “receive customer request” always precedes task “analyst customer credit history” does not hold (the process execution where task “open VIP account” is run). As a result, the problem of BPMN-Q is that people cannot know whether all of the process executions of a resulting process satisfy the requirement, or just some of them satisfy the requirement. This issue is very important in reuse of business processes, automatic modeling, and verification. For example, in reuse of business process, people often need to know whether there are some process executions that fail to satisfy a requirement, with the goal to check the reason and modify these process executions. Therefore, in order to yield correct result, we have to explore every process execution of every process in a repository, which is indeed based on semantics.

As we can see from the example, syntax-based retrieval languages are not powerful enough. In fact, retrieval technologies based on semantics are indeed in line with process execution and therefore are more intuitive to ordinary users who are not necessarily experts in business process management (BPM). A semantics-based process model query language should capture two types of requirements: (1) it can specify various semantic relationships between tasks; (2) it can explicitly specify that these relationships hold in just some process execution or in every process execution.

In light of the previous, in this paper, we aim to address two questions. One is that how many the semantic relationships between tasks are enough; the other is that the evaluation of semantics-based compliance rules requires to explore every process execution of a process model, which suffers from the well-known state space explosion problem.

In [5], a property specification pattern system (SPS) has been proposed for finite-state verification by Dwyer and so forth. SPS consists of 5 basic patterns and 5 scopes, which results in 5 × 5 = 25 LTL formulae. In this paper, we significantly simplify SPS without affecting its expressiveness through formal logic reasoning. After this simplification, there are only 3 basic LTL formulae from which the rest formulae can be deduced. A retrieval language for expressing semantics-based compliance rules is based on this simplified SPS. With respect to the evaluation of semantics of process models, we proposed a feasible technology which can extract every execution of a business process model. In such a way, the state space explosion can be avoided as much as possible. We achieve this by adopting the theory of complete finite prefixes (CFP) [6] and its improvements [7]. Moreover, a tool is implemented to evaluate the performance of our technology over three collections of Petri nets. For the three collections, two are obtained from practice, and the third is a much larger one and obtained by artificially generating.

The remainder of this paper is organized as follows. In Section 2 we simplify SPS to define a process model retrieval language for specifying compliance rules. While in Section 3 the basic concepts of Petri nets, unfolding, and CFP are presented. In Section 4, we detail the mechanism of efficient semantics-based compliance checking. Next, in Section 5 we illustrate the tool implementation and report on the performance evaluation over three process model collections. Finally, we discuss related work in Section 6 and conclude the paper in Section 7.

2. Language

As discussed in Section 1, a language is needed for specifying semantics-based compliance rules. This language should be powerful enough while being not too complex. In this section, we will simplify the SPS to obtain a core pattern system from which the rest patterns and scopes of SPS can be derived. Then we present the formal definition of a new query language, namely, “a semantics-based process query language” (ASBPQL), based on this core pattern system.

2.1. LTL Formulae

Linear temporal logic (LTL) is a widely used formalism for specifying properties of concurrent, finite-state systems. In this subsection, we use LTL to reason about the core of SPS.

Definition 1 (linear temporal logic formulae). The formulae of linear temporal logic are built from a finite set of atomic propositions , the logical operators ,   , and , and the temporal modal operators and . Formally, the set of LTL formulae over can be inductively defined as follows: (i)both true and false are LTL formulae;(ii)for all , and are LTL formulae;(iii)if and are both LTL formulae, then , , , , and are LTL formulae.The operator is read as “next” and denotes in the next state. The operator is read as “until” and means that its first argument has to hold until its second argument is true, where it is required that the second argument holds eventually (some literatures also define the weak until operator ( ) which related to the strong until operator ( ) through the following equivalences: , ). The operator is read as “releases” and is the dual of . In addition, two derived operators are in common use. They are as follows:(i) is read as “eventually, , which requires that its argument be true eventually, that is, at some states in the future; (ii) is read as “always, , which requires that its argument be true at all future states.

2.2. Simplification

SPS consists of 5 basic patterns (the other three patterns are defined based on them) and 5 scopes, as shown in Figure 3. The intents of the 5 basic patterns are as follows:(i)Absence, a given task never occurs within a scope;(ii)Universality, a given task occurs throughout a scope; (iii)Existence, a given task occurs at least once within a scope; (iv)Precedence, a task occurs before a task within a scope; (v)Response, a task must be followed by a task within a scope.

The meanings of the five scopes are presented as follows:(i)Global means the entire extent of a process execution; (ii)Before means the extent up to an occurrence of the given task within a process execution;(iii)After means the extent after an occurrence of the given task within a process execution;(iv)Between   and means the part of a process execution from an occurrence of the task and that of the task ;(v)After   until is similar to the scope Between   and except that the designated part of a process execution continues if the task does not occur.

As shown in Table 1, for each scope there is an LTL formula corresponding to a pattern, which results in 25 formulae.

Next, we provide proofs that the SPS can be simplified from 5 patterns and 5 scopes to only 3 patterns (Absence, Existence, and Precedence) and 1 scope (After   until ). This can significantly reduce the number of formulae from 25 to 3.

First, we take pattern Absence as an example to prove that scope Before can be derived from scope After   until . According to the semantics of LTL, if is always true, that is, , scope Before can be derived from scope After   until , that is, . Now we prove that this proposition holds.

Proposition 2. Consider .

Proof. By contradiction, assume :(1) (by assumption),(2) (given), (3) (given), (4) (by (1)), (5) (by (4)), (6) (by (5)), (7) (by (6)), (8) (by (7)), (9) (by (7)), (10) (by (3)), (11) (by (2)), (12) (by (10), (11)), (13) (by (9), (12)), (14) (by (13)), (15) (by (9)), (16) (by (15)).
By (14), (16), we get a contradiction. So, we conclude that proposition holds.

Next, if is always false, that is, , we can prove that for pattern Absence the formula corresponding to scope After can be derived from the formula corresponding to scope After   until .

Proposition 3. Consider .

Proof. One has(1) (given),(2)   (given),(3)   (by (1)),(4)   (by (2)),(5)   (by (3), (4)),(6)   (by (5)),(7)   (by (1)),(8)   (by (7)),(9)   (by (8)),(10)   (by (9)),(11)   (by (6), (10)),(12)   (by (11)).

Next, we prove that if holds eventually, that is, , we can derive the formula corresponding to scope Between   and from the formula corresponding to scope After   until .

Proposition 4. Consider .

Proof. One has (1) (given), (2) by (1), (3) (given), (4) (by (2), (3)), (5) (by (4)), (6) (by (1), (5)).

Now we have proved that the formulae corresponding to three scopes (After , Before , and Between and ) can be derived from the formulae corresponding to scope After   until . If always holds and always does not hold, that is, , the formula corresponding to scope Global can be derived from that of scope After   until . This proof is straightforward and is easy to be reasoned about. For page limit, we do not present it in this paper.

Next, we prove that only pattern Absence, Existence, and Precedence are core patterns, the rest patterns in SPS can be derived from these three patterns. Firstly, when we replace in the formulae corresponding to pattern Absence with , and the formulae corresponding to pattern Universality can be derived. Pattern Absence and Universality are dual of each other. Next, we present as follows the explicit proofs of the derivation of pattern Response from pattern Absence and Existence, in scope After   until . Lemmas 5 and 6 will be used in this reasoning.

Lemma 5. Consider .

Proof. One has

Lemma 6. Consider .

Proof. By Lemma 5,

Proposition 7. Consider .

Proof. One has  (1) (given), (2) (by (1)), (3) (given), (4) (by (3)), (5)   (by (2), (4)), (6) (by Lemma 6), (7) (by (6)).

Finally, we obtain a simplified pattern system that consists of only 3 patterns (Absence, Existence, and Precedence) and one scope (After until), as shown in Figure 4. As we can see, this simplified pattern system is far more concise than SPS.

2.3. Syntax

Based on the simplified SPS, we can define the basic relationships between tasks in ASBPQL. One is Existence, and the other is Precedence. And two other relationships are in very common use in business process management. One is Exclusive, and the other is Concurrence. As discussed in Section 1, after defining the basic semantic relationships between tasks, we have to determine whether these relationships hold in just some process executions or in every process execution of a business process. Combining with all these considerations, we can define 6 basic predicates to capture the occurrence of tasks and the relationships between tasks in some or every process execution. In the following, the first two basic predicates, posoccur and alwoccur, capture the occurrence of a given task in some or every process execution of a process model. These two basic predicates are based on pattern Existence:(1) : there exists some executions of process model where at least one instance of occurs,(2) : in every execution of process model , at least one instance of occurs.

The next two basic predicates, concur and exclusive, capture the concurrent and exclusive relationships between tasks, respectively. Note that these two basic predicates do not assume that an instance of and should eventually occur:(3) : and are both executable tasks (i.e., not dead tasks) of process model ; in every process execution of , it is never possible that an instance of and an instance of both occur;(4) : and are both executable tasks of process model ; and are not causally related; and in every execution of , if an instance of occurs, then an instance of occurs and vice versa.

The last two basic predicates, pospred and alwpred, capture the basic relationship Precedence between tasks in some or every process execution of a given process model, respectively:(5) : in every process execution of process model , it holds that an instance of occurs before an instance of ;(6) : there exists some process executions of process model where an instance of occurs before an instance of .

Finally, we define ASBPQL by BNF grammar. A Query in ASBPQL is a Condition. The result of the Query is those process models that satisfy the Condition. A Condition can consist of with the intended semantics what the basic predicate specifies, a with the intended semantics what the basic predicate specifies, and a , with the intended semantics that all process models satisfying that particular relation between tasks must be retrieved, or it can be recursively defined as a binary or unary Condition through the application of logical operators, that is, or . Specifically, a disjunction retrieves the union of the process models of the conditions involved, while a conjunction retrieves the intersection. The negation of a condition retrieves the process models that do not satisfy the condition. A task can be defined as its label which is a string as follows:

Using ASBPQL, we can capture the semantics-based compliance rules in which we are interested, including the relationship between tasks and the occurrence of tasks in some or every process execution. For example, rule “A” pospred “B” and “B” alwpred “C” mean that we want to search for all process models where in some process execution task A occurs before task B and in every execution task B occurs before task C.

3. Petri Nets and Unfoldings

In this section, we discuss the basic concepts of Petri nets and unfolding on which we base our work. For more details, readers can refer to [8] for an in-depth introduction to Petri nets and to [6, 7, 9, 10] for unfolding and its related definitions.

3.1. Petri Nets

Petri nets are a formal notation system which can be employed to specify workflow systems (see, e.g., [11, 12]). Petri nets are also used as a formal foundation for defining the semantics of other process modeling languages or for reasoning about process models specified in these languages, for example, BPMN [13], BPEL [14, 15], and EPCs [16]. A formal definition of Petri nets is presented as follows.

Definition 8 (Petri nets). A Petri net is a tuple , where(i) is a finite set of places;(ii) is a finite set of transitions, with and ;(iii) is a finite set of directed arcs representing the flow relation, connecting transitions and places together.

The conditions that the sets of places and transitions should be finite and that every transition has at least one input place and at least one output place derive from [7]. For notational convenience we adopt a commonly used notation, where represents all the inputs of a node (which can be a place or a transition) and captures all its outputs.

Next, a labeled Petri net is basically a Petri net with annotated transitions and the annotation does not affect the semantics of the net.

Definition 9 (Labeled Petri nets). A labeled Petri net is a tuple , where(i) is a Petri net;(ii) is a finite set of task names; (iii) is a label mapping function for , where is a silent action (i.e., an action not visible to the outside world).A marking of a Petri net is an assignment of tokens to its places. A marking represents a state of the net, and a transition, if enabled, may change a marking into another marking, thus capturing a state change, by firing.

Definition 10 (marking, enabling, and firing of a transition). Let be a Petri net.(i)A marking of is a mapping . A marking may be represented as a collection of pairs, for example, or as a vector, for example, (in that case we drop places that do not have any tokens assigned to them). A labeled Petri net system is a labeled Petri net with an initial marking usually represented as .(ii)Markings can be compared with each other, if and only if for all , . Similarly, one can define , , , .(iii)A transition is enabled in a marking , denoted as , if and only if the following holds: .(iv)A transition that is enabled in a marking may fire and change marking into . This is denoted as .The markings of a Petri net system and the transition relation between these markings constitute a state space. In this paper we consider -bounded Petri net systems (noting that such systems are always finite) which are necessary for the application of unfoldings.

Definition 11 (reachability and boundedness). Let be a Petri net system.(i)A marking is called reachable if a transition sequence exists such that , which may also be denoted as or, if the choice of does not really matter, .(ii) is called a finite Petri net system if and only if its set of reachable markings is finite.(iii) is called -bounded if and only if for every reachable marking and every place : .

3.2. Unfolding

It is well known that Petri nets may suffer from the state space explosion problem [17]. As such a naive exploration of the state space, especially in the context of a Petri net which allows highly concurrent behaviour, may not be tractable. In order to deal with this, McMillan [6] proposed a state space search technique based on the use of unfolding (this technique was later on improved by Esparza et al. [7] and is discussed in the next subsection). Unfoldings are applied to -bounded (or called -safe in [7]) Petri net systems and provide a method of searching the state space of concurrent systems without considering all possible interleavings of concurrent events. The concept of unfolding was firstly introduced by Nielsen et al. [9] and later elaborated upon by Engelfriet [10] using the term branching processes. In the following we introduce the necessary concepts and notations to make this paper self-contained and to be able to build upon this theory. Most of these defintions are adopted from [7].

Firstly, various types of relationship may hold between pairs of nodes in a Petri net.

Definition 12 (node relations (based on [7])). Let be a Petri net.(i) is the irreflexive transitive closure of , while is its reflexive transitive closure. The partial orders defined by these closures are denoted as and , respectively. Hence, for example, if and only if , and we say that causally precedes .(ii)If or , then and are causally related. (iii)Nodes and are in conflict, denoted by , if and only if there exist distinct transitions such that ,   , and . A node is in self-conflict if and only if .(iv)Nodes and are concurrent, denoted as co   , if and only if and are neither causally related nor in conflict.

The unfolding of a Petri net is an occurrence net, usually infinite but with a simple, acyclic structure.

Definition 13 (occurrence net (based on [7])). An occurrence net is a net , where(i) is a set of conditions;(ii) is a set of events, with ;(iii) such that (1) for all , , (2)   is acyclic; that is, is a strict partial order, and (3) for all the set of nodes for which is finite;(iv)No node is in self-conflict; that is, for all , .We also adopt the notion of , as in [7], to denote the set of minimal elements of with respect to the strict partial order . As for transitions in Petri nets, we only consider events that have at least one input and at least one output condition. The minimal elements are therefore conditions only, and intuitively can be seen as an initial marking of the net.

Definition 14 (branching process (based on [10])). A branching process of a Petri net system , with , is a pair , where(i) is an occurrence net;(ii) is a homomorphism which, following [10], means that(a) ;(b) ; that is, conditions are mapped to places and events to transitions;(c) for every , is a bijection between and , and is a bijection between and ;(d) is a bijection between and ;(iii) for all , if and , then .

Note that the definition allows for infinite branching processes. In [10] it is shown that, up to isomorphism, every net system has a unique maximal branching process. For a net system , this unique process is referred to as the unfolding of and it is denoted as . For example, in Figure 5 the Petri nets in (a) can be unfolded into the occurrence net in (b). Note that in Figure 5(b) all (condition/event) nodes are identified by integers and annotated by the corresponding place or transition identifiers in Figure 5(a).

3.3. Complete Finite Prefix

The unfolding of a Petri net is infinite when the net is cyclic, as, for example, in Figure 5(b). In [6], McMillan proposed an algorithm for the construction of a so-called truncated unfolding, which is a finite initial part of an unfolding and contains as much reachability information as the unfolding itself but may be much larger than necessary. In [7], Ezparza et al. referred to this truncated unfolding as complete finite prefix (CFP) and proposed an improved algorithm for computing a minimal CFP. For example, as illustrated in Figure 5(c) (the dashed arcs should be ignored for the moment), is a minimal CFP of . Note that in Figure 5(c) the tuple of conditions positioned next to an event node represents the marking of the net upon the occurrence of that event.

The main theoretical notions required to understand the concepts of a CFP are that of configuration and local configuration of events. Firstly, a configuration represents a possible partially ordered run of the net.

Definition 15 (configuration [7]). A configuration of an occurrence net is a set of events, that is, , satisfying the following two conditions:(i) is causally downward closed, that is, ;(ii) is conflict free, that is, for all .Given a configuration the set of places represents a reachable marking, which is denoted by . In other word, is the marking to reach by firing the configuration . For example, in the unfolding Unf in Figure 5(b) we have .

Definition 16 (cut [7]). Let be a Petri net system, and let be its unfolding. The set of conditions associated with a configuration of is called a cut and is defined as . A cut uniquely defines a reachable marking in : .

The concepts thus far can be used to introduce the unfolding algorithm. In [7] a branching process of a Petri net system is specified as a collection of nodes. These nodes are either conditions or events. A condition is a pair where is the input event of , while an event is a pair where is a transition and is its input conditions. A set of conditions of a branching process is a coset if its elements are pairwise in corelation. For example, in Figure 5(b) each of the node sets , , , , , and is a coset.

During the process of unfolding the collection of nodes increases where the function (which denotes the possible extensions) is applied to determine the nodes to be added. The possible extensions are given in the form of event pairs , where is a coset of conditions of and is a transition of such that (1)   , and (2)  no event exists for which and . In the unfolding algorithm, nodes from the set of possible extensions are added to the unfolding of the net till this set is empty (i.e., there are no more extensions).

In the complete finite prefix approach, it is observed that a finite prefix of an unfolding may contain all reachability-related information. The key to obtain a CFP is to identify those events at which we can cease unfolding (e.g., events 12, 41, and 42 in in Figure 5(c)) without loss of reachability information. Such events are referred to as cut-off events, and they are defined in terms of an adequate order on configurations.

Definition 17 (adequate order [7]). Let be a Petri net system, and let be a partial order on the finite configurations of one of its branching processes, then is an adequate order if and only if(i) is well founded;(ii)for all configurations and , ; (iii)the order is preserved in the context of finite extensions; that is, if and , then if we extend with to , and we extend to by using an extension isomorphic to then .

The last clause of this definition is not fully formalised here as it requires a certain amount of formalism, and we hope that the idea is sufficiently clear from an intuitive point of view. We refer the reader to [7] for a complete formal definition of this notion. Note that, as pointed out in [7], the order is essentially a parameter to the approach.

The concept of local configuration captures the idea of all preceding events to an event such that these events form a configuration.

Definition 18 (local configuration [7]). Let be an occurrence net, and the local configuration of an event , denoted , is the set of events , where , such that .

Definition 19 (cut-off event [7]). Let be a Petri net system, let be one of its branching processes, and let be an adequate order on the configurations of ; then an event is a cut-off event if and only if contains a local configuration for which and .

Without loss of reachability information, we can cease unfolding from an event , if takes the net to a marking which can be caused by some earlier other event . So in Figure 5(c), we remove the part after event 12 from because it is isomorphic to that after event 11; that is, the continuation after event 12 is essentially the same as the continuation after event 11. For a proof of this approach we refer to [7].

4. Evaluation

In this section, we demonstrate how the basic predicates introduced in Section 2 can be derived for Petri nets based on the process executions extracted from CFPs.

4.1. Annotating Complete Finite Prefix

In this work, the repository of process models are captured in terms of CFPs. All predicates between tasks are determined by examining the possible firing sequences in the CFP of each process model. To facilitate our algorithms for determining these predicates (presented in the next subsection), we would like to represent the continuation from cut-off events slightly more explicit in a CFP. The idea is that for each of the cut-off events in a CFP we mark out some earlier other event that can lead to the same marking as (i.e., and ). We referred to as the continuation event of in the CFP. We then annotate the CFP with links that connect from each cut-off event to its continuation event.

Definition 20 (notations of continuation events and cut-off events). Let be a Petri net system, with , and let , with , be an unfolding of ; then we define the following:(i) for any reachable marking of . If is clear from the context, we will simply omit it and write (a similar convention holds for the remainder of this definition, and is not introduced explicitly anymore);(ii)continuation which refers to the continuation node in for a reachable marking . It is defined as the unique event such that for all , if then ;(iii)cutoff continuation which denotes the set of cut-off events for a reachable marking .

Definition 21 (annotated complete finite prefix). Let be a Petri net system, and denotes a CFP of that is annotated with links from cut-off events to their continuation events, shortly referred to as an annotated CFP: , where(i) is the CFP of ; (ii) is a set of links defined as , and if and only if , then there is a reachable marking such that   continuation and .

Example 22. Consider as shown in Figure 5(c). For this annotated CFP, .

To generate an annotated CFP, we propose a slight adaptation of the algorithm for computing a CFP for a -safe net system in [7]. This adapted algorithm is presented as Algorithm 1. Based on Definition 21, the data structure for the representation of an annotated CFP comprises that of a CFP in [7] (written ) and a set of links (written ). is the set of events that can be added to a branching process (i.e., possible extensions of ), as defined in [7]. Application of yields an event which satisfies the following condition taken from [7]: and is minimal with respect to . The predicate is an abbreviation of , the condition used in [7]. Next, returns the result of whether or not is a cut-off event of (as in [7]), and during its application, the corresponding continuation event for is returned in the local variable , so that it does not need to be determined again when adding links. Note that we use as an abbreviation for and for .

Input: An -safe Petri net system
Output: Fin an annotated CFP of
begin
Fin ;
Fin ;
(Fin );
;
while     do
   );
  If   ( )  then
   Fin ;
    Fin );
   if   , Fin ,   then
     ;
    Fin ;
  else  

4.2. Determining the Basic Predicates

In Section 2, we defined a set of 6 basic predicates based on process execution semantics and to check if such a predicate holds requires in principle exploration of all process executions. Since different process executions result from choices in a process model, we propose to preprocess the annotated CFP of each process model (Algorithm 2) as follows: first we transform such a CFP to a set of conflict-free CFPs (specified by function GetAllExecutions in Algorithms 3) and then convert each resulting CFP to a directed bipartite graph (or bigraph) (specified by AnnotatedCFP2Bigraph in Algorithm 5).

function  
Input: An annotated CFP where and
Output: A set of bigraphs
begin
;
:= GetAllExecutions( );
for     do
   A

function  
Input: An annotated CFP where and
Output: A set of annotated CFPs
begin
;
;
 /* compute CFPs from each of the co-sets of leaf conditions */
 CS:= GetLeafCondCoSets( );
for     do
   ComputeCFP ;
 /* generate annotated CFPs from the above (conflict-free) CFPs */
:= ;
repeat
  Select ;
   ;
   := GetCutoffEvents( );
   := FALSE; /* the flag changes to TRUE if there are CFP updates */
  while     do
   Select ;
    := GetContinuationEvent( );
   if     then
     ;
   else
     := GetUpdatedCFPs( ); /* see Algorithm 4 */
     ;
     ;
     := GetLinks_to ;
    for           do
     
     ;
     ;
     := TRUE; /* set the flag to TRUE upon CFP updates */
     ; /* add to the remaining CFPs for link annotations */
    ;
  if   then
    ;
  
until   ;

In Algorithm 3, GetLeafCondCoSets yields all cosets of leaf conditions in the input CFP. By traversing backwards the input CFP (without considering the set of links) from each of these co-sets, ComputeCFPs produces the set of CFPs as a decomposition of the input CFP. This set of CFPs are free of conflicts due to the corelation between the leaf conditions in each co-set. For illustration, Figure 6 depicts the set of conflict-free CFPs as decomposition of   in Figure 5(c) via computation of GetLeafCondCoSets and ComputeCFPs.

Next, we convert the link annotations of the input CFP to the link annotations for each of the conflict-free CFPs (that result from the above decomposition of the input CFP). If such a CFP does not contain a cut-off event ( ), there is no link annotation, and the CFP will remain as it is. Otherwise, for a CFP with cut-off events, there are two cases to consider depending on whether a cut-off event ( ) in the CFP links to a continuation event ( ) within or outside this CFP. If the CFP contains both events, the link is directly added into the link annotations of the CFP. Otherwise, if the CFP contains but not , we propose to update the CFP (specified by function GetUpdatedCFPs in Algorithm 4) and the link annotations till there exists no link across two different CFPs.

function  
Input: A CFP , a set of CFPs , a (cut-off) event , a (continuation) event
Output: A set of (updated) CFPs
begin
;
 /* get   ready by removing the successor conditions of (in ) */
:= iSuccessors( );
;
;
 /* retrieve and process the CFPs that contain   in    */
 for     do
  /* remove from   the part before ,   itself, and the outgoing edges of  */
     GetSubCFP_to ;
   ;
   ;
     iSuccessors ;
  /* connect the above (updated)   and   to   */
   ;
   ;
     InitialConditions ;
  

Function  
Input: An annotated CFP where and
Output: A directed bigraph = ( : condition nodes, : event nodes, : directed edges)
begin
;
;
;
for   do
     iSuccessors ;
     iSuccessors ;
   ;
   ;
   ;

Algorithm 4 specifies how to update a CFP with a cut-off event linking to a continuation event outside the CFP. The basic idea is to identify among the set of conflict-free CFPs ( ) those ( ) that contain and to replace the part before and including in such a CFP ( ) with the part before and including in the original CFP ( ). This results in the same number of updated CFPs ( ) as that of the CFPs containing . Since is replaced by in the updated CFPs and is not used any more, the link annotations need update as well.

Back to Algorithm 3, we retrieve the links ( ) that lead to except for and replace with in these links. Accordingly, the flag is set to TRUE signaling the fact that CFP updates have been applied, and the updated CFPs are added to the set of remaining CFPs ( ) for processing of link annotations. For a given CFP ( ), if all the cut-off events in the CFP are processed without CFP updates ( ), the set of links ( ) that are computed from such processing is added as the CFP’s link annotations. The previous procedure for converting link annotations is repeated till there are no more remaining CFPs ( ). For illustration, Figure 7 depicts the set of conflict-free annotated CFPs as decomposition of in Figure 5(c) via computation of Algorithm 3. Note that Figures 7(d)7(f) show the three updated CFPs as result of combining the part before and including cut-off event 12 in the CFP in Figure 6(d) with the part after continuation event 11 in each of the CFPs in Figures 6(a)-6(c), respectively, and then replacing continuation event 11 with event 12 in the corresponding CFPs.

Finally, Algorithm 5 specifies how to convert an annotated CFP into a directed bigraph. The transformation is straight-forward where the events in the CFP become event nodes in the bigraph, conditions become condition nodes, the arcs become the directed edges, and the links are converted to the edges leading from a cut-off event to each of the immediate successors (conditions) of the corresponding continuation event. For illustration, Figure 8 depicts an example of converting an annotated CFP to a directed bigraph.

During preprocessing, we first generate a CFP from a Petri net, and then from the CFP we extract one of more bigraphs. As we only add link information in an annotated CFP, the complexity of the adapted CFP generation algorithm (cf. Algorithm 1) is the same as that of the original CFP algorithm, which is exponential on the number of arcs of the Petri net [7]. The complexity of generating a bigraph from a CFP (cf. Algorithm 2) is linear on the size of the CFP, since the latter is traversed depth-first in reverse order (i.e., starting from a leaf condition).

Now we define the algorithms for determining the 6 basic predicates. First, we introduce two common functions: RetrieveBigraphs which returns the set of bigraphs for a process model ( ) from the above preprocessing, and RetrieveAllEvents which returns the set of event nodes for (i.e., labeled with) a task ( ) in a bigraph ( ). Each such bigraph represents a possible execution of the corresponding process, and each event node labeled with a task identifier in the bigraph captures an occurrence of the corresponding task in that process execution. For a short notation, an event node labeled with task is hereafter referred to as an -event node.

Algorithms 6 and 7 specify how to evaluate the two unary predicates. Predicates posoccur or alwoccur of task in process model can be determined by checking the presence of a -event node in any or all bigraphs of . Based on the fact that the set of bigraphs of process model is each free of choices, the exclusive relation between two tasks and is determined by checking in every bigraph of if there are both a -event node and a -event node, as specified in Algorithm 8. In Algorithm 9, the concur relation between and in holds if and only if in each bigraph of either (1) there are no - and -event nodes at all, or (2) there are both an -event node and an -event node, and no directed path exists between the two nodes.

function  POSOCCUR
Input: A taskID , a process model
Output: A boolean value
begin
  RetrieveBigraphs ;
 return (RetrieveAllEvents

function  ALWOCCUR
Input: A taskID , a process model
Output: A boolean value
begin
  RetrieveBigraphs ;
 return (RetrieveAllEvents

function  EXCLUSIVE
Input: Two taskID and , a process model
Output: A boolean value
begin
  RetrieveBigraphs ;
 return (RetrieveAllEvents RetrieveAllEvents

function  CONCUR
Input: Two taskID and , a process model
Output: A boolean value
begin
  RetrieveBigraphs ;
 return (RetrieveAllEvents   RetrieveAllEvents
      RetrieveAllEvents   RetrieveAllEvents
      NoDirectedPath   NoDirectedPath ))

Next, the remaining algorithms are defined for basic predicates capturing causal relationships between tasks. Evaluation of each such predicate is based on the result of evaluating the corresponding intermediate predicate in individual process executions. Given a process model , predicate alwpred holds only when its intermediate predicate (i.e., Pred) holds in all process executions of , while predicate pospred holds as long as its intermediate predicate (i.e., Pred) holds in one process execution of . To capture such semantics, we apply logical operator (for predicate alwpred) or (for predicate pospred) between the intermediate predicate over the set of bigraphs ( ) of in the algorithms. Algorithm 10 specifies the evaluation of predicate alwpred, and Algorithm 11 specifies the evaluation of pospred.

function  ALWPRED
Input: Two taskID and , a process model
Output: A boolean value
begin
  RetrieveBigraphs ;
 return

function  POSPRED
Input: Two taskID and , a process model
Output: A boolean value
begin
  RetrieveBigraphs ;
 return

Let us move on to the algorithms for evaluation of intermediate predicates Pred. Consider an execution of process model and two tasks and in . Algorithm 12 specifies the evaluation of Pred. In this algorithm, refers to and to in the previous discussion, and function Precedes (which we will shortly describe in more detail) is used to evaluate causal relationship between two specific task occurrences.

function  PRED
Input: Two taskID and , a bigraph
Output: A boolean value
begin
  RetrieveAllEvents ;
RetrieveAllEvents ;
 return

Finally, we introduce the definition of function Precedes. In Algorithm 13, function Precedes determines if a given -event node eventually precedes a given -event node in bigraph (representing a process execution). Following a typical graph search algorithm, it traverses bigraph from the -event node (via recursively calling itself) until reaching the -event node ( ), the end of the graph (iSuccessors where iSuccessors denotes the immediate successors of node in graph ), or a node that was visited before ( where stores the set of visited nodes). Also, we consider that the Precedes relationship is irreflexive; that is, a task occurrence cannot have a Precedes relationship with itself. Hence, when and refer to the same task occurrence ( ), Precedes returns a negative result.

Function  Precedes
Input: A bigraph , a node , a event node , a set of nodes (the set of visited nodes)
Output: A boolean value
Begin
if     then
  return FALSE;
else
  if     then
   return TRUE;
  else
   if     then
    return FALSE;
   else
    return

A basic predicate is evaluated by traversing breadth first each bigraph of each process model in the repository; thus this operation is linear on the size of a bigraph. Let be the total number of bigraphs in the repository, and let be the number of basic predicates in a compliance rule. Hence, the complexity of evaluating a single rule (cf. Algorithms 6, 7, 8, 9, 10, 11, and 12) is linear on times times , where is the size of the largest bigraph in the repository.

It should be noted that for our purposes the adapted CFP generation algorithm and bigraph extraction algorithm are applied to computing the basic predicates over a repository of process models specified as Petri nets. Hence, these operations are performed when inserting a Petri net in the repository. This means that the cost of evaluating a rule is not determined by the complexity of these two algorithms, as the computation of the basic behavioural relations would already have taken place (so essentially we trade space for time).

5. Experiments

In this section, we first describe the implementation of ASBPQL in a software tool, and then we report on the performance of ASBPQL which we measured using this tool.

5.1. Implementation

In order to evaluate the performance of ASBPQL we implemented a tool, namely, ASBPQL Querier, that supports compliance checking for business process models with ASBPQL. A screen shot of ASBPQL Querier is shown in Figure 9. The tool is part of the BeehiveZ toolset v3.0. BeehiveZ is an open-source BPM analysis system based on Java (BeehiveZ can be downloaded from http://code.google.com/p/beehivez/downloads/list).

The architecture of the ASBPQL Querier and of the process model repository with which the ASBPQL Querier interacts inside BeehiveZ is illustrated in Figure 10. The core of the ASBPQL Querier is the query engine: it takes as input the compliance rules produced by users via the query editor and generates as output the results of compliance checking via the query results display. The query editor uses the syntax of ASBPQL. Using this syntax, users can easily specify the semantic relationships in which they are interested. For example, “A” alwpred “B” and “C” concur “F” mean that the users want to retrieve all process models where in every execution task A precedes task B as well as task C occurs parallel with task F.

Under the hoods, the query engine exploits an internal parser which converts each query statement into a grammar tree. This parser is built by JavaCC (http://javacc.java.net/) which is a widely used open source parser generator and lexical analyzer generator for Java. Grammar trees are then used by the evaluator to identify all process models in the repository that satisfy the requirements of a given query. To do so, the evaluator needs to get access to the collection of process models stored in the process model repository in Petri net format, as well as the directed bigraphs which have been constructed from the annotated CFPs of each Petri net by the annotated CFP decomposer using Algorithm 2. The generation of annotated CFPs is performed by the annotated CFP generator using Algorithm 1. For an annotated CFP, the data structure of conditions, events, and directed arcs are represented by nodes of doubly linked lists which support in particular fast insertion of nodes and backward traversing.

Moreover, for efficiency reasons, we keep an inverted index for every node label that appears in the set of annotated CFPs. We use Apache Lucene to manage these indexes (http://lucene.apache.org/). Specifically, for each label we record all processes which contain that label in some nodes. Based on this index, after a compliance rule is issued the tool can instantly filter out a set of candidate models containing the labels used in the compliance rule. The rest of the models are thus ignored since they are not relevant to the current rule. This step typically reduces the scope of searching and increases the tool’s performances. Furthermore, an advantage of using inverted indexes is that they can be easily updated as a result of changing a node label in the repository. For more details on this index, we refer to previous work [18].

5.2. Performance Measurements

We prepared a set of eight sample rules using various ASBPQL basic predicates and measured the evaluation of each of these rules over three process model collections. The first two collections are real-life repositories: the SAP R/3 reference model, consisting of 604 EPC models, and the IBM BIT library, consisting of 1,128 Petri nets. The SAP dataset is used by SAP consultants to deploy the SAP enterprise resource planning system within organizations [19]. The IBM BIT library includes five collections (A, B1, B2, C1, and C2) of process models from various domains, including insurance and banking [20]. The third dataset contains 10,000 artificially-generated models. (This dataset is available at http://code.google.com/p/beehivez/downloads/list.)

Since the SAP dataset is represented in the EPC notation, we first transformed these models into Petri nets using ProM (http://www.processmining.org/). This resulted in 591 Petri nets for the SAP dataset (13 SAP reference models could not be mapped into Petri nets through ProM). In the resulting dataset there are 4,439 transitions out of which 1,494 are uniquely labeled (33% of the total), while in the IBM dataset there are 9,083 transitions with 946 uniquely labeled one (10% of the total). The structural characteristics of the three datasets used in the experiments are reported in Table 2. In particular, we can see that the SAP and IBM collections have models of comparable sizes based on the average number of their elements (transitions, places, and arcs).

We generated the third dataset using BeehiveZ based on the reduction rules from [8]. The number of nodes per model follows a normal distribution. Specifically, the number of transitions per model ranges from 1 to 50 (average 24.85), the number of places from 1 to 47 (average 16.81), and the number of arcs from 2 to 162 (average 63.22). The labels of transitions were randomly chosen from a fixed label set comprising the characters “A–Z” and “a–z” and the numbers “0–9”, each label being made by a single character or number. In total, this led to 248,493 transitions in this dataset, with 62 unique labels (corresponding to 0.026% of the total number of transitions). As we mentioned earlier that we deployed inverted labels for each task label, we chose such a very low set of unique labels compared to the total number of transitions in order to increase the number of models that can potentially satisfy a rule; thus we can get precise measurement result about the efficiency of executing a compliance rule. All models used in the experiments are bounded Petri net, which is a requirement for unfolding according to [21].

We conducted our tests on an Intel Core i7-2600 @3.4 GHz and 8 GB RAM, running Windows 7 ultimate and JDK6. The heap memory for the JVM was set to 1 GB. We executed each compliance rule twelve times and measured each response time. We then discarded the highest and lowest response times for each rule and computed the average response time over the remaining ten values. The test rules and the response times for the three datasets are reported in Table 3.

In particular, to are used to test the unary basic predicates posoccur and alwoccur, and and are for the concur and exclusive predicates, while to are for causal relation predicates. For readability, in the table we use fictitious labels for transitions (e.g., ). The real labels from the three datasets, can be found in the Appendix.

The second and third columns of Table 3 show for each rule the number of models being filtered by BeehiveZ’s inverted index (“candidate models”) and the number of models that actually satisfy the rule (“returned models”). These numbers are very low for the SAP and IBM datasets (e.g., yields six models in the SAP dataset, out of which only two satisfies the rule), due to the high number of unique labels within these collections (see Table 2). However, as expected, these numbers grow significantly in the artificially generated collection (as an example, yields 552 models of which 72 satisfy the rule).

The last column of Table 3 shows the response times to execute the sample queries. These times are in the order of milliseconds for the SAP and IBM datasets (average 15 ms and 254.7 ms) and less than one second for the artificial dataset (average 850.4 ms). This shows that the technique is highly scalable to very large datasets. Having said that our technique shifts computation time from compliance checking to model insertion. In other words, most of the time is employed in generating the CFPs rather than in executing the compliance checking. Specifically, the overall time for building the set of CFPs and the corresponding bigraphs for the three datasets is 12.6 mins (SAP dataset), 28.5 mins (IBM), and 8.1 hours (artificial dataset). However, since we build annotated CFPs incrementally as we insert each Petri net into the repository, in practice the time for creating a single CFP is very short: only 1.28 s on average for a model from the SAP dataset, 1.52 s for a model from the IBM dataset, and 2.92 s for a model from the artificial dataset. These times are reasonable since repository users typically insert or remove single process models, or small groups thereof, at once, rather than inserting or removing entire model collections at once.

As expected, the storage size of the CFPs (including the label indexes) and corresponding bigraphs can be large. While it is only 26.8 MB for the SAP dataset and 18.1 MB for the IBM dataset, this value gets to 3.38 GB for the artificial dataset. However, this space is still acceptable considering that in organizational settings dedicated servers are typically employed to host process model repositories, rather than single desktop machines.

Based on the importance of query languages for business process models, in 2004, the Business Process Management Initiative (BPMI) planned to define a standard process model query language. While such a standard has never been published, two major research efforts have been dedicated to the development of query languages for process models. One is known as BP-QL [1], a graphical query language based on an abstract representation of BPEL and supported by a formal model of graph grammars for processing of queries. BP-QL can be used to query process specifications written in BPEL rather than possible executions and ignores the run-time semantics of certain BPEL constructs such as conditional execution and parallel execution.

The other effort, namely, BPMN-Q [2, 3], is also a visual query language which extends a subset of the BPMN modelling notation and supports graph-based query processing. Similar to BP-QL, BPMN-Q only captures the structural (i.e., syntactical) relationships between tasks. BPMN-Q uses a directed path (enhanced by operators like ≪leads to≫ and ≪precedes≫) connecting two activities to capture the requirement that they occur in order. The processing of BPMN-Q queries includes several steps. In short, BPMN-Q query engine searches for the process models that contain subgraphs that structurally match a query, reduces these subgraphs (remove elements that are not relevant to the query), translates the reduced subgraphs into Petri nets, and then calculates the corresponding reachability graph for each Petri net. Next, the query is translated into temporal logic formula which is fed into a model checker together with the reachability graphs generated from Petri nets. Finally, the model checker would output the process models that satisfy the query. Although part of the evaluation of BPMN-Q queries is based on LTL formulae, one of the most important step is subgraph matching which is totally structure based. For example, for the BPMN-Q query in Figure 2, the subgraphs obtained from the process model (c) in Figure 1 is shown in Figure 11. If only consider this subgraph, this BPMN-Q query holds, but this is not the case for the process execution where task “open VIP account” occurs. Accordingly, as discussed in Section 1, the main problem of BPMN-Q is that it cannot answer the question whether for the resulting processes the requirements of a query hold in every process execution or in just some process executions. BPMN-Q only returns process models where requirements hold in some process executions, rather than in every process execution. A comparison between ASBPQL and BPMN-Q is shown in Figure 12 where empty cells mean that the corresponding requirements cannot be captured by BPMN-Q. In [22], the authors explore the use of an information retrieval technique to derive similarities of activity names and develop an ontological expansion of BPMN-Q to tackle the problem of querying business processes that are developed with different terminologies. A framework of tool support for querying process model repositories using BPMN-Q and its extensions is presented in [23]. In [24], the authors proposed an indexing mechanism to improve the efficiency of evaluating BPMN-Q queries.

ASBPQL provides three distinguishing features compared to the previous languages. First, its abstract syntax and semantics have been purposefully defined to be independent of a specific process modelling language (such as BPEL or BPMN). This allows ASBPQL and its query evaluation technique to be implemented for a variety of process modelling languages. Second, ASBPQL can express various temporal-ordering relations (precedence/succession, concurrence, and exclusivity) between individual tasks, between an individual task and a set of tasks, and between different sets of tasks (in some or every process execution). Third, these rich querying constructs are evaluated over the execution semantics of process models, rather than their structural relationships. In fact, structural characteristics alone are not able to capture all possible order relations among tasks which can occur during execution, in particular with respect to cycles and task occurrences (recall the discussions in Section 1).

In earlier work [25], we provided an initial attempt at defining a query language based on execution semantics of process models. The language was written in linear temporal logic (LTL) and only supported precedence/succession relations among individual tasks (not sets of tasks). Queries in this language are evaluated based directly on annotated CFPs (i.e., TPCFPs in [25]), rather than on the directed bigraphs which are built from decomposing the annotated CFPs (a directed bigraphs represents an execution of a process). As a result, this language only returns the process models which satisfy the requirements just in some process executions, rather than in every execution. In addition, using LTL formulae as queries is not very user friendly for ordinary users. In [26], the authors proposed an business query language (BQL) to capture 4 types of relations (Exist, ParallelWith, Exclude, and Precede). A query in BQL returns processes of which some executions satisfy these four types of relations. Furthermore, BQL suffers from a drawback that the formal semantics of it has not been defined.

In addition to the development of a specific process model query language, other techniques are available in the literature which can be useful for querying process model repositories. In [27, 28] the authors focus on querying the content of business process models based on metadata search. In [29], an XML-based process query language, IPM-PQM, was designated to express search requirements. IPM-PQM can express four types of search conditions: Process-Has-Attribute, Process-Has-Activity, Process-Has-Subprocess, and Process-Has-Transition. IPM-PQM is a typical structure-based process querying technology. VisTrails system [30] allows users to query scientific workflows by example and to refine workflows by analogies. WISE [31] is a workflow information search engine which supports keyword search on workflow hierarchies. In [32] the authors use graph reduction techniques to find a match to the query graph in the process graph for querying process variants, and the approach, however, works on acyclic graphs only. In [3336], a group of similarity-based techniques has been proposed which can be used to support process querying. In previous work, we designed a technique to query process model repositories based on an input Petri net [18]. In [37], the authors introduced an execution-log-based query language which enables users to find elements and their relationships in process logs. In [38, 39], an approach that supports “static” and “dynamic” querying of process has been presented. As for the static querying, this approach searches for matching processes which contains specified context elements, such as business function, roles, and resources. This is based on keyword matching. As for the dynamic querying, similar to BPMN-Q, it tries to find process models where the requirements hold in just some process instances. In [40], the authors proposed an approach to searching business process models. This approach induces relationships between activities from their labels; it provides an approximate process model search mechanism. Finally, in [41], the notion of behavioural profile of a process model is defined, which captures dedicated behavioural relations like exclusiveness or potential occurrence of activities. These behavioural relations are derived from the structure of the unfolding of a process model. However, the main foundation of beavioural profile is the weak order (two transitions ,   are in weak order, if there exists an process execution in which occurs before ). Thus, for the reasons mentioned above, behavioural profile only provides an approximation of a process model’s behavior which just holds in some process executions, whereas we can precisely determine whether a process model satisfies or not a given query in every process execution. Moreover, the efficient computation of this approach requires process models to be sound free-choice Petri nets, whereas our query evaluation technique only requires Petri nets to be bounded, in order to unfold them.

7. Conclusions

In this paper, we simplify SPS by logic reasoning to define a concise and expressive retrieval language to specify semantics-based compliance rules. And we contribute an efficient technology based on unfolding to explore the semantics of process models. In such a technology, we can extract every independent execution from business process models without suffering from well-known state explosion. The language and its evaluation have been implemented as a component of the process analysis tool BeehiveZ. We also conduct experiments over three large datasets to evaluate the efficiency of our technology. Indeed, the performance measurements show that the technique can efficiently cope with very large datasets (the artificial collection counted 10,000 process models).

In the future, we will introduce graphical interface for querying in order to make BeehiveZ more intuitionistic.

Appendix

In Table 4, we provide the mapping between the fictitious labels used in Table 3 and the real labels used in the SAP, IBM, and artificial datasets.

Acknowledgments

Song, Wang, and Wen are supported by the National Basic Research Program of China (2009CB320700), the National High-Tech Development Program of China (2012AA040904), the Project of National Natural Science Foundation of China (90718010,61003099), the Program for New Century Excellent Talents in University of China, and the Ministry of Education and China Mobile Research Foundation (MCM20123011).