Recent Advances in Information TechnologyView this Special Issue
Research Article | Open Access
Simplified Process Model Discovery Based on Role-Oriented Genetic Mining
Process mining is automated acquisition of process models from event logs. Although many process mining techniques have been developed, most of them are based on control flow. Meanwhile, the existing role-oriented process mining methods focus on correctness and integrity of roles while ignoring role complexity of the process model, which directly impacts understandability and quality of the model. To address these problems, we propose a genetic programming approach to mine the simplified process model. Using a new metric of process complexity in terms of roles as the fitness function, we can find simpler process models. The new role complexity metric of process models is designed from role cohesion and coupling, and applied to discover roles in process models. Moreover, the higher fitness derived from role complexity metric also provides a guideline for redesigning process models. Finally, we conduct case study and experiments to show that the proposed method is more effective for streamlining the process by comparing with related studies.
Information systems are mostly driven by the process model. Therefore, the process model is a key factor for the system to run effectively. Process mining techniques aim to automatically generate process models by analyzing the event log, which assist in the redesign of process models.
Process mining first appeared in the field of software engineering. It is proposed by Jonathan from New Mexico State University in 1995 . Then, Agrawal started to apply process mining to process management in 1998 . He used directed graphs to represent the association between different activities in business processes. Instead of using directed graph, Aalst used the workflow net, which is a subclass of Petri nets to represent process models. Based on the work, some scholars extended process mining algorithm to handle business logic, including sequence, parallel and circular relationship . Compared with other process mining algorithms, genetic mining proposed by de Medeiros et al. is a global search algorithm, dealing with noise effectively .
The structure of process models is often complex. They consist of circular, parallel, choice and hidden structures. Current process mining algorithms are not well developed in dealing with these structures. Process mining aims to find the mode from the process execution log which most closely matches the actual behavior of business processes. But with the complication of the process, there will be large amounts of alternative process models. How to find out the process model with low complexity is necessary for process improvement. For example, Tian proposed a process mining algorithm combining genetic algorithms and simulated annealing algorithm . Through analyzing process participants, the algorithm built a causal relationship matrix mapping process instances to the population chromosomes and mined the process model effectively. In fact, those methods are very complex, and the process model mined may be very complicated too. Then, some researchers employed the complexity metric of control flow and computed the structural complexity of process models to guide process redesign . However, these approaches only pay attention to the complexity of the process model from the perspective of control flow. How to discover process models with low complexity, especially from the organizational view, and streamline the collaboration between process actors are necessary.
At present, most of process mining methods are based on process activities. They neglect the fact that the process depends on the collaboration between multiple roles. Though some scholars come to extract knowledge from the role perspective, their studies on the relationship between process roles are not complete and merely confined to discuss the interaction among organizational entities [7, 8]. Actually, the relationship between them is very complicated, so it is hard to uncover hidden information and the role complexity of business processes is ignored.
The remainder of the paper is organized as follows. Section 2 introduces the role complex metrics. Then, the role complex fitness is shown in Section 3 with the case study conducted in Section 4. Moreover, we conduct comparative experiments in Section 5. Finally, Section 6 concludes the paper.
2. The Role Complexity of Process Models
The complexity of business processes describes process models from different perspective. It implies whether a business process model has right size, clear structure and is easy to understand and reasonably modular. Therefore, it is necessary to design process models with low complexity.
The previous studies focus on control flow that is composed of activities and their relationships. For example, Cardoso discussed the complexity metric of control flow through experiments . In addition, Vanderfeesten et al. proposed cohesion and coupling metrics for process design . However, a process is the integration of participants (roles), resources, objectives, information and business rules, and so on. Control flow is just one of the factors affecting process complexity. More researchers begin to analyze business processes from different aspects.
2.1. The Role Cohesion Metric
The role cohesion analyzes closeness of multiple activities performed by one role. It proposes that the activities performed by the same role have closer relationship. For example, if the activities performed by a role are based on the same data or require similar capacities, then the role may have greater role cohesion and be more efficient to take the activities. The role cohesion metric is categorized into the following types.
(1) The role activity cohesion is to assess the interaction between roles in terms of control flow. The shorter the interval is, the higher the role activity cohesion is. Herein, the interval between two activities is defined as the number of activities between them. For example, there are activities between the activities and . Then, we can define the distance between them as . Actually, there may be several execution sequences containing and . Say that there are execution sequences which contain and , we can define the distance between and as
We can examine the interval of every two activities by a role to measure the role activity cohesion. So, the role activity cohesion of can be defined as where represents all activities performed by , max is the maximum distance between activities of the process separately, and means the number of situations of activities’ combinations. So, the role activity cohesion reflects the distance of activities performed by . The shorter the distance is, the higher the role activity cohesion of is.
(2) The role data cohesion measures the cohesion between roles in terms of data. It analyzes the frequency of using different data.
Provided that is the input data set of , which is necessary for , and is the output data set of , then is the data set, which is related to , and is the number of elements in . Then, the role data cohesion of is defined as
The role data cohesion indicates the proportion the input and output data of activities by have in common. The more they share the same data, the higher the role data cohesion of is.
(3) The role ability cohesion measures the cohesion between roles in terms of abilities needed. It computes what abilities are required for the role to perform different activities. If is the set of abilities necessary to perform , and is the number of elements in , then the role ability cohesion of is defined as
The role ability cohesion shows the kinds of common abilities required by different activities performed by . The more they share, the higher the role ability cohesion of is.
As a whole, the role cohesion metric is computed as follow:
2.2. The Role Coupling Metric
The role coupling metric implies the degree of association between activities taken by different roles. If there are several kinds of connections between activities performed by two roles, and one role is connected with more roles, it has greater role coupling metric.
(1) The role activity coupling shows the degree of association between activities performed by different roles in a process. If activities by different roles are connected, these roles are interrelated. There are several kinds of connections corresponding to different degrees of association.
Assume that is responsible for and is not performed by role . and represent the outdegree and indegree of the connector between and separately. We can define the coupling weight as follows through the connection form between and .(i)If and are directly connected, then and are coupled, so the coupling weight between them is 1.(ii)If and are connected through AND connector, then and are also coupled, so the coupling weight between them is 1.(iii)If and are connected through OR connector, then the probability of coupling between them is , so the coupling weight between them is .(iv)If and are connected through XOR connector, then the probability of coupling and coupling weight between them are both 1/mn.(v)If and are not connected, they cannot be coupled, so the coupling weight between them is 0.
The role coupling metric of is defined as where connected represents the coupling weight between and , Arc stands for the set of arcs in the process model, and is the number of elements in Arc. The larger is, the higher the role activity coupling of is.
(2) The role coupling is not only related to role activity coupling, but also to the number of roles associated with a role. If a role is associated with more roles, it may be complicated. So, the role relation coupling of is defined as where is the number of roles associated with and represents the number of roles in the process. The larger the is, the higher the role relation coupling of is. The role coupling metric is defined as
The lower the role cohesion is and the higher the role coupling is, the more complex the role is. Therefore, the role complexity is defined as
As each role is different in importance, the role complexity of each role should be accompanied by the appropriate weight depending on its importance. Then, according to the weight of each role and its role complexity, we can get the role complexity of a business process. The weight of each role can be defined as where , time, and cost represent the number of activities, time, and cost to perform activities by separately. , TIME, and COST represent the number of activities, time, and cost to perform activities of the business process separately. In (10), 1/3 is to ensure that the sum of weights of all the roles is 1. The role complexity of the business process is defined as where is the role complexity of the role , represents its weight, and is the set of roles in the process.
3. The Fitness Function of Role-Oriented Process Mining
In 2005, Aalst first introduced genetic algorithm to process mining (genetic mining). In genetic mining, an individual is a candidate process model and the fitness function evaluates how well it is able to represent the actual process .
The fitness function is used to evaluate the adaptation of every individual and guide searching process of genetic programming. In order to mine the simplified business process model, we introduce the complex fitness into the fitness function.
3.1. Role Complexity Fitness
We define as role complexity of process individual; and stand for the minimum role complexity value and the maximum role complexity value separately in a generation of population. So, the role complexity fitness is defined as
PFcomplex describes the relative role complexity of individuals in the same population in (12). When the role complexity value of an individual is the maximum, the fitness value of role complexity is 1. When the role complexity value of individual reaches the minimum, the fitness value of role complexity is 0. The smaller the PFcomplex of the individual is, the lower its relative complexity is.
3.2. Fitness Function
The basic principle of fitness function is that a process model should match event logs as much as possible. So the precision is defined as where means the number of roles in a process model, is the participant set of , and is the cosine similarity between the participants and .
In order to discover simple process models, we add the role complexity fitness to describe the precision. As mining process models with correct roles is the nature of process mining, complex fitness should have lower weights. The individuals that are complex are punished. Assuming that the weight of the precision and complexity fitness are and separately, the complete fitness is defined as follows: where the fitness is affected by not only the correct recognition of roles in the process model, but also by the role complexity of the model. So, it can make the role complexity of mined process models lower.
The basic idea of genetic mining is as follows. First, event logs are collected and activities by each participant are analyzed. Then, initial population is created. After that, the fitness of every individual in the population is computed according to the fitness function (14). If the fitness does not satisfy the termination condition, the population needs iterative evolution through the genetic operations including selection, crossover, and mutation. Each genetic operation transfers the individual, which has higher fitness value in the population to the next generation. This loop terminates until the optimal solution is found. In Section 4, we resort to a case study to discuss the procedure of genetic mining in detail.
4. Case Study
We first give a process mining experiment mentioned in  and compare its process mining algorithm with ours.
The interaction between roles and role identification are analyzed by using genetic algorithm and achieving optimal role identification . On the one hand, it shows the degree of similarity in the activities executed by participants. On the other hand, it indicates the similarity of performing internal activities of the participants of a certain role and the interaction between participants. Table 1 shows the fragment of workflow logs. The data in the first column represents the process instance number, the second column represents activities, and the third column represents the participant corresponding to the activity in the second column.
The matrix shows the role situation of a process model . If is 1, that means the participants and undertake the same role. If is 1 and , that means undertakes the role by himself . As seen in matrix , and undertake a role, undertakes a role alone, and and undertake a role. We can encode the process model through linking value of each row in , and the chromosome is 010001000011000010000
In workflow logs, the situation of activities by each actor is as follows: executes 10 times, 6 times, and 5 times separately. executes 2 times, 2 times, 5 times, and 6 times separately. executes 8 times. executes activity 8 times, 7 times, and 6 times separately. executes activity 7 times, 7 times, 8 times, 6 times, 10 times, and 2 times separately. executes 5 times, 3 times, 3 times, 5 times, 4 times, and 1 times separately. Tables 2 and 3 show the data and abilities required for each activity in the process separately.
The precision value and the role complexity of represented in the matrix are as follows:
The maximum role complexity value in this generation is 393.74, and the minimum one is 25.54. So, the complex fitness value of is
It is supposed that the complex fitness has lower weight than the precise fitness. In this paper, we assume that the weight of the precision fitness is 0.7 and that of the complex fitness is 0.3. So, the fitness value of is
After that, genetic operations are performed: we use the selection operator. It retains process models which have higher fitness values. Herein, we choose 15 process models to get corresponding chromosomes in each generation. Then, we use the crossover operator. For example, the model with the code 010000000001000010001 makes a change in the 16th bit with the model with the code 010001000011000110000, the new chromosomes 010000000001000110001 and 010001000011000010000 are produced. Its occurrence probability is : where and are constants, and we assume that they are 0.7 and 0.1. and are the maximum and minimum fitness values in this generation separately: 0.87 and 0.52. is the fitness value of , which is 0.632.
The mutation operation is that one bit of the chromosome changes at random, from 0 to 1 or from 1 to 0. Its probability of occurrence is and is defined as where and are constants and we assume that they are 0.09 and 0.01 separately.
As mentioned above, we can get the role situation matrix through role-oriented genetic mining below:
As seen from and share a role,, and undertake a role, and both and undertake a role alone. It groups the participants into four roles: business manager, technical staff , technical staff , and production workers. Figure 1 is the role-activity diagram through our role-oriented process mining method.
In order to verify the effectiveness of our method, we compare the role complexity of process models mined by our algorithm and the algorithm proposed in . For Figure 1, we can calculate the role complexity of and by our algorithm. The cohesion and coupling complexity of and are shown in Table 4.
and are taken as one role . The role complexity of that role is 7.82. Clearly, the role complexity of that role is far greater than the sum of the role complexity of and in Figure 1. The reason is that the technical staff is mainly responsible for designing samples and inspection, as well as raw material application and confirmation. These two kinds of activities are different and require different data and abilities, which leads to low role activity cohesion, role data cohesion, and role ability cohesion. This brings about the high role complexity ultimately. By means of our method, the work of technical staff is split, which ensures that the process model is correct and has low role complexity at the same time. The result shows that our method performs better in discovering simplified role-based process models. In fact, the roles have high cohesion and low coupling.
The role mining algorithms proposed in [12, 13] got role hierarchy through the combination of permissions based on participants and their permissions. Their algorithm identified roles based on permissions, ignorant of the difference between the activities of participants. Phalp and Shepperd measured the role’s complexity through surveying internal activities and interactions between roles . They didnot give full considerations of cohesion and coupling between roles. In addition,  considered the complexity of the application of resources. But it ignored the internal cohesion of roles. In comparison, our method treats the similarity between activities by different participants as the basis for identifying roles and it is based on genetic mining. So, it deals with noise more effectively in workflow logs. Additionally, it measures the complexity of roles through cohesion and coupling in terms of activities and resources. Therefore, the role complexity makes the process model correct and simple.
In order to analyze the performance of the algorithm we proposed, we select some event logs produced by 8 workflow models shown partly in Table 1 to perform some experiments.
As can be seen from Figure 2, when the population size is small, the fitness value is low. And when the population size is bigger than 200, the fitness value no longer increases. So, the population size is set to 200.
As shown in Figure 3, the time spent by the algorithm is increased with the increase of the maximum number of iterations. When the maximum number of iterations is small, the algorithm will stop before finding the optimal solution. In this case, the solution is questionable. And when the maximum number of iterations reaches 5000, the time spent by the algorithm will remain stable. That means the optimal solution will be found before 5000 iterations. So, the maximum number of iterations is set to 5000.
Except for processes 3 and 5, the role complexity of process models mined by our method is lower than that by I-GA  in Figure 4. The reason is that the role complexity of process models is not considered in I-GA . And in processes 3 and 5, the role complexity of process models may be not reduced any more. Therefore, the algorithm we propose can reduce the role complexity of mined process models.
In Figure 5, we can see that the fitness value of process models mined by our method is relatively close. That means, though our method considers role complexity, it has little adverse effect on the fitness.
Through these experiments, we can see that the algorithm performs better when mining simpler process models, because it uses the role complexity of process models. Therefore, it can reduce the role complexity when mining process models.
In this paper, we combine genetic programming with the role complexity and propose the role-oriented process mining approach. The advantage of our method is that it can mine process models not only correctly, but also simply. In the future, we will consider the relationship between process roles more comprehensively and reduce the role complexity further. Besides, we can improve the efficiency of model mining through improved genetic algorithm.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research is supported by the Natural Science Foundation of China (no. 71071038). Many thanks are due to Hongzhi Hu for her helpful work to the corresponding author Weihui Dai of this paper.
- J. E. Cook, “Process discovering and validation through event-data analysis,” Tech. Rep. CU-CS-817-96, University of Colorado, Boulder, Colo, USA, 1996.
- R. Agrawal, D. Gunopulos, and F. Leymann, “Mining process models from workflow logs,” in Proceedings of the 6th International Conference on Extending Database Technology, pp. 469–483, Valencia, Spain, 1998.
- W. M. P. van der Aalst and B. F. van Dongen, “Discovering workflow performance models from timed logs,” in Proceedings of the 1st International Conference on Engineering and Deployment of Cooperative Information Systems, pp. 45–63, Beijing, China, 2002.
- A. K. A. de Medeiros, A. J. M. M. Weijters, and W. M. P. van der Aalst, “Genetic process mining: an experimental evaluation,” Journal of Data Mining and Knowledge Discovery, vol. 14, no. 2, pp. 245–304, 2007.
- K. Tian and Z. Qingxin, “Study of workflow reconstruction based on hybrid genetic algorithm,” Computer Science, vol. 34, no. 1, pp. 103–105, 2007.
- K. B. Lassen and W. M. P. van der Aalst, “Complexity metrics for Workflow nets,” Information and Software Technology, vol. 51, no. 3, pp. 610–626, 2009.
- C. A. Ellis, A. J. Rembert, K. H. Kim et al., “Beyond workflow mining,” in Business Process Management, vol. 4102 of Lecture Notes in Computer Science, pp. 49–64, Springer, 2006.
- M. Song and W. M. P. van der Aalst, “Towards comprehensive support for organizational mining,” Decision Support Systems, vol. 46, no. 1, pp. 300–317, 2008.
- J. Cardoso, “Process control-flow complexity metric: an empirical validation,” in Proceedings of the IEEE International Conference on Services Computing (SCC '06), pp. 167–173, IEEE Computer Society, Chicago, Ill, USA, September 2006.
- I. Vanderfeesten, H. A. Reijers, and W. M. P. van der Aalst, “Evaluating workflow process designs using cohesion and coupling metrics,” Computers in Industry, vol. 59, no. 5, pp. 420–437, 2008.
- W. D. Zhao, W. H. Dai, A. H. Wang et al., “Role-activity diagrams modeling based on workflow mining,” in Proceedings of the World Congress on Computer Science and Information Engineering (CSIE '09), pp. 301–305, Los Angeles, Calif, USA, April 2009.
- M. Kuhlmann, D. Shohat, and G. Schimpf, “Role mining: revealing business roles for security administration using data mining technology,” in Proceedings of the 8th ACM Symposium on Access Control Models and Technologies, pp. 179–186, ACM, June 2003.
- S. Jürgen and S. Ulrike, “Role mining with ORCA,” in Proceedings of the 10th ACM Symposium on Access Control Models and Technologies, pp. 168–176, ACM, June 2005.
- K. Phalp and M. Shepperd, “Quantitative analysis of static models of processes,” Journal of Systems and Software, vol. 52, no. 2-3, pp. 105–112, 2000.
- A. Kumar, R. M. Dijkman, and M. Song, “Optimal resource assignment in workflows for maximizing cooperation,” July 2013, http://www.personal.psu.edu/faculty/a/x/axk41/coop13.pdf.
Copyright © 2014 Weidong Zhao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.