Abstract

Scientific workflow is a valuable tool for various complicated large-scale data processing applications. In recent years, the increasingly growing number of scientific processes available necessitates the development of recommendation techniques to provide automatic support for modelling scientific workflows. In this paper, with the help of heterogeneous information network (HIN) and tags of scientific workflows, we organize scientific workflows as a HIN and propose a novel scientific workflow similarity computation method based on metapath. In addition, the density peak clustering (DPC) algorithm is introduced into the recommendation process and a scientific workflow recommendation approach named HDSWR is proposed. The effectiveness and efficiency of our approach are evaluated by extensive experiments with real-world scientific workflows.

1. Introduction

Scientific workflow is an effective and important means to deal with data-intensive, computation-intensive, and collaboration-intensive scientific issues in many large-scale complex systems or applications from domains such as physics, astronomy, chemistry, bioinformatics, and life sciences [13]. In practice, many scientific workflows have been successfully deployed and executed on clouds. Recently, with the quick development of smart user devices and edge computing, a number of studies have been carried out to construct and execute workflows in a cloud-edge collaborative manner [4, 5].

Scientific workflow modelling plays an important role in complex scientific workflow applications, which is a not only complex but also error-prone process. In recent years, more and more scientific workflows have been published onto the Web and shared in some repositories such as CrowdLabs, SHIWA, Galaxy, and the myExperiment [6, 7]. People can leverage and repurpose a part of existing scientific workflows for specific complex applications, rather than constructing new ones from scratch. However, with the growth of the amount of scientific workflows, finding suitable scientific workflows from a sea of candidates becomes a new problem for scientists and engineering personnel. Though process retrieval methods can help to handle this problem by retrieving similar scientific workflow fragments from repositories, much manual work is still required. Consequently, to provide better automatic support, it is necessary to build effective scientific workflow recommendation techniques, which is fundamental for the reuse and repurposing of current scientific workflows.

In scientific workflow repositories, various types of data can be used for recommendation, including scientific workflow structure and annotation. However, the tags of scientific workflows are usually neglected by existing scientific workflow recommendation methods. In fact, the tags of scientific workflows contain much valuable information and different underlying logical relations among scientific workflows which can be explored via them. For example, many tags in the myExperiment repository are substantially shared by multiple scientific workflows and there exist partial similarity relations among these scientific workflows. Therefore, integrating tags and other information of scientific workflows is promising to generate more accurate recommendations.

On the other hand, heterogeneous information network (HIN) has been proved to be a powerful modelling method to incorporate various heterogeneous types of information and it has been successfully applied in recommender systems [8, 9]. Motivated by the HIN-based recommendation idea and data characteristics of scientific workflow repository, we plan to integrate multiple types of scientific workflow data into the form of HIN and use a metapath-based technique to measure similarity and calculate distance between scientific workflows, by which multiple metapaths can be combined with the semantic description information of scientific workflows and more accurate similarity computation results would be obtained.

With these observations, in this paper, we propose a heterogeneous information network-based approach for recommending scientific workflows to scientists and engineering personnel. In our approach, different data objects and underlying logical relations on scientific workflows are organized as a HIN, according to which the similarity between scientific workflows is evaluated. In addition, to facilitate the reuse and repurposing of current scientific workflows, the density peak clustering (DPC) algorithm [10] is introduced and used to group candidates into clusters. Our main contributions are summarized as follows:(1)We propose a new representation form of scientific workflow based on HIN, which is enriched through incorporation of multiple types of data including tags and logical relations of such data(2)We build a metapath-based method to assess the similarity between scientific workflows, where the similarity is calculated according to objects of tag, description, activity, and subscientific workflow involved in scientific workflows(3)We present a HIN- and DPC-based scientific workflow recommendation approach named HDSWR to generate more accurate recommendations and, on the basis of it, to facilitate the reuse and repurposing of current scientific workflows for scientists and engineering personnel(4)We provide two real-world datasets with tags on scientific workflows for experiments

The remainder of this paper is organized as follows. Section 2 describes the related studies. Section 3 introduces some notations and basic definitions used in the paper. Section 4 presents the scientific workflow similarity computation method. In Section 5, we propose the HDSWR approach. Then, we evaluate our method in Section 6. Section 7 concludes this paper.

In this section, we briefly review related work on the workflow models, workflow recommendations, and HIN.

A workflow model is fundamental for various workflow applications. In practice, workflows can be modelled by different tools such as directed acyclic graphs (DAGs), Petri nets, event-driven process chains (EPCs), the business process execution language (BPEL), or the fairly complex business process modelling notation (BPMN) language which has over 100 symbols [11]. However, modelling workflows is always a knowledge-intensive and laborious task. To improve workflow modelling, methods such as workflow mining [12] have been proposed to discover workflow models from event logs. However, similar to process retrieval, much manual work is still involved.

In recent years, some workflow recommendation approaches have been proposed. Current techniques can be mainly classified into two types: business workflow (process) recommendation and scientific workflow recommendation.

In the business process management domain, business workflow is usually modelled with block structures including sequential structures, alternative structures, parallel structures, and iterative structures. So far, only a limited number of business workflow recommendation methods have been proposed to serve different purposes, which can be classified into complete process recommendation and process fragments (nodes) recommendation [13]. For example, Zhang et al. [14] leveraged workflow provenance to recommend a set of nodes for a partial workflow. Li et al. [15] adopted minimum depth-first-search codes and string edit distances for representing and recommending business workflow fragments. Deng et al. [13] developed a recommendation system to generate a sorted candidate node sets, which used a subgraph mining method to extract patterns from process repositories. Wang et al. [16] utilized the properties of business process repositories and proposed a representation-learning-based recommendation method.

Scientific workflows are based on the automation of scientific process which is typically composed of multiple scientific programs or Web services. Compared with business workflows, scientific workflows have a strong focus on the dataflow to sufficiently support a variety of data-intensive applications, in which the control structure just simply describes the partial ordering of tasks. Therefore, scientific workflows are usually modelled with unstructured DAGs, which conceptually use a set of nodes and edges instead of complex block structures. However, similar to business workflow recommendation, there are two kinds of work in scientific workflow recommendation. For instance, Zhang et al. [17] used the term of unit of work (UoW) to represent a collection of services (i.e., fragments of a scientific workflow) chained together, based on which a UoW-driven scientific workflow recommendation framework and three algorithms for UoW mining and recommendation are proposed. Cheng et al. [18, 19] converted a scientific workflow into a lay hierarchy in terms of a tree style, where the hierarchical relations specify the links between a scientific workflow, its subworkflows, and activities. Based on it, a semantic similarity computation algorithm considering the lay hierarchy and description of scientific workflows is proposed for clustering and recommending appropriate scientific workflows. Krzywucki and Polak [20] utilized semantic-type comparison to evaluate the similarity of scientific workflows. Bergmann et al. [21] proposed a semantic workflow graph-based method for modelling scientific workflow similarity and developed an search-based algorithm for workflow similarity computation. Starlinger et al. [7] presented a layer decomposition approach for the comparison and similarity search of scientific workflow. Mohan et al. [22] developed several folksonomy-based scientific workflow recommendation strategies and implemented them in a prototype system.

HIN is a newly emerging direction in recommender systems and a good candidate for improving the accuracy of recommendations. However, to the best of our knowledge, HIN is normally neglected in the workflow recommendation literature. So far, most of the HIN-based recommendation methods consider the metapath-based similarity. For example, Sun et al. [8] investigated a similarity search problem in HIN and introduced the concept of metapath-based similarity. Zhao et al. [23] introduced the concept of metagraph to incorporate more complex semantics for HIN-based recommendation. Shi et al. [24] developed a metapath-based random walk strategy and proposed a HIN embedding-based recommendation algorithm. On the other hand, scientific workflows in repositories have rich tag information, which are seldom exploited by existing workflow recommendation methods. Some research work related to tags has been done in the domain of Service Computing [25] and other related research work on service recommendation was carried out in [26]. Our previous work in [27] has preliminarily utilized scientific workflow tags for recommendation. In this paper, we further organize scientific workflows and their relations as a HIN to calculate the similarity of scientific workflows and generate more accurate recommendations.

3. Preliminaries

To make our approach well understood, we first introduce HIN and relevant concepts in this section. The notations we will use throughout this paper are summarized in Table 1.

Definition 1. (Scientific Workflow [18]). A scientific workflow sw is a tuple (nm, sw_dsc, sw_D, sw_A, sw_L, and sw_T), where nm and sw_dsc are the name and text description of sw, respectively. sw_D is the set of subscientific workflows that sw invokes. sw_A is the activity set of sw. sw_L denotes a set of links connecting activities and subscientific workflows in sw. sw_T is a set of tags on sw.
Generally, a subscientific workflow can be regarded as a scientific workflow [7]. For example, in the myExperiment repository, a subscientific workflow is stored as an independent scientific workflow.

Definition 2. (Heterogeneous Information Network [24, 28]). A heterogeneous information network is defined as a direction graph with an object-type mapping function and a link-type mapping function , satisfying .

Definition 3. (HIN-Based Scientific Workflow Representation). The scientific workflow can be organized and represented as a heterogeneous information network, which contains five object types: scientific workflow (denoted as SW), tag (denoted as T), activity (denoted as A), subscientific workflow (denoted as D), and description (denoted as dsc). Each scientific workflow can link with a set of tags, a set of activities, and a set of subscientific workflows, and a description.

Example 1. An example of HIN-based scientific workflow representation is shown in Figure 1, which consists of two real-world scientific workflows named Chemical2URIs (https://www.myexperiment.org/workflows/97.html) (denoted as ) and DFCUAM (https://www.myexperiment.org/workflows/4700.html) (denoted ).
The links with a text description (), three tags (annotation, chemspider and cheminformatics), two activities (REST_Service and Xpath_Service), and two subscientific workflows (CNTCI and workflow40).
The links with a text description (), three tags (cheminformatics, chemspider, and metabolomics), and two activities (SearchByMass and GetCompoundDetails).
Besides, and are linked by two tags (cheminformatics and chemspider), which are shared by and . Similarly, if some objects of subscientific workflow, activity, or description are shared by two scientific workflows, there exists some link relation between these two scientific workflows.

Definition 4. (Network Schema [24, 28]). The network schema is a meta template for a heterogeneous information network with the object-type mapping function and the link-type mapping function , which is a directed graph defined over object types B and link types R.
According to Definition 4, we can construct a HIN-based scientific workflow representation schema, which is shown in Figure 2. There are five types of objects: scientific workflow (SW), tag (T), activity (A), subscientific workflow (D), and description (dsc). Besides, there exist four types of links between objects to represent different relations:(1)A link relation between a scientific workflow and a tag.(2)A link relation between a scientific workflow and an activity.(3)A link relation between a scientific workflow and a subscientific workflow.(4)A link relation between a scientific workflow and a description. Such link relation is single-way, because a specific text description belongs to a specific scientific workflow.

Definition 5. (Metapath [8, 24]). A metapath p is a path defined on a network schema and is represented in the form of and thus defines a composite relationship between two object types and , where denotes the composition operator on relations R.
According to Definition 5 and the HIN-based scientific workflow representation schema, we can construct four types of metapaths, which are shown in Figure 3:(1)Metapath : if a tag is shared by two scientific workflows and , we can use the metapath SWTSW (Scientific Workflow  Tag  Scientific Workflow) to indicate a cotag relation between and .(2)Metapath : if an activity is shared by two scientific workflows and , we can use the metapath SWASW (Scientific Workflow  Activity  Scientific Workflow) to denote a coactivity relation of and .(3)Metapath : if a subscientific workflow is shared by two scientific workflows and , we can use the metapath SWDSW (Scientific Workflow  Sub-Scientific Workflow  Scientific Workflow) to denote a relation between and on a subscientific workflow.(4)Metapath : if a description is shared by two scientific workflows and , we can use the meta-path SWdscSW (Scientific Workflow  dsc  Scientific Workflow) to denote a relation between and on a description.

4. Similarity Computation for Scientific Workflows

Based on the basic definitions mentioned above, we propose a novel scientific workflow similarity computation method in this section. It mainly consists of four steps.Step 1: construct three adjacent matrices on the objects of tag, activity, and subscientific workflow.According to the objects of tag, activity, and subscientific workflow involved in the scientific workflows, we can construct three adjacent matrices, respectively, denoted as SWT, SWA, and SWD. A row of the adjacent matrices corresponds to a specific scientific workflow. A column of the adjacent matrices SWT, SWA, and SWD corresponds to a specific object of tag, activity, and subscientific workflow, respectively. The values in these three adjacent matrices can be 1 or 0, which denotes whether a specific object belongs to a specific scientific workflow.Besides, for computational convenience, we use the feature vector , to represent the relation between the scientific workflow and all the objects of tag involved, which corresponds to a row in the adjacent matric SWT. Likewise, we use the feature vector and to represent the relations between the scientific workflow and the objects of activity and subscientific workflow involved, respectively, which correspond to a row in the adjacent matrices of SWA and SWD, respectively.Step 2: Calculate the similarity on the metapaths.As mentioned in Section 3, there exist four types of metapaths. Therefore, the similarity strength of and on meta-path can be calculated by the following equation:In equation (1), and are two feature vectors of scientific workflow and on tags, respectively. is the transpose of the feature vector . The higher the number of common tags between and , the greater the inner product of the and , and thus, the more the similarity between and on the tag. The meaning of notations in equations (2) and (3) is similar to these in equation (1).Likewise, the similarity strength of and on metapaths and can be obtained by equations (2) and (3), respectively, where the meaning of notations is similar to these in the following equation:Based on equation (1), we can also obtain the values of and . To normalize the similarity strength effectively, we utilize the ratio between the and the max one in the and to represent the similarity between scientific workflows and with respect to metapath , which is described as follows:Analogously, the similarity between scientific workflows and with respect to metapaths and is described as follows:Step 3: Calculate the similarity value on the descriptions of scientific workflows.The doc2vec model can learn the fixed-length feature from the variable-length text [29]. Therefore, we utilize the doc2vec model to form the paragraph vectors and for the descriptions of scientific workflows and , respectively. Besides, the normalized cosine similarity between and is calculated as the similarity value on the descriptions of scientific workflows and , which is described as:In equation (6), the notations and represent the norm of the paragraph vectors and , respectively.Step 4: Summarize different similarity values.To effectively fuse different similarities of scientific workflows obtained by the above steps, we introduce the weighting mechanism, which is described as:In equation (7), α, β, γ, and δ are the weight coefficients satisfying α + β + γ + δ = 1.

5. HDSWR Approach

To improve the accuracy and efficiency of scientific workflow recommendation, we propose an approach named HDSWR. In this section, we provide an overview of the HDSWR and introduce its related function algorithms in detail.

5.1. Overview of the HDSWR Approach

The proposed HDSWR approach is shown in Algorithm 1, which consists of four steps:Step 1 (line 1): we construct a matrix to denote the similarity values between scientific workflows in the list SW, which may come from some scientific workflows repository. All the scientific workflows in the list SW are organized as a HIN for similarity computation.Step 2 (line 2): we adopt the density peak clustering (DPC) algorithm [10] to group all the scientific workflows in the list SW into multiple different clusters, where the similarity values in the matrix are used as the distances between scientific workflows and denotes a set of clusters on scientific workflows.Step 3 (lines 3-4): according to textual description in the requirement of scientists and engineering personnel, i.e., requirement.dscs, we search and choose appropriate objects of activity and subscientific workflow involved in the list SW, where and denote a set of subscientific workflows and a set of activities, respectively. Then, a HIN-based sample scientific workflow can be constructed (Line 4).Step 4 (line 5): according to the sample scientific workflow , we firstly select an appropriate group of scientific workflows in the set by the similarity values between and different clusters. Then, a list is generated for recommendation, where the number of scientific workflows in the list is related to the parameter of rec_K.

Input:
(i) SW: a list of scientific workflows.
(ii) requirement: a modelling requirement, denoted as (, , dscs).
(iii) , , , : parameters for similarity computation.
(iv) rec_K: a parameter on the number of recommended scientific workflows.
Output:
(i) : a list of recommend scientific workflows.
(1) ComputeSimilarity (SW, , , , )
(2) DPCClustering (, SW)
(3) GetActivity_SubWF (requirement.dscs)
(4) Construct a sample scientific workflow with , , requirement. and requirement.
(5) RecommendSWs (, , rec_K, )
(6)return
5.2. Similarity Computation

Assessing workflow similarity is important for workflow recommendation. Its main purpose is to measure the distances between workflows. Based on the scientific workflow similarity computation method introduced in Section 4, the function ComputeSimilarity is described as Algorithm 2.

Input:
(i) SW: a list of scientific workflows.
(ii) : weight coefficients.
Output:
(i) : the final similarity matrix of SW.
(1)SWT construct the adjacency matrix of SW on tag objects
(2)SWA construct the adjacency matrix of SW on activity objects
(3)SWD construct the adjacency matrix of SW on sub-scientific workflow objects
(4)for each scientific workflow in do
(5)  for each scientific workflow in do
(6)   obtain , , , from SWT, SWA, SWD
(7)   calculate , ,
(8)   calculate , ,
(9)   calculate
(10)   
(11)  end for
(12)end for
(13)return

In Algorithm 2, three adjacent matrices on the scientific workflow list SW are constructed first (lines 1–3). Then, the feature vector of scientific workflows and is used to compute the similarity strengths on metapaths by equations (1)–(3) (lines 6-7), based on which the similarity between and with respect to metapaths can be obtained by equations (4)–(6) (line 8). Finally, the similarity values are obtained by equation (7) (line 9) and stored in the matrix for further clustering and recommendation (line 10).

Example 2. The scientific workflows and in Figure 1 can be used as an example. As illustrated by Figure 1, there are four tags (annotation, chemspider, cheminformatics, and metabolomics) involved in the scientific workflows and . Therefore, as shown in Figure 4(a), the corresponding value on these four tags in the adjacent matrix SWT is 1 or 0 with respect to the and , where the value of 0 denotes that such tag does not belong to some scientific workflow. Similarly, the matrix SWA in Figure 4(b) shows the corresponding values on the activities of the scientific workflows and , and the matrix SWD in Figure 4(c) shows the corresponding values on the subscientific workflows. Besides, the feature vectors of , , and are also illustrated by Figure 4.

5.3. DPC-Based Clustering of Scientific Workflows

To improve the efficiency of recommendation, we introduce the clustering strategy proposed in [10, 30], by which the scientific workflows are grouped and divided into different clusters for further recommendation. Different from the work in [10, 30], we choose the density peak clustering (DPC) algorithm [10] as our clustering method, because it can effectively identify clusters with different distribution shapes and it is rarely affected by noise points. Based on the DPC algorithm, the function DPCClustering can be described as Algorithm 3.

Input:
(i) : a similarity matric of scientific workflows.
(ii) SW: a list of scientific workflows.
Output:
(i) : the set of generated scientific workflow clusters.
(1)
(2)dc select a value from the so that the number of values below it is around 1 to 2% of the total number of values in the
(3)for each scientific workflow in do
(4)  
(5)  for each scientific workflow in do
(6)   if then
(7)    
(8)   end if
(9)  end for
(10)end for
(11)for each scientific workflow do
(12)  
(13)  for each scientific workflow do
(14)   if and then
(15)    
(16)   end if
(17)  end for
(18)end for
(19) clustering scientific workflows by the DPC algorithm with the local density values such as and relative distances values such as
(20)return

In Algorithm 3, we first initiate the matrix according to the matrix Matrix (line 1) and initiate the value of cutoff distance dc according to the rule of thumb introduced in [10] (line 2). Then, we calculate the local density values of scientific workflows (lines 3–10) and their relative distances values (lines 11–18). Finally, we can apply the DPC algorithm to divide scientific workflows into different clusters (line 19), where each cluster in the can be denoted as a group of scientific workflows with a scientific workflow as its cluster center.

5.4. Retrieval of Appropriate Activities and Subscientific Workflows

According to the modelling requirement of scientists and engineering personnel, we can search in the scientific workflow list and get appropriate activities and subscientific workflows, which can be used to construct a sample scientific workflow and guide the recommendation process. Such procedure is performed by the function GetActivity_SubWF, which is described as Algorithm 4.

Input:
(i) requirement.dscs: a list of descriptions on activities and subscientific workflows.
(ii) SW: a list of scientific workflows.
Output:
: a set of subscientific workflows
: a set of activities
(1)
(2)for each dsc in dscs do
(3)  
(4)for each sw in SW do
(5)   for each activity a in sw do
(6)    sim cosine_sim (doc2vec (dsc), doc2vec (a))
(7)    if sim >  then
(8)     
(9)     
(10)    end if
(11)   end for
(12)   for each sub-scientific workflow d in sw do
(13)    simcosine_sim (doc2vec (dsc), doc2vec (d))
(14)    if sim >  then
(15)     
(16)     
(17)    end if
(18)   end for
(19)end for
(20)if then
(21)  append to
(22)else
(23)  append to
(24)  end if
(25)end for
(26)return

In Algorithm 4, because the descriptions requirement.dscs provided in the requirement are related to activities or subscientific workflows, the best matching result on each description in requirement.dscs may be an activity or a subscientific workflow. Therefore, we calculate the similarity values on activities and subscientific workflows, respectively, where the working procedure of the function cosine_sim in lines 6 and 13 is similar to that of equation (6).

Besides, for each description in requirement.dscs, we search the best matching activity (lines 5–11) and the best matching subscientific workflow for it (lines 12–18), then we choose the better one for constructing a sample science workflow (lines 20–24).

5.5. Generation of Scientific Workflow Candidate List

Once a sample science workflow is constructed, we can generate a list of scientific workflows that are most relevant to it, the whole procedure of which is described as Algorithm 5.

Input:
(i) : a sample scientific workflow.
(ii) : the set of scientific workflow clusters.
(iii) rec_K: a hyper-parameter to control the number of recommend scientific workflows.
(iv) : weight coefficients.
Output:
(i) : a list of recommended scientific workflows.
(1) construct the feature vector on the activities, tags and sub-scientific workflows of .
(2) and
(3)for each do
(4)   choose the cluster center scientific workflow of the
(5)   construct the feature vector on the activities, tags and sub-scientific workflows of .
(6)  calculate , ,
(7)  calculate , ,
(8)  calculate
(9)  
(10)  if  <  then
(11)   
(12)   
(13)  end if
(14)end for
(15) choose the top rec_K% most similar scientific workflows in
(16)return

The Algorithm 5 mainly consists of three steps.Step 1 (line 1): as introduced before, we can construct the feature vectors of the sample scientific workflow on the objects of activity, subscientific workflow, and tag.Step 2 (lines 3–14): we compute the similarity between the sample scientific workflow and the cluster center scientific workflow first (lines 4–8), where the procedure is performed according to the method introduced in Section 4. Then, a cluster is selected as if the similarity value between its cluster center and is the largest among all the clusters (lines 9–13).Step 3 (line 15): after the cluster is determined, the rec_K% scientific workflows of the which are most related to in similarity values are chosen as candidate scientific workflows and recommended in a list.

5.6. An Example on Textual Descriptions

So far, research studies for recommending whole scientific workflows typically adopt the scientists’ requirements for recommendation. For example, Cheng et al. [18] used a layer hierarchy with respect to the scientist’s requirement. In our approach, we mainly adopt textual descriptions with respect to the scientist’s requirement. For ease of illustration, the scientific workflow in Figure 1 is used as an example on textual descriptions.

Example 3. As illustrated by Figure 1, there exists a subscientific workflow named CNTCI, which is short for Chemical_Name_To_Chemspider_ID, and a subscientific workflow named Workflow40 in the scientific workflow . We can get the textual descriptions of the , i.e., “This workflow will map a chemical name or identifier to uniform resource identifiers (URIs). First the ChemSpider web service is used to map the chemical name to a ChemSpider identifier, then the ChemSpider identifier is mapped to URIs via the Open PHACTS platform.
According to the textual descriptions of the , we can use the doc2vec model to learn the sequence relationship between the subscientific workflows of CNTCI and Workflow40. Furthermore, by this way, similar structural information involved in scientific workflows can also be obtained and used for retrieval of appropriate activities and subscientific workflows, some of which can be performed with the function cosine_sim in Algorithm 4. Similarly, logical relationships involved in the components of scientific workflows can also be clearly described in the scientist’s requirement. Therefore, though these structural features are not explicitly expressed in the form of HIN, they are implicitly considered and used in our proposed approach for generating more accurate recommendations.

6. Experiments

In this section, a series of experiments are performed to answer two questions: (1) Compared with the state-of-the-art scientific workflow recommendation techniques, does our approach have better performance? (2) What is the performance of our HDSWR approach in the presence of different parameters and datasets used for recommendation?

All experiments are performed on a computer with Intel (R) Core (TM) i5-7300HQ CPU@ 2.50 GHz 2.50 GHz and 8 GB memory running Window 10, JDK 1.8.0 and python 3.5. Next, we focus on experimental evaluations of these two questions.

6.1. Datasets

The myExperiment is a widely used scientific workflow repository supporting the publication and sharing of scientific workflows. It also allows scientists to search scientific workflows related to their research and then reuse and repurpose scientific workflows according to their distinct needs [31]. There are various types of scientific workflows in the myExperiment, such as Tarvena1 and Tarvena2. We crawled related data on the Tarvena2 type of scientific workflows from the myExperiment and created two datasets named SW#80 and SW#236 accordingly. The datasets used in our experiments are publicly accessible from GitHub via the website: https://github.com/yixinxunwu/myExperiment.

As Table 2 shows, the SW#80 dataset includes 80 scientific workflows with 229 activities, 125 tags, and 85 subscientific workflows, where the number of activities contained in each scientific workflow is in the range of 3 to 20. The SW#236 dataset includes 236 scientific workflows with 430 activities, 310 tags, and 243 subscientific workflows, where the number of activities contained in each scientific workflow is in the range of 2 to 30.

6.2. Evaluation Metrics

To evaluate the efficiency of scientific workflow recommendations, we adopt the precision and recall measures used in [18] and the F1 score used in [16] as our evaluation metrics, which are described as equations (8)–(10), respectively:

In equations (8)–(10), the notation represents a list of scientific workflows which are generated by recommendation algorithms, and the notation represents an expected list of scientific workflows. Similar to the work in [18], we adopt a means to generate , by which the top exc_K% most similar scientific workflows involved in a dataset are selected. Besides, the symbols and denote the numbers of scientific workflows in the and , respectively.

6.3. Methods Used for Experiments

The scientific workflow recommendation methods used for experiments are as follows:(i)LH [18]: this method converts a scientific workflow into a hierarchy incipiently, which manifested as the relationship between scientific workflows and subscientific workflows and activities. Thus, the similarity assessment between scientific workflows becomes the similarity assessment between the hierarchies.(ii)LHWT [27]: this method transforms a scientific workflow into a hierarchy incipiently, as described in [18]. Considering tag information of scientific workflow enables labeling of the functional semantics of the scientific workflow in similarity computation. Hence, the tag information utilized the scientific workflow recommendation in this method.(iii)HDSWR: it is our proposed recommendation approach. In our experiments, some parameters for HDSWR are set as follows: , , and .

6.4. Comparison with Related Scientific Workflow Recommendation

As described in Section 6.2, the evaluation metrics are based on and , which are affected by parameters rec_K% and exc_K% for our approach. Therefore, we study the impact of rec_K% and exc_K% on different recommendation methods with the SW#80 dataset.

To investigate the impact of rec_K% on scientific workflow recommendation precision and recall, the exc_K% is set to 10% and rec_K% is set to 4%, 6%, …, 30%, respectively (step size is 2%). As shown in the Figures 5(a) and 5(b), methods HDSWR and LHWT perform higher precision and recall than LH. This is due to some functions being implemented in some scientific workflows, which does not mention in the description of scientific workflows, but in tags [27]. As a result, it is challenging for these scientific workflows to gather into the appropriate clusters. When tag information is considered, these scientific workflows are reaggregated into the appropriate cluster. This demonstrates that function semantics of tags have a great impact on scientific workflow recommendation. Besides, we also discover that HDSWR is superior to LHWT in precision and recall because the HDSWR approach applies metapaths to capture the weak semantics between scientific workflows and thus achieves high-level semantics recommendation, compared to the LHWT method.

When rec_K% is set to be a relatively small value (e.g., 4%, 6%), we detect that the precision and recall of several methods are extremely close. This indicates that these scientific workflows particularly similar to the sample scientific workflow are recommended to scientists naturally, whatever recommendation methods they are. When rec_K% sets to a relatively large value, the precision of several methods is reduced greatly in Figure 5(a). This is due to the fact that many unrelated scientific workflows are recommended, which do not exist in . Meanwhile, the recall of several methods is relatively stable in Figure 5(b), for determined by the exc_K%, and exc_K% is a fixed value. Furthermore, when the rec_K% is 14%, the recall of HDSWR is stable. This manifests that most expected scientific workflows in were identified and recommended to scientists through HDSWR. When the rec_K% is 18%, the recall of LHWT is stable, and the recall of LH is stable until the rec_K% is 22%.

Studying the impact of exc_K% on scientific workflow recommendation precision and recall, the rec_K% is set to 10%, exc_K% is set to 4%, 6%, …, and 30%. In Figures 5(c) and 5(d), we discovered that the precision and recall of HDSWR are higher than LH and LHWT. Due to the above reason, when exc_K% sets a relatively large value, the scientific workflows in are abundant, while scientific workflows in are fixed. Therefore, the precision of several methods is stable. However, due to the increasing discrepancy between and , the recall of all methods has been declining.

To display the difference in scientific workflow recommendation efficiency intuitively, F1 is applied to achieve this target. Studying the impact of rec_K% or exc_K% on the recommendation efficiency in Figures 6(a) and 6(b), the differences between HDSWR, LHWT, and LH are small in the first two groups (i.e., the value of rec_K% is 4% and 6%, respectively); this indicates that scientific workflows most similar to the sample scientific workflow are recommended easily. With the increase of rec_K% or exc_K%, the difference between several methods becomes distinct and the differences between HDSWR and other methods are obvious. Hence, this demonstrates that HDSWR can capture the similarity semantics between scientific workflows effectively and thus promote the reasonable clustering of scientific workflows. When rec_K% exceeds 24% and exc_K% exceeds 22%, the difference of several methods becomes stable, and this indicates that recommendation performance of all methods cannot play a role while excess scientific workflows are recommended.

6.5. Detailed Analysis of the Proposed Approach

In this part, we conduct a series of experiments to analyse the details of our proposed method.

6.5.1. Impact of Clustering Method

As described in Section 5.3, HDSWR requires the DPC clustering algorithm to group scientific workflows into appropriate scientific workflow clusters and assist the scientific workflow recommendation. Therefore, the impact of clustering algorithms on scientific workflow recommendation is worth studying. In our previous work [27], the SNN (Shared Nearest Neighbour) clustering algorithm [30] is used for the clustering of scientific workflows. In our study, the DPC clustering algorithm is utilized to cluster scientific workflows to the appropriate scientific workflow clusters. In Figure 7, the performance comparison of two clustering algorithms DPC and SNN on the dataset SW#80 is displayed.

The overall recommendation performance ranking is as follows: DPC > SNN, shown in Figure 7. SNN has poor performance, because it takes some data points below the density threshold and points within its domain as noise. Meanwhile, the DPC performs better recommendation performance than SNN.

6.5.2. Impact of the Size of Datasets

To study the impact of the size of datasets on the recommendation efficiency of several recommendation methods, we conduct a series of experiments with three methods on the dataset SW#236 which has a relatively larger amount of data. The experiment setting is the same with that of the dataset SW#80.

As shown in Figures 8(a) and 8(b), the HDSWR approach has better recommendation performance than other methods, both in the dataset SW#80 with a small amount of data and in the dataset SW#236 with a relatively large amount of data. This proves that the HDSWR approach has good robustness, and the recommendation performance can be effectively improved considering the attribute information of scientific workflows. Besides, we find that the distinction between the recommendation efficiency of the LHWT and HDSWR approaches on the dataset SW#236 is lower than that on the dataset SW#80.

6.5.3. Comparison of the Time Efficiency

To evaluate the time efficiency of the HDSWR approach, we conduct a series of experiments with the datasets of SW#236 and SW#80. Table 3 shows the experiment results of three methods on their average running time (in seconds) with two datasets.

As shown in Table 3, the HDSWR approach has better running time performance than other methods. In fact, the operations of similarity computation occupy most of the running time of three methods, while their operations of clustering need little time. The LHWT method is proposed based on the LH method, which simply appended extra label information for similarity computation. Therefore, the LHWT method needs more running time than the LH method. In contrast, the similarity computation operation adopted by the HDSWR approach is based on the HIN, which is totally different from that of other methods. Therefore, it effectively reduces the running time of handling various information for similarity computation.

7. Conclusion

In this paper, we aim to provide automatic support for the reuse and modelling of scientific workflows. Specifically, we utilize heterogeneous information network as a means of organizing and representing the relations between scientific workflows and consider the objects of tag, description, activity, and subscientific workflow for scientific workflow recommendation. We propose a novel scientific workflow similarity computation method based on metapath. In addition, we present a scientific workflow recommendation approach named HDSWR, where the density peak clustering algorithm is adopted for grouping scientific workflows into clusters and a list of scientific workflows is ranked and recommended according to the requirements of scientists and engineering personnel. As future work, we tend to consider how to apply machine learning methods to automatically tune some parameters on the [3235] HDSWR and yield better performance. Furthermore, we will handle related privacy problems in view of the newest research studies [3641].

Data Availability

The data sets of our experiments are publicly accessible via the following website: https://github.com/yixinxunwu/myExperiment.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research in this paper was supported by the National Key Research and Development Project of China (Nos. 2018YFB1702600 and 2018YFB1702602), National Natural Science Foundation of China (Nos. 61772193, 61402167, 61872139, and 61876062), Hunan Provincial Natural Science Foundation of China (Nos. 2017JJ4036 and 2018JJ2139), and Research Foundation of Hunan Provincial Education Department of China (Nos. 17K033 and 19A174).