Abstract

In hybrid cloud environments, reasonable data placement strategies are critical to the efficient execution of scientific workflows. Due to various loads, bandwidth fluctuations, and network congestions between different data centers as well as the dynamics of hybrid cloud environments, the data transmission time is uncertain. Thus, it poses huge challenges to the efficient data placement for scientific workflows. However, most of the traditional solutions for data placement focus on deterministic cloud environments, which lead to the excessive data transmission time of scientific workflows. To address this problem, we propose an adaptive discrete particle swarm optimization algorithm based on the fuzzy theory and genetic algorithm operators (DPSO-FGA) to minimize the fuzzy data transmission time of scientific workflows. The DPSO-FGA can rationally place the scientific workflow data while meeting the requirements of data privacy and the capacity limitations of data centers. Simulation results show that the DPSO-FGA can effectively reduce the fuzzy data transmission time of scientific workflows in hybrid cloud environments.

1. Introduction

With the widespread applications of Big Data technologies, the amount of data generated by modern network environments is greatly increasing. Therefore, traditional distributed computing modes such as grid computing may not meet the requirements of massive data processing. In recent years, cloud computing has emerged as a research hotspot [15], where hybrid cloud environments show the advantages of high sharing, high availability, and customization. Specifically, the hybrid cloud environments are composed of the data centers distributed in different geographical locations, including multiple private and public data centers [6]. On the one hand, the public cloud is good at providing high reliability and large capacity with the resource-sharing feature. On the other hand, the private cloud is adept at offering high flexibility and security, which guarantees data privacy during the work process.

Due to the complexity of the work process and increasing data volumes, scientific research studies with strict work steps cannot be managed manually. To address this problem, the workflow technology was proposed [7], where scientific workflows [8] can be used to manage, monitor, and execute these scientific processes. However, the amount of data involved in scientific workflows is commonly huge, which may need to be stored in the data centers in different geographical locations and transferred across data centers during the operation of scientific workflows. Therefore, it has become a research hotspot to effectively execute the data placement for scientific workflows in hybrid cloud environments under limited bandwidth conditions with the goal of reducing data transmission time [912].

Some researchers have contributed to addressing the problem of data placement for scientific workflows. Yuan et al. [13] proposed a k-means clustering algorithm-based data placement method, which utilized the dependency among data and considered the load balancing in data centers. Reddy et al. [14] designed an entropy-based data placement strategy for enhancing the map-reduce performance in Hadoop clusters, where the k-means clustering algorithm was used to group different datasets. However, this method was not suitable for hybrid cloud environments with different capacities of data centers. Li et al. [15] designed a data placement solution for hybrid data centers, which can reduce the data transmission time. Also for hybrid data centers, the data placement strategy proposed in [16] reduced the data transmission volume and the data transmission times across data centers. However, they did not consider some essential factors in data placement, such as differences in data centers (e.g., capacities and bandwidth) and bandwidth fluctuations. Zheng et al. [17] and Cui et al. [18] developed a data placement scheme based on the GA, which may easily fall into the local optimal solution during operation. As for optimization objectives in data placement, Liu et al. [19] set the transmission times of crossing data centers as an objective, Deng et al. [20] and Zhao et al. [21] targeted the data transmission volume, and Chen et al. [22] aimed for reducing the transmission costs. However, these methods did not involve the network bandwidth and its fluctuations, and thus, it is hard for them to map the data transmission time from their models to real-world network environments [23].

Moreover, most of traditional data placement strategies are based on deterministic environments. However, uncertainty is an essential feature of network environments, which may have a significant impact on data transmission [24]. Due to various loads between data centers, bandwidth fluctuations, network congestion, and other hardware characteristics, the data transmission time may be changeable even if the same data are transmitted between fixed data centers. Therefore, the uncertainty should be considered when building the data placement model for scientific workflows. In response to the uncertainty, the fuzzy theory has emerged as an effective tool [25]. Sun et al. [26] and Lei [27] fuzzified the processing time, completion time, and deadline, and then, the job scheduling method was studied under specific constraints. Based on the analytic hierarchy process (AHP) model, a data placement strategy was proposed in [28] to select the most suitable storage sites, which applied the fuzzy comprehensive evaluation to candidate data centers for different users. However, they did not involve the data placement problem of scientific workflows, and their fuzzy object and optimization goal was not the data transmission time.

To address the above problems, we proposed an effective data placement strategy for scientific workflows in hybrid cloud environments. The main contributions of this paper are summarized as follows:(i)We define and model the data placement problem for scientific workflows in hybrid cloud environments. Specifically, we fuzzify the data transmission time into triangular fuzzy numbers and regard it as the optimization objective of the proposed model.(ii)Based on the problem definitions and modeling, the DPSO-FGA is proposed as the second contribution for reducing the fuzzy data transmission time while considering the uncertainty of data transmission time, the different numbers and capacities of private data centers, and network bandwidth limitations, which can well adapt to real-world network environments.(iii)We validate the effectiveness of the proposed DPSO-FGA method by using various scientific workflows in hybrid cloud environments, which can outperform the classic CFRA and CFGA methods in terms of fuzzy data transmission time.

The rest of this paper is organized as follows. Section 2 defines the data placement problem for scientific workflows in hybrid cloud environments. In Section 3, the proposed DPSO-FGA is discussed in detail. Section 4 shows the performance evaluation of the proposed method with simulation experiments. Finally, we conclude this paper and look for future work in Section 5.

2. Problem Definitions and Modeling

2.1. Problem Definitions

Definition 1. Hybrid cloud environment
A hybrid cloud environment consists of public and private data centers, where each private data center has a certain capacity, while each public data center has no capacity limitation. Thus, a hybrid cloud environment is defined aswhere is the set of public data centers, is the set of private data centers, dci represents the i-th data center, and Vi indicates the maximum capacity of a data center. Specifically, the capacity of a public data center is unlimited, while a private data center may reserve some storage space with an upper limit Vi, and represents the attribute of dci. If , then , and dci can be used to store public data. If , then , and dci can be used to store both public and private data. For any two data centers dci and dcj, bij represents the network bandwidth between them, which is assumed to be known and fluctuate within a certain range.

Definition 2. Scientific workflow
The scientific workflow is a data-intensive application consisting of tasks and datasets, where a task may be related to multiple datasets and a dataset may also be related to multiple tasks. There is a data dependency relationship between the tasks, where the output datasets of a task may be the input datasets of other tasks. Meanwhile, there is also a sequential relationship between the tasks, where a task may only be executed after all its predecessor tasks have been executed. After all the tasks are completed, the scientific workflow ends. In particular, the task without a predecessor task is the beginning task and the task without a successor task is the ending task. Moreover, datasets can be divided into initial and generated datasets, where the original input datasets of a scientific workflow are the initial datasets and the datasets generated during the running process are the generated datasets. Also, datasets can be divided into private and public datasets, where private datasets can only be stored in private data centers and the tasks using them as the input datasets must also be scheduled to the same data centers. By contrast, public datasets have no restriction on storage locations. Therefore, a scientific workflow is defined as a directed acyclic graph (DAG), denoted by G aswhere T is the set of tasks in G, E is the set of data dependencies between different tasks in G, and DS is the set of datasets in G. Specifically, tc represents the c-th task and eij indicates the data dependency between tasks ti and tj, where eij = 1 indicates that ti is the direct predecessor task of tj. Moreover, dsl is the l-th dataset, Ii is the input dataset of ti, Oi is the output dataset of ti, and DC(ti) is the data center for executing ti. Furthermore, is the size of the dataset dsi, gti is the task number of generating dsi, in which gti of the initial dataset is 0, and lci is the serial number of the data center storing dsi.
It should be noted that the settings of privacy datasets in scientific workflows need to satisfy three logical rules. Specifically, for the task ti in the hybrid cloud environment DC, when the set of input or output datasets of ti (denoted by {Ii, Oi}) contains the privacy dataset dsi, the following holds:(i)Rule 1. .(ii)Rule 2. .(iii)Rule 3. For each privacy dataset , .According to Definition 2, private datasets can only be stored in private data centers, while their storage locations cannot be changed. As shown in Rule 1, private datasets cannot be transmitted across data centers, and thus, the data center for executing a task using the dataset as an input or output must be fixed. As known from Rule 2, locations of private datasets for fixed tasks must be consistent with execution locations of the tasks, and thus, the privacy datasets cannot be stored in other data centers. Otherwise, the tasks cannot be executed.

Definition 3. Fuzzy data transmission time
When optimizing uncertainty problems, there are commonly three types of theories, including the probability theory, gray theory, and fuzzy theory. Specifically, the probability theory can be applied in sampling problems with massive samples, the gray theory is suitable for the problems with fewer samples, and the fuzzy theory can be used to solve the problems with unclear extensions of concepts [29]. As for addressing the data placement problem of scientific workflows, the fuzzy theory can be regarded as an effective tool because this problem has no clear boundary or the limitation involves uncertainty.
In the past research, the data transmission time was usually defined as the ratio of the dataset size to the bandwidth between data centers, without considering the other essential factors such as bandwidth fluctuations. However, the data transmission time is uncertain in real-world network environments. In response to this uncertainty, by utilizing the fuzzy theory, triangular fuzzy numbers are introduced to represent the data transmission time. For each independent data transmission process, the mapping indicates that the dataset dsk is transmitted from the data center dci to dcj. Therefore, the fuzzy data transmission time is defined aswhere and are the lower and upper bound elements of the triangular fuzzy number, respectively. When , the triangular fuzzy number degenerates into a real number. Moreover, the membership function indicates the degree which the element x belongs to the fuzzy interval. When , the element x completely belongs to the interval. The membership function of the triangular fuzzy number is defined as

Definition 4. Calculation of fuzzy number(1)The model involves addition and comparison operations between fuzzy numbers. For the triangular fuzzy numbers and , the above operations are defined as follows:(i)Addition operation (calculating the fuzzy data transmission time):(ii)Comparison operation (comparing the fuzzy completion time and choosing suitable values).For , three comparison values are defined asAccording to the literature [30], if , . If , ; if , ; otherwise, (2)The model involves the addition, subtraction, multiplication, division, fuzzification, and defuzzification operations between fuzzy and real numbers. For a triangular fuzzy number and a real number t, the above operations are defined as follows:(i)Addition and subtraction operations:(ii)Multiplication and division operations:(iii)Fuzzification and defuzzification operations.On the one hand, the fuzzification operation, according to the literature [26], is defined aswhere . On the other hand, the defuzzification operation is commonly used to quantitatively compare fuzzy numbers and analyze results. Li [31] defined the mean and standard deviation of fuzzy numbers under uniform distribution and proportional distribution, where the proportional distribution is suitable for the uncertainty problem of data transmission time. For the triangular fuzzy number , the mean and standard deviation are defined aswhere is the mean of the triangular fuzzy number , which reflects the most likely value of the fuzzy number under probability measurement. indicates the standard deviation of , which reflects the uncertainty degree of the fuzzy number. represents the weight of .

Definition 5. Data placement strategy
The purpose of effective data placement is to reduce the data transmission time while meeting the order of task execution, the proportion of dataset privacy, and the capacity constraints of data centers. Only when the datasets required by a task are transmitted to the same data center, the task can be executed. Moreover, the time of scheduling a task to a data center is much shorter than the data transmission time [32], and thus, the model may need to focus on the data placement strategy. Before executing a task, the data center with the least fuzzy data transmission time will be chosen to schedule the task. Therefore, the data placement strategy is defined aswhere M represents the mapping between the dataset DS and the set of data centers DC. {dci, dsk, dcj} indicates that the dataset dsk is transmitted from the data center dci to dcj. is the fuzzy data transmission time of {dci, dsk, dcj}. is the total fuzzy data transmission time during the operation of scientific workflows, where eijk = {0,1} indicates whether there is {dci, dsk, dcj} during this time (eijk = 1 for yes and eijk = 0 for no).

2.2. Modeling

According to the above definitions, the data placement problem for scientific workflows is modeled based on the fuzzy theory with the objective of minimizing the fuzzy data transmission time while considering the capacity constraints of data centers. The problem model is defined aswhere uij = {0,1} indicates whether the dataset dsj is stored in the data center dci (uij = 1 for yes and uij = 0 for no).

3. Effective Data Placement for Scientific Workflows Based on DPSO-FGA

In light of the advantages of the particle swarm optimization (PSO), genetic algorithm (GA), and fuzzy theory, we propose an adaptive discrete particle swarm optimization algorithm based on the fuzzy theory and genetic algorithm operators (DPSO-FGA) to implement the effective data placement for scientific workflows, with the goal of minimizing the fuzzy data transmission time.

3.1. PSO Algorithm

The PSO algorithm was derived from the literature [33], which was inspired by the regularity in the cluster activities of flying birds. Based on the information exchanges between individuals, the movement of the entire population will gradually become orderly, and eventually, the optimal solution can be obtained, where the solution of the optimization problem is called “particle”. When running the PSO algorithm, a fixed-size particle swarm is randomly initialized, where each particle constantly keeps iterating and updating through tracking the optimal solution found by itself and the population. The update of particles contains two aspects as follows.(i)Speed update:(ii)Location update:

where the detailed definitions of the symbols can be found in [29].

3.2. Fitness Function

A fitness function needs to be defined for particles to track the optimal solution during the update process. As the optimization goal, the fuzzy data transmission time is used to define the fitness function aswhere F(S) represents the fitness function of data placement strategy S and indicates the fuzzy data transmission time of the particle Xi.

If the total size of datasets placed in a data center does not exceed its maximum capacity, the particle can be a feasible solution. Otherwise, it is infeasible. When making the selection between feasible and infeasible solutions, the feasible one will be directly selected. When making the selection between feasible solutions, the particle with the smaller fitness function will be selected. When making the selection between infeasible solutions, the particle with the smaller fitness function will be also selected because it is more likely to become a feasible solution in subsequent operations.

3.3. Particle Encoding

The particle encoding needs to meet three principles, including completeness, nonredundancy, and soundness [34]. Specifically, the n-dimensional particles are discretely encoded [35], where n represents the number of datasets involved in the scientific workflow. Therefore, the structure of the particle i at the t-th iteration is defined aswhere represents the placement location of the k-th dataset at the t-th iteration.

The following is an example of the particle encoding, where the particle number is 3, the current iteration number is 10, the number of datasets is 10, and the number of data centers is 4. Moreover, the datasets with underlines indicate that they are privacy datasets, where the data centers used for storing them cannot be changed during the subsequent update process:

3.4. Particle Update

The traditional update of the PSO algorithm is shown in equations (13) and (14)fd16, which reveal some weaknesses in practical operation, such as low search ability, small solution space, and the premature convergence to the local optimum. To address these problems, crossover and mutation operations in the GA are introduced into the update. Since there are certain proportions of privacy datasets, the data centers used for storing them cannot be changed during the update process.

For the inertial part of the traditional update, the mutation operation is defined aswhere represents a random factor and the mutation operation Mu() is to randomly change a quantile in the encoded particle within a range of values. It should be noted that the quantiles of privacy datasets cannot be mutated. Moreover, an infeasible particle should select a quantile, which makes the particle infeasible, to be mutated. Thus, the quantile to be selected should be the location of an overloaded data center. For instance, the following mutation operation is taken based on equation (17) aswhere the 2nd quantile is selected by the mutation operation and the corresponding data center number changes from 2 to 3.

For the individual and population cognitions in the traditional update, the cross operation is introduced aswhere represent random factors; Cp(Ai(t+1), pi(t)) and , are to randomly select two quantiles of the encoded particles Ai(t + 1) and Bi(t + 1) and cross with the values at the same positions of pi(t) and . It should be noted that the storage locations of privacy datasets cannot be changed when crossing. For instance, the following crossover operation is taken based on (17)fd19 aswhere the crossover operations happens at the 4th and 5th quintiles and the serial numbers of the data centers on the quantiles change from 2 to 4.

In summary, the process of the particle update is defined as

3.5. Mapping from Particles to Data Placement Results

Algorithm 1 shows the mapping from encoded particles to data placement results, where G represents the scientific workflow, DC indicates the hybrid cloud environment, X denotes the encoded particles, and S expresses the data placement strategy.

Procedure dataPlacement(G, DC, X)
Input: G, DC, X
Output:
1:  Initialization: set the current storage capacity of data centers dccur(i) to 0 and the fuzzy data transmission time to (0, 0, 0)
2:  for each dsiof DSini//Determine whether the particle would cause the data center overloaded
3:   dccur(X[i]) + = //Place the dataset dsiin the data center dcX[i]
4:   ifdccur(X[i]) > VX[i]
5:    return this particle is infeasible
6:   end if
7:  end for
8:  for j = 1 to |T|//Determine whether the data center is overloaded during task execution
9:   Place the task tj in the data center dcj with the least fuzzy data transmission time
10:   if dccur(j) + sum(Ij) + sum(Oj) > 
11:    return this particle is infeasible
12:   end if
13:   Place the output dataset Oj of tj into the corresponding data center
14:   Update the storage capacity of the data center
15:  end for
16:  for j = 1 to |T|//Calculate the fuzzy transmission time of the corresponding data layout
17:   Find the data centers dsi in the placement of task tj’s input dataset Ij
18:   Calculate the fuzzy data transmission time generated by the input dataset Ij to dsj
19:   
20:  end for
21:  Output and the corresponding data placement strategy
End procedure

The execution steps of Algorithm 1 are listed as follows:Step 1 (line 1). Initialize the current storage capacity of data centers dccur(i) to 0 and the fuzzy data transmission time to (0, 0, 0).Step 2 (lines 2∼7). Place each initial dataset into data centers according to their numbers and update dccur(i). If dccur(i) exceeds the maximum capacity of data centers, the solution corresponding to the particle is infeasible and the current operation will be stopped and returned.Step 3 (lines 8∼15). During the task traversal, the data center dcj with the smallest fuzzy data transmission time will always be selected to place the task tj. If the solution corresponding to the particle is infeasible (i.e., the sum of dccur(j), sum(Ij), and sum(Oj) exceeds the maximum capacity of data centers), the current operation will be stopped and returned. Otherwise, the output dataset Oj of the task tj will be placed into a corresponding data center and the storage capacity of data centers will be updated.Step 4 (lines 16∼20). Traverse all tasks, calculate the fuzzy data transmission time for each dataset that needs to be transmitted across data centers, and sum them to obtain the total fuzzy data transmission time .Step 5 (line 21). Output , and the corresponding data placement strategy.

3.6. Model Parameters

The inertial weight in (13) has a direct influence on the convergence of the PSO algorithm [36], which can affect the search speed of particles in solution space. Thus, we propose a new method to define the weight as follows, which can adaptively adjust the value of based on the pros and cons of the corresponding solution of the current particle (i.e., the degree of difference between the current and historically optimal particles):where d(Xi(t), (t)) represents the degree of difference between the current particle Xi(t) and the historically optimal particle of the current population (i.e., the number of different values on the same quantiles). In the early stage of training, d(Xi(t), (t)) is usually large with the large value of . Therefore, it is necessary to expand the search range of particles in solution space, in order to find the optimal solution and avoid prematurely falling into local optimum. In the later stage of training, d(Xi(t), (t)) becomes small with the small value of . Thus, it is better to narrow the search range of particles and accelerate the search speed of particles for the optimal solution.

Moreover, the individual and population cognition factors (i.e., c1 and c2) are defined by using the gradient descent method [37].

4. Performance Evaluation

4.1. Parameter and Environment Settings

The scientific workflow model comes from five different scientific fields [38], including CyberShake, Epigenomics, Inspiral, Montage, and Sipht. Each scientific field has a scientific workflow with a different number of tasks, and each scientific workflow has a unique task structure, number of datasets, and computational requirements [39]. Specifically, a medium-sized (about 50 tasks) workflow in each field is selected for experiments, and the parameter and environment settings are shown in Table 1.

Moreover, some extrasettings are shown as follows:(i)Maximum capacity: the datum capacity is set to where the maximum capacity of the three private data centers is set to 2.6 times the datum capacity.(ii)Bandwidth (M/s) between data centers: the bandwidth between dc1 and {dc2, dc3, dc4} is set to {10, 20, 30}, the bandwidth between dc2 and {dc3, dc4} is set to {150, 150}, and the bandwidth between dc3 and dc4 is set to 100.(iii)Proportion of privacy datasets: due to the difference of datasets among various workflows, the proportions of privacy datasets of in the scientific workflows, including CyberShake, Epigenomics, Inspiral, Montage, and Sipht, are set to {0.25, 0.2, 0.2, 0.2, 0.02}.(iv)Fuzzy parameter: based on the fuzzy theory, the data transmission time T is fuzzified into the triangular fuzzy number , where the fuzzy parameters are set to .

4.2. Comparison Algorithms

The proposed DPSO-FGA is compared with the constraint fuzzy randomized algorithm (CFRA) and constraint fuzzy greedy algorithm (CFGA), which can improve the performance of the randomized algorithm (RA) and greedy algorithm (GA) in data placement. The CFRA and CFGA rely on the fuzzy theory while considering some essential conditions, including the application scenarios of scientific workflows, privacy settings, and capacity constraints. The conditions refer to meet the maximum capacity requirements of data centers and the proportion of private datasets during the data placement process.(i)The steps of CFRAStep 1. Set privacy datasets and the maximum capacity of data centers, initialize parameters, and keep the same values as the corresponding parameters in the DPSO-FGA.Step 2. Generate a random population that meets the conditions according to the discrete encoding method in the DPSO-FGA. The population contains a certain number of individuals, and each individual represents a candidate solution for data placement.Step 3. Define the fitness function as the fuzzy data transmission time of the corresponding solution of the individual encoding.Step 4. Calculate the fitness value of each individual and compare it with the current best individual of the population. If the current individual is the better one, update the best individual of the population by using the current one.Step 5. End the traversal and output the best individual with its fitness value.(ii)The steps of CFGAStep 1. Set privacy datasets and the maximum capacity of data centers, initialize parameters, and keep the same values as the corresponding parameters in the DPSO-FGA.Step 2. Design a data placement strategy. According to the task execution sequence of the scientific workflow, traverse the datasets that have not been deployed for tasks. If the current task has been placed, the dataset will also be placed to the same data center by using the GA. If the current task has not been placed but there is already a placed dataset, the dataset will be placed to the same data center as the placed dataset by using the GA. If the current task has not been placed and there is no already placed dataset, the dataset will be placed into the data center with the smallest fuzzy data transmission time by using the GA.Step 3. Calculate the fuzzy data transmission time of the current data placement strategy and output the strategy.

4.3. Experimental Results and Analysis

To avoid the randomness of results, 10 independent experiments are carried out on five scientific workflows under different environment settings. Table 2 records the average fuzzy data transmission time of different algorithms under various scientific workflows.

In the subsequent experimental results, the fuzzy data transmission time is defuzzified, in order to make the comparison between the algorithms more intuitive, where ∂ is set to 1.

Figure 1 shows the defuzzification results for the fuzzy data transmission time of different algorithms under various scientific workflows, where the names of the scientific workflows are indicated by their first letter.

From the perspective of algorithms, the DPSO-FGA outperforms the CFRA and CFGA. This is because that the CFGA may easily fall into the local optimum by using the GA during execution, and thus, it ignores the global performance. Moreover, the overall performance of the CFRA is better than CFGA since the search space of the CFRA is larger than the CFGA and will not fall into the local optimum, and thus, the CFRA can obtain a good solution when the algorithm runs for a long time. However, the CFRA does not consider the fitness of the current particle when a solution is generated, and thus, the performance of the CFRA is worse than the DPSO-FGA. From the perspective of workflows, the data transmission time of the same algorithm in various scientific workflows is significantly different. Although all these scientific workflows contain about 50 tasks, the number of datasets used varies greatly. For example, the CyberShake uses datasets only about 70 times, while the Sipht uses datasets up to 4000 times, which results in the different data transmission time between them.

As the number of private data centers in a hybrid cloud environment sometimes changes, the performance of DPSO-FGA needs to be evaluated with different numbers of private data centers. Thus, we change the number of private data centers without modifying other default settings. Specifically, these three algorithms are tested when the number of private data centers is set to {3, 5, 6, 8, 10}, where the bandwidth between newly added private data centers and public data centers is set to 20 M/s and the bandwidth between other private data centers is set to 120 M/s. The experimental results are shown in Figure 2.

From the perspective of algorithms, the performance of the DPSO-FGA outperforms the CFRA and CFGA, and the reasons have been analyzed in Figure 1. From the perspective of private data centers, as the number of private data centers increases, the data transmission time of these three algorithms also inclines. This is because, with the increasing number of private data centers, the privacy datasets, which are randomly set according to the privacy proportion, are dispersed and fixed in more private data centers. Therefore, the fixed tasks, which require these private datasets, need to be executed in more scattered locations, and it will lead to increasing data transmission time.

Since the maximum capacity of private data centers is regarded as a constraint, the sensitivity of the DPSO-FGA to this constraint needs to be evaluated. Specifically, the CyberShake is selected as the scientific workflow for experiments, the multiple of datum capacity is set to {2, 2.6, 3, 5, 8}, and the rest of the settings remain default. The experimental results are shown in Figure 3.

When the maximum capacity of private data centers increases and the bandwidth between data centers remains the same, each data center is able to store more datasets and the datasets required for executing tasks become more concentrated. Therefore, the data transmission time of the DPSO-FGA is reduced. Specifically, the fastest decline in data transmission time happens when the maximum capacity is 2∼3 times than that of the datum capacity, and the slowest decline happens when the maximum capacity is 5∼8 times than that of the datum capacity. This is because when the maximum capacity of data centers is small, the available space is small and the placement locations of datasets are restricted. Thus, the maximum capacity has a significant impact on data transmission time. When the maximum capacity of data centers becomes larger, each data center can store more datasets, and it will be easy to meet the operational requirements of scientific workflows. Therefore, the maximum capacity has little effect on the data transmission time.

Finally, the performance of the DPSO-FGA is evaluated under different bandwidths between data centers. Specifically, the CyberShake is selected as the scientific workflow for experiments, the multiple of the bandwidth between data centers relative to the default one is set to {0.5, 0.8, 1.5, 3, 5}, and the rest of the settings remains default. The experimental results are shown in Figure 4. The data transmission time decreases greatly as the bandwidth increases, which indicates that the bandwidth changes between data centers will not significantly affect the data placement strategy.

5. Conclusions

In this paper, we propose a DPSO-FGA-based data placement method for scientific workflows in hybrid cloud environments. Based on the fuzzy theory, the DPSO-FGA fuzzifies the data transmission time for adapting to real-world network environments while considering the characteristics of hybrid cloud environments, bandwidth fluctuations, capacity limitations of private data centers, and dependencies between different scientific workflow tasks. Simulation results demonstrate the effectiveness of the proposed DPSO-FGA method. In the future, we will study the impact of other essential factors on the proposed method, such as different proportions of private datasets in scientific workflows and capacities of various private data centers. Moreover, under the conditions that are not critical for data transmission time, such as business network environments, data transmission costs between different clouds should also be regarded as a prioritized optimization goal. Therefore, a comprehensive model for minimizing the fuzzy data transmission time and costs will be researched.

Data Availability

The data used to support the findings of this study are produced by a public workflow generator available at https://confluence.pegasus.isi.edu/display/pegasus/WorkflowGenerator.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

Zheyi Chen and Xu Zhao are equally contributed to this work. Zheyi Chen and Xu Zhao developed the model, carried out the parameter estimations, and planned as well as performed the experiments. Zheyi Chen wrote the main part of the manuscript, while Xu Zhao and Bing Lin provided the support for writing materials. Bing Lin also took part in the design and evaluation of the model. Zheyi Chen and Bing Lin reviewed the manuscript. All the authors read and approved the final manuscript.

Acknowledgments

This work was supported by the National Key R&D Program of China (Grant no. 2018YFB1004800), Natural Science Foundation of China (Grant nos. 61672159, 41801324, and 61972165), Natural Science Foundation of Fujian Province (Grant nos. 2019J01286, 2019J01244, and 2018J01619), Young and Middle-Aged Teacher Education Foundation of Fujian Province (Grant no. JT180098), Open Foundation of Engineering Research Center of Big Data Application in Private Health Medicine, Fujian Province University (Grant no. KF2020001), Talent Program of Fujian Province for Distinguished Young Scholars in Higher Education, and China Scholarship Council (no. 201706210072). The authors sincerely thank Dr. Jia Hu and Dr. Geyong Min for providing useful advice that greatly improved this paper.