The amount of energy needed to operate high-performance computing systems increases regularly since some years at a high pace, and the energy consumption has attracted a great deal of attention. Moreover, high energy consumption inevitably contains failures and reduces system reliability. However, there has been considerably less work of simultaneous management of system performance, reliability, and energy consumption on heterogeneous systems. In this paper, we first build the precedence-constrained parallel applications and energy consumption model. Then, we deduce the relation between reliability and processor frequencies and get their parameters approximation value by least squares curve fitting method. Thirdly, we establish a task execution reliability model and formulate this reliability and energy aware scheduling problem as a linear programming. Lastly, we propose a heuristic Reliability-Energy Aware Scheduling (REAS) algorithm to solve this problem, which can get good tradeoff among system performance, reliability, and energy consumption with lower complexity. Our extensive simulation performance evaluation study clearly demonstrates the tradeoff performance of our proposed heuristic algorithm.

1. Introduction

For a long time, energy consumption has simply been ignored in the performance evaluation in large-scale parallel computing systems. However, Intelligence (DCDi) Industry Census reported that the amount of electricity consumed by global data centers ran up to 40 GW in 2013, and it was also with a increase [1]. According to the latest world’s Top 500 supercomputers Ranking, the power consumption of first supercomputer “Tianhe-2” is 17.808 MW and average power consumption for Top systems in Ranking list is 6.2939 MW, respectively [2]. Thus, it is obvious that high energy cost is a key feature of designing and applying heterogeneous systems.

On the other hand, computing systems are a group of heterogeneous processors connected via a high-speed network that supports the execution of parallel applications. For example, the Top supercomputer “Tianhe-2” in Top lists consists of Intel Xeon® E5-2692 12C 2.200 GHz and Intel Xeon Phi 31S1P (MIC) [2]. For each processor, the number of transistors integrated into today’s Intel Xeon EX processor reaches to nearly billion and its power consumption over 130 W [3]. This implies the possibility of worsening single processor reliability, eventually resulting in poorness of the whole heterogeneous system reliability. Furthermore, the modern large-scale computing systems usually have a lot of processors, such as “Tianhe-2” with 3,120,000 cores and “Titan” with 560,640 cores [2]. One of the main problems existing in this situation is system reliability, which drastically decreases as the number of processor cores increases [4]. Even when the single processor’s one-hour reliability becomes very high, such as 0.999999, as the system size approaches 10,000 cores, the system’s MTTF (the Mean Time to Failure) drops to less than hours [4]. This also allows us to focus primarily on the main problem of this paper, which is the simultaneous management of system performance, reliability, and energy consumption.

In recognition of this, we first build a reliability and energy aware task scheduling architecture including precedence-constrained parallel applications and energy consumption model on heterogeneous systems. Then, we propose the single processor failure rate model based on DVFS technique and deduce the application reliability of systems. Finally, to provide an optimum solution for this problem, we propose a heuristic Reliability-Energy Aware Scheduling (REAS) algorithm, which adopts a novel scheduling objective RE. The overall objective of this paper is trying to get good tradeoff among performance, reliability, and energy consumption.

The rest of the paper is organized as follows: the related work is summarized in Section 2. We describe the task scheduling system model in Section 3. In Section 4, we provide a system reliability model. To solve this problem, a heuristic reliability and energy aware task scheduling algorithm is proposed in Section 5. In Section 6, we verify the performance of the proposed algorithm by comparing the results obtained from performance evaluation. Finally, we summarize the contributions and make some remarks on further research in Section 7.

The high-performance parallel application running on computing systems is usually composed of intercommunicated tasks, which are scheduled to run over different processors in the systems. In most cases, the main objective of scheduling strategies is to map the multiple interacting program tasks onto processors and order their executions so that task precedence requirements are satisfied and, in the meanwhile, the minimum schedule length (makespan) can be achieved. The problem of finding the optimal schedule is NP-complete in general [59]. There are many scheduling algorithms that have been proposed to deal with this problem, for example, dynamic-level scheduling (DLS) algorithm [6] and heterogeneous earliest-finish-time (HEFT) algorithm [5, 8, 10, 11].

As the energy consumption has become important issue in designing large-scale computing systems in the last few years, many techniques including dynamic voltage-frequency scaling (DVFS), dynamic powering on/off, slack reclamation, resource hibernation, and memory optimizations have been investigated and developed to reduce energy consumption [1214]. DVFS, which is a technique in which a processor runs at a less-than-maximum frequency when it is not fully utilized in order to conserve power, is perhaps the most appealing method for reducing energy consumption [14, 15]. Most of the early DVFS-enabled researches focused on the single processor of embedded and real-time computing systems [14, 16, 17]. Recently, there has been a significant amount of work on task scheduling for heterogeneous systems using DVFS-enabled techniques. For instance, Rountree et al. focused on energy optimization of MPI program in HPC environment and proposed a linear programming (LP), which incorporates allowable time delays, communication slack, and memory pressure into its scheduling using DVFS (i.e., slack reclamation) [18]. Rizvandi et al. proposed a method to find the best frequencies of processor to obtain the optimal energy consumption [19]. Lee and Zomaya addressed the problem of scheduling precedence-constrained parallel applications on multiprocessor computer systems and their scheduling decisions are made using the relative superiority metric (RS) devised as a novel objective function [20]. In [21], Zong et al. proposed two energy-efficient scheduling algorithms (EAD and PEBD) for parallel tasks on homogeneous clusters based on duplication strategy.

All of this work demonstrated that dynamic adjusting the processor’s voltage and frequency can effectively reduce system energy consumption. However, recent researches have illustrated that scaling the processor’s voltage and frequency has negative impact of nanoscale semiconductor circuits’s cosmic ray radiations, electromagnetic interference, and alpha particles, which enforce the unreliability of processor [2224]. Thus, it is a good way to incorporate the reliability into energy aware scheduling based on DVFS. Recently, Zhu etc. focused on reducing energy consumption while preserving the system reliability for periodic real-time tasks [25, 26]. They proposed a reliability model that the processor’s reliability decreases as scaling their voltage and frequency from max to min and incorporated the reliability requirements into heuristic energy aware task scheduling strategies. However, their techniques are not suitable for precedence-constrained parallel applications on heterogeneous systems based on DVFS-enabled processors.

Many researches had dealt with the reliability on heterogeneous systems. For example, Dogan and Özgüner introduced three reliability cost functions that were incorporated into making dynamic level (DL) and proposed a reliable dynamic level scheduling algorithm (RDLS) [27]; the goal was to minimize not only the execution time but also the failure probability of the application. In our previous work [8], we propose a scheduling algorithm which considers the task’s execution reliability. Qin and Jiang investigated a dynamic and reliability-cost-driven (DRCD) scheduling algorithms for precedence-constrained tasks in heterogeneous clusters [28]. Unfortunately, those works did not consider the energy consumption and the reliability of scaling the processor’s voltage and frequency. In recognition of this, we focus on the reliability and energy consumption on DVFS-enabled heterogeneous systems.

3. System Models

3.1. Scheduling Architecture

Various task scheduling architectures are proposed in literature [5, 8, 9, 14, 28, 29]. However, the energy consumption and system reliability are not effectively incorporated into scheduling. In this paper, we propose a reliability and energy aware task scheduling architecture, as depicted in Figure 1(a). It is assumed that all parallel applications, along with information provided by user, are submitted to system by a special user command. First, the parallel applications are divided as a task DAG by Task  DAG  Model. Then, the estimate energy consumption of tasks, which are executed on the DVFS-enabled heterogonous processors, is computed by the Eneregy  Consumption  Estimator. At the same time, reliability  analysis computes the processors’ reliability according to different frequency to get the whole system reliability. Finally, the Scheduler schedules tasks based on the above task energy consumption and system reliability.

3.2. Heterogeneous Systems

The target system used in this work consists of a set of heterogeneous processors/machines [5, 8, 9, 14, 29], which are connected by high-speed interconnects, such as Infiniband and Myrinet. Each DVFS-enabled processor can adjust its operational voltage and frequency [14]. Therefore, they can be executed on discrete set of frequency-voltage pairs, , in which and , where is processor ’s operation level [14, 30]. For example, the quad-core AMD Phenom II supports different frequencies (0.8 GHz, 2.1 GHz, 2.5 GHz, and 3.2 GHz) and voltages ranging from  V to  V [30]. Since clock frequency transition overheads take a negligible amount of time (e.g., 10 us–150 us), these overheads are not considered in our study.

The heterogeneous processor’s failure is assumed to follow a Poisson process and each processor has a constant failure rate [8, 9, 29]. For example, denotes a processor failure rate when it works at normal voltage and frequency [8, 9, 27, 29]. These failure rates can be derived from system’s profiling, system log, and statistical prediction techniques [31]. For demonstration purposes, we illustrate two heterogeneous processors, one has frequency levels and the other has frequency levels, and the parameters are listed in Table 1.

3.3. Applications Model

The precedence-constrained tasks of parallel application are usually denoted as a Directed Acyclic Graph (DAG) [5, 810, 29], where is the set with tasks that can be scheduled to any available DVFS-enabled processors [5, 810, 29]; represents the precedence relation that defines a partial order on the task set , such that implies that the task must be finished, before can start execution [5, 810, 29]. is communication matrix that denotes the communication time between tasks and for , . is computation matrix in which each gives the estimated time to execute task on processor at frequency . Here, is the maximal operation level on systems. The communication cost and computation cost can be evaluated by building a historic table and using code profiling or statistical prediction techniques [31]. Figure 1(b) shows a parallel application DAG, Table 2 lists the tasks execution time on two heterogeneous DVFS-enabled processors listed in Table 1, and the communication time among these tasks is listed in Table 3.

Generally, the common objective of task scheduling is to map tasks with precedence constrained onto processors and get a minimum schedule length (which is also called makespan) [10, 11]. Before presenting the schedule length, it is necessary to define the scheduling attributes and of task . denotes the earliest execution starting time of task on DVFS-enabled processor at frequency , which is constrained by tasks precedence relation and the available time of processor [5, 810, 29]. is the earliest execution finish time of task on processor at frequency , which is described as

In this paper, let denote the task scheduled on processor at frequency ; otherwise . Thus, the schedule length is defined as follows:

3.4. Energy Model

The major energy consumption of computing systems depends on its memory, disks, CPUs, and other components. This paper only considers DVFS-enabled CPUs, which consume the largest proportion of energy on systems [14, 19, 20, 32]. The power consumption of DVFS-enabled microprocessor based on complementary metal-oxide semiconductor (CMOS) logic circuits mainly consists of static  power and dynamic  power  dissipation, which can be modeled as [25, 26]where is the static  power, which is a constant and the power used to maintain basic circuits and keep the clock running, and frequency-independent  active  power. denotes the processor’s model, if processor is at execution model, ; otherwise, . is the most significant factor of processor power consumption and can be estimated as [14, 16, 19, 20, 32]where represents the switched capacitance, is the supply voltage, represents processor’s working frequency, and stands for circuit dependent constant. The example of such processor parameters is listed in Table 1.

Let be the energy consumption caused by task running on DVFS-enabled processor at frequency , of which it is determined by task execution time and processor power consumption:where denotes dynamic  power  dissipation of processor at frequency (see (4)). Thus, for an application , the energy consumption is the summation of all tasks of energy consumption:

At the same time, for heterogeneous systems, all processors are power-on; they are sleep or execution model. That is to say, all processors of systems consume all the time. Thus, the computing systems energy consumption is the summation of all processors static  power and dynamic  power  dissipation of application energy consumption:

Obviously, systems energy consumption is greater than application energy consumption . In this paper, one of our main objectives is to minimize systems energy consumption .

4. System Reliability Analysis and Problem Statement

In this section, we first provide the single DVFS-enabled processor failure rate model. Then, we analyze heterogeneous systems reliability. At last, we formulate the reliability and energy aware task scheduling as a linear programming problem.

4.1. Single DVFS-Enabled Processor Failure Rate

Among various sources of unreliability in a semiconductor circuit processor, it is predicted that the failure rate due to cosmic ray radiation-induced soft errors dominates all other reliability issues [24]. Transient fault occurs when a high energy particle such as alpha or neutron strikes a sensitive region in a semiconductor device and flips the logical state of the struck node [33]. Most of the modern DVFS-enabled processor is the integration of multibillion transistors on a single chip leading to increasing number of sensitive devices in submicron technologies which is vulnerable to soft error and consequently raises the Soft Error Rate (SER) [34]. These phenomena become more and more serious with the continued scaling of processor’s voltage and frequency [23, 25].

Traditionally, the modern DVFS-enabled processor’s reliability has been modeled as the following Poisson distribution with a failure rate when it works at normal voltage and frequency [8, 9, 27, 29, 35]. Moreover, it has been shown that DVFS has a direct and negative effect on failure rates as blindly applying DVFS to scale the supply voltage and processing frequency for energy savings, which may cause significant degradation in processor’s reliability [23, 25, 26]. Therefore, for the DVFS-enabled heterogonous processor to be considered in this paper, the failure rate at a reduced frequency (and the corresponding voltage ) can be modeled aswhere is the failure rate corresponding to the normal processing frequency (and corresponding to normal voltage ). Prior researches which studied the effect of normal voltage on processor’s reliability have revealed that the failure rates generally increase with scaled processing frequencies (and supply voltages) away from normal voltage [24, 36]. On the other hand, the fault rates are exponentially related to the circuit’s critical charge (which is the threshold voltage). Thus, we have the following equations:where the exponent is the parameter of threshold voltage and is a constant, representing the sensitivity of fault rates to frequency scaling, and and denote the minimum and maximum frequency, respectively.

In order to get the precision value of parameters and , we use least squares curve fitting method [37]. Therefore, the natural logarithm of both sides for (9) isLet , , , and . Then, (10) becomesThus, we can get the parameters and approximation value by using least squares linear fitting method.

4.2. Application Reliability Analysis

Assume that the task processing time has taken place during the time interval on heterogeneous DVFS-enabled processor at frequency , where denotes the task start execution time and denotes the task finish time [5, 8, 9, 29]. Thus, the task execution reliability can be given by

For a task of application on processor at frequency , its reliability is equal to all of its immediate parent tasks and its execution reliability, which can be defined bywhere denotes all direct predecessors of and is the reliability of task that is equal to the reliability of task executing on processor at frequency

For the entry task of application, which is executed on processor at frequency and , its reliability

Generally, application has one exit task . The reliability of application is equal to the exit task :

This is the other objective of this paper, in which we try to improve the application reliability . From the above analysis, we know that allocating tasks with less execution times to more reliable processors might be a good heuristic to increase the reliability.

4.3. Problem Statement

As simultaneous management of scheduling performance, system reliability, and energy consumption is the main problem of this paper, we formulate it as follows:

5. Proposed Reliability-Energy Aware Scheduling Algorithm

This section presents a Reliability-Energy Aware Scheduling algorithm on heterogeneous systems called REAS, which aims at achieving lower energy consumption, high reliability, and shorter schedule length. Its scheduling decisions are made using the hybrid metric including energy consumption, reliability, and schedule length, devised as a novel objective function. The pseudocode of the algorithm is shown in Algorithm 1. The algorithm is complete in three main phases as described in the following sections.

Input: The task DAG of parallel applications
Output: The scheduling of task-processor pairs
() Calculate each task _level of DAG
() Sort tasks in a scheduling list by non-increasing order of _level
() while  the scheduling list isnot empty  do
()  Remove the first task from the scheduling list
()  Set , as maximum value
()  for  each processor-frequency   in systems  do
()   Compute the earliest finish time use (22)
()   if    then
()   end
()  Compute task energy consumption use (5)
()  if    then
()  end
() end
() Set as maximum value
() for  each processor-frequency    in systems  do
()  Compute the earliest finish time use (22)
()  Compute task energy consumption use (5)
()  Compute metric use (24)
()  if    then
()  end
() end
() Assign task to the corresponding processor-frequency
() Update the processor execution finish time
() end
() “Slack reclamation
() for  each task in scheduling task-processor pairs  do
() Compute task slack time use (25)
() for  each frequency of processor   do
()    Compute the optimal frequency use (26)
() end
() Reassign task and update corresponding data
() end
() Compute the schedule length, application reliability , systems energy consumption
5.1. Task Priorities Phase

This step is essential for list scheduling algorithms. A task processing list is generated by sorting the task by decreasing order of some predefined rank function, such as , , , , and [5, 6, 810, 29]. Here, we use the average computation capacity, which is defined as

In this research, we use as the rank function. The of task is the sum of the path weight from task to exit task. We can compute this value recursively traversing DAG from exit task, and it is defined as follows:where is the set of immediate successors of task . is the average reliability overhead of task and can be computed by

For the exit task , the is equal to

Basically, is the length of the critical path from task to the exit task, including the average computation cost and reliability overhead of task . For example, considering the application DAG in Figure 1(b), heterogeneous systems parameters in Table 1, task execution time matrix in Table 2, and communication matrix in Table 3, the task value which is recursively computed by (19) and (21) is shown in Table 4.

5.2. Task Assignment Phase

In this phase, tasks are assigned to the processors with earliest execution finish time , high reliability, and minimum task energy consumption . However, for heterogeneous systems, these performance metrics are conflicted most of the time. Here, we introduce a novel objective as , which can get good tradeoff among these metrics. We first redefine task earliest execution finish time on processor at frequency aswhere is the reliability overhead of task on processor at frequency and is computed by

On the other hand, we let ,    denote the earliest execution finish time and minimum task energy consumption on all processors of heterogeneous systems. Thus, the novel metric of task on processor at frequency iswhere is the weight of task earliest execution finish time. If the task execution time is more important than energy consumption, we can give higher value to ; otherwise, value is lower. Moreover, the scheduling objective of this problem is minimum in both schedule length and energy consumption. Thus, in each task assignment step, we try to get the minimum and assign task to the corresponding processor frequency.

5.3. Slack Reclamation

Tasks of parallel application may have some slack time for their execution due primarily to communication events, for example, “multidimensional” intertask communication (or intertask data dependencies), and these processor slacks are an obvious source of energy wastage. Slack reclamation was studied to reduce energy consumption using the slack left by some completed task instances. The idea behind the slack reclamation for the reducing of energy consumption is to exploit the slack time to slow down the execution speeds of the remaining tasks [12, 20]. In this paper, we adopt this technique to reduce energy consumption after making the scheduling decision. The slack time of task is defined bywhere is the task earliest start time in scheduling processor-frequency pairs and is the earliest finish time.

If task slack time , we can scale down the execution frequency to save energy consumption. Thus, the optimal frequency is satisfied withwhere is the original scheduling processor-frequency pairs. At last step, we reassign task to the optimal frequency .

6. Experimental Results and Discussion

In this section, we compare the performance, energy consumption, and system reliability using our REAS algorithm with three existing scheduling algorithms: DLS [6], RDLS [27], and ECS [20]. The experiments are performed on the synthetic randomly generated precedence-constrained parallel application graphs as described below. The performance metrics chosen for the comparison are the schedule length (2) and (22), systems energy consumption (7), and application reliability (16).

To test the performance of these algorithms, we have developed a discrete event simulation environment of heterogeneous systems with DVFS-enabled processors using C++. This simulator includes Intel® Core Duo, Intel Xeon, AMD Athlon, 1 TI DSP, and Tesla GPU, mostly based on Intel processor. The systems are interconnected by Infiniband, which is a switched fabric communications link primarily used in high-performance computing. For the Infiniband configuration, the switch considered is Mellanox InfiniScaleTM III SDR and NIC is Mellanox ConnectXTM IB Dual Copper Card [21]. Other parameters of the model are set as follows. The failure rates of processors are assumed to be uniformly distributed between and failures/hr [8, 9, 28]; the transmission rates of links are assumed to be  Mbits/sec.

6.1. Randomly Generated Application Graphs

These experiments use three commonly DAG characteristics to generate parallel application graphs [5, 8, 9, 29]:(i)DAG Size (). It is the number of tasks in the application DAG.(ii)Communication-Computation Ratio (CCR). It is the ratio of communication time to computation time. A small value means the application is computation-intensive; a large value indicates that the application is communication-intensive [5, 810, 29].(iii)Out-Degree. It is out-degree of a task node.

In experiments setting, DAG are generated based on the above parameters with the number of tasks and . Task weights are generated randomly from uniform distribution [] execution cycles to be around ; thus the average task execution cycles are . We also generated edge weights with a uniform distribution based on a mean . Different objective parallel applications can be produced as giving various values [5, 810, 29]. In these experiments, we varied in a reasonable range of to .

6.2. Various Weight of REAS Algorithm

In the first experiments, we evaluate the performance of weight to REAS algorithm. Figure 2 shows the simulation results of scheduling and tasks with CCR = 1 by varying weight from to , in steps of . We observe from Figure 2 that the schedule length and energy consumption decrease and the application reliability almost at the same level as the REAS algorithm weight increases. It is reasonable that the REAS algorithm with high is mostly based on task execution time and makes its schedule length shorter and consumes less energy. However, as the weight over , the performance of REAS is not much distinguishable. Thus, in the below experiments, we let .

6.3. Random Task Performance Results

For the set of randomly generated parallel applications, the results are shown in Figures 3 and 4, where each data point is the average of the data obtained in 1,000 experiments. In this set of experiments, we assume the weight of metric (see (24)) in REAS algorithm. In other word, the REAS algorithm has the same weight on task execution time and energy consumption. In the next section, we will examine the performance by various weights .

We observe from Figure 3(a) that REAS is over RDLS and ECS with respect to schedule length, and the schedule length increases as the CCR increases. The average schedule length of the REAS algorithm is shorter than that of the RDLS and ECS by and , respectively. This improvement becomes more obvious as CCR increases, for CCR = 5 and REAS over RDLS and ECS by and , respectively. However, the REAS is inferior to DLS in terms of schedule length. Figure 3(b) reveals that REAS saves more average energy consumption than RDLS by , ECS by , and DLS by , respectively. Figure 3(c) shows that REAS outperforms RDLS, ECS, and DLS by , , and in terms of the average application reliability.

This is mainly due to the fact that REAS algorithm schedules tasks according to the novel objective , which can get effective tradeoff among task execution time, energy consumption, and task execution reliability. However, DLS algorithm only focuses on optimizing the task execution time and its actual execution time including the task scheduling time and reliability overhead. Thus, the scheduling solution generated by DLS can get optimal schedule length. However, it consumes more energy and has lower reliability. RDLS algorithm schedules tasks considering their execution reliability and ignoring task energy consumption. ECS algorithm is a solution for optimizing both schedule length and energy consumption, but this solution needs more task execution reliability overhead. Thus, REAS algorithm outperforms RDLS, ECS, and DLS in terms of the schedule length, energy consumption, and reliability. Other interesting experimental phenomena are that RDLS and DLS are better than ECS in terms of reliability. This is mainly due to the fact that tasks of solutions RDLS and DLS are always executing on the normal frequency of processor, which has the high reliability in all processor frequency.

The improvements of scheduling performance also could be concluded from Figures 3(d), 3(e), and 3(f) for tasks. These results also show REAS over RDLS and ECS by and in terms of the average schedule length. And, REAS is also over RDLS, ECS, and DLS by , , and in terms of the average energy consumption and by , , and in terms of the average application reliability, respectively.

We also simulate heterogeneous systems with Intel Xeon and AMD Athlon; the other configurations are the same as before. Figure 4 shows the results of randomly generated tasks on this heterogeneous computing platform. The results show REAS over RDLS, ECS, and DLS in terms of average schedule length and energy consumption. However, REAS is inferior to RDLS in terms of the application reliability.

6.4. Application Graphs of Real-World Problem

Using real applications to test the performance of algorithms is very common [5, 810, 29]. In this section, we also simulate a real-world digital signal processing (DSP) problem, and the detail can be seen in [5, 810, 29]. From Figure 5, we can conclude that REAS is also better than RDLS, ECS, and DLS.

7. Conclusions and Future Work

In the past few years, with the rapid development of heterogeneous systems, the high price of energy, system performance, reliability, and various environmental issues have forced the high-performance computing sector to reconsider some of its old practices with an aim to create more sustainable system. In this paper, we attempt the simultaneous management of system performance, reliability, and energy consumption. To achieve this goal, we first built a reliability and energy aware task scheduling architecture, which mainly includes heterogeneous systems, parallel application DAG model, and energy consumption model. Then, we proposed a relationship between execution reliability and processor’s voltage/frequency and deduced its parameters approximation value by least squares curve fitting method. Thirdly, we established parallel application execution reliability model and formulated this reliability and energy aware scheduling problem as a linear programming. Finally, to provide an optimum solution for this problem, we proposed a heuristic Reliability-Energy Aware Scheduling (REAS) algorithm based on a novel scheduling objective RE, which is synthetic considering the task execution time, energy consumption, and reliability.

The performance of REAS algorithm is evaluated with an extensive set of simulations and compared to three of the best existing scheduling algorithms for heterogeneous systems: the RDLS, ECS, and DLS algorithms. The comparison is also performed on the synthetic randomly generated precedence-constrained parallel application DAG. The simulation experiment results clearly confirm the superior performance of REAS algorithm over the other three, particularly in energy saving.

This work is one of the first attempts to consider the simultaneous management of system performance, reliability, and energy consumption on high-performance computing systems. Future studies in this domain are twofold. Firstly, it will be interesting to extend our model to multidimensional computing resources, such as interconnections, memory access, and I/O activities. Secondly, in this paper, the failures occurring on resources of systems are assumed to follow Poisson process. Other reliability models can also be used in further studies.

Competing Interests

The authors declare that they have no competing interests.


This research was partially funded by the National Science Foundation of China (Grant no. 61370098), Hunan Provincial Natural Science Foundation of China (Grant no. 2015JJ2078), National High-Tech R&D Program of China (2015AA015303), Key Technology Research and Development Programs of Guangdong Province (2015B010108006), and a project supported by the Science Foundation for Postdoctorate Research from the Ministry of Science and Technology of China (Grant no. 2014M552134).