Abstract
The amount of energy needed to operate highperformance computing systems increases regularly since some years at a high pace, and the energy consumption has attracted a great deal of attention. Moreover, high energy consumption inevitably contains failures and reduces system reliability. However, there has been considerably less work of simultaneous management of system performance, reliability, and energy consumption on heterogeneous systems. In this paper, we first build the precedenceconstrained parallel applications and energy consumption model. Then, we deduce the relation between reliability and processor frequencies and get their parameters approximation value by least squares curve fitting method. Thirdly, we establish a task execution reliability model and formulate this reliability and energy aware scheduling problem as a linear programming. Lastly, we propose a heuristic ReliabilityEnergy Aware Scheduling (REAS) algorithm to solve this problem, which can get good tradeoff among system performance, reliability, and energy consumption with lower complexity. Our extensive simulation performance evaluation study clearly demonstrates the tradeoff performance of our proposed heuristic algorithm.
1. Introduction
For a long time, energy consumption has simply been ignored in the performance evaluation in largescale parallel computing systems. However, Intelligence (DCDi) Industry Census reported that the amount of electricity consumed by global data centers ran up to 40 GW in 2013, and it was also with a increase [1]. According to the latest world’s Top 500 supercomputers Ranking, the power consumption of first supercomputer “Tianhe2” is 17.808 MW and average power consumption for Top systems in Ranking list is 6.2939 MW, respectively [2]. Thus, it is obvious that high energy cost is a key feature of designing and applying heterogeneous systems.
On the other hand, computing systems are a group of heterogeneous processors connected via a highspeed network that supports the execution of parallel applications. For example, the Top supercomputer “Tianhe2” in Top lists consists of Intel Xeon® E52692 12C 2.200 GHz and Intel Xeon Phi 31S1P (MIC) [2]. For each processor, the number of transistors integrated into today’s Intel Xeon EX processor reaches to nearly billion and its power consumption over 130 W [3]. This implies the possibility of worsening single processor reliability, eventually resulting in poorness of the whole heterogeneous system reliability. Furthermore, the modern largescale computing systems usually have a lot of processors, such as “Tianhe2” with 3,120,000 cores and “Titan” with 560,640 cores [2]. One of the main problems existing in this situation is system reliability, which drastically decreases as the number of processor cores increases [4]. Even when the single processor’s onehour reliability becomes very high, such as 0.999999, as the system size approaches 10,000 cores, the system’s MTTF (the Mean Time to Failure) drops to less than hours [4]. This also allows us to focus primarily on the main problem of this paper, which is the simultaneous management of system performance, reliability, and energy consumption.
In recognition of this, we first build a reliability and energy aware task scheduling architecture including precedenceconstrained parallel applications and energy consumption model on heterogeneous systems. Then, we propose the single processor failure rate model based on DVFS technique and deduce the application reliability of systems. Finally, to provide an optimum solution for this problem, we propose a heuristic ReliabilityEnergy Aware Scheduling (REAS) algorithm, which adopts a novel scheduling objective RE. The overall objective of this paper is trying to get good tradeoff among performance, reliability, and energy consumption.
The rest of the paper is organized as follows: the related work is summarized in Section 2. We describe the task scheduling system model in Section 3. In Section 4, we provide a system reliability model. To solve this problem, a heuristic reliability and energy aware task scheduling algorithm is proposed in Section 5. In Section 6, we verify the performance of the proposed algorithm by comparing the results obtained from performance evaluation. Finally, we summarize the contributions and make some remarks on further research in Section 7.
2. Related Work
The highperformance parallel application running on computing systems is usually composed of intercommunicated tasks, which are scheduled to run over different processors in the systems. In most cases, the main objective of scheduling strategies is to map the multiple interacting program tasks onto processors and order their executions so that task precedence requirements are satisfied and, in the meanwhile, the minimum schedule length (makespan) can be achieved. The problem of finding the optimal schedule is NPcomplete in general [5–9]. There are many scheduling algorithms that have been proposed to deal with this problem, for example, dynamiclevel scheduling (DLS) algorithm [6] and heterogeneous earliestfinishtime (HEFT) algorithm [5, 8, 10, 11].
As the energy consumption has become important issue in designing largescale computing systems in the last few years, many techniques including dynamic voltagefrequency scaling (DVFS), dynamic powering on/off, slack reclamation, resource hibernation, and memory optimizations have been investigated and developed to reduce energy consumption [12–14]. DVFS, which is a technique in which a processor runs at a lessthanmaximum frequency when it is not fully utilized in order to conserve power, is perhaps the most appealing method for reducing energy consumption [14, 15]. Most of the early DVFSenabled researches focused on the single processor of embedded and realtime computing systems [14, 16, 17]. Recently, there has been a significant amount of work on task scheduling for heterogeneous systems using DVFSenabled techniques. For instance, Rountree et al. focused on energy optimization of MPI program in HPC environment and proposed a linear programming (LP), which incorporates allowable time delays, communication slack, and memory pressure into its scheduling using DVFS (i.e., slack reclamation) [18]. Rizvandi et al. proposed a method to find the best frequencies of processor to obtain the optimal energy consumption [19]. Lee and Zomaya addressed the problem of scheduling precedenceconstrained parallel applications on multiprocessor computer systems and their scheduling decisions are made using the relative superiority metric (RS) devised as a novel objective function [20]. In [21], Zong et al. proposed two energyefficient scheduling algorithms (EAD and PEBD) for parallel tasks on homogeneous clusters based on duplication strategy.
All of this work demonstrated that dynamic adjusting the processor’s voltage and frequency can effectively reduce system energy consumption. However, recent researches have illustrated that scaling the processor’s voltage and frequency has negative impact of nanoscale semiconductor circuits’s cosmic ray radiations, electromagnetic interference, and alpha particles, which enforce the unreliability of processor [22–24]. Thus, it is a good way to incorporate the reliability into energy aware scheduling based on DVFS. Recently, Zhu etc. focused on reducing energy consumption while preserving the system reliability for periodic realtime tasks [25, 26]. They proposed a reliability model that the processor’s reliability decreases as scaling their voltage and frequency from max to min and incorporated the reliability requirements into heuristic energy aware task scheduling strategies. However, their techniques are not suitable for precedenceconstrained parallel applications on heterogeneous systems based on DVFSenabled processors.
Many researches had dealt with the reliability on heterogeneous systems. For example, Dogan and Özgüner introduced three reliability cost functions that were incorporated into making dynamic level (DL) and proposed a reliable dynamic level scheduling algorithm (RDLS) [27]; the goal was to minimize not only the execution time but also the failure probability of the application. In our previous work [8], we propose a scheduling algorithm which considers the task’s execution reliability. Qin and Jiang investigated a dynamic and reliabilitycostdriven (DRCD) scheduling algorithms for precedenceconstrained tasks in heterogeneous clusters [28]. Unfortunately, those works did not consider the energy consumption and the reliability of scaling the processor’s voltage and frequency. In recognition of this, we focus on the reliability and energy consumption on DVFSenabled heterogeneous systems.
3. System Models
3.1. Scheduling Architecture
Various task scheduling architectures are proposed in literature [5, 8, 9, 14, 28, 29]. However, the energy consumption and system reliability are not effectively incorporated into scheduling. In this paper, we propose a reliability and energy aware task scheduling architecture, as depicted in Figure 1(a). It is assumed that all parallel applications, along with information provided by user, are submitted to system by a special user command. First, the parallel applications are divided as a task DAG by Task DAG Model. Then, the estimate energy consumption of tasks, which are executed on the DVFSenabled heterogonous processors, is computed by the Eneregy Consumption Estimator. At the same time, reliability analysis computes the processors’ reliability according to different frequency to get the whole system reliability. Finally, the Scheduler schedules tasks based on the above task energy consumption and system reliability.
(a)
(b)
3.2. Heterogeneous Systems
The target system used in this work consists of a set of heterogeneous processors/machines [5, 8, 9, 14, 29], which are connected by highspeed interconnects, such as Infiniband and Myrinet. Each DVFSenabled processor can adjust its operational voltage and frequency [14]. Therefore, they can be executed on discrete set of frequencyvoltage pairs, , in which and , where is processor ’s operation level [14, 30]. For example, the quadcore AMD Phenom II supports different frequencies (0.8 GHz, 2.1 GHz, 2.5 GHz, and 3.2 GHz) and voltages ranging from V to V [30]. Since clock frequency transition overheads take a negligible amount of time (e.g., 10 us–150 us), these overheads are not considered in our study.
The heterogeneous processor’s failure is assumed to follow a Poisson process and each processor has a constant failure rate [8, 9, 29]. For example, denotes a processor failure rate when it works at normal voltage and frequency [8, 9, 27, 29]. These failure rates can be derived from system’s profiling, system log, and statistical prediction techniques [31]. For demonstration purposes, we illustrate two heterogeneous processors, one has frequency levels and the other has frequency levels, and the parameters are listed in Table 1.
3.3. Applications Model
The precedenceconstrained tasks of parallel application are usually denoted as a Directed Acyclic Graph (DAG) [5, 8–10, 29], where is the set with tasks that can be scheduled to any available DVFSenabled processors [5, 8–10, 29]; represents the precedence relation that defines a partial order on the task set , such that implies that the task must be finished, before can start execution [5, 8–10, 29]. is communication matrix that denotes the communication time between tasks and for , . is computation matrix in which each gives the estimated time to execute task on processor at frequency . Here, is the maximal operation level on systems. The communication cost and computation cost can be evaluated by building a historic table and using code profiling or statistical prediction techniques [31]. Figure 1(b) shows a parallel application DAG, Table 2 lists the tasks execution time on two heterogeneous DVFSenabled processors listed in Table 1, and the communication time among these tasks is listed in Table 3.
Generally, the common objective of task scheduling is to map tasks with precedence constrained onto processors and get a minimum schedule length (which is also called makespan) [10, 11]. Before presenting the schedule length, it is necessary to define the scheduling attributes and of task . denotes the earliest execution starting time of task on DVFSenabled processor at frequency , which is constrained by tasks precedence relation and the available time of processor [5, 8–10, 29]. is the earliest execution finish time of task on processor at frequency , which is described as
In this paper, let denote the task scheduled on processor at frequency ; otherwise . Thus, the schedule length is defined as follows:
3.4. Energy Model
The major energy consumption of computing systems depends on its memory, disks, CPUs, and other components. This paper only considers DVFSenabled CPUs, which consume the largest proportion of energy on systems [14, 19, 20, 32]. The power consumption of DVFSenabled microprocessor based on complementary metaloxide semiconductor (CMOS) logic circuits mainly consists of static power and dynamic power dissipation, which can be modeled as [25, 26]where is the static power, which is a constant and the power used to maintain basic circuits and keep the clock running, and frequencyindependent active power. denotes the processor’s model, if processor is at execution model, ; otherwise, . is the most significant factor of processor power consumption and can be estimated as [14, 16, 19, 20, 32]where represents the switched capacitance, is the supply voltage, represents processor’s working frequency, and stands for circuit dependent constant. The example of such processor parameters is listed in Table 1.
Let be the energy consumption caused by task running on DVFSenabled processor at frequency , of which it is determined by task execution time and processor power consumption:where denotes dynamic power dissipation of processor at frequency (see (4)). Thus, for an application , the energy consumption is the summation of all tasks of energy consumption:
At the same time, for heterogeneous systems, all processors are poweron; they are sleep or execution model. That is to say, all processors of systems consume all the time. Thus, the computing systems energy consumption is the summation of all processors static power and dynamic power dissipation of application energy consumption:
Obviously, systems energy consumption is greater than application energy consumption . In this paper, one of our main objectives is to minimize systems energy consumption .
4. System Reliability Analysis and Problem Statement
In this section, we first provide the single DVFSenabled processor failure rate model. Then, we analyze heterogeneous systems reliability. At last, we formulate the reliability and energy aware task scheduling as a linear programming problem.
4.1. Single DVFSEnabled Processor Failure Rate
Among various sources of unreliability in a semiconductor circuit processor, it is predicted that the failure rate due to cosmic ray radiationinduced soft errors dominates all other reliability issues [24]. Transient fault occurs when a high energy particle such as alpha or neutron strikes a sensitive region in a semiconductor device and flips the logical state of the struck node [33]. Most of the modern DVFSenabled processor is the integration of multibillion transistors on a single chip leading to increasing number of sensitive devices in submicron technologies which is vulnerable to soft error and consequently raises the Soft Error Rate (SER) [34]. These phenomena become more and more serious with the continued scaling of processor’s voltage and frequency [23, 25].
Traditionally, the modern DVFSenabled processor’s reliability has been modeled as the following Poisson distribution with a failure rate when it works at normal voltage and frequency [8, 9, 27, 29, 35]. Moreover, it has been shown that DVFS has a direct and negative effect on failure rates as blindly applying DVFS to scale the supply voltage and processing frequency for energy savings, which may cause significant degradation in processor’s reliability [23, 25, 26]. Therefore, for the DVFSenabled heterogonous processor to be considered in this paper, the failure rate at a reduced frequency (and the corresponding voltage ) can be modeled aswhere is the failure rate corresponding to the normal processing frequency (and corresponding to normal voltage ). Prior researches which studied the effect of normal voltage on processor’s reliability have revealed that the failure rates generally increase with scaled processing frequencies (and supply voltages) away from normal voltage [24, 36]. On the other hand, the fault rates are exponentially related to the circuit’s critical charge (which is the threshold voltage). Thus, we have the following equations:where the exponent is the parameter of threshold voltage and is a constant, representing the sensitivity of fault rates to frequency scaling, and and denote the minimum and maximum frequency, respectively.
In order to get the precision value of parameters and , we use least squares curve fitting method [37]. Therefore, the natural logarithm of both sides for (9) isLet , , , and . Then, (10) becomesThus, we can get the parameters and approximation value by using least squares linear fitting method.
4.2. Application Reliability Analysis
Assume that the task processing time has taken place during the time interval on heterogeneous DVFSenabled processor at frequency , where denotes the task start execution time and denotes the task finish time [5, 8, 9, 29]. Thus, the task execution reliability can be given by
For a task of application on processor at frequency , its reliability is equal to all of its immediate parent tasks and its execution reliability, which can be defined bywhere denotes all direct predecessors of and is the reliability of task that is equal to the reliability of task executing on processor at frequency
For the entry task of application, which is executed on processor at frequency and , its reliability
Generally, application has one exit task . The reliability of application is equal to the exit task :
This is the other objective of this paper, in which we try to improve the application reliability . From the above analysis, we know that allocating tasks with less execution times to more reliable processors might be a good heuristic to increase the reliability.
4.3. Problem Statement
As simultaneous management of scheduling performance, system reliability, and energy consumption is the main problem of this paper, we formulate it as follows:
5. Proposed ReliabilityEnergy Aware Scheduling Algorithm
This section presents a ReliabilityEnergy Aware Scheduling algorithm on heterogeneous systems called REAS, which aims at achieving lower energy consumption, high reliability, and shorter schedule length. Its scheduling decisions are made using the hybrid metric including energy consumption, reliability, and schedule length, devised as a novel objective function. The pseudocode of the algorithm is shown in Algorithm 1. The algorithm is complete in three main phases as described in the following sections.

5.1. Task Priorities Phase
This step is essential for list scheduling algorithms. A task processing list is generated by sorting the task by decreasing order of some predefined rank function, such as , , , , and [5, 6, 8–10, 29]. Here, we use the average computation capacity, which is defined as
In this research, we use as the rank function. The of task is the sum of the path weight from task to exit task. We can compute this value recursively traversing DAG from exit task, and it is defined as follows:where is the set of immediate successors of task . is the average reliability overhead of task and can be computed by
For the exit task , the is equal to
Basically, is the length of the critical path from task to the exit task, including the average computation cost and reliability overhead of task . For example, considering the application DAG in Figure 1(b), heterogeneous systems parameters in Table 1, task execution time matrix in Table 2, and communication matrix in Table 3, the task value which is recursively computed by (19) and (21) is shown in Table 4.
5.2. Task Assignment Phase
In this phase, tasks are assigned to the processors with earliest execution finish time , high reliability, and minimum task energy consumption . However, for heterogeneous systems, these performance metrics are conflicted most of the time. Here, we introduce a novel objective as , which can get good tradeoff among these metrics. We first redefine task earliest execution finish time on processor at frequency aswhere is the reliability overhead of task on processor at frequency and is computed by
On the other hand, we let , denote the earliest execution finish time and minimum task energy consumption on all processors of heterogeneous systems. Thus, the novel metric of task on processor at frequency iswhere is the weight of task earliest execution finish time. If the task execution time is more important than energy consumption, we can give higher value to ; otherwise, value is lower. Moreover, the scheduling objective of this problem is minimum in both schedule length and energy consumption. Thus, in each task assignment step, we try to get the minimum and assign task to the corresponding processor frequency.
5.3. Slack Reclamation
Tasks of parallel application may have some slack time for their execution due primarily to communication events, for example, “multidimensional” intertask communication (or intertask data dependencies), and these processor slacks are an obvious source of energy wastage. Slack reclamation was studied to reduce energy consumption using the slack left by some completed task instances. The idea behind the slack reclamation for the reducing of energy consumption is to exploit the slack time to slow down the execution speeds of the remaining tasks [12, 20]. In this paper, we adopt this technique to reduce energy consumption after making the scheduling decision. The slack time of task is defined bywhere is the task earliest start time in scheduling processorfrequency pairs and is the earliest finish time.
If task slack time , we can scale down the execution frequency to save energy consumption. Thus, the optimal frequency is satisfied withwhere is the original scheduling processorfrequency pairs. At last step, we reassign task to the optimal frequency .
6. Experimental Results and Discussion
In this section, we compare the performance, energy consumption, and system reliability using our REAS algorithm with three existing scheduling algorithms: DLS [6], RDLS [27], and ECS [20]. The experiments are performed on the synthetic randomly generated precedenceconstrained parallel application graphs as described below. The performance metrics chosen for the comparison are the schedule length (2) and (22), systems energy consumption (7), and application reliability (16).
To test the performance of these algorithms, we have developed a discrete event simulation environment of heterogeneous systems with DVFSenabled processors using C++. This simulator includes Intel® Core™ Duo, Intel Xeon, AMD Athlon, 1 TI DSP, and Tesla GPU, mostly based on Intel processor. The systems are interconnected by Infiniband, which is a switched fabric communications link primarily used in highperformance computing. For the Infiniband configuration, the switch considered is Mellanox InfiniScaleTM III SDR and NIC is Mellanox ConnectXTM IB Dual Copper Card [21]. Other parameters of the model are set as follows. The failure rates of processors are assumed to be uniformly distributed between and failures/hr [8, 9, 28]; the transmission rates of links are assumed to be Mbits/sec.
6.1. Randomly Generated Application Graphs
These experiments use three commonly DAG characteristics to generate parallel application graphs [5, 8, 9, 29]:(i)DAG Size (). It is the number of tasks in the application DAG.(ii)CommunicationComputation Ratio (CCR). It is the ratio of communication time to computation time. A small value means the application is computationintensive; a large value indicates that the application is communicationintensive [5, 8–10, 29].(iii)OutDegree. It is outdegree of a task node.
In experiments setting, DAG are generated based on the above parameters with the number of tasks and . Task weights are generated randomly from uniform distribution [] execution cycles to be around ; thus the average task execution cycles are . We also generated edge weights with a uniform distribution based on a mean . Different objective parallel applications can be produced as giving various values [5, 8–10, 29]. In these experiments, we varied in a reasonable range of to .
6.2. Various Weight of REAS Algorithm
In the first experiments, we evaluate the performance of weight to REAS algorithm. Figure 2 shows the simulation results of scheduling and tasks with CCR = 1 by varying weight from to , in steps of . We observe from Figure 2 that the schedule length and energy consumption decrease and the application reliability almost at the same level as the REAS algorithm weight increases. It is reasonable that the REAS algorithm with high is mostly based on task execution time and makes its schedule length shorter and consumes less energy. However, as the weight over , the performance of REAS is not much distinguishable. Thus, in the below experiments, we let .
(a)
(b)
(c)
6.3. Random Task Performance Results
For the set of randomly generated parallel applications, the results are shown in Figures 3 and 4, where each data point is the average of the data obtained in 1,000 experiments. In this set of experiments, we assume the weight of metric (see (24)) in REAS algorithm. In other word, the REAS algorithm has the same weight on task execution time and energy consumption. In the next section, we will examine the performance by various weights .
(a)
(b)
(c)
(d)
(e)
(f)
(a)
(b)
(c)
We observe from Figure 3(a) that REAS is over RDLS and ECS with respect to schedule length, and the schedule length increases as the CCR increases. The average schedule length of the REAS algorithm is shorter than that of the RDLS and ECS by and , respectively. This improvement becomes more obvious as CCR increases, for CCR = 5 and REAS over RDLS and ECS by and , respectively. However, the REAS is inferior to DLS in terms of schedule length. Figure 3(b) reveals that REAS saves more average energy consumption than RDLS by , ECS by , and DLS by , respectively. Figure 3(c) shows that REAS outperforms RDLS, ECS, and DLS by , , and in terms of the average application reliability.
This is mainly due to the fact that REAS algorithm schedules tasks according to the novel objective , which can get effective tradeoff among task execution time, energy consumption, and task execution reliability. However, DLS algorithm only focuses on optimizing the task execution time and its actual execution time including the task scheduling time and reliability overhead. Thus, the scheduling solution generated by DLS can get optimal schedule length. However, it consumes more energy and has lower reliability. RDLS algorithm schedules tasks considering their execution reliability and ignoring task energy consumption. ECS algorithm is a solution for optimizing both schedule length and energy consumption, but this solution needs more task execution reliability overhead. Thus, REAS algorithm outperforms RDLS, ECS, and DLS in terms of the schedule length, energy consumption, and reliability. Other interesting experimental phenomena are that RDLS and DLS are better than ECS in terms of reliability. This is mainly due to the fact that tasks of solutions RDLS and DLS are always executing on the normal frequency of processor, which has the high reliability in all processor frequency.
The improvements of scheduling performance also could be concluded from Figures 3(d), 3(e), and 3(f) for tasks. These results also show REAS over RDLS and ECS by and in terms of the average schedule length. And, REAS is also over RDLS, ECS, and DLS by , , and in terms of the average energy consumption and by , , and in terms of the average application reliability, respectively.
We also simulate heterogeneous systems with Intel Xeon and AMD Athlon; the other configurations are the same as before. Figure 4 shows the results of randomly generated tasks on this heterogeneous computing platform. The results show REAS over RDLS, ECS, and DLS in terms of average schedule length and energy consumption. However, REAS is inferior to RDLS in terms of the application reliability.
6.4. Application Graphs of RealWorld Problem
Using real applications to test the performance of algorithms is very common [5, 8–10, 29]. In this section, we also simulate a realworld digital signal processing (DSP) problem, and the detail can be seen in [5, 8–10, 29]. From Figure 5, we can conclude that REAS is also better than RDLS, ECS, and DLS.
(a)
(b)
(c)
7. Conclusions and Future Work
In the past few years, with the rapid development of heterogeneous systems, the high price of energy, system performance, reliability, and various environmental issues have forced the highperformance computing sector to reconsider some of its old practices with an aim to create more sustainable system. In this paper, we attempt the simultaneous management of system performance, reliability, and energy consumption. To achieve this goal, we first built a reliability and energy aware task scheduling architecture, which mainly includes heterogeneous systems, parallel application DAG model, and energy consumption model. Then, we proposed a relationship between execution reliability and processor’s voltage/frequency and deduced its parameters approximation value by least squares curve fitting method. Thirdly, we established parallel application execution reliability model and formulated this reliability and energy aware scheduling problem as a linear programming. Finally, to provide an optimum solution for this problem, we proposed a heuristic ReliabilityEnergy Aware Scheduling (REAS) algorithm based on a novel scheduling objective RE, which is synthetic considering the task execution time, energy consumption, and reliability.
The performance of REAS algorithm is evaluated with an extensive set of simulations and compared to three of the best existing scheduling algorithms for heterogeneous systems: the RDLS, ECS, and DLS algorithms. The comparison is also performed on the synthetic randomly generated precedenceconstrained parallel application DAG. The simulation experiment results clearly confirm the superior performance of REAS algorithm over the other three, particularly in energy saving.
This work is one of the first attempts to consider the simultaneous management of system performance, reliability, and energy consumption on highperformance computing systems. Future studies in this domain are twofold. Firstly, it will be interesting to extend our model to multidimensional computing resources, such as interconnections, memory access, and I/O activities. Secondly, in this paper, the failures occurring on resources of systems are assumed to follow Poisson process. Other reliability models can also be used in further studies.
Competing Interests
The authors declare that they have no competing interests.
Acknowledgments
This research was partially funded by the National Science Foundation of China (Grant no. 61370098), Hunan Provincial Natural Science Foundation of China (Grant no. 2015JJ2078), National HighTech R&D Program of China (2015AA015303), Key Technology Research and Development Programs of Guangdong Province (2015B010108006), and a project supported by the Science Foundation for Postdoctorate Research from the Ministry of Science and Technology of China (Grant no. 2014M552134).