Abstract

Due to the ever-growing requirements in high performance data computation, multiprocessor systems have been proposed to solve the bottlenecks in uniprocessor systems. Developing efficient multiprocessor systems requires effective exploration of design choices like application scheduling, mapping, and architecture design. Also, fault tolerance in multiprocessors needs to be addressed. With the advent of nanometer-process technology for chip manufacturing, realization of multiprocessors on SoC (MpSoC) is an active field of research. Developing efficient low power, fault-tolerant task scheduling, and mapping techniques for MpSoCs require optimized algorithms that consider the various scenarios inherent in multiprocessor environments. Therefore there exists a need to develop a simulation framework to explore and evaluate new algorithms on multiprocessor systems. This work proposes a modular framework for the exploration and evaluation of various design algorithms for MpSoC system. This work also proposes new multiprocessor task scheduling and mapping algorithms for MpSoCs. These algorithms are evaluated using the developed simulation framework. The paper also proposes a dynamic fault-tolerant (FT) scheduling and mapping algorithm for robust application processing. The proposed algorithms consider optimizing the power as one of the design constraints. The framework for a heterogeneous multiprocessor simulation was developed using SystemC/C++ language. Various design variations were implemented and evaluated using standard task graphs. Performance evaluation metrics are evaluated and discussed for various design scenarios.

1. Introduction

Current uniprocessor systems guarantee high processor utilization; however, they fall short in meeting critical real time demands that require (1) high data throughput, (2) fast processing time, (3) low energy and power consumptions, and (4) fault tolerance. Present-day applications that run on embedded or uniprocessor systems, such as multimedia, digital signal processing, image processing, and wireless communication, require an ever-growing amount of resources and speed for data processing and storage, which can no longer be satisfied by uniprocessor systems [1]. The increasing performance demand dictates the need for concurrent task processing on multiprocessor systems. Multiprocessor SoC (MpSoC) systems have emerged as a viable alternate to address the bottlenecks of uniprocessor systems. In an effort to improve the processing performance, multiprocessor systems incorporate a number of processors or processing elements (PEs). A typical MpSoC system is composed of the following major components:(1)multiple processing elements (PEs) that perform the task execution,(2)multiple memory elements (MEs) that are used for data storage, and (3)a communication network that interconnects the processors and memories in a manner defined by the topology. Heterogeneity in MpSoCs [2] is described as the variation of the functionality and flexibility of the processor or the memory element. Heterogeneous processors in MpSoCs can be programmable or dedicated (nonprogrammable) processors. Also, memory elements can be heterogeneous with respect to their memory type, size, access mode, and clock cycles. To transfer the data to and from the processing and memory elements, the network component may include busses, switches, crossbars, and routers.

Present-day MpSoC architectures employ most of the above discussed components [3]. In fact, to effectively meet today’s complex design requirements, modern multiprocessor systems are real time, preemptive, and heterogeneous [4]. For data exchange, memory elements in MpSoCs often follow either distributed memory architecture (message passing) or a global shared memory architecture [57].

An efficient MpSoC is based upon several key design choices, also known as design space exploration: network topology selection, good routing policy, efficient application scheduling, mapping, and the overall framework implementation. A formal categorization of MpSoC design issues presented in [8] is summarized below. Design methodologies in multiprocessor systems (MpSoC and NoC) have been classified as follows: (1) application characterization, scheduling, and mapping; (2) processing, memory, switching and communication architecture modeling, development, and analysis; and (3) simulation, synthesis, and emulation framework.

Major design features that need to be addressed for multiprocessors are task scheduling, mapping, and fault tolerance. This work concentrates on developing a simulation framework for multiprocessors on a single chip (MpSoC) that simplifies design space exploration by offering comprehensive simulation support. The simulation framework allows the user to optimize the system design by evaluating multiple design choices in scheduling, mapping, and fault tolerance for the given design constraints. TILE and TORUS topologies, which are common layouts in multiprocessing environments, are supported in this framework.

Using the developed MpSoC simulation framework, we propose and evaluate (1) an efficient strategy for task scheduling, called the performance driven scheduling (PDS) algorithm based on Simulated Annealing heuristic that determines an optimal schedule for a given application; (2) a new mapping policy called Homogenous Workload Distribution (HWD), which maps tasks to processors by considering processors runtime status; and (3) finally, in an effort to implement a robust system, a simple fault-tolerant (FT) algorithm that reconfigures task execution sequence of the system in the event of processor failure. Heterogeneity in processor architecture is a key design feature in multiprocessor systems and therefore is modeled in this framework. Even though many earlier multiprocessor designs solve the scheduling and mapping problem together, our framework considers them individually for fault-tolerance.

In summary, the research presented in this paper addresses the following issues and solutions.(1)Develop a comprehensive MpSoC framework implemented using C++ and SystemC programming languages for effective exploration of design choices like application scheduling, mapping, and architecture design.(2)The developed system can also evaluate MpSoC system architecture design choices differing in the number and type of processing elements (PEs) and topology.(3)The paper proposes a heuristic scheduling algorithm based on Simulated Annealing.(4)Also, a mapping algorithm that distributes the workload among the available PE is proposed.(5)Finally, a centralized fault-tolerant scheme to handle faults in PEs is proposed.

This paper is organized as follows. Section 2 briefly summarizes previous works in the field of multiprocessor systems. Section 3 provides definitions and terminologies used in this work. Section 4 presents the framework description as well as the proposed algorithms and their implementation in task scheduling, mapping, and fault tolerance techniques. Section 5 discusses the experimental setup for simulation and evaluations. Section 6 discusses the detailed simulation results and the analysis of the algorithms on benchmarks. Finally, Section 7 summarizes the conclusions.

2. MpSoC Design Literature Review

Extensive work has been done in the field of MpSoC- and NoC-based systems. Earlier research targeted a variety of solutions including processor architecture, router design, routing and switching algorithms, scheduling, mapping, fault tolerance, and application execution in multiprocessors. to address the performance and area demands. The readers can refer to [810] for an extensive enumeration of NoC research problems and proposed solutions. In this section, we concentrate on the earlier proposed scheduling, mapping, and fault-tolerant algorithms. It has been shown that earlier scheduling algorithms like FCFS and Least Laxity [7] fail to qualify as acceptable choices of scheduling because they are unable to meet deadline (hard deadlines) requirements for real-time applications. Thai [7] proposed Earliest Effective Deadline First (EEDF) scheduling algorithm. This work demonstrated that increasing the number of processors to improve the scheduling success increased the size of the architecture, which makes task scheduling more difficult. Stuijk et al. [11] proposed dynamic scheduling and routing strategies to increase resource utilization by exploiting all the scheduling freedom on the multiprocessors. This work considered application dynamism to schedule applications on multiprocessors both for regular and irregular tiled topology. Hu and Marculescu [12] propose an algorithm based on list scheduling where mapping is performed to optimize energy consumption and performance. Varatkar and Marculescu [13] propose an ILP-based scheduling algorithm that minimizes interprocess communication in NoCs. Zhang et al. [14] propose approximate algorithms for power minimization of EDF and RM schedules using voltage and frequency scaling. Also other researchers [1517] have used dynamic voltage scaling in conjunction with scheduling for low power requirements.

Finding an optimal solution in scheduling and mapping problem domain for set of real-time tasks has been shown to be NP-hard problem [11]. Carvalho et al. [18], proposed a “congestion-aware” mapping heuristic to dynamically map tasks on heterogeneous MpSoCs. This heuristic approach initially assigns master-tasks to selected processors and the slave-tasks are allocated on congestion minimizing manner to reduce the communication traffic. Sesame project [4], proposed a multiprocessor model that maps applications together with the communication channels on the architecture. This model tries to minimize an objective function, which considers power consumption, application execution time and total cost of the architectural design. Many researchers have discussed static and dynamic task allocation for deterministic and non-deterministic tasks extensively [5]. Static mapping has also been shown non-efficient strategy because it does not take application dynamism into account that is a core design issues in multiprocessor systems. Ahn et al. [19], suggests that real-time applications should behave deterministically even in unpredictable environments so that the tasks can be processed in real-time. Ostler and Chatha [20] propose an ILP formulation for application mapping on network processors. Harmanani and Farah [21], propose an efficient mapping and routing algorithm that minimizes blocking and increases bandwidth based on simulated annealing. Also mapping algorithms to minimize communication cost, energy, bandwidth and latency either individually [2224] or as a multi-objective criteria [25, 26] have been proposed.

Task migration as opposed to dynamic mapping has been proposed in [4] to overcome performance bottlenecks. Task migration may result in performance improvement in shared memory systems but may not be feasible when considered for distributed memory architectures due to the high volume of inter-processor data communication [5].

Previous research in [5, 6, 27] has demonstrated reliability issues regarding application processing, data routing and communication link failure. Zhu and Qin [28], demonstrated a system level design framework for a tiled architecture to construct a reliable computing platform from potentially unreliable chip resources. This work proposed a fault model, where faults may have transient and permanent behaviour. Transient errors are monitored by a dedicated built-in checker processor that effectively corrects the fault at runtime. Nurmi et al. [6] demonstrated error tolerance methods through time, space and data redundancy. This is however a classic and costly approach with respect to system resources and bandwidth though the work demonstrated redundancy is one potential approach to address faulty scenarios. Holsmark et al. [29], proposed a fault tolerant deadlock free algorithm (APSRA) for heterogeneous multiprocessors. APSRA stores routing information in memory to monitor faulty routs and allow re-configurability dynamically. Pirretti et al. [30] and Bertozzi et al. [31] propose NoC routing algorithms that can route tasks around the fault and sustain network functionality. Also research on router design to improve reliability [32], buffer flow control [33] and error correction techniques [34, 35] for NoC have been proposed. Koibuchi et al. [36] proposed a lightweight fault-tolerant mechanism called default backup path (DBP). The approach proposed a solution for non-faulty routers and healthy PEs. They design NoC routers with special buffers and switch matrix elements to re-route packets through special paths. Also, Bogdan et al. [37] propose an on-chip stochastic communication paradigm based on probabilistic broadcasting scheme that takes advantage of the large bandwidth available on the chip. In their method, messages/packets are diffused across the network using spreaders and are guaranteed to be received by destination tiles/processors.

A number of EDA research groups are studying different aspects of MpSoC design, some of which include the following. (1) Polis [18] is a framework based on both microcontrollers and ASICs for software and hardware implementation. (2) Metropolis [18] is an extension of Polis that proposed a unified modeling and structure for simulating computation models. (3) Ptolemy [18] is a flexible framework for simulating and prototyping heterogeneous systems. (4) Thiele et al. [1] proposed a simulation based on distributed operation layer (DOL) which considers concurrency, scalability, mapping optimization, and performance analysis in an effort to yield faster simulation time. (5) NetChip: xpipes, xpipes compiler, and SUNMAP. NetChip is a NoC synthesis environment primarily composed of two tools, namely, SUNMAP [38] and the xpipes compiler [39]. (6) NNSE: Nostrum NoC Simulation Environment. NNSE [40] is a SystemC-based NoC simulation environment initially used for the Nostrum NoC. (7) ARTS Modeling Framework: ARTS [41] is a system-level framework to model networked multiprocessor systems on chip (MpSoC) and evaluate the cross-layer causality between the application, the operating system (OS), and the platform architecture. (8) StepNP: a System-Level Exploration Platform for Network Processors. StepNP [42] is a System-Level Exploration Platform for Network Processing built in SystemC. It enables the creation of multi-processor architectures with models of interconnects (functional channels, NoCs), processors (simple RISC), memories, and coprocessors. (9) Genko et al. [43] present a flexible emulation environment implemented on an FPGA that is suitable to explore, evaluate, and compare a wide range of NoC solutions with a very limited effort. (10) Coppola et al. [44] proposes an efficient, open-source research and development framework (On-Chip Communication Network (OCCN)) for the specification, modeling, and simulation of on-chip communication architectures. (11) Zeferino and Susin [45] present SoCIN, a scalable network based on a parametric router architecture to be used in the synthesis of customized low cost NoCs. The architecture of SoCIN and its router are described, and some synthesis results are presented.

The following papers presented are the closest works related to the proposed algorithms discussed in this paper. Orsila et al. [46] investigated the use of Simulated Annealing-based mapping algorithm to optimize energy and performance on heterogeneous multiprocessor architecture. In specific the cost function considered for SA optimization includes processor area, frequency of operation, and switching activity factor. The work compares the performance of their SA-mapping algorithms for different heterogeneous architectures. In another similar work [47], proposed by the same author, the SA-mapping algorithm is adopted to optimize usage of on-chip memory buffers. Their work employs a B-level scheduler after the mapping process to schedule tasks on each processor.

The proposed framework discussed in this paper employs SA-based scheduling and workload balanced mapping algorithms. Our algorithms also optimize scheduling and mapping cost functions that include processor utilization, throughput, buffer utilization, and port traffic along with processor power. In addition, our algorithms also consider task deadline constraints. Both scheduling and mapping algorithms adopted in our work perform global optimization.

Paterna et al. [48] propose two workload allocation/mapping algorithms for energy minimization. The algorithms are based on integer linear programming and a two-stage-heuristic mapping algorithm based on linear programming and bin packing policies. Teodorescu and Torrellas [49] and Lide et al. [50, 51] explore workload-based mapping algorithms for independent tasks, and dependent tasks respectively. The work by Zhang et al. is based on ILP optimization for execution time and energy minimization. Our proposed mapping algorithm varies from the above-specified workload constrained algorithms by considering the buffer size usage and execution time of the tasks in the buffer.

Works by Ababei and Katti [52] and Derin et al. [53] have used remapping strategies for NoC fault tolerance. In the solution proposed by Ababei and Katti [52], fault tolerance in NoCs is achieved by adaptive remapping. They consider single and multiple processor (PE) failures. The remapping algorithm optimizes the remap energy and communication costs. They perform comparisons with a SA-based remapper. Orsila et al. [46] propose an online remapping strategy called local nonidentical multiprocessor scheduling heuristic (LNMS) based on integer linear programming (ILP). The ILP minimizing cost functions include communication and execution time. The proposed NoC fault-tolerant algorithm employs rescheduling and remapping strategies to optimize processor performance and communication cost. Also task deadline constraints are considered during the rescheduling and remapping.

With the current design trends moving towards multiprocessor systems for high performance and embedded system applications, the need to develop comprehensive design space exploration techniques has gained importance. Many research contributions in development of hardware modules and memory synchronization to optimize power, performance, and area for multiprocessor systems are being developed. However, algorithmic development for multiprocessor systems like multiprocessor scheduling and mapping has yielded significant results. Therefore, a comprehensive and generic framework for a multi-processor system is required to evaluate the algorithmic contributions. This work addresses the above requirement in developing a comprehensive MpSoC evaluation framework for evaluating many scheduling, mapping, and fault-tolerant algorithms. Also, the framework is developed on generic modeling or representation of applications (task graphs), architecture, processors, topologies, communication costs, and so forth resulting in a comprehensive design exploration environment.

The main contributions of this work can be summarized as follows.(1)We propose a unique scheduling algorithm for MpSoC system called performance driven scheduling algorithm that is based on Simulated Annealing and optimizes an array of performance metrics such as task execution time, processor utilization, processor throughput, buffer utilization, and port traffic and power. The proposed algorithm performance is compared to that of the classical earliest deadline first scheduling algorithm.(2)This paper proposes a mapping algorithm that succeeded the scheduling algorithm called the homogeneous workload distribution algorithm that distributes tasks evenly to the available processors in an effort to balance the dynamic workload throughout the processors. The proposed algorithm performance is compared with other classical mapping algorithm.(3)Also, a fault-tolerant scheme that is effective during PE failure is proposed in this paper. The proposed fault-tolerant (FT) scheme performs an updated scheduling and mapping of the tasks assigning to the failed PE when the PE fault is detected. The unique contribution of this work is the fact that it addresses the pending and executing tasks in the faulty PEs. The latency overhead of the proposed FT algorithm is evaluated by experimentation.

3. Definitions and Theory

Simulation Model. The simulation model consists of three submodels (Figure 1): (1) application, (2) architecture, and (3) topology. The application model represents the application to be executed on the given MpSoC architecture and connected by the topology specified.

Application Model (Task Graphs). A task graph is a directed acyclic graph (DAG) that represents an application. A task graph is denoted as TG, such that , where is a set of nodes and is a set of edges. A node in represents a task such that , and an edge in represents the communication dependency between the tasks such that . A weighted edge, if specified, denotes the communication cost incurred to transfer data from a source task to destination task. Each task can exist in a processor in one of the following states: ideal, pending, ready, active, running, or complete. The task graph is derived by parsing the high-level description of an application.

Architecture Model. The architecture of the MpSoC system consists of the processors, hardware components, memory, and interconnects as defined by the topology model. The MpSoC architecture is represented as a set, , where are a set of processors and are a set of memory units. Each processor or memory is defined by the following characteristics: type, power/task, execution/task, clock rate, preempt overhead, and also by a set of tasks each processor can execute, along with their respective execution times and power costs. The implemented heterogeneous MpSoC consists of processors of the same architecture but can execute various set of tasks.

Topology Model (Topology Graph). Topology describes the physical layout of hardware components in a system. The topology model is represented as a topology graph (NG) that consists of the set of processors, topology type , and communication cost. This communication cost is the same communication cost as defined in the application model.

Scheduling. Scheduling is the process of assigning an execution start time instance to all tasks in an application such that tasks are executed within their respective deadline. For an application , where is a task instance in the application, task scheduling, denoted as , is defined as an assignment of execution start time for the task in the time domain such that where is the worst-case execution time of the task on a specific processor, is the deadline of the task , and is the execution start time of task . Scheduling of an application is denoted as If there exists a precedence operation between any two tasks such that the execution of a task should occur before the execution of task , represented as , then: where is the start time of task , and and are the execution time and deadline of task , respectively.

Performance Index (PI). PI is the cumulative cost function used to determine the optimal schedule by using the Simulated Annealing technique. PI quantifies the performance of the system, taking into consideration the averaged values of processor execution time (ETIME), utilization (UTL), throughput (THR), buffer utilization/usage (BFR), port traffic (PRT), and power (PWR). PI is evaluated by a cost function that is expressed as where are normalizing constants.

The Simulated Annealing procedure compares the performance of successive design choices based on the PI. Execution time, processor utilization, processor throughput, and buffer utilization are maximized so that their cost is computed reciprocally; however, power and port traffic are minimized so that the nonreciprocal values are taken. The coefficient values in the cost equation are set according to the desired performance goals of the system.

Optimization of Scheduling Algorithm. The problem of determining optimal scheduling is to determine the optimal start time and execution time of all tasks in the application. where is the worst-case execution time of each task , DL is the deadline of the application, and and are the start time and worst-case execution time of the last task, respectively, in the application. COST is the cost function of scheduling the tasks.

Mapping. Mapping is the process of assigning each task in an application to a processor such that the processor executes the task and satisfies the task deadline as well as other constraints (power, throughput, completion time), if any.

For an application , where is a task instance in the application, mapping a task, denoted as , is defined as an assignment of the processor for the task : Application mapping is defined as mapping of all tasks, , in the application to processors and is denoted as

Optimization of the Mapping Algorithm. The problem in determining optimum mapping is to determine the optimum allocation of all tasks in the application on the processors such that the total execution time of all tasks is minimized. Other factors for optimum mapping include maximizing processor execution time, throughput, minimizing inter-processor communication delay, and balancing workload homogenously on the processors in order to improve utilization where is the cost function for optimal mapping of the tasks on the processor.

Power Modeling. In this research, scheduling and mapping algorithms model heterogeneous processor execution and power requirements as follows.(1)Switch power is consumed whenever a task is routed from input port to output port. (2)Processor power is modeled with respect to idle-mode (also includes pending, ready, active, and complete) and execution-mode (running) power, both of which are functions of the task execution time, . We also assume that for the idle model, where is the unit power consumed per unit time of task processing. The total power rating for a processor would be expressed as where is the number of tasks executed by the processor.(3)Whenever a task is executed on processor , the execution time and the power consumed by processor are updated as (4)Whenever a task is transferred between switches, the communication delay is updated on the Total Task Communication Delay parameter and is expressed as where is number of switch hopes, task-size is the size of the task defined as a function of its execution time in time-units, and is a time constant that is required to transfer single time-unit task for one hop.(5)For each task , the Task Slack Time and Average Slack Time are calculated as where is the deadline, is the release time, and is the total number of tasks processed.

Fault-Tolerant System. A fault-tolerant system describes a robust system that is capable of sustaining application processing in the event of component failure. A fault-tolerant algorithm is a process that performs scheduling and mapping on a set of unprocessed tasks for the available processors.

Let an application with a set of tasks be scheduled, denoted by , and mapped on the set of processors , denoted as . A fault is defined as a nonrecurring, permanent failure of a processor at time instance during the execution of a task . The application after the occurrence of the fault is expressed as such that , which has a list of tasks that are not dispatched to processors by the mapper when the processor failure occurs. The set of processors after the failure is denoted by and , where is the failed processor. The proposed fault-tolerant MpSoC framework determines the updated scheduling and updated mapping for a set of unprocessed tasks in the application subset , such that where and are the start time and execution time, respectively, of the last scheduled task in .

Optimization of Fault-Tolerant Strategy. Due to the dynamic mapping methodology adopted in the framework, only the tasks dispatched to the failed processor—which includes the task that was executing on the failed processor and the tasks dispatched by the mapper to the failed processor—will need to be rescheduled along with the tasks that were not dispatched by the mapper.

Let us consider an application with a list of tasks that have not been dispatched to processors at the occurrence of the fault. Also, let the tasks be the list of tasks in the buffer of the failed processor and let be the task that was being executed during failure by the failed processor. The task list of the fault: tolerant algorithm at time is represented as It is also assumed that all the tasks that are being executed by the nonfailed processor complete the execution of the tasks even after the time of failure . Thus, the FT algorithm performs the updated scheduling and updated mapping on the task set , which satisfies the deadline constraint, , of the application .

4. Framework Modeling and Implementation

The proposed framework adopts a “bottom-up” modular design methodology that gives the benefit of component reusability. Behavioral component models were designed and stored in hardware and software libraries. The framework was implemented in SystemC 2.2/C++ and compiled with g++-4.3 under Linux. The framework is divided into three major components: input or user interface module, core MpSoC simulation module, and the output or performance evaluation module. The hardware modules, such as the processor, switch, dispatcher unit, and multiplexer, were implemented in SystemC; the software modules, such as input parser, scheduler, and task interface classes, were implemented in C++.

Input Module. Such user’s data as the number of processors, topology, task graph, processor types, buffer size, switching technology, fault-tolerant mode, and scheduling/mapping specifications are read either by using a command line interface or else from a GUI interface.

Core Simulation Module. The core simulation module is depicted in Figure 2, which shows the software/hardware libraries and control flow between the modules. The simulation module consists of two components: the software and hardware libraries. Software libraries include software components of the framework, for instance, the input parser, scheduler, data store, and task interface. Hardware libraries include hardware components implemented in SystemC, such as the processor, the mapper, the dispatcher, and multiplexers. The processors are implemented as behavioural processors capable of computing the various tasks in the task graphs. The communication cost for a task is dependent on the size, and thus the execution duration, of the task (Section 3). Thus, communication latency is modelled as a function of task size. Heterogeneity is modelled by varying the task processing cost (processing time/task and power/task). All processors are capable of executing all possible operations but with different costs. Processor failure is permanent once it occurs, and only single processor failure is considered. Due to the modular implementation of the simulation module, architecture, topology, routing, scheduling, and mapping methods can be easily changed in MpSoCs to conduct extensive evaluation of design space exploration.

Output Module. The simulation terminates by evaluating the system performance module, which retrieves the various performance variables from the individual processors, such as the total execution time, power, throughput, port traffic, number of tasks processed, and buffer usage. These values are displayed on the output GUI and stored in the result log file.

4.1. Performance Driven Scheduling Algorithm

This work also proposes an efficient, offline, and performance driven scheduling algorithm based on the Simulate Annealing technique. The performance index (PI), determined by a cost function, is used to determine the optimal schedule. The performance index is a cumulative factor of (1) processor execution time, (2) processor utilization, (3) processor throughput, (4) processor power, (5) processor port traffic, and (6) processor buffer utilization. The problem of determining the optimal schedule is defined as determining the schedule with maximal performance index. The Simulated Annealing algorithm performs (1) random scheduling and mapping of tasks, (2) simulation, and (3) capturing, averaging, and normalizing performance variables; finally, it calculates the performance index. These steps are repeated several times in an effort to increase the PI. During the iterations, if a better PI is found, the state is saved, and the simulation repeats until the temperature reaches the threshold value.

For the problem being considered, the normalizing coefficients used to calculate the PI were set to , , , , , and . These coefficients were determined after conducting several runs on all benchmarks. Execution time, utilization, throughput, and power factors are key performance indices. Therefore, they are given significant weights (a higher value of coefficient) during normalization. Port traffic and power indices are minimized; consequently, their respective cost function is computed reciprocally. The pseudocode for the performance driven scheduling (PDS) and HWD mapping algorithms is given in Algorithm 1.

for each task in ATG do{
//set the longest-path-maximum-delay deadline
setDeadline();
//initial solution
task.random_mapToPE(); }
while simulation temp > cooling_threshold
do{
for temperature length{
for task_sub_iteration_length{
while each task is assigned a release time{
//generates the RTW w/ Comm. cost
genRTL();
//rand task from RTL
t_sch = sel_rand(RTL);
//generate release time window
t_sch.genRTW();
//set rel time
t_sch.rel_time = sel_rand(RTW); }
//init and run systemC simulation
sc_start(time);
//calculating objective function value
calcObjFunc();
//accepts or rejects a solution
probFunc();}
for each task in ATG do{
//calculate workload of processors
task.calc_HWDcost();
//new mapping solution
task.HWD_mapToPE();  }
}
//calculate new simulation temperature
calc_simulation_temp();
}

4.2. Multiprocessor Task Mapping Strategy

Multiprocessor environments grant the freedom for dynamic task mapping. Thus, the multiprocessor mapping algorithm needs to specify the location where a particular task should be processed. Task mapping in heterogeneous systems is accomplished in two steps: (1) binding and (2) placement. First, the task binding procedure attempts to find which particular assignment of tasks would lead to effective lower power and faster execution time. During task placement, the mapper considers allocating any two tasks that exchange data more frequently as close to each other as possible. Placement of communicating tasks to adjacent or nearby processors reduces the communication latency, switch traffic, and power dissipation.

Static mapping assigns a particular task to a predetermined processing element offline regardless of the dynamic state of the system components. This, however, does not satisfy the MpSoCs requirement because processing tasks on a heterogeneous architecture introduces workload irregularity on system resources (PEs). Irregular workload distribution is a design issue that degrades the utilization of system. To overcome this problem, tasks should be mapped dynamically, based on the dynamic workload on the processors. The number of tasks in the processors’ local memory measures the processor workload.

The dynamic mapper in the proposed framework performs the task allocation as per the configured policy. In order to take heterogeneity of processors into consideration, additional power cost and execution latency cost have been introduced. This emphasizes the fact that processing a particular task on different types of processors incurs different cost (power consumption and execution time). Efficient and optimized heterogeneous mapping algorithms are tailored to minimize this additional cost.

In the framework, various task allocation heuristics were implemented. Given a scheduled task, the task mapping follows one of the below policies.(1)Next Available Mapping (NXT AVLB). In the Next Available Mapping policy, a scheduled task is mapped on processor , followed by the mapping of the next scheduled task on processor , provided and are neighbors to each other, as defined by the topology. This policy assigns tasks on successive processor sequentially but does not consider the heterogeneity of the tasks nor the processor workload. (2)Homogenous Workload Distribution Mapping (HWD). HWD considers the workload on each processor and maps a scheduled task on a processor that has the minimum workload. (3)Random mapping (RANDOM): rRandom task allocation is used by the PD algorithm to distribute the workload randomly throughout the available processors.

HWD Mapping Approach. Minimizing the idle time of each processor, or maximizing processing time, improves processor utilization. To reduce idle time, tasks should be distributed evenly throughout the available processors. The proposed Homogenous Workload Distribution (HWD) algorithm is tailored to distribute tasks evenly to the available processors in an effort to balance the dynamic workload throughout the processors. The algorithm is a two-step process: (1) probing individual processor buffers through dedicated snooping channels that connects each processor to the mapper and (2) mapping each scheduled task on a processor that has the lowest workload. Workload is a measure of the number of tasks and its respective execution time in the processor buffer. Thus, a processor having the least number of tasks and respective execution time in its buffer has the lowest workload; therefore, the mapper will assign the next scheduled task to this processor.

HWD Mapping Algorithm. Let be a set of processors, let be the dynamic task workloads defined by number of tasks on respective processor buffers, and let be the number of processors. Let be a set of tasks and the symbol “→” denotes “mapping.” Then, where and such that .

This approach has three advantages: (1) it minimizes the chance for buffer overflow, (2) the task waiting time in processor buffer is reduced, and (3) utilization is increased because idle processors will be assigned scheduled tasks.

4.3. Fault-Tolerant Methodology

System designs often have flaws that cannot be identified at design time. It is almost impossible to anticipate all faulty scenarios beforehand because processor failure may occur due to unexpected runtime errors. Real-time systems are critically affected if no means of fault tolerance scheme is implemented. Failure to employ fault detection and recovery strategies is often unaffordable in critical mission systems because the error may lead to possible data discrepancies and deadline violation.

Processor failure introduces inevitable overall performance degradation due to (1) the reduced computing power of the system and (2) the overhead involved in applying the fault recovery schemes. Recovery procedures involve tracking the task execution history and reconfiguring the system by using the available processors. This enables application processing to resume when the execution was terminated. During processor failure, application processing is temporarily suspended, and error recovery procedures are applied before application processing is resumed.

Fault-Tolerant Model. Fault modeling can have several dimensions that define the fault occurring time, the fault duration, location of failure, and so forth. The proposed framework adopts the following fault modeling parameters.(1)It is assumed that the processor failure is permanent. Thus, fault duration time is infinite . During failure, the address of the failed processor is completely removed from the mapper address table.(2)The location specifies where the failure occurred. For tile and torus topology, location is represented in two-dimensional coordinate addresses. (3)During task processing, any processors may fail at any time instance. Consequently, time of failure is modeled as a random time instance.

In the event of failure, the following procedures are executed.(1)The mapper removes the failed processor ID from the address table.(2)Tasks that are scheduled or mapped but not dispatched have to be rescheduled and remapped. The respective new task release time and task address both have to be done again. This is because the execution instance of a task depends on the finish time of its predecessor task, which may possibly have been suspended in the failed processor. (3)Tasks that are already dispatched to the failed processor can be either (1) in the processor’s buffer or (2) in the middle of execution when the failure occurred. These tasks have to be migrated back to the mapper, and their respective rescheduling and remapping have to be done. The scheduler can perform task scheduling concurrently when the nonfaulty processors are executing their respective tasks.(4)Tasks that are dispatched to other nonfailed processors are not affected. Due to the dynamic property of the mapper and the fault-tolerant implementation, the cost incurred during the event of processor failure is minimal; overhead incurred is only 2.5% of the total execution time of the task. The only penalty incurred during processor failure is the cost due to task migration and the respective rerescheduling and remapping procedures.

5. Experimental Setup

The heterogeneous MpSoC framework developed in this work was developed using C++ and SystemC programming language. The architectural components such as processing elements (PEs), memory elements, routers and network interfaces are developed using SystemC language and provide cycle accurate simulations. The processing elements are implemented as a simple 32-bit processor that performs simple logic and arithmetic operations. The heterogeneous PE structure in our MpSoC is implemented by altering the type of operation and the execution times of the tasks each PE can execute. The routers implementations consist of a switching matrix with input port buffers. The tasks and data within PEs are transmitted using a data-serializing module of the PE. A typical data transmitted within PEs consists of task commands, task data (if any), and the destination PE IDs (addresses). The data is propagated through the MpSoC system using the routing algorithm.

The scheduling and mapping algorithm are implemented using C++ language and are interfaced with the SystemC architecture implementations. The application in this work is considered as a task graph. Each task graph consists of processor operations, their execution time, communication cost, and deadline. The application as a whole also has a deadline. The scheduling algorithm parses the application (task graph) and assigns start times for the tasks on the processors based on the set of optimization goals described in the proposed performance driven scheduling algorithm. The scheduled tasks are mapped to their respective processors in the MpSoC system based on the proposed homogenous workload-mapping algorithm. The tasks’ information comprising of the task operations, task data, destination PEs (path/address of successor PEs) are assembled and forwarded to the initial PE by the dispatcher module. The performance evaluation module or the monitor module records cycle accurate timing signals in the architecture during tasks execution and communication.

During the simulation of fault-tolerant scenario, the monitor module triggers a PE failure event at a predefined time instance. The dispatcher is informed of the PE failure and based on the mapper table, and the current processing task and the tasks in the buffer of the failed PE are determined. An updated task graph is constructed, and scheduling and mapping on the remaining unexecuted tasks are performed. Tasks currently executing on the PEs are undisturbed. The reassigned tasks are inserted in the PE buffers to satisfy the deadlines.

The implemented MpSoC system has been evaluated extensively by modifying the architecture and topology. The system was evaluated using a set of synthetic task graphs with 50–3000 tasks. The architectural modifications for evaluating the MpSoC include change in type of PEs in the system, mesh and torus topology variations, and the number of PEs in the system (9, 16, 32). The MpSoC system has been used to evaluate the proposed scheduling, mapping, and fault-tolerant algorithms and compared with respective classical algorithms. The performance of the proposed algorithms along with variation in architecture and topology have been evaluated and presented. The performance metrics considered in this work include processor throughput, processor utilization, processor execution time/task and power/task, buffers utilization, and port utilization. Detailed summary of the results and its significance are shown in Section 6.

6. Simulation Results and Analysis

For evaluation of the proposed framework to explore design features variations in scheduling, mapping and fault tolerant algorithms have been considered. The application model adapted is similar to the standard task graph (STG) [50, 51] and E3S benchmark suite [54], which contains random set of benchmarks and application-specific benchmarks. The simulation was conducted with many STG benchmarks having task sizes 50 through 3000. The task graph is characterized by the number of tasks, dependencies of tasks, task size (code size), communication cost (latency for interprocessor transfer), and application deadline. The processor architecture model is derived from the E3S benchmark, where each processor is characterized by the type, tasks it can execute, task execution time, task size (code size), and task power. The processor also has general characteristics like type, overhead cost (preemptive cost), task power/time, and clock rate.

The simulation results presented in this paper follow the following legend: , where denotes or for torus or tile topology, denotes or for EDF-SA or PD scheduling algorithm, + or − for presence or absence of fault tolerance algorithm, and denotes the number of PEs in the MpSoC system. Performance evaluations on architecture and topological variations were performed but are not included in this paper. Only performance results related to the proposed algorithms are presented.

6.1. Simulation Scenarios

To demonstrate the capabilities of the proposed framework and the algorithms, we performed several simulations on task graph sets and evaluated their effectiveness over the following design abstractions.(i)Architectural: architecture-based evaluation emphasizes the effect of different PEs in the MpSoC framework. Current implementation of our framework consists of processors of the same architecture that can execute various tasks; that is, each processor has a different set of tasks they can execute. The MpSoC architecture was modified by changing the number of PEs, the types of PEs, and topology. It is important to consider the effect of topology in architecture evaluation because the topology of the MpSoC also dictates the performance of the individual PEs. The effect of such architecture changes is measured in terms of average execution time/PE/task (ETIME) in μsec, average PE utilization (UTL), power/task of PEs (PWR) in μ-Watts, and power delay product (PDP) which are shown in Table 1.(ii)Scheduling: evaluation of scheduling algorithms over various MpSoC architectures was conducted. Performance of earliest deadline first based simulated annealing (EDF-SA) and the new performance driven (PD) algorithms are compared (Table 2). Performance metrics for evaluating scheduling algorithms are the same as that of architecture evaluations. Also the standardized SA costs are shown.(iii)Mapping: evaluation of mapping algorithms over various MpSoC architectures was conducted using performance driven scheduling policy. Performance of the proposed HWD, Next Available (NA), and random mapping algorithms is compared as shown in Table 3.(iv)Fault Tolerant—evaluation and comparison of fault-tolerant (FT) and nonfault-tolerant (NFT) implementation over various MpSoC architectures were conducted, and results for two test cases are shown in Tables 4 and 5.

6.2. Simulation Results and Explanation

Table 1 shows the average performance metrics obtained for various architectural variations possible in the developed MpSoC framework. The architectural variations include the type of topology (tile or torus), scheduling algorithm (EDF-SA or PDS), mapping algorithm (HWD), fault-tolerant scheme (NFT or FT), and the number of PEs. Table 2 specifically compares the performance of the EDF-SA and PDS scheduling algorithm. Comparison results on total execution time (ECT) in μsec, average processor utilization, and average throughput in bytes/μsec are presented. Other metrics that have less significance have been omitted. Also, comparisons between EDF-SA and PDS algorithms for processor utilization, throughput, and buffer usage evaluations across various architectural and topological scenarios are shown in Figures 3, 4, and 5. Based on the above table and figures, it can be seen that the utilization, throughput, and buffer usage have increased for a marginal increase in execution time, thereby increasing the overall performance goal or cost of the MpSoC system with PDS algorithm. Table 3 shows the evaluation of the various mapping algorithms. From the table it is clear that the HWD minimizes the use of buffer used during the task execution. The units specified in the tables are as follows: execution time (ETIME) in μsec, processor utilization (UTL) as ratio, throughput (THR) in bytes/nsec, port traffic (PTR) in bytes/nsec, power (PWR) in μWatts, and buffer usage (BFR) in number of 32-bit registers.

The proposed fault-tolerant scheme has been evaluated for a few benchmarks, and two test cases have been presented in Tables 4(a), 4(b), 5(a), and 5(b). For test case 1 (Table 4), a PE fault on PE ID 9 at 100 ns is simulated. Table 4(a) presents the general performance characteristics comparing simulation scenarios without and with PE fault. Table 4(b), presents the detailed performance comparison of significantly affected PEs for without and with fault-tolerant scheme. Also the difference and percentage change (Δ/%) between the simulation scenarios are shown in Table 4(b). Similarly, Tables 5(a) and 5(b) present the second test case for a task graph with 500 tasks operating on tile topology with 81 PEs. The PE fault is introduced on PE ID 80 at 146 ns. Only variations of performance metrics on significantly affected PEs for nonfault and fault simulation are shown. The difference in performance metrics (executions time, port traffic, and power) incurred with the fault-tolerant algorithm/scheme is less significant compared to the original nonfault simulation performance metrics. For example, in Table 5(b), we can see that execution time and processor utilization due to the occurrence of the fault reduce from 397 μs to 91 μs and 0.36 to 0.08, respectively. The tasks assigned to the faulty processor (PE ) are reassigned to an other processor of the same type using the fault-tolerant algorithm discussed. The increase in execution times and utilization are shown in the table. A marginal increase in cost on other processors has also been observed.

7. Conclusions

The problem designing efficient MpSoC systems require the close scrutiny of the various design choices in order to optimize the performance characteristics. Towards optimizing the scheduling, mapping, and fault-tolerant strategies, a fault-tolerant heterogeneous MpSoCs simulation framework was developed in SystemC/C++ in order to provide a comprehensive simulation tool for designing and verifying proposed methodologies. Three algorithms were proposed: a performance driven (PD) scheduling algorithm, based on Simulated Annealing technique; a strategic Homogenous Workload Distribution (HWD) Mapping algorithm, which considers dynamic processor workload; and a fault-tolerant (FT) methodology to deliver a robust application processing system.

Extensive simulation and evaluation of the framework as well as of the proposed and classical algorithms were conducted using task graph benchmarks. Simulation results on all performance indices for different scenarios were evaluated and discussed. For the scheduling space, the PD heuristic has shown better overall performance than EDF, specifically for small number of processors. Fault-tolerant evaluations showed that throughput, buffers utilization, execution time/task, and power/task factors are not significantly affected even after processor failure occurs. A fault-tolerant scheme showed a small decrease in processor utilization. Tile topology showed better utilization and throughput; however, torus topology showed significantly better performance with respect to execution time/task and power/task. A number of processor comparisons showed a proportional decrease in utilization, execution time, and power when the number of processors was increased. However, throughput and buffer utilization remained almost identical. Executing highly heterogeneous tasks resulted in higher power and latency costs. Finally, the proposed HWD algorithm evenly distributed the execution workload among the processors, which improves the processor overall performance, specifically processor and buffer utilization.

The proposed algorithms on the MpSoC framework are currently evaluated only with synthetic task graphs. We would like to evaluate the algorithms for application benchmarks. The PE implemented is a basic 32-bit behavioral processor. We would like to implement industry standard processor like ARM or OpenRISC processors as our PEs. Also, during scheduling communication, cost values are derived from the execution time of the tasks. Future work that considers actual communication cost needs to be explored. Work related to router and buffer failures will also be considered.