Abstract

Embedded systems are widely used today in different digital signal processing (DSP) applications that usually require high computation power and tight constraints. The design space to be explored depends on the application domain and the target platform. A tool that helps explore different architectures is required to design such an efficient system. This paper proposes an architecture exploration framework for DSP applications based on Particle Swarm Optimization (PSO) and genetic algorithms (GA) techniques that can handle multiobjective optimization problems with several hybrid forms. A novel approach for performance evaluation of embedded systems is also presented. Several cycle-accurate simulations are performed for commercial embedded processors. These simulation results are used to build an artificial neural network (ANN) model that can predict performance/power of newly generated architectures with an accuracy of 90% compared to cycle-accurate simulations with a very significant time saving. These models are combined with an analytical model and static scheduler to further increase the accuracy of the estimation process. The functionality of the framework is verified based on benchmarks provided by our industrial partner ON Semiconductor to illustrate the ability of the framework to investigate the design space.

1. Introduction

Over the past few decades, the demand for embedded digital signal processing (DSP) systems has been increasing constantly. These systems are used in numerous applications such as portable audio players, wireless communication sets, and intelligent hearing aid devices, just to name a few. Due to the nature of these devices, they are usually implemented using System on Chip (SoC) technology. DSP applications are complex, parallel in nature, and time consuming to develop. Designers are usually faced with different conflicting design objectives, such as low power, low cost, flexibility, and high performance. SoC embedded systems are usually equipped with a heterogeneous multiprocessor architecture in which different components are integrated on a single chip. These components range from fully programmable processors to dedicated hardware blocks. The designer has to select the proper components to optimize the different design objectives. Fully programmable processors could be selected for flexibility in supporting multiple applications and system extension while dedicated hardware accelerators are used to optimize hard constraints such as time and power dissipation.

Several architectures are available to implement a given DSP application. Selecting a suitable near-optimal architecture for the given application is a very challenging task. A tool that helps the designer to select a near-optimal architecture is of great interest because it can reduce development time and hence time to market. Architecture exploration tools tend to explore the design space to find an optimal or near-optimal solution for a given application. During the exploration phase, several architectures are generated and evaluated to estimate the normally conflicting constraints and objectives. Multiobjective optimization (MOO) techniques can be used for such effective exploration. The literature shows that multiobjective evolutionary algorithms (MOEA) [1] and Particle Swarm Optimization (PSO) [2] are efficient and robust enough to explore the complex design space of heterogeneous embedded systems. The majority of these algorithms make use of the Pareto-dominance concept to assign a single fitness value for each individual in the population; this value is used to select a set of nondominant solutions, called the Pareto front. The generated architectures need to be evaluated to direct the search towards promising points in the design space. Because several architectures need to be evaluated, the evaluation technique used should give quick results with a good level of accuracy. An accurate evaluation technique can be performed at a lower level of abstraction for the selected candidates. Analytical and statistical based evaluation methods are usually used to give quick results. Analytical models for each component in the architecture are used to estimate the overall performance of the architecture [3]. The evaluation results are fed back to the optimizer to accept or reject the generated architecture. This paper proposes an efficient architecture exploration framework for DSP applications based on a combination of metaheuristics (PSO and GA) techniques that can handle multiobjective optimization problems with several hybrid forms. An interesting technique for performance evaluation of embedded systems is also presented. The proposed design exploration-based technique is applicable to certain embedded systems specifically those that include general purpose processors and specialized coprocessors.

1.1. Main Contributions

Different optimization and evaluation techniques are investigated in the literature. Selecting an effective optimization technique for architecture exploration is one of the main goals of this paper. The main contributions of this paper can be summarized as follows. (i)An evaluation technique based on static task scheduling and ANN modeling of multiprocessor systems is developed to estimate the performance of different architectures generated during the optimization process. The static scheduler utilizes several algorithms to estimate the number of clock cycles required to run a specific application on a multiprocessor system. The training data are collected from the simulation of executing several applications on each processor element and using tools provided by the processor vendor for power estimation [4]. This evaluation technique is integrated with an optimization engine developed for multiobjective optimization [5] to facilitate the creation of several architectures that meet the designer requirements. (ii)A complete framework was developed that makes use of the tools and algorithms proposed in the previous two points for architecture exploration of embedded DSP systems. The optimization engine is used to search the design space and propose different multiprocessor architectures. The framework is responsible for searching the design space to find a set of suboptimal solutions that meet the user requirements. The framework includes a set of tools to model the application in the form of a Directed Acyclic Graph (DAG), modeling different architectures and manipulation of different system components. The remainder of this paper is organized as follows. Section 2 provides essential background on multiobjective optimization techniques and the architecture exploration process. In Section 3 related works on frameworks proposed for design exploration along with techniques of searching the design space are presented. Section 4 gives an overview of the methodology used to develop and test each single component of the framework. A methodology for evaluating embedded processors is introduced in Section 5. This methodology is combined with a static task scheduler, the optimization engine, and other tools to build the architecture exploration framework that is presented in Section 6. Finally, Section 7 concludes the paper and presents future work.

2. Background

Architecture exploration is the problem of searching the design space of a given application to find an optimal or suboptimal hardware implementation. The application is normally described using a software model, and the main objectives of the exploration tool are to construct an architecture and map the software model to the proposed hardware.

2.1. Problem Definition

The problem of architecture exploration is illustrated in Figure 1.

The application is modeled using a set of software blocks represented in a high-level modeling tool or language. The high level description of the application is then converted into a flow graph. Each software block in the graph has several attributes . These attributes might be the block size, application load, timing constraints, and so forth. The goal of the architecture exploration is to specify the core , the mapping of into (), and the interfacing between each pair of cores . The core can be either a dedicated hardware module, a general purpose processor or application-specific processor chosen from a core library. The combination of , and results in different architectures. The resulting architectures are then evaluated against the given application constraints. The most appropriate architecture should meet the overall application constraints. An architecture exploration tool is required to explore and evaluate as many designs as possible. The efficiency of the tool is measured by the total number of architectures identified, how close the resulting architecture is to the optimal architecture, and the speed of the search process.

2.2. System Evaluation

The evaluation of the generated architectures is the most important phase of the exploration process. These measures can include speed, area, and power consumption. Based on these measures an architecture can be accepted or rejected. Accurate evaluation will efficiently guide the exploration towards the optimal solution; however, this might be a time-consuming process. Evaluation can be performed at different levels of abstraction. Each level provides a different accuracy measure ranging from the transistor level (more accurate and more complex) to system level (less accurate but simpler and hence faster) [6]. At each level of abstraction, a model should be provided for each core that provides information about the power consumption, performance, and area. These models form the core library that is used during exploration. The goal of the evaluation process is to extract performance measures of the evaluated architecture.

In general, the evaluation of embedded systems can be classified into three categories: cycle-accurate, statistical, or analytical. Simulators that perform cycle-accurate simulation of the processor and peripherals tend to give an accurate evaluation at the expense of a significant amount of CPU time. The speed of the simulation process can be enhanced using transaction level models for embedded processors and the communication subsystem. Communication subsystem components are modeled as channels and are presented to execution modules using interface classes. Statistical simulators, on the other hand, use statistical information gathered from the profiled and estimated running time of the application on the given architecture. This type of simulator gives appropriate evaluation in reasonable time. In the analytical evaluation scheme, analytical models exist for each computational unit in the embedded architecture. Analytical based techniques provide fast evaluation of a given architecture however with a low level of accuracy. The accuracy can be enhanced to represent a more realistic environment at the expense of more computation time. This evaluation scheme is more suitable for AE tools since a large number of architectures must be evaluated in a short period of time. Cycle-accurate simulation can be used in later stages when more accuracy is required for fine tuning of the resulting architecture.

Power consumption is a critical issue in the design process of embedded systems. Estimating the power consumption of the proposed architectures during the architecture exploration is a very important task, as it enables the evaluation of an important measure of the system performance. During architecture exploration, it is not possible to perform physical measurements as the system is still under investigation. A power estimation technique should be employed to quickly estimate the performance of several architectures generated during the search process [7].

In this paper we present a novel approach to evaluate an embedded architecture based on artificial neural networks (ANNs) for processor modeling and static task scheduling for overall system performance [8, 9].

2.3. Optimization Algorithms

The literature shows that architecture exploration is an NP-complete problem [10], and therefore an optimal solution cannot be obtained in polynomial time using exact mathematical methods. Heuristic methods are therefore used instead to find a quick suboptimal solution to an optimization problem. Simulated annealing, Tabu search, and genetic algorithms are a few examples of such metaheuristics [11]. The goal of these methods is to quickly obtain a near-optimal solution by avoiding getting trapped in local minima.

In an architecture exploration application, different conflicting objectives are required to be optimized such as speed, area, power consumption, and flexibility. For example, increasing the system performance results in an increase in the system dynamic power consumption. An efficient heuristic search technique that is able to handle the different trade-offs of these objectives will be helpful in many real-world applications such as design space exploration of embedded systems. In these systems, multiobjective evolutionary algorithms were the main optimization technique that proved to be efficient [11].

There are two commonly used multiobjective genetic algorithms found in the literature that can be employed in architecture exploration, Strength Pareto Evolutionary Algorithm (SPEA) [12], and Nondominated Sorting Genetic Algorithm II (NSGA-II) [13]. The framework presented in this paper is based on a hybrid multiobjective optimization engine developed in [5]. This engine combines genetic algorithms (GA) and Particle Swarm Optimization (PSO) to perform multiobjective optimization using the Strength Pareto algorithm.

Architecture exploration for embedded systems aims to find a suitable hardware architecture for a given application; the architecture includes the computation resources that will be used in the architecture, the mapping of different application modules onto the computation resources, and communication resources used between different components. The exploration methodology defines the modeling method of the application, the level of abstraction used to process the exploration (usually system level) [11], and the goal of the exploration (architecture definition, mapping or communication mapping). In [14] the concept of orthogonalization of concerns is introduced. This paradigm states that the separation of various aspects of the design allows more effective exploration of alternative solutions. An essential aspect of this design methodology is the separation between (i) function and architecture and (ii) communication and computation. In [6] a methodology is introduced for the architecture exploration of a parametrized platform. To reduce the search time, a graph is built to reflect the dependency between different parameters. The -chart strategy is employed to separate application design from the architecture. A mapping stage is used to map the application onto hardware.

3.1. Power Modeling and Estimation of Embedded Processors

The evaluation of embedded systems plays an important role in the architecture exploration process. Every candidate architecture generated during the search process should be evaluated to measure its performance compared to other solutions. Accurate evaluation should guide the search process towards a near-optimal solution. Another effect of the evaluation process is that accurate simulation is usually time consuming (gate level, cycle-accurate simulation produces an accurate evaluation at the cost of long simulation time). Evaluation at a higher level of abstraction can reduce the evaluation time significantly at the cost of reducing the level of accuracy. In this section, several approaches for modeling embedded processors are presented.

A combined simulation and analytical estimation framework is introduced in [15] for network processor architectures. The authors propose an analytical model for network processors that can be used for performance simulation and evaluation for such systems. Their analytical model is based on a real-time calculus that can be used in the analysis of various system properties, timing, and loads of different components. A methodology for estimating power consumption at a high level of abstraction is presented in [16]. It is based on a layers approach for estimating power consumption. This approach shows enough accuracy and fast estimation which makes it suitable as a guide for exploring the algorithmic and architectural space. Additionally, a preliminary floor-plan is used to improve the accuracy of the estimates. The work is based on the industry standard SPIRIT for specifying IPs and platforms. This work is in the early design stages and its speed at estimating power consumption cannot be verified for its use in architecture exploration. The work presented in [17] makes use of an instruction-level model to estimate power estimation for soft-core processors implemented in FPGAs. This model provided an estimation of power consumption with a maximum error of 4.78% for several test applications. However, this work does not include analysis of the effect of the software model on power consumption. The relation between the application and the power consumption for the embedded processor is required during architecture exploration. Each solution proposes a different application mapping. With every new architecture, the software load changes and as a result the power consumption changes. An energy consumption modeling technique for embedded systems based on a microcontroller was demonstrated in [18]. The software tasks running on the embedded system are profiled, and their characteristics are extracted from profiling data. Direct measurement is used to capture energy consumption on the actual circuit and relate it to the software characteristics to build the energy consumption model. This work was designed and tested to work with a microcontroller to build an analytical model that can be used to estimate power consumption. Generalizing this work for the use with other architectures is not verified. Design space exploration with high level synthesis is tackled in [19] using a learning-based method. The method is based on Random-Forest learning model and is effective in finding an approximate Pareto set of RTL designs.

The work in [20] proposes a table-less power model for complex circuits that uses neural networks to learn the relationship between power dissipation and input/output signal statistics. This approach was used to model the system at circuit level. The neural network showed good accuracy for modeling the relationship, demonstrating the ability of ANN to model complex digital circuits without detailed circuit information. The work in [21] presents a simulation model for GPUs in order to facilitate architecture exploration. The work is based on the cycle-accurate simulator presented in [22] and the power models multicore architectures presented in [23]. The power model is created based on actual measurements that are used along with activity factors to estimate the power consumption for the GPU running a specific kernel. While this provides good accuracy, like any cycle-accurate simulator, simulation time is too long, which makes it not suitable for architecture exploration. Several approaches found in the literature investigated the issue of estimating power consumption. The common objective of these approaches is to provide a reasonable trade-off between accuracy and estimation time. The use of an analytical model based on instruction set analysis is a common approach for power estimation; however, it suffers from low estimation accuracy. The work in [24] presents a mapping approach that computes multiple energy-throughput trade-off points (mappings) at design time and uses one of these points at run-time based on desired throughput and current resource availability while optimizing for the overall energy consumption. In this paper we propose the use of a power estimation approach based on ANN. This approach provides reasonable accuracy of estimation, with high speed estimation suitable for architecture exploration.

3.2. Architecture Exploration Frameworks

Different frameworks found in the literature support architecture exploration. A group of these frameworks are geared to the exploration of a specific parametrized architecture. A parametrized architecture is a predefined architecture that has fixed main parametrized computation units and has the ability to change parameters such as memory size, bus width, cache size, and cache association [6, 25]. The architecture exploration framework goal is to choose system parameters that meet the application requirements. Another group of frameworks deals with a generalized architecture model (i.e., no predefined architecture). The framework is responsible for generating a suitable architecture, including all the system components and parameters for the given application. These frameworks may also include tools for profiling and simulation. Some frameworks are limited to exploration of the memory hierarchy of the architecture. Other frameworks are dedicated to exploring the communication subsystem of embedded systems. In this case the computation subsystem and the mapping of different software blocks to the computation subsystem are defined, and the framework is used to optimize the communication between different computation modules according to the software model.

A framework for architecture exploration of embedded system is introduced in [26]. A selection of heuristic methods to approximate Pareto optimal curves are used. The WATTCH framework [7] is used for simulation and evaluation of the different resulting architectures. The framework is used to optimize a superscalar microprocessor-based system. In [27] a simulation approach and a formulation of the design space exploration are carried out using a stochastic simulation optimization problem. A novel multiobjective genetic algorithm is proposed to enhance solution quality. However, the methodology is limited to soft real-time systems. A framework for architecture exploration of an SoC system designed for GPS hand-held devices is introduced in [28]. The framework is used to explore the design space of a multiprocessor configurable chip (SPP chip-set). Two performance metrics are used for system evaluation, the execution time, and power consumption. Trace-driven simulation is used for evaluating the candidate architectures. The main idea of this work is to use a high level of abstraction during the first phase that performs the real search. This approach does not guarantee the optimality of the selected architectures. It might give satisfying results for the chip-set and the specific applications it is designed for. The architecture exploration methodology introduced in [25] is used to build the Platune framework which is also based on the parametrized multiprocessor SoC introduced in [6]. The framework includes a set of simulators for the CPU, cache, and memory for architecture evaluation. The methodology is based on finding the dependency between the parameters of the SoC so that applying the methodology on other architectures may not be possible. Also when the architecture is not defined, the task to efficiently explore the design space will be more challenging.

The Spade framework is introduced in [29]. It is used with heterogeneous signal processing systems to quickly build models of architectures at a high level of abstraction. The evaluation is performed through cosimulation of the application and the architecture using the trace-driven simulation. The Sesame framework is introduced in [30]. In this framework multiobjective optimization evolutionary algorithms (MOEA) are used for architecture exploration. The architecture exploration problem is formulated in [31]. The evaluation model for Sesame framework and the use of MOEA are introduced in [1]. This framework includes a complete set of tools that can be used for the architecture exploration of heterogeneous systems. The main disadvantage of this framework is the time required for the exploration process. The performance estimation process is based on cycle-accurate simulation which is time consuming. Searching the design space is a basic task that has to be performed efficiently in architecture exploration. Exhaustive search techniques are used to search the design space by covering all possible solutions in the design space at the expense of high computation time [32]. In the Platune framework [25] exhaustive search is used to search design space clusters separately. These clusters are created by investigating the different design parameters and their dependency. Searching each cluster exhaustively takes a much shorter time than exploring the complete design space, and all clusters can be searched in parallel.

Due to the huge computational effort required by exhaustive search, several heuristic methods are used in architecture exploration to speed up the search process. Local search is used in [32] to explore the memory hierarchy for embedded systems. The authors introduce an iterative local search algorithm based on sensitivity analysis of the objective function to design parameters. The sensitivity analysis measures the change of the objective function to each of the design parameters. It is used to move the starting point of the search close to suspected global minima. Sensitivity analysis can be used with other search techniques to improve its performance [6]. Most local search approaches suffer from getting stuck in local minima, and therefore the quality of the final result depends on the choice of starting point.

4. Architecture Exploration: Overall Methodology

The work presented in this paper focuses on two basic topics: (i) multiobjective optimization of architecture exploration to effectively search the solution space and (ii) performance estimation of embedded systems at the system level. The research in performance estimation involves modeling both the application and the architecture at a level of abstraction suitable for the exploration process. Different algorithms and techniques are proposed for both topics, which are integrated to build an architecture exploration framework for DSP embedded systems. Figure 2 presents the main modules within the architecture exploration framework proposed. Each module is tested individually to ensure the accuracy of each component. These components are then integrated into a single framework that is validated and verified. The proposed design exploration framework can be used to a variety of embedded systems specifically those that include specialized coprocessors and general purpose processors.

During architecture exploration several solutions are generated; thus physically implementing each solution is not practical. For these reasons, simulators can be used to provide an estimation for these values. However, cycle-accurate simulation with high accuracy is a very time-consuming process. Since a large number of solutions requires evaluation, estimating performance via simulation might not be the best route to pursue. In this paper we propose a methodology to model embedded processors to provide a quick estimation of embedded systems performance in a reasonable amount of time. The estimation is based on information extracted from cycle-accurate simulations for different applications.

4.1. Input/Output of Framework

The architecture exploration framework proposed in this paper takes a software application as an input and proposes a near-optimal architecture that can be used to implement the user application. The system accepts the input application in the form of a Directed Acyclic Graph (DAG). Accordingly, the designer needs to partition the application into tasks and provide information about the tasks and the required dependency between them. The DAG describes one application execution cycle starting from accepting input from the user and terminating by generating output and the next cycle.

The proposed solutions are composed of different components selected from a core library provided by the designer. Each solution includes the following information. (i)Different components of the system, including embedded processors and the communication subsystem connecting them. These components are selected by the system to meet the designer's requirements. (ii)The total memory size and operating frequency of each embedded processor. (iii)The mapping of the software application on each processor. The system specifies task assignments to processors. (iv)A complete static task schedule. This schedule specifies when each task will start and complete execution taking into account task dependency provided by the DAG.(v)An estimation of the total time required to complete one system cycle, total power consumption of the system, individual power consumption of each processor, total area of the system, and the percentage of the total system flexibility.

In our study we focused on a set of general purpose processors. However, the flow presented in this paper can easily support hardware accelerators with minor modifications. With general purpose processors, there are no constraints about the type of tasks that can be executed on a specific processor. This is in contrast to hardware accelerators that are more specialized and accordingly introduce some restrictions. This limits the mapping of the application on the hardware, as some tasks cannot be mapped to the hardware accelerator. Besides, the limited functionality of the hardware accelerator reduces the complexity of the model required.

The models developed in this paper assume a distributed memory structure. Each processor is connected to a local memory through the processor local bus. The model developed for each processor includes the performance of the processor and the memory as well. Shared memory structure can be accommodated in the model as well but will require further investigation related to synchronization mechanisms to manage the shared memory.

4.2. Methodology for Modeling Embedded Processors

In order to model an embedded processor, several data points are required to capture the relations between the processor, application and the specific performance measure. This paper focuses on building a statistical model for both the performance and power consumption of each processor. For each processor investigated in this paper, C/C++ compilers are provided by the vendor, which enables the coding of several benchmarks and running them. A set of benchmarks is executed on each processor using the tools provided by the vendor. Each run is used to form a point in the relation between application/architecture pair to performance/power consumption. Each new data point increases the accuracy of the model generated. For this reason several tests are conducted for each architecture/application combination to get a sufficient number of points to accurately build the model. This process is required for each processor that needs to be included in the core library.

4.2.1. Collecting Data Points

Data-sets collected from the simulation process of several embedded processors are used to build different models at various levels of abstraction. These models are introduced to tackle the problem of performance estimation with various levels of accuracy. Each model is tested and verified for its accuracy in estimating the performance of the embedded processors. Several platforms can be used to test the proposed approaches. In this paper the Tensilica Xtensa embedded processors system [4] was used to create several architectures. The Tensilica flow also provides a set of tools to evaluate the performance and power consumption of the architectures generated. Statistical data for several embedded processors is collected and used to build several models. These models are later used to evaluate the performance of the architectures generated during the search process which are based on these embedded processors.

4.2.2. Artificial Neural Network Modeling

Artificial neural networks (ANNs) are efficient tools for classification, pattern recognition, and modeling. The ANN is a nonlinear model that can be trained to capture the embedded pattern in the data. The ANN is composed of interconnected units, neurons, with weighted links. The relation between the input and output is modeled by the ANN using an iterative process of updating the weights of these links, which is called training. The training is repeated for many epochs to select the best weights which would minimize the error level and thus enhance the accuracy of the generated model. This model can be utilized for predicting the output of new points produced by the same system. The ANN will be used to estimate both the performance and power consumption of the embedded processor being modeled. For each processor investigated using the Tensilica platform a data-set is created using the simulation results. These data-sets are used to train the ANN. Several training cycles are executed for each data-set to perform statistical analysis on the accuracy of the model generated.

4.2.3. Static Task Scheduling

A static task scheduling module has been developed to schedule tasks based on the DAG extracted from profiling the application in [8, 33]. The static task scheduler targets heterogeneous multiprocessor architectures. The use of the task scheduler is assumed to increase the accuracy of estimating the number of clock cycles required to run a specific application. The task scheduler inherently puts into account the effect of the embedded processor and the communication between the different components of the architecture. Combining the task scheduler and the analytical model introduces another level of accuracy to the performance estimation process.

4.3. Objectives of Proposed Framework

The proposed framework performs multiobjective optimization using the design space specified by the DAG and the core library. The framework focuses on four objectives.(1)Minimizing the total time required to complete one system cycle. (2)Minimizing the total power consumption of the system. (3)Minimizing the total area of the system. (4)Maximizing system flexibility, which is measured by the percentage of software that can be modified and the availability of a specific feature in the system.

These four objectives need to be calculated for each processor in the proposed system. The most accurate method of accomplishing this task is to physically implement the system and take actual measurements.

5. Performance Modeling and Estimation

A smart approach for embedded systems performance estimation is presented in this section based on analytical estimation, static task scheduling, and artificial neural networks (ANNs). In order to verify these approaches, several models are developed at different levels of abstraction to cover various trade-offs between accuracy and speed of evaluation. Accuracy plays a significant role in the direction of the search process. High accuracy leads to high quality solutions. On the other hand, the time used to evaluate the performance of an embedded system is also important. The accuracy of the proposed models is evaluated by comparing the estimated performance measure to cycle-accurate simulation results.

5.1. Experimental Setup

Experiments performed are divided into two categories. The first set of experiments is used to build data-sets that relate the application to the power/performance of the architecture. The results from these simulations are used to derive an analytical model for estimating the performance and power consumption of embedded processors. The simulation results are later used to build an ANN model for the estimation of the two measures. A second set of experiments is used to verify the capability of the ANN for modeling performance.

5.1.1. Benchmarks for Building Power/Performance Data-Sets

Several benchmarks were used to generate the data-sets for each embedded processor investigated in this paper. The benchmarks were selected to tackle different loading conditions as shown in Table 1. The first set of benchmarks, ARR_OPR, is mainly used for testing different flows and verifying the accuracy of simulation. Benchmarks FIR and FFT are used to study the effect of arithmetic operations on each processor under investigation, whereas the last benchmark is mainly used to study the effect of applications with a higher data transfer rate on each processor.

The main objective of these benchmarks is to collect data points for building the data-sets. Each benchmark is compiled for the given processor, and the corresponding simulation flow is executed. The number of instructions required to complete a specific application is calculated using the profiler. Each benchmark is executed with several parameters to form different loading conditions for the processor under investigation.

All simulation results are used to form the training data-sets. The profiling results for each benchmark are used to form a model for the application in terms of individual instructions. These data-sets are used to train the ANN to build a model for the processor. For each processor, the training is repeated for 20 independent runs. The average accuracy and training time for the 20 runs are recorded for statistical analysis. Confidence intervals around the average are calculated with a CF of . All statistical results are presented using box and whisker plots for both training accuracy and training time.

5.1.2. Simulations and Embedded Processor Data-Set Generation

Several architectures are investigated in this paper to test the proposed approach for modeling embedded processors. Each processor under investigation is inserted into an architecture that can be used for profiling, simulating, and estimating the power consumption and the total number of clock cycles required to complete a specific application.

Four embedded processors are configured using the Xtensa flow. This architecture is used to build four configurations shown in Table 2. For each processor, the software tools are generated automatically for compilation and profiling. The generation of the tools (compiler, profiler, and instruction set simulator) is part of the Xtensa platform.

As shown in Table 2 the first processor includes a basic configuration of the Xtensa LX processor. The second processor includes a hardware multiplier that can be used for most DSP applications. The third processor contains a MAC unit in addition to the hardware multiplier, as a multiply-accumulate operation is a basic operation for most DSP applications. The fourth processor is configured to have more cache memory than the basic configuration, as well as using a different cache access methodology. Each benchmark in Table 1 is compiled and profiled using the tool set for each processor. The energy estimator is also used to estimate the power consumption for running the benchmark on each of those processors.

5.2. Profiling and Application Modeling

The application is modeled by extracting some features from running the application after detailed profiling. These features are selected to capture the effect of the application on the embedded processor. Profiling the application is also used to build a Directed Acyclic Graph (DAG) which captures the dependency between different tasks in the application. This graph is also used to schedule tasks over multiple processors.

Based on the Xtensa flow we attempt to run each benchmark of Table 1 and collect profile data for each run. Besides testing the effect of each benchmark on the processor performance, the effect of the operating frequency is also investigated. Each benchmark is executed with different frequencies, and the power consumption is estimated. Table 3 gives a summary of the simulations performed on each processor and the total number of simulation points collected for each processor. Using these simulations, a model for each processor can be built. The total number of simulations performed is 900 for the four processors. As shown in Table 3, each simulation requires a relatively short time to complete, allowing a large number of simulations to be performed in a short period of time. This should increase the number of data points in the data-set.

5.3. Modeling Embedded Processors Using ANN

In this section a back-propagation ANN will be investigated for modeling the power consumption and performance of embedded processors. A three-layer ANN topology is used as shown in Figure 3. The weights of the inputs for each neuron are updated using the back-propagation algorithm, depending on the error between the actual and predicted outputs of the system. The training process iteratively updates the weights to minimize the error. By repeating the training for several samples of the same process, the ANN can be used to predict the outputs for samples it never encountered before [34]. The ANN module is designed and implemented using C/C++ based on the back-propagation algorithm as a general model. The design of this module is flexible such that it can be configured for any problem. The functionality of the ANN for modeling linear/nonlinear systems was tested.

All simulation results are grouped into one data-set for each processor. Each sample in the data-set represents the outcome of one simulation. Figure 3 illustrates the processor model for estimating power consumption. During the training phase, the estimation error is used to update the ANN weights. After training, the feed-forward path is used for estimating the performance of the processor during the architecture exploration.

The parameters of each data-set of processors used are listed in Table 4. The testing samples are selected randomly from the main data-set. After each iteration, the testing samples are used to measure the accuracy of the training. The testing samples are not used during the training process. After training, data extracted from the other applications is used to test the accuracy of the model as will be shown in Section 6.

The ANN was able to estimate the behavior of the system with a minimum accuracy level of . Increasing the number of training epochs increases the accuracy of the training, as well as increasing the number of neurons in the hidden layer. The choice of the values for these two parameters should be related to the size of the problem, as values greater than necessary may result in over-learning, thus reducing the accuracy of the model.

Results obtained indicate that ANN modeling of an embedded processor can accurately estimate the performance of the processor with a much faster estimation time compared to accurate simulation. The estimation time is the time required to propagate the input (the application model) through the network to generate the output which is minimal compared to the simulation time.

5.4. Multiprocessor System Evaluation

The majority of the solutions generated during the search process are based on a multiprocessor system architecture. The software application is executed on the proposed platform to further improve system performance. This section presents a methodology for estimating the performance of a multiprocessor system.

Figure 4 shows a complete proposed flow chart for performance evaluation. The process starts by decoding the solution vector to determine the number of processors and the mapping of tasks to the architecture. Each processor is assigned a set of tasks. With the knowledge of task assignments, the number of cycles to execute each task is estimated using the ANN. The estimated cycle count of each task is passed to the static scheduler, combined with the architecture configuration (number of processors and their connections). The static scheduler defines the start and end times of each task on its designated processor and gives the total number of cycles required by the application for one execution cycle . The latter is one of the objectives that needs to be optimized. The task scheduler also takes into account the communication costs required to transfer data between tasks running on different processors.

After building the task schedule running on each processor, the software load on each processor can be calculated as the sum of the software loading of each task, plus any idle time added to the schedule. This information can be used to estimate the power consumption of each processor using the ANN. Equation (1) is used to calculate the total power consumption of the system, which is considered to be the second objective to be optimized. In (1), is the total power consumption of the system and is the estimated power consumption of processor using the ANN:

The total area of the system is calculated as a function of the number of gates of each processor used in the architecture. Equation (2) is used to calculate the total gate count as the sum of the number of gate in processor including the memory and cache. Each processor in the library has a specific memory size. If during exploration the processor requires more memory, the area of the extra memory is added to the initial size of the processor. This is the third objective to be optimized. In (2), is the total area of the system and is the total area of processor , including the area of the local processor memory:

Flexibility is a measure of the ability of changing the system after implementation. In some cases the application needs to be updated after being delivered to the customer due to some problems in the implementation or the addition of more features. The higher the level of flexibility, the greater the ability to change the application. All the processors investigated in our study have RAM to store application data and ROM to store the application, and both are connected to the processor local bus. The content of the application memory can be replaced by a different application if required. For that reason all the processors in our study have a high level of flexibility. However, the availability of the multiplier unit or MAC unit reduces the size of the code required to implement DSP applications, because the hardware unit will be used to implement the multiplication instead of its implementation using software. This gives a processor with a dedicated hardware multiplication unit a higher level of flexibility compared to the other processors. On the other hand, increasing the size of the cache memory and the size of the main memory increases the ability to replace the application and enhances its performance. A flexibility level is assigned to each processor in the core library taking into consideration all these factors. The choice of this value is very subjective and depends on the designer's preference. These values are used to estimate the overall flexibility of the system, , using

5.4.1. Handling Design Constraints

The performance evaluation module handles two types of constraints: (i) per-task timing constraints and (ii) global design constraints. Each of these two types introduces some change to the design space being explored, which in place affects the results of the exploration process.

(1) Per-Task Constraints. In some cases, a task is required to be completed in a specific time. The system allows imposing timing constraints on specific tasks. With the knowledge of number of cycles required by each task on each processor, combined with the timing constraints, the operating frequency can be calculated as shown in (4). In this equation is the maximum operating frequency for processor , is the number of clock cycles required by task , and is the timing constraints associated with task . The value of is selected to ensure that all tasks are meeting their timing constraints: As can be seen, each processor can have its own operating frequency to reduce the amount of power consumption, although the designer can force the use of a single global clock frequency to simplify the design. In the case of multiple clock frequencies, the number of clock cycles returned by the task scheduler is calculated using the fastest clock. The clock cycles required by other processors running on slower clocks are adjusted accordingly.

(2) Global Design Constraints. Each of the four objectives being optimized can be constrained according to the designer requirements. Without designers' constraints, all solutions generated during architecture exploration are considered feasible solutions; with constraints, some solutions might not be feasible if they do not meet the design requirements. For example, the design may require the total power consumption of the system not to exceed 1 mW. In that case any solution that has a total power consumption that exceeds this value is considered infeasible. The proposed exploration framework attempts to satisfy these constraints by repeating the process until the required constraints are achieved in some of the solutions or a predefined number of iterations is reached. The final set of solutions are presented to the designer, and it is up to the designer to select the most suitable architecture for the given application. An alternative approach would seek to repair infeasible solutions at the expense of the quality of the objective function.

5.4.2. Feasibility of Solutions

From the discussion presented in the previous sections we can conclude that any solution during architecture exploration will be feasible to be implemented. That is based on the assumption that all tasks can be executed on any embedded processor, which is true for general purpose processors that were used in our study. However, there two exceptions.(1)As discussed in the previous section, when some constraints are required to be met for the overall architecture, the design is considered infeasible if it does not meet the required constraints. The design is still feasible to be implemented, but it does not meet the design requirements. (2)In the case of the use of hardware accelerators with limited functionality, some tasks may not be possible to be executed on that specific processor. When a task is assigned to a processor that is not capable of executing it, the implementation of the solution is not feasible.

In both cases infeasible solutions need to be addressed for the exploration process to be completed. There are several techniques to handle infeasible solutions. Repairing the infeasible solution is one of these approaches and is used on the task scheduling phase to repair solutions if required.

5.5. Static Task Scheduling

During each cycle of the exploration process, different architectures are generated. A mapping of the software application to the hardware is also generated for each solution. The static scheduler schedules tasks to run in each processor, taking into account the total number of cycles required by each task to be executed on each processor, and task dependency extracted from the DAG. The static scheduler places each task for execution on its designated processor and inserts idle times if required to preserve the order of execution required by the task dependency. The task scheduler uses this information to calculate the total number of clock cycles required to complete one system cycle of the DAG.

The static scheduling was proposed and developed in [8, 33]. The optimization engine we present in this section was incorporated within the static scheduler to investigate the effect of task partitioning, variable task rates, and architecture configuration on the number of clock cycles required to run the application. In this section the task scheduler is used as a subcomponent of the performance evaluation module.

6. System Integration and Experimental Results

In this section the developed optimization engine and scheduler along with the performance estimator modules are integrated along with other necessary tools to perform complete architecture exploration for DSP applications.

The designer first converts the software application into a Directed Acyclic Graph (DAG) and specifies the required constraints that will be used by the system. A set of tools are provided for building the graph and task modeling (see Section 6.4.1). The core library is yet another input to the system which is composed of embedded processors the designer wishes to utilize in his/her design. In our case the processors investigated in Section 5 are used as the basic components of the library. If a specific component is required to be included in the library, a model should be built for such a component. It is the responsibility of the designer to build these models for each component using the approach presented in Section 5. The DAG and the core library specify the design space of the problem which is required to be explored.

6.1. Profiling and Directed Acyclic Graph Building

As discussed in Section 4, the input of the framework is a DAG that represents the software application. In the DAG, vertices represent software tasks, and edges represent the data flow from one task to the other; the edges are used to preserve the data dependency between tasks as shown in Figure 5.

Each task in the graph has a specific attribute that represents the task load. These attributes are extracted from the application profile, which is also used to extract data dependency between tasks. It is up to the designer to define tasks in the software application. The designer partitions the application into tasks, and, based on profiling results, the attributes of each task are defined. Table 5 shows eight attributes to model the load of the application. These attributes are used as the input of the ANN for estimating both the power consumption and clock cycles required to run the task or application. The parameters , , , , , , , and are calculated from profiling each task on a common processor instruction set.

After partitioning the application into specific tasks, profiling can also be used to extract data dependency between those tasks. A specific task “Task 2” is said to be dependent on “Task 1” if “Task 2” cannot start execution until “Task 1” completes execution, as shown in Figure 5. The dependency relation between the two tasks is represented by an edge in the graph. Each edge is associated with a weight value proportional to the amount of data transferred between the two tasks. A graphical tool was implemented to help the designer build the DAG.

6.2. Core Library

In Section 5 several embedded processors were investigated. Simulation was performed to build models that can be used to estimate the performance of these processors. Based on the data available, the four embedded processors investigated from the Xtensa platform are used to form an experimental core library. For each processor, the ANN model for both power consumption and clock cycles is used.

In addition to the ANN models, the area of the processor, static power consumption, maximum and minimum operating frequencies, and other important data are used to model each processor.

The communication between processors is expressed by two types of communication channels: direct link and shared bus. The effect of communication on the system performance is embedded in the static scheduler. The communication cost adds extra clock cycles if the execution of a specific task requires data transfer from one processor to the other. Accordingly, communication channels are modeled by the number of clock cycles required to transfer one word of data.

6.3. Optimization Engine

The optimization engine presented in [5] is used to search the design space composed of the core library and the software application. Four objectives are selected for optimization as explained earlier. The optimization engine is designed so that any multiobjective optimization problems can be plugged into the system as a problem solver with minimal modification to the proposed developed system. All the algorithms presented in [5] can be applied to the architecture exploration problem.

6.3.1. Solution Representation

Each solution generated during the search process is represented using a vector of variables based on the format shown in Figure 6(a). The format consists of three parts; each has several variables that are modified by the optimization engine to generate new solutions. The first part represents the mapping of each task on the system processors. Each variable points to a location in the second part of the representation. The size of the first part is equal to the number of tasks in the DAG. The second part represents the processors/cores forming the architecture represented by this solution. The value of each variable refers to the processor identification number in the core library. The range of values depends on the size of the library and processors used. In the worst case, each task will be mapped to a different processor; hence the size of this part is equal to the number of tasks in the DAG. The third part represents the communication subsystem which determines the communication scheme among components. This part focuses only on intercommunication between different processors. The local processor/memory interface is assumed to be included in the processor model. The size of this part is equal to the number of edges in the DAG. The connection between each pair of processors is modeled using a single variable. The value of each variable determines the type of communication channel used for data transfer between them. If the tasks are mapped to the same processor or any pair of processors with no communication, the communication between these two tasks is ignored. Figure 6(b) shows an example for the solution representation. In this example four tasks are mapped to three processors which are connected with two types of communication channels. If the number of tasks in the graph is and the number of edges is , then the vector length can be calculated by

6.4. Experimental Setup

Several experiments are performed to validate the operation and performance of the framework. All experiments are performed on a dual core machine running at 2.16 GHz clock frequency. The machine is equipped with 3 GB memory. Several benchmarks from our industrial partner (ON semiconductor) are profiled and are used to verify the functionality of the system. Tensilica Xplorer 2.1.2 software was used to test and profile tasks. Profiling information is used to build the DAG of the given application.

Architecture exploration is performed on each benchmark to validate the functionality of the framework and quality of solutions produced. The testing strategy can be summarized as follows.(1)Convergence: the purpose of this test is to investigate the convergence rate of the framework for each of the four objectives. We also study the effect of adding constraints to the application. At each iteration, a new Pareto front is formed. The average of all the solutions in the Pareto front for the four objectives is calculated and plotted. In each exploration iteration, the quality of solutions in the Pareto front should improve until it reaches some minimal value. (2)Quality of solutions: for each benchmark some of the final solutions proposed by the framework will be investigated and compared to actual values obtained through simulations. This will include an analysis of the static schedule and the memory requirements. (3)Execution time: is the time required by the exploration framework to generate new architectures.

6.4.1. Benchmarks

Two benchmarks are provided from our industrial partner ON semiconductor. The first benchmark, ON_BM_1, for a hearing aid system, is shown in Figure 7. The system is originally implemented on Ezairo 5900 system that is based on two DSP processors designed by ON semiconductor [35]. This application consists of a beam-former, filter-bank, and frequency-based noise cancellation and filtering. Figure 8 shows the corresponding DAG developed for this application based on the approach presented earlier. The second benchmark, ON_BM_2, is similar to ON_BM_1 but is designed for dual audio channel processing (stereo processing). The size of the benchmark ON_BM_2 is double the size of ON_BM_1. Accordingly, timing, power consumption, and area requirements for this application are expected to double in value.

The following subsections present the steps required to build the DAG from the block diagram of Figure 7. It is the task of the designer to partition the application into blocks of code. As shown in Figure 7, the application is partitioned into several DSP blocks. Each of these blocks is mapped to a task (vertex) in the DAG. The data flow between any two blocks is mapped to an edge in the DAG. For this application, audio samples are grouped into frames. The designer can choose the frame size to be processed. For this benchmark, we selected the frame size to be 16 samples, which is the cost of all the data dependencies.

Building DAGs involves and requires eliminating feedback. In Figure 7 the feedback path should be broken at the start of the feedback path. As shown in the DAG in Figure 8, tasks in the feedback path are considered inputs for the main signal path. The inputs to these tasks are only available after the first system cycle. Using this method, the DAG represents only one system cycle where some of the inputs are delayed.

6.5. Results

As discussed earlier in Section 6.4, several tests are conducted to validate the functionality of the framework. In this Section, the results of each of these tests on both benchmarks are presented.

6.5.1. Convergence and Effect of Constraints

Three different tests are performed on benchmark ON_BM_1. In the first test, the benchmark is used with no constraints. On the second test, timing constraints are imposed on task “Analysis_SP" to complete in 1 ms. The final test is performed with five more tasks having timing constraints imposed. The convergence rate of each of the four objectives is shown in Figures 9 and 10 where the total number of clock cycles is converted to time using the maximum clock frequency. Each test is repeated 10 times. The external archive after every iteration of each run holds the set of possible solutions for the current problem (the current Pareto front). After each iteration the average of every objective for all the solutions in the archive is calculated. Each point in the curve is the average of 10 runs.

The framework offers two clock settings for the designer. The first setting allows each processor to have its own clock. In this case, processors running noncritical tasks can run with the slowest possible frequency, while processors running critical tasks can operate with a clock frequency that is suitable for the timing constraints. The second option is to use one global clock for all processors in the system, which simplifies the design of the system. However, some processors will run with higher frequencies than necessary, thus increasing the power consumption of the system.

Figure 9(a) shows the convergence rate of execution time with multiple clocks used. As the figure shows, final convergence value is affected by the constraints level. When the optimization is performed with no constraints, the system converges to a higher level than that with constraints. When timing constraints are imposed with a specific task, the operating frequency of the processor changes to satisfy the required timing, which reduces the execution time. Increasing the number of tasks associated with timing constraints increases the number of processors running at higher frequencies, which further reduces the convergence level.

The convergence of power consumption and area is shown in Figures 9(b) and 9(c), respectively. As described earlier in Section 5, power consumption is directly proportional to operating frequency. When the frequency of some processors changes to satisfy a specific constraint, the average power consumption increases. The increase in power consumption is affected by the number of tasks associated with timing constraints as the number of processors running on higher frequencies increases. The power consumption is also affected by the load of each processor. For this reason reducing the system area increases the load of each processor. This results in an increase in the power consumption. The level of constraints did not affect the final level at which system area converges, although there is a slight decrease for systems with timing constraints. For flexibility (shown in Figure 9(d)) with more constraints associated with the system, the flexibility level increases, because faster processors have a higher level of flexibility.

Figure 10 shows the convergence rate when a single global clock is used. The convergence level of run-time is reduced significantly compared to the multirate system. That is because all the processors in the system are running on the fastest required clock frequency, resulting in an increase in the level of power consumption as shown in Figure 10(b). Comparing the results of the global clock frequency in Figure 10(b) to the multiple clock frequencies of Figure 9(b) we can see that systems running with a global clock frequency are consuming more power because all the processors in the system are running at the highest clock rate as stated previously.

Comparing the level of power consumption for the systems running with global clock frequency to the systems running with multiple clock frequency, a significant increase in power consumption can be seen with the global clock setting for the same reason.

When timing constraints are imposed on a specific task, the operating frequency of the processor changes to satisfy the required timing. Modifying the operating frequency alters the design space of the problem, as well as the suboptimal Pareto front. The higher number of tasks with constraints, the more processors affected. It should also be noted that if none of the tasks on a specific processor has timing constraints, the processor clock frequency is set to the default value specified in the core library. This explains the change in the convergence level with the constraints level.

6.5.2. Exploration Time

The exploration time for both benchmarks is illustrated in Figures 11(a) and 11(b) for an average of 10 runs. Each bar in these figures shows the maximum and the minimum values for the exploration time and the confidence interval around the average value with a CF of . The average exploration time is in the range of 140 to 220 seconds and does not depend on the level of constraints or the application. This indicates that application size does not have a significant effect on exploration time. The parameters used for the optimization engine for all benchmarks are the same; the exploration time is only affected by the time consumed in evaluating different architectures. The evaluation time depends on the time required to calculate the output of the ANN and the multiprocessor evaluation system, which is almost constant compared to the application size. In total, 10,000 solutions are evaluated during the exploration process. This demonstrates the effectiveness of the performance evaluation techniques presented in providing an accurate estimation of system performance in a reasonable amount of time.

6.5.3. Solution Quality

In this section, we attempt to validate the solution quality of the architectures proposed by the framework. Four points are selected from the Pareto front that cover different trade-offs of the objectives under investigation. For each candidate solution the details provided by the framework will be presented. This includes the operating frequency, power consumption, and execution schedule. Table 6 lists the candidate points investigated in this section.

The first candidate point solution investigated is for benchmark ON_BM_1 with no constraints imposed. Therefore, all processors are expected to run at their minimum frequency as defined in the core library. The solution is shown in Figure 12. Three processors are proposed to run the benchmark shown in Figure 7.

Task assignments and a detailed timing schedule of the execution of each task are shown in Figure 13. The schedule shows the execution of tasks on each processor. For each task, the start and end time are reported. The value on the -axis is the number of cycles elapsing from the start of the execution using the fastest clock frequency on the system. With the knowledge of the operating frequency, the execution time of each task can be calculated in addition to the total execution time. As can be seen some processors go to idle state when their next task to be executed depends on other tasks running on other processors. Idle time is reflected by NOP operations on the processor load. The execution time of each task includes the time required to transfer data from one processor to another using the communication subsystem.

To verify the accuracy of the solution a code is written for each processor in the system using code blocks for each task. Empty loops that insert NOP instructions are used to simulate idle time if required. An example code fragment is shown in Figure 14. The “Idle” function inserts NOP instructions to put the processor in an idle state for the specified number of cycles. We ignore the communication cost on the total execution time as it is not significant compared to the total execution time. This code is simulated using Xplorer IDE to calculate the performance of the processor running this portion of the application. This process is repeated for each processor in the system.

For each processor, the total number of cycles and the average power consumption are reported and compared to the estimated values reported by the architecture exploration framework. Table 7 shows the comparison results between the simulation results and the estimated values for each processor in the system. The comparison shows an overall error of for power consumption and for the overall system clock cycles. This illustrates both the accuracy of the multiprocessor evaluation module and the performance of the architecture exploration framework.

The second candidate solution is for benchmark ON_BM_1 with no constraints. However, this solution is composed of two processors, PCach and PBasc, connected using one common bus for data transfer. The load of the application is balanced between the two processors, with a minimum increase in execution time compared to Solution 1. This comes with the advantage of much lower power consumption and smaller area. A comparison between the simulation and the framework estimated values is shown in Table 8. The comparison shows an overall error of for power consumption and overall error for clock cycles between the estimated and simulated values. Solutions 1 and 2 show two possible trade-offs for the problem. It is up to designer to select the best solution that matches the design requirements. The third solution is for benchmark ON_BM_1 where the task “Analysis_SP” has a timing constraint of 1 ms. The simulation results of this system are shown in Table 9 where the clock cycles reported are scaled for both P1 and P2 for the different clock frequencies. The simulated clock cycles for P3 are higher than the estimated value. There are two possible explanations for this difference: (i) during evaluation, the clock cycles of each task are estimated separately from the other tasks. Since there is an estimation error up to the source of this error can be due to error accumulation; (ii) the second possible cause is the effect of idle time on clock cycles. The idle time effect is added to the processor load by increasing the number of “NOP" operations for each processor that would possibly generate an equal idle time. This process can increase the estimation error. The overall error for power consumption is , which is within the 10% error range.

The final solution that is investigated is for benchmark ON_BM_2. This benchmark is composed of 40 tasks and for this solution six of these tasks are associated with 1ms timing constraints. Table 10 shows a detailed comparison between the simulation and estimation of each processor in the system. There is a significant difference in the clock cycle estimation for some of the processors. As can be seen, these processors are only running 3 tasks and most of the time they are idle. As discussed in Solution 3, substantial idle time may affect the estimation of clock cycles.

Several solutions are generated during architecture exploration. Each solution has certain features that match the designer requirements and others that may not. It is the task of the designer to select the solution that is best for his/her requirements. The performance estimation module is shown to be efficient in evaluating the performance of multiprocessor architectures. The level of estimation error is acceptable for the task of architecture exploration.

7. Conclusions and Future Work

The demand for high speed, low power implementation for multicore systems is increasing due to the high demand for embedded mobile computing and high speed computation. Various applications require different architecture depending on the software workload, implementation constraints, and project budget. Exploring the design space to find a suitable implementation of specific application is a complex process. We identified two main components for such a process: modeling embedded multicore systems (power, performance, etc.) and searching the design space. The first component should be accurate and fast, while the second component should enable efficient exploration capabilities.

In this paper, we target both components by introducing an architecture exploration framework for embedded DSP applications. The framework aims to explore the design space for a specific application, modeled using software workloads, and DAGs to find a suitable, multicore, embedded system to efficiently run the application.

An optimization engine based on multiobjective PSO/GA is developed and used to search the design space. Each solution generated by the optimization engine is evaluated by estimating its performance, power consumption, area, and flexibility. A novel approach for modeling an embedded processor based on artificial neural networks (ANNs) is also presented. The models generated have a high accuracy between 85% and 95% compared to the simulation results. Several experiments are used to validate the quality of the solutions proposed by the framework. The results show an average accuracy of 90% for both power consumption and total clock cycles required to execute the application, as well as showing the effectiveness of the proposed framework.