EURASIP Journal on Embedded Systems
Volume 2009 (2009), Article ID 867362, 14 pages
doi:10.1155/2009/867362
Research Article

Improving the Performance of Bus Platforms by Means of Segmentation and Optimized Resource Allocation

1ABB Corporate Research, Automation Networks Department, SE-72178 Västerås, Sweden
2Department of Information Technology, University of Turku and TUCS, FIN-20014 Turku, Finland

Received 8 August 2008; Revised 11 January 2009; Accepted 5 April 2009

Academic Editor: Leonel Sousa

Copyright © 2009 T. Seceleanu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Consider a processor organization consisting of a number of client modules and server modules (jointly called devices), like memory units and arithmetic-logic processing units. Suppose that these devices are interconnected with a bus which is segmented in such a way that devices connected to a particular segment can communicate in parallel to the data transfer operations going on in the other segments. This is achieved by a control logic which is able to reserve a continuous subsequence of the segments necessary to establish a path from the source to the target device. Given the frequency of data transfer operations between the devices, our task is to determine an efficient segmentation and segment-to-device assignment of this on-chip architecture. This task is formulated as an optimization problem which considers the amount of data transfer operations performed via the bus segments. The problem turns out to be NP hard but we propose efficient local search-based heuristics for it. The heuristics are applied to sample cases, and the outcome is an improved performance in terms of a shorter execution time.

1. Introduction

The growing diversity of devices within the boundaries of a modern system-on-a-chip (SOC) brings up a great number of possible interfaces. System design and performance are often limited by the complexity of the interconnection between the modules and blocks that are integrated into these devices. Furthermore, different data transfer speeds are required as well as parallel transmission. A conventional bus structure is not suitable for such designs. This is because only one module can transmit at a time, and the signaling speed on the bus is restricted by the large capacitive load [1] caused by the interfaces of the attached modules and the long bus wires.

A possible solution to the above problems is the use of a segmented bus platform, combined with a globally asynchronous locally synchronous (GALS) system architecture. In this paper, a group of modules is synchronized to a local clock, whereas interactions between such groups are arranged asynchronously. Hence, the routing of the clock signal and that of the clock skew are no more system level design problems, but they are limited to each locally synchronous module.

Premises
Segmented buses have been proposed in the past, for multicomputer architectures [24]. More recent approaches apply segmentation in the context of single-chip devices.
To the best of our knowledge, the first attempt to introduce the partitioned bus concept in the design of digital systems is by Ewering [5]. The structure resembles a dual rail pipelined scheme, where functional units are placed between two buses. Symmetrically placed switches connect the bus segments.

An illustrative analysis focused on segmented bus design is described by Jone et al. [6]. The system is implemented as an ASIC , with specific characteristics of physical interconnect and of the communication structure. The communication infrastructure allows tree-like constructs, differently from the partitioned bus approach (an ASIC style, too) taken in [5].

The segmented bus platform of the present paper was initially introduced in [7], where the platform is viewed from an asynchronous design perspective. Intuition was used there in order to build a segmented bus structure and to compare it with a nonsegmented implementation. The synchronous platform is described in [8]; arbitration policies are addressed in [9, 10].

We consider here the resource allocation procedure for applications running on the segmented bus platform (SB) described in [8]. By a reasonable organization of the hardware components and of the bus segments, one can increase the degree of parallelism of data transfers and in this way possibly improve the overall system performance, expressed as the time required to perform the tasks specified at the application level (evaluated in the number of clock “ticks”). On the other hand, each extra segment means a new switch for allowing the connectivity of the respective segment to the rest of the platform. A balance between parallelism and complexity of the system is therefore to be found. The success of an SB implementation depends on the profile of the accesses between the hardware units, on the organization of the segments, and on the assignment of the units to the segments.

The idea in the present paper is to organize the component devices and the segments in such a way that the number of parallel data transfers is maximized. We maximize the possibilities for parallel transfers by minimizing the amount of requests using any single bus segment (since such traffic necessarily is sequential). We evaluate and try to minimize the communication costs of data transfers to obtain an optimal device-to-segment allocation, in terms of performance. The cost is supposed to be linearly dependent on the amount of data transferred locally (within a segment) and globally (intersegment communication). The objective here is to keep the inter-segment data transfers of each segment low. Our approach assumes that the application flow has been analyzed, and the communication patterns have been extracted. This is followed by binding functionality to devices, such that a device-to-device communication matrix can be built. We may start then considering how the performance is affected by the bus segmentation and resource allocation. We express the device-to-segment allocation problem as a min-max optimization problem and show its NP hardness. To find reasonable (although suboptimal) solutions, we propose a generic local search algorithm which performs a set of exchange operations on the current candidate solution in order to proceed toward better solutions. In practical tests, we work with synthetic data to be able to characterize the platform without binding it to a specific (set of) application(s). It turns out that applications with a biased (that is, a noneven) traffic will have a better performance on an SB platform. The algorithms developed here are implemented in the SBTool application, returning the optimal allocation parameters, based on the communication matrix input.

Paper Overview
The rest of paper is organized as follows. We continue in Section 2 by exploring existing approaches to segmented bus architectures. In Section 3 we make a short description of the segmented bus concept and the operation modes on such a platform. The problem of segmenting the bus is described in Section 4. Section 5 discusses the time complexity of the problem and introduces a device-to-segment allocation algorithm using local search operations. The behaviour of proposed algorithms is evaluated with theoretical traffic loads by means of two examples of the device-to-segment allocation, in Section 6.1. Two another examples are further analyzed, from implementation perspectives, in Sections 6.2 and 6.3. The paper is concluded in Section 7.

2. Related Work

The on-chip multiprocessor domain has recently ceased to exist only in theory, or at the level of microcomputer architectures. The most popular concept for such systems is today the network-on-chip (NOC) paradigm [11]; see Jantsch and Tenhunen [12] for a discussion on the benefits and challenges of NOC systems.

The SB and the NOC approaches share several advantages, such as modularity, reusability, predictability, and adaptability as well as a set of disadvantages, such as an increased configuration process, loss of optimality, and communication latency. Still, due to the reduced complexity of the SB platform, compared to an NOC system, and to its linear, compared to the two-dimensional structural aspect, the former is closer to the traditional bus-based design experience.

The main differences between the two architectures reside in the centralized versus the distributed arbitration and routing policies. As data-traffic congestions are expected in both architectures, the SB solutions come in the shape of carefully designed arbitration policies , while NOCs benefit mostly from two packet traffic coordination schemes (guaranteed throughput (GT)—bounded latency at data stream levels, and best-effort (BE)—no given guarantee on the arrival time). However, in the context of computer networks, Rexford and Shin [13] report that combining GT and BE traffic is a fundamentally hard issue. Avasare et al. [14] address routing policies for NOCs with centralized control, in order to improve BE traffic characteristics. Such solutions bring NOC closer to the communication management of the segmented buses.

Moreover, at present day design complexity, NOCs do not always provide the huge predicted impact on the design process. With the exception detailed by Delorme and Houzet [15], even for relatively complex applications such as Motion-JPEG decoder [14] or MPEG-2 encoder [16], the number of processing nodes (routers plus the attached processing devices) is quite low (4 and 2, resp.), while the “element interconnect bus”—a bus architecture which, as our SB, allows parallel transmissions—has successfully been employed by Pham et al. in the implementation of a complex “cell processor” [17].

Jone et al. [6] consider the mathematical principles necessary for a sound bus partitioning and aspects of an ASIC-style implementation. The target technology is decisive in building the architecture, and cost functions, as direct connections between communicating devices are possible. The power consumption of the segmented bus is lowered by minimizing the switch capacitance (i.e., effective capacitance) on each bus line. This is the sum of the products of load capacitance and switching frequency. The method produces an optimal segment tree by using a multiterminal network flow formulation of the problem.

Wang et al. [18] study the memory usage and device allocation on segmented buses. Their partitioning schemes emerge from employing a Data Transfer and Storage Exploration methodology, for system level memory management. Hence, the segmentation/partitioning issues are not the focus of their study.

Srinivasan et al. in [19] give a method for minimizing the power consumption of their segmented bus platform. They (as also we) have different operating frequencies at each bus segment. The cited study, however, does not offer a clear description of the practical implementation issues, and of the architectural features of the platform.

Lahiri et al. [20] discuss impact of communication protocols on the optimal segmentation problem. Their segmented bus architecture is memoryless. The approach introduces a simulation-based trace extraction, which is used to indicate the communication patterns in processing.

Current Study Approach
In comparison to the above research efforts, our problem setting is different in several aspects. Some of them are depicted here as follows.
(i)The selection of FPGAs (versus ASIC [5, 6, 21], etc.) as the implementation technology imposes specific constraints related to the placement of devices on the platform. Strict localization of the clock domains is extremely important in FPGA implementations, due to the restrictions on routing global signals (such as clocks). Therefore, we use the “LogicLocks” feature of Altera design tools [22] in order to group together devices operating in the same clock domain. A tree-like structure would imply the adjacency of at least three of such regions, around a single border unit. Given the geometry of the regions and the restrictions on placement, this is most often hard (or even impossible) to implement. Hence, we restrict ourselves only on the linear organization of bus segments (extensible to a circular arrangement)—thus, we do not allow a tree-like segment organization.(ii)Our objective is to maximize the parallelization and, at the same time, to minimize the frequency of inter-segment transactions, as opposed to minimizing the overall usage of power consumed by the bus segments, in [6, 21].(iii) We do not fix (by a relaxation of the problem) the device topology but allow a free search for the order of the devices.

More generally, we recognize that the bus segmentation problem is clearly a combinatorial optimization problem. While in such problems methods like local search, simulated annealing, and genetic algorithms are typically the best ones, we omit the latter, since simulated annealing and local search methods are very natural options to apply for this particular problem.

The approach taken in [19] provides a range of frequencies that are coded into the details of the genetic algorithms developed to solve the allocation problem. In contrast, we take a more liberal view and do not restrict our models to a given range of frequencies. These will result in the process of selection for the functional modules (IPs) and must be selected to suit the application(s) at hand, being thus a later step in the design methodology.

Compared to [20], we consider a model where communication instances are not correlated, allowing for consideration of multiple application contexts.

3. Segmented Bus Architecture

A segmented bus is a bus which is partitioned into two or more segments. Each segment acts as a normal bus for the associated modules and operates in parallel with other segments. Neighboring segments can be dynamically linked to each other in order to establish a connection between modules located in different segments. In this case, all dynamically connected segments act as a single bus. The first step in the design is to organize a communication scheme that allows the components of a system to efficiently transfer data over the shared bus.

A bus-based system consists of three kinds of components (subsystems): masters , slaves , and arbiters . A master is a device that requests services from other devices, the slaves . Only one master at a time may transfer data on the bus, thus there is need for arbitration. In a conventional single-bus approach, a master-slave connection reserves the whole bus, regardless of the relative placement of these devices. The SB approach allows a connection to reserve only a small portion of the bus, while other devices may use the remaining segments.

The SB platform is thought as having a single central arbitration (CA) unit and local segment arbitration (SA) units. The SA decides which master within the segment will get access to the bus in the following transfer burst. If a specific master requires an inter-segment connection, the request is forwarded to the CA, which performs the same operation at the bus level, deciding which segments need to be dynamically connected to establish a link between the granted master and the target slave. Hence, the interface components between adjacent segments, the segment bridges (or border units), are controlled (opened and closed) by the CA; see Figure 1 for a high level diagram of the SB system.

867362.fig.001
Figure 1: The SB architecture.

Operations on a Segmented Bus
From a local arbitration standpoint, the operation on a specific segment may proceed in three modes. These depend on the location of the granted master and the target slave, taking a local arbitration unit as a reference point. Thus, we have (i) a local master-local slave situation, which means that the master and the slave are both situated in the same segment with the SA, (ii) a local master-external slave situation: only the granted master resides in the same segment as the SA, and (iii) an external master-local/external slave situation: only the target slave possibly resides in the same segment as the SA.

In all the situations, the master connects to the slave after a four-phase signaling protocol between the master, and the corresponding SA has been executed. The latter also monitors the communication, by counting the number of data words being transferred from the master, in the cases (i) and (ii) above.

In the case (ii), the master signals the request for another segment by correspondingly selecting the slave address. First lines of this address, which encode the target segment number, are also read by the SA which forwards the request to the CA, in order to obtain passage to the slave. While the master is waiting for the response from the CA, another master may obtain the bus control for an intra-segment local operation. Whenever the acknowledgment from the CA arrives, and the possible local operation has been completed, the SA passes the bus control to the requesting master which then accesses the remote target slave through a number of dynamically connected bus segments.

Notice that all the components in the SB implementation are mutually asynchronous devices. Therefore, communication between them follows rules posed by the applied handshake protocols that must consider also the necessary synchronization elements. A more detailed block description of segment components and signals is given in Figure 2, while the protocol and functional descriptions can be found elsewhere [8].

867362.fig.002
Figure 2: The segment control elements.

The performance speedup of SB platform is based on the overlaps between local activities in different segments and between inter-segments transfers and local activities. Arbitration processing is not an issue from a time perspective, unless the SA or the CA were idling prior to a decision; otherwise, arbitration procedures also overlap with transaction activities.

4. Problem Statement

Consider a specific case of a bus with 𝑛 𝑠 = 3 segments and 𝑛 = 8 devices, as in Figure 3. For example, a data transfer between 𝐷 4 and 𝐷 6 reserves the segment 2 only. On the other hand, a transfer between 𝐷 2 and 𝐷 8 reserves all the three segments. The traffic between devices is defined by a device-to-device communication matrix 𝐶 ( 𝑐 𝑖 , 𝑗 ; 1 𝑖 , 𝑗 𝑛 ) giving the amount of data transfer requests per time unit between each device pair ( 𝑖 , 𝑗 ) ; see Table 1. Denote the total traffic with 𝐶 s u m = 𝑖 , 𝑗 0 𝑥 0 2 0 0 𝑑 𝑐 𝑖 , 𝑗 .

tab1
Table 1: An example of communication matrix 𝐂 . The amount 𝐜 𝐢 , 𝐣 of data transfers per time unit from source 𝐢 to target 𝐣 .
867362.fig.003
Figure 3: A segmented bus with 8 devices divided into 3 segments.

For each segment 𝑘 ( 𝑘 = 1 , 2 , , 𝑛 𝑠 ) we can calculate the total amount of data transfers over that segment as the sum of transfers which have

(1)source and target device in segment 𝑘 ( 𝑡 𝑘 , 1 ),(2)source in segment 𝑘 , target elsewhere ( 𝑡 𝑘 , 2 ),(3)target in segment 𝑘 , source elsewhere ( 𝑡 𝑘 , 3 ), or(4)source in segment 𝑖 and target in 𝑗 , where 𝑖 < 𝑘 < 𝑗 or 𝑖 > 𝑘 > 𝑗 ( 𝑡 𝑘 , 4 ).

Here 𝑡 𝑘 , 𝑗 denotes the amount of data transfers per time unit in case 𝑗 = 1 , , 4 . Figure 4 shows the different cases of data transfers for the 2nd segment in case of 3 segments. In the figure, the numbers 1 to 4 refer to the indices 𝑗 of 𝑡 𝑘 , 𝑗 .

867362.fig.004
Figure 4: Data transfers reserving the segment 𝑘 = 2 .

Let 𝑇 𝑘 ( 𝑘 = 1 , 2 , , 𝑛 𝑠 ) denote a sum of transfers for segment 𝑘 as defined above: 𝑇 𝑘 = 4 𝑗 = 1 0 𝑥 0 2 0 0 𝑑 𝑡 𝑘 , 𝑗 . ( 1 )

Suppose further that there are 𝑛 devices, 𝐷 1 , , 𝐷 𝑛 , and let 𝐴 𝑖 be the segment number ( 1 𝐴 𝑖 𝑛 𝑠 ) to which device 𝑖 is allocated. Thus, in Figure 3 we have the device-segment allocation 𝐴 = ( 𝐴 1 , , 𝐴 8 ) = ( 1 , 1 , 2 , 2 , 1 , 2 , 3 , 3 ) .

We define the segment 𝑘 related traffic load (or simply cost) 𝑇 𝑘 ( 𝐴 ) for an allocation 𝐴 in terms of access frequencies 𝑐 𝑖 , 𝑗 ( 1 𝑖 , 𝑗 𝑛 ) as 𝑡 𝑘 , 1 𝐴 = 𝐴 𝑖 = 𝐴 𝑗 = 𝑘 0 𝑥 0 2 0 0 𝑑 𝑐 𝑖 , 𝑗 , 𝑡 𝑘 , 2 𝐴 = 𝐴 𝑖 = 𝑘 , 𝐴 𝑗 𝑘 0 𝑥 0 2 0 0 𝑑 𝑐 𝑖 , 𝑗 , 𝑡 𝑘 , 3 𝐴 = 𝐴 𝑖 𝑘 , 𝐴 𝑗 = 𝑘 0 𝑥 0 2 0 0 𝑑 𝑐 𝑖 , 𝑗 , 𝑡 𝑘 , 4 𝐴 = 𝐴 𝑖 < 𝑘 < 𝐴 𝑗 o r 𝐴 𝑖 > 𝑘 > 𝐴 𝑗 0 𝑥 0 2 0 0 𝑑 𝑐 𝑖 , 𝑗 . ( 2 )

Problem 1 (multisegmented bus device allocation problem (MSDA)). Suppose that the frequencies of device-to-device communications are given by a matrix 𝐶 . Denote by 𝑇 𝑘 ( 𝐴 ) , as calculated by (1) and (2), the sum of data transfers for segment 𝑘 with the device-to-segment allocation 𝐴 = ( 𝐴 1 , 𝐴 2 , , 𝐴 𝑛 ) . The cost of allocation 𝐴 is 𝑇 𝐴 = m a x 1 𝑘 𝑛 𝑠 𝑇 𝑘 𝐴 . ( 3 ) In MSDA problem we want to find, for a fixed number of segments 𝑛 𝑠 , a segment allocation 𝐴 for which the largest sum of data transfer operations of any segment (i.e., the cost) is minimal: 𝑇 𝐴 = m i n 𝐴 𝑇 𝐴 . ( 4 )

The allocation in Figure 3, for the example in Table 1, is a solution for (4) giving 𝑇 ( 𝐴 ) = 4 8 9 .

Segment Traffic Load
Previously, we expressed the traffic load in terms of interdevice communications. This made the formulae dependent on the allocation of devices to segments. We get a simple form of the traffic load of each segment, if we suppose that the device-to-segment allocation is given by the vector 𝐴 . We can then calculate, from 𝐴 and the device-to-device communication matrix 𝐶 , a segment traffic load matrix 𝑄 consisting of elements 𝑞 𝑖 𝑗 ( 1 𝑖 , 𝑗 𝑛 𝑠 ): 𝑞 𝑖 𝑗 = 𝐴 𝑘 = 𝑖 , 𝐴 𝑙 = 𝑗 , 1 𝑘 , 𝑙 𝑛 0 𝑥 0 2 0 0 𝑑 𝑐 𝑘 , 𝑙 . ( 5 )

This gives the traffic load of the segment 𝑘 as 𝑇 𝑘 = 𝑘 𝑖 = 1 0 𝑥 0 2 0 0 𝑑 𝑛 𝑠 𝑗 = 𝑘 0 𝑥 0 2 0 0 𝑑 𝑞 𝑖 𝑗 + 𝑛 𝑠 𝑖 = 𝑘 0 𝑥 0 2 0 0 𝑑 𝑘 𝑗 = 1 0 𝑥 0 2 0 0 𝑑 𝑞 𝑖 𝑗 𝑞 𝑘 𝑘 = 𝑘 𝑖 = 1 0 𝑥 0 2 0 0 𝑑 𝑛 𝑠 𝑗 = 𝑘 0 𝑥 0 2 0 0 𝑑 𝑞 𝑖 𝑗 + 𝑞 𝑗 𝑖 𝑞 𝑘 𝑘 . ( 6 ) The term 𝑞 𝑘 𝑘 is subtracted in the above formula to cancel its double existence in the sum expression.

Example 1. In order to understand the effect of segmentation to the traffic load, we make temporarily the simplifying assumption 𝑞 𝑖 𝑗 = 𝑣 (constant) for all 𝑖 , 𝑗 . This means that all segment pairs communicate with the same frequency (consider an extreme case where each segment consists of only one device and all device pairs communicate uniformly). This case helps us to observe how much the segmentation as such can improve (or worsen) the situation. We then have 𝑇 𝑘 = 𝑘 𝑖 = 1 0 𝑥 0 2 0 0 𝑑 𝑛 𝑠 𝑗 = 𝑘 𝑘 𝑛 0 𝑥 0 2 0 0 𝑑 2 𝑣 𝑣 = 2 𝑣 𝑠 𝑘 + 1 𝑣 = 𝑣 2 𝑘 𝑛 𝑠 2 𝑘 2 . + 2 𝑘 1 ( 7 )

Because traffic between two segments 𝑆 𝑖 and 𝑆 𝑗 (assume 𝑖 < 𝑗 ) has to pass the segments between these two ( 𝑆 𝑖 + 1 , , 𝑆 𝑗 1 ), the total traffic load becomes larger in the middlemost segment(s).

It is interesting to note that the traffic load of the middlemost segment (assume 𝑛 𝑠 is even) is 𝑇 𝑛 𝑠 / 2 𝑛 2 𝑣 𝑠 2 𝑛 𝑠 2 = 𝑛 𝑣 𝑠 2 2 1 𝑣 . ( 8 )

This indicates that, for a fixed 𝑣 , the load of the middlemost segment increases with the square of 𝑛 𝑠 . However, when the overall traffic load 𝑋 = 𝑖 , 𝑗 0 𝑥 0 2 0 0 𝑑 𝑞 𝑖 𝑗 is constant, then 𝑣 ( 𝑛 𝑠 ) = 𝑋 𝑛 𝑠 2 , since there are 𝑛 𝑠 2 different segment-to-segment routes in the bus (direction and self-routing are considered). In the limit, l i m 𝑛 𝑠 𝑇 𝑛 𝑠 / 2 = l i m 𝑛 𝑠 𝑋 𝑛 𝑠 2 2 𝑛 𝑠 2 𝑛 𝑠 2 = 𝑋 + 1 1 2 . ( 9 ) In other words, half of the traffic crosses over the middlemost segment in such an extreme (bad) case. In the same way we observe that l i m 𝑛 𝑠 𝑇 1 = l i m 𝑛 𝑠 𝑇 𝑛 𝑠 = 0 . ( 1 0 )

Now consider three cases for 𝑛 𝑠 : (a) 𝑛 𝑠 = 1 , (b) 𝑛 𝑠 = 2 , and (c) 𝑛 𝑠 = 𝑛 . Assume that all segments have an equal number 𝑛 / 𝑛 𝑠 of devices, and there is a fixed traffic 𝑐 𝑖 , 𝑗 = 𝑣 between all devices. In case (a), the whole traffic of load 𝑛 2 𝑣 happens in one segment. In case (b), the traffic load within both segments is ( 𝑛 / 2 ) ( 𝑛 / 2 ) 𝑣 , and the traffic load crossing the segment border is 𝑛 ( 𝑛 / 2 ) 𝑣 . Thus in case (b) the traffic load of both segments ( ( 3 / 4 ) 𝑛 2 𝑣 ) is 7 5 % of that in case (a). In case (c) each node has its own segment, and the traffic load of the middlemost segment is 2 ( 𝑛 / 2 ) ( 𝑛 / 2 ) 𝑣 = 𝑛 2 𝑣 / 2 . Thus, for even traffic patterns, segmenting the bus can decrease the traffic load by at most 5 0 % , and in case 𝑘 = 2 by 2 5 % . Notice that for nonuniform traffic patterns the benefits can be much greater.

5. Algorithms for Solving Segmentation

Next, we propose algorithms for solving the MSDA Problem 1. In Section 5.1, we prove that solving (4) optimally is an NP-hard problem. Thus, we are forced to look on heuristics for the problem. Such solutions are considered in Section 5.2. The algorithms described in the following paragraphs create the basis for the development of SBTool, a command line application, designed to solve problems related to allocation and segmentation for the SB platform.

5.1. NP Completeness

The proof of the next theorem is based on a reduction from the Integer Partition problem , which it is known to be NP complete [23].

Problem 2 (Integer Partition Problem). Given a set of 𝑛 integers, 𝑎 1 , 𝑎 2 , , 𝑎 𝑛 , partition them into two subsets such that the sums of the subsets are equal.

Theorem 1. Bus segmentation Problem 1 is NP hard.

Sketch of Proof
Reduction, from a given Integer Partition problem to the bus segmentation problem, is done so that for each integer 𝑎 𝑖 , 1 𝑖 𝑛 , we form nodes 𝑆 𝑖 and 𝑇 𝑖 , define that node 𝑆 𝑖 wants to make 𝑎 𝑖 requests to 𝑇 𝑖 , set the number of bus segments to be two, and 𝐿 0 = 1 / 2 𝑛 1 0 𝑥 0 2 0 0 𝑑 𝑎 𝑖 . (To be exact, here, one should consider the decision version of the bus segmentation problem. A predefined limit 𝐿 0 is given in this problem, and it is asked whether an allocation can be found, such that m a x 𝑘 𝑇 𝑘 𝐿 0 .) Now, suppose that there exists an algorithm solving our Problem 1 optimally. An optimal placement clearly is such that 𝑆 𝑖 - 𝑇 𝑖 pairs are located in the same segment, and there is no cross-traffic between the segments. Moreover, the cost of an optimal solution is as close to half of the sum of the total traffic as possible. If there is a solution for Problem 2, then an optimal solution for Problem 1 is such a solution. Thus, an optimal solution straightforwardly gives a solution to the Integer Partition problem, too. Since the reduction can be done in polynomial time, Problem 1 is NP hard.

To determine the NP completeness of the decision version of the MSDA problem, it is sufficient to notice that its decision version belongs to NP.

5.2. Heuristic Solutions

Since solving the Problem 1 optimally is NP hard, we look for efficient heuristic solutions. The proposed heuristics start with a random initial device-to-segments allocation set by:

(i)InitRandomly Random initial order of devices, and randomly set segment borders (code not shown).
5.2.1. Greedy Local Search Methods

Algorithm 1 is a basic greedy local search algorithm for solving the Problem 1. Besides the device-to-device communication matrix 𝐶 and the number of segments, 𝑛 𝑠 , it receives as its parameters the iteration bound 𝑏 , a method InitFunc to give the initial setting, and a method ModifyFunc to generate a new allocation. New allocations are generated as long as they improve the current setting or 𝑏 nonimproving allocations have been generated in sequence. Algorithm 1 returns the final device-to-segments mapping.

alg1
Algorithm 1: Greedy local search with iteration bound.

Algorithm SB-Local-Exhaustive-Search (Local exhaustive search) is similar to Algorithm 1. The only difference is that it tries all possible allocations that can be generated from the current setting by using ModifyFunc, and the best of those is chosen, if it is better than the original allocation. The current allocation is modified in that way as long as a better allocation is found. A potential problem with SB-Local-Exhaustive-Search is that the number of possible allocations can be too large to be checked. This is the case, when 𝑛 and 𝑛 𝑠 are large and/or ModifyFunc includes many elementary operations to derive new allocations. The pseudocode of Algorithm SB-Local-Exhaustive-Search (omitted) is an obvious modification of Algorithm 1.

Algorithms SB-Greedy-Local-Search and SB-Local-Exhaustive-Search calculate the goodness of the current setting by Algorithm 2, which simply implements the objective function 𝑇 𝑘 ( 𝐴 ) .

alg2
Algorithm 2: Goodness function.

5.2.2. Algorithms for Generating the Next Allocation

Swapping Devices Randomly
Algorithm Swap-Randomly picks two devices at random and swaps their places on the bus. Observe that swapping does not change the number of devices allocated for each segment, and thus the goodness of this method highly depends on how well the segment borders have been set initially.

Moving a Device Randomly to Another Segment
Algorithm Move-Randomly moves a randomly chosen device to a randomly chosen segment. Observe that a swap consists of two move operations, and thus in principle Move-Randomly could be used in local search methods instead of Swap-Randomly. In practice, there can be situations, where a swap improves the cost whereas no single move operation does not.

Random Swaps and/or Moves
Algorithm Swaps-Moves-Randomly performs a sequence of 𝑥 random swap/move operations for a given device-to-segment allocation. The type of operation (swap or move) is chosen randomly with equal probability in each iteration round. In our experiments, we use Swaps-Moves-Randomly1, which performs a single random swap or move.

6. Experimental Results

In Section 6.1 we study the goodness of the proposed heuristic algorithms by measuring how quickly the algorithms will find the global optimum. As the problem space is huge, two rather small sample problems are used, and the exhaustive search method is used to find the global optima for the two problems.

In Sections 6.2 and 6.3 we apply the approach defined in the previous sections to two other examples. The first one is based on a synthetic communication matrix, and the second one analyzes the specification of a (simplified) stereo mp3 decoder (layer III) [24]. The first example, while not being concrete, explores a large problem space. On the other hand, the concrete application offers the opportunity to test our methodology on a real example, even if with a less complex communication matrix. In both situations (Sections 6.2 and 6.3), we employed the “LogicLocks” feature of Altera design tools [22] for “locking” together devices operating in the same clock domain. Manual placement of such structures may be required, for placing blocks on the same hierarchical level close to each other, when necessary. This helps providing the best solutions for clock signal distribution.

6.1. Evaluation of Algorithms

Experiments are made with 3 heuristic methods.

(i) 𝐿 𝑜 𝑐 𝑎 𝑙 𝐸 𝑥 𝑎 𝑢 𝑠 𝑡 𝑖 𝑣 𝑒 1
SB-Local-Exhaustive-Search is applied with the procedures InitRandomly and Swaps-Moves-Randomly1. This means that the algorithm studies all neighboring points of the current search space point (solution) and advances to the one giving the biggest gain. The algorithm has an additional parameter, the number of attempts, # 𝑎 , which tells the number of randomly chosen starting points. In the experiments, # 𝑎 = 5 0 unless stated otherwise.

(ii) 𝐿 𝑜 𝑐 𝑎 𝑙 𝐺 𝑟 𝑒 𝑒 𝑑 𝑦 𝑀
Algorithm 1 is applied with the procedures InitRandomly and Move-Randomly. The parameter 𝑏 (maximal number of consecutive nonimproving search space positions) has value 1000 in the experiments unless stated otherwise. The parameter # 𝑎 has value 50.

𝐿 𝑜 𝑐 𝑎 𝑙 𝐺 𝑟 𝑒 𝑒 𝑑 𝑦 𝑀 , 𝑆
This algorithm is the same as 𝐿 𝑜 𝑐 𝑎 𝑙 𝐺 𝑟 𝑒 𝑒 𝑑 𝑦 𝑀 but now Swaps-Moves-Randomly1 is used instead of Move-Randomly. Again, # 𝑎 is applied.

The test problems case-1 and case-2 (Tables 2 and 3) are so small that they can be solved optimally with an exhaustive search method; see Tables 4 and 5 for results with different 𝑛 𝑠 values—due to the exhaustive search, the results are 𝑇 ( 𝐴 ) values of (4). Without segmentation, in both cases the communication cost 𝑇 would be 100.

tab2
Table 2: Communication matrix 𝐂 of test case-1 with 𝐧 = 6 .
tab3
Table 3: Communication matrix 𝐂 of test case-2 with 𝐧 = 8 .
tab4
Table 4: Optimal solutions for case-1 (symbol “ ” marks segment border).
tab5
Table 5: Optimal solutions for case-2.

In theory, 𝐿 𝑜 𝑐 𝑎 𝑙 𝐸 𝑥 𝑎 𝑢 𝑠 𝑡 𝑖 𝑣 𝑒 1 also finds the optimal solution in all cases given that enough randomly chosen starting points ( # 𝑎 ) are used. For case-1, we made one set of experiments with a randomly chosen seed that yields a random sequence of starting positions. Optimal results were then achieved for cases 𝑛 𝑠 = 2 6 after 7 , 1 , 1 3 , 2 4 , and 6 7 attempts, respectively. For case-2 and 𝑛 𝑠 = 2 8 , optimal solution was achieved after 2 , 5 , 3 , 3 , 4 5 , 1 1 , and 8 2 attempts, respectively. Since the number of possible starting positions is huge (approximately ( 𝑛 + 𝑛 𝑠 𝑛 𝑠 ) ; see the rightmost column of Table 5), it is notable that a modest number of attempts need to be made to reach the global optimum. For example when 𝑛 = 8 and 𝑛 𝑠 = 6 , our exhaustive search studies 191520 allocations for case-2, but # 𝑎 = 4 5 random starting points, and studying all in all 2295 allocations was enough for 𝐿 𝑜 𝑐 𝑎 𝑙 𝐸 𝑥 𝑎 𝑢 𝑠 𝑡 𝑖 𝑣 𝑒 1 . In case 𝑛 𝑠 = 7 and # 𝑎 = 1 1 , it was sufficient to evaluate 275 allocations (out of 141120 possible different allocations) to find the global optimum.

Similar observations can be made for 𝐿 𝑜 𝑐 𝑎 𝑙 𝐺 𝑟 𝑒 𝑒 𝑑 𝑦 𝑀 and 𝐿 𝑜 𝑐 𝑎 𝑙 𝐺 𝑟 𝑒 𝑒 𝑑 𝑦 𝑀 , 𝑆 . Table 6 gives some values for 𝑏 and # 𝑎 that yield an optimal result. The number of evaluated allocations is given in the column marked with # 𝑠 . The results in the table reflect only one experiment. The main observation remains the same: modest values for 𝑏 and # 𝑎 (yielding modest total numbers of studied allocations) make the heuristics to find the global optimum.

tab6
Table 6: Situations where heuristic methods produced optimal solutions for case-2.
6.2. Simulation Results for Rather Large Synthetic Example

Consider a (case-3) situation, where there are 16 devices ( 𝐷 0 , , 𝐷 1 5 ), and the communication matrix 𝐶 is as shown in Table 7. The first column identifies the masters and the first row the slaves. The master takes care of requesting access to the bus, in order to send data as specified by the communication matrix, while the slaves receive data from masters.

tab7
Table 7: Test case-3 with 𝐧 = 1 6 .

We solved the segmentation problem of case-3 by the exhaustive search and the 𝐿 𝑜 𝑐 𝑎 𝑙 𝐺 𝑟 𝑒 𝑒 𝑑 𝑦 𝑀 , 𝑆 algorithm; see Table 8 for results with 2 to 8 segments. In cases 𝑛 𝑠 = 2 , , 4 (exhaustive search), the result is globally optimal. In cases 𝑛 𝑠 = 5 , , 8 , the heuristic method was applied. The parameters (the iteration bound 𝑏 = 2 0 0 0 , , 3 0 0 0 and the number of random starting positions for searching # 𝑎 = 3 0 0 0 ) were set so that computations took approximately one minute. During that time, the algorithm typically evaluated approximately 1 0 7 (different) device-to-segment allocations. For cases 𝑛 𝑠 = 2 , , 4 , the heuristics also found a global optimum.

tab8
Table 8: Solutions for case-3; “*” = optimal solutions, “ ” = segment borders.

In order to observe the effect of the bus segmentation on the performance factors, we implemented the 3-segment solution of Table 8. The 3-segment solution is one of the best (Table 8), and the complexity of the implementation is not too demanding. Then, we compared the simulation output with a similar implementation on a single bus platform. In the next lines, we describe the setup for the simulation system.

System Model—The Segmented Bus
We can characterize a segment by the amount of data it has to send locally , or externally, to some of the other segments.
For the three-segment architecture (Table 8), master devices send data ( 1 ) locally, ( 2 ) externally, to one of the other segments, and ( 3 ) to the other one. The data to be transferred is generated by a counter associated with each of the masters. For a model of this system, see the upper part of Figure 5.

fig5
Figure 5: Simulation model for the three segment (above)/single (below) bus architectures.

System Model—The Nonsegmented Bus
The corresponding “single-bus” model in represented in the lower part of Figure 5. In order to preserve the relative size of the implementation (for future studies referring to power consumption evaluation, for instance), the system contains the same number of devices as in the segmented bus approach. Hence, even though we can only talk about local transfers, we still have nine masters and nine slaves.

Platform Parameters
The communication on the SB platform is built around a store and forward scheme. A data packet contains both data provided by the master as well as information regarding the target address (slave ID) and source address (master ID) [8]. Thus, within the target segment, the respective slave identifies itself as the intended repository of the packet and identifies the device that sent the data, for possible further communication. In the current version of the platform, each of these IDs is stored on a different word, at the beginning of the packet. Hence, each data packet has 2 additional locations, apart from the actual data load. The same packet format is specified for the single bus implementation, too. For the sample case of Figure 5, we let the packet size be 2 5 + 2 (data + address locations).
Regarding clock frequency, one has to specify four values: segment 0 runs at 91 MHz, segment 1 at 98 MHz, segment 2 at 89 MHz, while the central arbitration unit operates at a 90 MHz clock frequency. We assigned for the single bus clock the fastest of the above frequencies, 98 MHz. The frequency values have been assigned arbitrarily but the highest one is the lowest which guarantees that and clock data signals are delivered to registers such that the required setup and hold times are met, given the selection of the FPGA device.

Simulation Results
The whole system was simulated at postsynthesis levels, in the Modelsim environment [25]. For the segmented bus solution, the results show a 2 6 % increase of performance, compared to the execution on the single bus implementation ( 2 . 2 3  milliseconds compared to 2 . 8 2   milliseconds, the time required for all the masters to send their data packets).

6.3. MP3 Decoder Example

Next, we illustrate the application of the device-to-segment allocation algorithm on an actual application model but we abstract from the details of the arbitration schemes and the implementation of the actual devices.

We have selected a (simplified) stereo MP3 decoder (layer III) [24] to exemplify our allocation algorithms. The application is well suited for packet-based communication, with interleaved communication and processing times. We remind the reader that our research task here is to assess the impact of using the SB platform on the execution time. Hence, we will not use actual figures and modules for the functional components of the MP3 example. We model these units as counters, running up to various limits such that various execution times are emulated.

The MP3 example specification is given in Figure 6. In brief, process 𝑃 0 represents frame decoding, 𝑃 1 / 𝑃 8 -scaling on the left/right channel, 𝑃 2 / 𝑃 9 -dequantizing left/right, and so on. The first component of a transition label between two processes specifies the number of packets to be transferred from source to destination, while the second figure specifies the order in which traffic is organized. Based on this, programmes for both the SAs and the CA are conceived [26]. The communication matrix corresponding to the diagram in Figure 6 is illustrated in Table 9. The communication is here organized based on 36 data +2 address word packets.

tab9
Table 9: The communication matrix 𝐂 for the MP3 decoder example.
867362.fig.006
Figure 6: Application diagram for a (simplified) MP3 decoder.

We have run the allocation algorithm SB-Greedy-Local-Search for a setup of two to four segments, linear topology. The costs ( 𝑇 ( 𝐴 ) ) associated with different settings of 𝑛 𝑠 are given in Table 10. The results show a relatively large improvement in performance (around 4 0 % ) brought by a segmented bus platform but also that the gain vanishes with an increasing number of segments. This is due to the highly unbalanced traffic requirements of the application, many of the processes are not even exchanging any data.

tab10
Table 10: Allocations and associated cost results for the MP3 example.

Simulation Results
Performance-wise, the simulation of the implemented example validates the results previewed by running the algorithm of the previous sections. Compared to a traditional bus solution ( 9 6 5 9 8 2 5 5 0 𝑝 𝑠 ), segmentation gives a 4 0 % improvement, approximately ( 6 8 1 6 5 2 7 1 0 𝑝 𝑠 ).

6.4. General Discussion

We have used the simulation settings described in Sections 6.2 and 6.3 in order to analyze the platform from several points of view. In these trials, we noticed the influence of the packet size (the upper bound of latency is computed based on the packet size in [8]), the performance worsening effect of balanced traffic, and the impact of various individual device processing times. We summarize the conclusions of these experiments as follows.

Algorithm versus Implementation Results: Example 6 . 2
The differences between the results of the example in Section 6.2 and the data specified in Table 8 originate from the fact that the introduced device-to-segment allocation algorithms analyze an ideal situation, where there is no inter-segment delivery latency. This is because we cannot ensure a fixed value for this latency but only a bound for it. Moreover, both the communication loads and the size of the data packets affect the performance more than the segment-to-segment delay [8]. However, these values are dictated by the application (as in Section 6.3) or by design decisions.
Similar simulation models, based on synthetic data generation, have been used by Lahiri et al. [20]. There, the counters considered in Section 6.2 are replaced by “stochastic traffic generators.” This kind of model may be considered a weakness of the analysis, as a specific application could be considered instead. However, the model we used brings us closer to a multiapplication environment, where packets coming from different applications are not related in precedence.

Algorithm versus Implementation Results: Example 6 . 3
The results offered in Section 6.3 are consistent with the data in Table 10. This is due to the existence of processing times, which reduce the importance of the communication overheads. In order to assess the impact of the device processing time on the performance figures, we have used synthetic values for the former. We noticed that, for processing times (counted in clock ticks) larger than the packet size, the improvements remain close to the figures offered by the algorithm. Dropping the processing times below the packet size threshold dramatically diminishes the advantages of the platform; for values less than half of the packet size, we actually worsen the overall execution time.
Considering the above, the inclusion of the processing time and of the packet size in future versions of the algorithm comes as a necessity.

Impact of Topology
A further improvement of performance is represented by a circular geometry of the system (segment “ 0 ” connected also to segment “ 𝑛 1 ”). The resource allocation algorithm introduced here can easily be applied in this case, too. Simulation results indicate a further 10% improvement, compared to the linear topology.

Power Consumption
At the moment, exact figures for power consumption are not available, especially due to the lack of appropriate tools for dealing with multiple clock domains. Accurate estimations, as the ones offered by Hsieh and Pedram [21], describing a bus structure at transistor level, are difficult to propose, as here the analysis is done at higher design levels. The same applies when considering the work by Jone et al. [6]. Still, our communication-based metrics for system performance match the power-based metrics of [6]. Hence, within segment limits, our approach will also help decreasing the power consumption. However, additional power is spent in SB due to the involvement of border FIFOs and of the CA, as we briefly will discuss next.
The use of available tools (Modelsim, and Altera’s “PowerPlay”) allows only for a different setup for analysis, consisting of using a single clock signal for the SB implementation. The results showed a 2% increase of the power consumption in the case of the SB system. The respective figures are derived from the implementation of the border units, synchronizer, and CA modules, as the simulated platform contained all the elements necessary to run a multiple clock platform, in order to truly match the switching activity of the multiple clock implementation.
To deepen the analysis, we have “isolated” and run appropriate power consumption tests for the border units, in the context of the MP3 example. On the basis of these tests, we conclude that, while the static power consumed by one border unit is approximately 25% of the whole design, the figures of the dynamic power consumption are only up to 3% of the corresponding whole system values. One should remember that the rest of the design is composed of arbiters and counters—hardly energy hungry devices. Hence, the comparisons we obtained are actually quite promising. Consequently, when the design is instantiated with the actual functional devices, one may expect real benefits in power consumption aspects from the SB platform. This is due to the fact that the relative (to the whole system) static power consumed by the border units will decrease, while the dynamic part will remain the same, in actual figures (hence, also decreasing with respect to the whole system).

Furthermore, the experiment avoids capitalizing on one of the important advantages offered by the SB platform, that is, the employment of different clock domains. Given the improvement in performance, the frequencies on the SB can be lowered, such that the same overall execution time figure is achieved. Accordingly, we may deduce an approximated overall reduction in power consumption of 20%—for the example of Section 6.2 and of 35%—the example of Section 6.3. This approximation, however, refers to the dynamic power consumption only. The static power consumption will definitely be superior in the case of the SB, its relative value depending on the contribution of the border units to the overall system area.

7. Conclusions

The problem of multisegment device allocation was considered from a general implementation independent point of view. Optimal location of devices on the bus segments was formalized as an organizational problem, where the objective was to minimize the maximal traffic load caused by the devices. This model supposes that there is an available control logic which connects as many bus segments as needed for a particular data transfer operation.

The problem was shown to be NP hard by a reduction from the set partitioning problem. Small problem instances can be solved by exhaustive enumeration of the various device-to-segment allocations but this method explodes when the number of devices and segments grows. One must then search for suboptimal solutions. For these cases, a generic local search algorithm was proposed. The algorithm advanced local search operations which were performed in a greedy manner. Practical tests show that near optimal or even optimal solutions can be found heuristically, in reasonable time.

Future Work
As shown by the practical implementation of the segmentation results, the expected performance figures are affected by the size of the data packets and by the processing time of individual devices. The resulting ideal solutions stand, however, as a basis for architectural selection, to be completed by application specific analysis. Application level issues were not within the scope of the present paper. The allocation algorithm and the simulations we performed offer a necessary support to the further development of the design methodology for the SB platform.

The SBTool is also subject to an extension process. Requirements regarding power consumption, and area of the devices will be considered as design constraints by the tool. We are currently studying a further extension towards the analysis of NOC systems as their structure and operation details come close to the SB platform.

Possibly one of the most urgent issues to be addressed by forthcoming research concerns the dynamic arbitration policies, as opposed to the current static solution. Such research will affect the way arbitration schemes are applied at both SA and CA levels.

Acknowledgments

The authors would like to thank the anonymous reviewers for their unusually careful and constructive comments which truly helped them to improve the quality of this paper. The research of the first author has been partially supported by the DOMES Project at the University of Turku (no. 8123518, 2008-2010), funded by the Academy of Finland (application no. 123518/2007).

References

  1. W. J. Dally and J. W. Poulton, Digital System Engineering, Cambridge University Press, Cambridge, UK, 1998.
  2. C. Katsinis, “A segmented-shared-bus multicomputer architecture,” in Proceedings of the 9th International Conference on Parallel and Distributed Computing and Systems (PDCS '97), Washington, DC, USA, October 1997.
  3. R. Krishnamurti and E. Ma, “An approximation algorithm for scheduling tasks on varying partition sizes in partitionable multiprocessor systems,” IEEE Transactions on Computers, vol. 41, no. 12, pp. 1572–1579, 1992.
  4. C.-H. Yeh and B. Parhami, “Design of high-performance massively parallel architectures under pin limitations and non-uniform propagation delay,” in Proceedings of the 2nd Aizu International Symposium on Parallel Algorithms/Architecture Synthesis (AISPAS '97), pp. 58–65, Aizu-Wakamatsu, Japan, March 1997.
  5. C. Ewering, “Automatic high level synthesis of partitioned busses,” in Proceedings of IEEE International Conference on Computer-Aided Design (ICCAD '90), pp. 304–307, Santa Clara, Calif, USA, November 1990.
  6. W.-B. Jone, J. S. Wang, H.-I. Lu, I. P. Hsu, and J.-Y. Chen, “Design theory and implementation for low-power segmented bus systems,” ACM Transactions on Design Automation of Electronic Systems, vol. 8, no. 1, pp. 38–54, 2003.
  7. T. Seceleanu, J. Plosila, and P. Liljeberg, “On-chip segmented bus: a self timed approach,” in Proceedings of the 15th Annual IEEE International ASIC/SOC Conference, pp. 216–221, Rochester, NY, USA, September 2002.
  8. T. Seceleanu, “The SegBus platform—architecture and communication mechanisms,” Journal of Systems Architecture, vol. 53, no. 4, pp. 151–169, 2007.
  9. T. Seceleanu, T. Knuutila, and O. Nevalainen, “Starvation-free arbitration policies for the segmented-bus platform,” in Proceedings of International Symposium on Signals, Circuits and Systems (ISSCS '05), vol. 1, pp. 67–70, Iasi, Romania, July 2005.
  10. T. Seceleanu, S. Stancescu, and V. Lazarescu, “Distributed arbitration for the segmented-bus platform,” in Proceedings of International Symposium on Signals, Circuits and Systems (ISSCS '05), vol. 1, pp. 63–66, Iasi, Romania, July 2005.
  11. A. Jantsch and H. Tenhunen, Eds., Networks on Chip, A. Jantsch and H. Tenhunen, Eds., Kluwer Academic Publishers, Hingham, Mass, USA, 2002.
  12. A. Jantsch and H. Tenhunen, “Will networks on chip close the productivity gap?,” in Networks on Chip, A. Jantsch and H. Tenhunen, Eds., pp. 3–18, Kluwer Academic Publishers, Dordrecht, The Netherlands, 2002.
  13. J. Rexford and K. G. Shin, “Support for multiple classes of traffic in multicomputer routers,” in Proceedings of the 1st International Workshop on Parallel Computer Routing and Communication (PCRCW '94), vol. 853 of Lecture Notes In Computer Science, pp. 116–130, Springer, Seattle, Wash, USA, May 1994.
  14. P. Avasare, V. Nollet, J.-Y. Mignolet, D. Verkest, and H. Corporaal, “Centralized end-to-end flow control in a best-effort network-on-chip,” in Proceedings of the 5th ACM International Conference on Embedded Software (EMSOFT '05), pp. 17–20, Jersey City, NJ, USA, September 2005.
  15. J. Delorme and D. Houzet, “A complete 4G radio communication application mapping onto a 2D mesh NoC architecture,” in Proceedings of IEEE North-East Workshop on Circuits and Systems (NEWCAS '06), pp. 93–96, Gatineau, Canada, June 2006.
  16. H. G. Lee, U. Y. Ogras, R. Marculescu, and N. Chang, “Design space exploration and prototyping for on-chip multimedia applications,” in Proceedings of the 43rd Annual Conference on Design Automation (DAC '06), pp. 137–142, San Francisco, Calif, USA, July 2006.
  17. D. C. Pham, T. Aipperspach, D. Boerstler, et al., “Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor,” IEEE Journal of Solid-State Circuits, vol. 41, no. 1, pp. 179–196, 2006.
  18. H. Wang, A. Papanikolaou, M. Miranda, and F. Catthoor, “A global bus power optimization methodology for physical design of memory dominated systems by coupling bus segmentation and activity driven block placement,” in Proceedings of the Conference on Asia and South Pacific Design Automation (ASP-DAC '04), pp. 759–761, Yokohama, Japan, January 2004.
  19. S. Srinivasan, L. Li, and N. Vijaykrishnan, “Simultaneous partitioning and frequency assignment for on-chip bus architectures,” in Proceedings of the Conference on Design, Automation and Test in Europe (DATE '05), vol. I, pp. 218–223, Munich, Germany, March 2005.
  20. K. Lahiri, A. Raghunathan, and S. Dey, “Design space exploration for optimizing on-chip communication architectures,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 23, no. 6, pp. 952–961, 2004.
  21. C.-T. Hsieh and M. Pedram, “Architectural power optimization by bus splitting,” in Proceedings of the Conference on Design, Automation and Test in Europe (DATE '00), pp. 612–616, Paris, France, March 2000.
  22. Altera Corporation, Quartus II Design Book, Altera, San Jose, Calif, USA, 2007.
  23. M. R. Garey and D. S. Johnson, Computers and Intractability, W.H. Freeman, San Francisco, Calif, USA, 1979.
  24. C. Park, J. Jung, and S. Ha, “Extended synchronous dataflow for efficient DSP system prototyping,” Design Automation for Embedded Systems, vol. 6, no. 3, pp. 295–322, 2002.
  25. ModelSim Simulator, http://www.model.com.
  26. D. Truscan, J. Lilius, T. Seceleanu, and H. Tenhunen, “A model-based design process for the SegBus distributed architecture,” in Proceedings of the 15th IEEE International Conference and Workshop on the Engineering of Computer-Based Systems (ECBS '08), pp. 307–316, Belfast, UK, March-April 2008.