Abstract

We propose new 3D 2-layer and 3-layer NoC architectures that utilize homogeneous regular mesh networks on a separate layer and one or two heterogeneous floorplanning layers. These architectures combine the benefits of compact heterogeneous floorplans and of regular mesh networks. To demonstrate these benefits, a design methodology that integrates floorplanning, routers assignment, and cycle-accurate NoC simulation is proposed. The implementation of the NoC on a separate layer offers an additional area that may be utilized to improve the network performance by increasing the number of virtual channels, buffers size, or mesh size. Experimental results show that increasing the number of virtual channels rather than the buffers size has a higher impact on network performance. Increasing the mesh size can significantly improve the network performance under the assumption that the clock frequency is given by the length of the physical links. In addition, the 3-layer architecture can offer significantly better network performance compared to the 2-layer architecture.

1. Introduction

3D integration is emerging as an attractive solution to the problem of increasing global interconnect delay of integrated systems [14]. The main advantage of 3D integration technologies is that the footprint area of the chip is smaller compared to the 2D case. Therefore, because the connections between device layers can be realized by short and reduced delay through silicon vias (TSVs), the average interconnect delay is significantly shorter. However, 3D integration technologies face challenges related to thermal issues.

Network-on-Chip (NoC) represents a new design paradigm for increasingly complex Systems-on-Chip (SoC), and since the idea of routing packets instead of wires was proposed [57], it has grown into a rich research topic [810]. The NoC concept replaces design-specific global on-chip wires with a generic on-chip interconnection network realized by specialized routers that connect generic processing elements (PE)—such as processors, ASICs, FPGAs, memories, and so forth—to the network and facilitate communications or links between them. The benefits of the NoC based SoC-design include scalability, predictability, and higher bandwidth with support for concurrent communications.

The scalability and predictability of NoCs enable designers to design increasingly complex systems, with large numbers of IP/cores and lower communication latencies for many applications. In such scenarios, where flexibility and predictability are primary concerns, homogeneous regular networks (see Figure 1(a)) are preferred. However, homogeneous NoC topologies have limitations in that communication locality is poorly supported, and the utilization of network resources is low. Moreover, designs with IP/cores with different sizes are not well suited to implementations based on regular mesh NoC topologies. Therefore, when area and performance are more important, application-specific heterogeneous irregular networks (see Figure 1(b)) are preferred. However, the design of these networks is more difficult and specialized routing algorithms are necessary to prevent deadlock.

Most of the previous works assumed equal area for all tiles. This assumption simplifies the design process due to the regularity of the mesh NoC topologies. However, assuming tiles with equal area in cases where IP/cores have different sizes is unrealistic. To address this problem for systems with heterogeneous floorplans and to exploit the benefits of 3D integration, in this paper, we propose new 3D NoC architectures. The proposed architectures utilize homogeneous networks on a separate layer and heterogeneous floorplans on different layers. Our objective is to combine the benefits of compact heterogeneous floorplans with those of regular homogeneous mesh networks.

In this section, we discuss recent works on 3D NoCs and studies that utilize floorplanning information during the design and optimization of NoCs. The reader is referred to [11, 12] for recent surveys of NoCs. Nanophotonic and wireless NoCs have also been recently proposed as alternative solutions to 2D architectures. However, they pose several issues including scaling and integration of photonic devices and power dissipation of mm-wave transceivers [13].

Performance benefits of 3D NoC topologies were investigated analytically in [14] and experimentally in [15]. The transition from 2D to 3D NoC architectures is done by equally distributing tiles onto the device layers of the 3D architecture in [16, 17]. A different approach is to reduce the footprint of each tile by implementing the processing element and the router in a distributed fashion across layers [18]. By leveraging long wires to connect remote intralayer nodes, a low-diameter 3D network is studied in [19]. Efficient application-specific 3D NoC topology synthesis algorithms are studied in [20]. A novel layer-multiplexed 3D network architecture with vertical demultiplexing and multiplexing links is proposed in [21]. The scalability of 3D NoCs in terms of throughput, latency, and area overhead is studied in [22]. The per-flow worst-case communication performance in 2D and 3D regular mesh NoCs with four layers is investigated in [23]. The first demonstration of a fabricated 3D NoC is reported in [24]. Previous works focused on homogeneous regular NoCs and floorplans where all PEs have equal size. In this paper, the proposed architectures provide a design solution to applications where PEs are heterogeneous with different sizes. Hence, the proposed 3D NoC architectures represent an alternative solution to heterogeneous irregular NoC topologies.

Floorplanning information is used in the area-wirelength calculations from [25, 26] or during mapping [27]. A floorplanner is used to compute links power consumption and to detect timing violations in application-specific NoCs in [28]. The NoC synthesis approach from [28] was extended to designing custom 3D network topologies in [29]. Slicing floorplanning is used in the design methods for custom NoC topology synthesis studied in [3032]. Frequently communicating modules are placed next to each other using a floorplanner in the network synthesis heuristic studied in [33]. A physical planner is used in [34] during topology design to reduce power consumption on wires. Previous works use floorplanning either for wirelength and power estimations or to find a single placement, which then remains fixed throughout the topology synthesis process. However, other floorplanning solutions may represent better starting placements for the synthesis process. To address this problem, the methodology proposed in this paper explores multiple floorplans to increase the chance of finding the optimal initial placement.

3. Contribution

In this paper, we propose novel 3D NoC architectures and implement an automated design space-exploration tool. Our main contribution can be summarized as follows.(i)We propose and study two 3D NoC architectures (2-layer and 3-layer architectures) that utilize a homogeneous network on a separate layer and heterogeneous floorplans on different layers. In this way, the network regularity is maintained for flexibility and delay predictability while the IP/cores can have arbitrary sizes. This approach avoids design difficulties due to IP/core size irregularities that are typically addressed by specialized routing algorithms [35].(ii)For the 2-layer architecture, we propose the use of a floorplanning and routers assignment-based design methodology for the placement of IP/cores on the first layer and the minimization of their connections to the NoC routers located on the second layer. In the case of the 3-layer architecture, the design methodology also includes a partitioning step. The floorplanning of the two partitions on layers 1 and 3 is done using a newly modified floorplanner capable of handling vertical constraints. One advantage of the separation between IP/cores and network is that once the best floorplan is found, one can focus on improving the system performance by focusing on the network. The second layer has an additional available area that may be utilized to increase the number of routers or their complexity (e.g., increase the number of virtual channels and/or the buffers size). In addition, network interfaces (NIs), which are important components of NoC-based systems, also may be placed on the second layer, thereby reducing the footprint area of the floorplanning layers.(iii)We implemented a versatile software framework to investigate the benefits of the proposed 3D architectures. It integrates an efficient B*Tree-based floorplanner with a cycle-accurate NoC simulator for maximum confidence in the experimental results.

Preliminary results on the 2-layer NoC architecture were reported in [36]. In this paper, we also propose the second 3-layer NoC architecture that aims at further reducing the footprint area of the chip and at improving the average flit latency. Due to the differences from the first proposed architecture, we also modify the design methodology by introducing an additional partitioning step and enhancing the floorplanning algorithm to handle vertical constraints.

4. 3D Architectures

4.1. 2-Layer Architecture

The 2-layer architecture has two device layers. The first layer is used entirely for the heterogeneous irregular IP/cores, while the second layer is dedicated to the homogeneous regular NoC (see Figure 2). This approach simplifies the design process in that it separates the floorplanning optimization from the network topology synthesis. The goal of the floorplanning step is to find the best floorplan with minimal white space. The second device layer accommodates the regular mesh network. In this way, the network regularity is maintained for flexibility and delay predictability, while the IP/cores can have arbitrary sizes. In addition, a simple packet routing algorithm can be used such as the deterministic XY routing.

4.2. 3-Layer Architecture

The 3-layer architecture has three device layers. The second layer is again dedicated to implementing the NoC, while layers 1 and 3 are used for IP/cores placement (see Figure 2). This architecture aims at reducing the footprint area of the chip, which in turn leads to shorter physical links, hence improving the network performance (overall average flit latency). In both proposed architectures, the vertical connections between IP/cores and their assigned routers are realized using through silicon vias (TSV). Routers connected to IP/cores have five ports, while the rest of the routers have only four ports.

One advantage of the proposed 3D NoC architectures is that the 3D fabrication will be simpler compared to 3D architectures with more than three layers [17], as the misalignment is only between two or three layers. In addition, the thermal management of fewer layers also will be simpler. Moreover, the extra space available on the second layer may be used to increase the number of routers or their complexity. The additional area may be utilized to implement fault/error tolerance techniques such as error-correcting codes to address crosstalk issues or could be allocated to additional wires to increase the bandwidth of physical links and therefore improve the overall network performance. Alternatively, the extra area also may be utilized to implement thermal monitoring and management schemes [37], to implement buffers for pipelining the physical links, or to incorporate reconfigurability capabilities [38].

5. Proposed Design Methodology

The proposed design methodology is presented in Figure 3. The input to the design flow is the application represented as a communication task graph (CTG) whose tasks have been mapped to floating IP/cores using existing mapping algorithms [39]. By floating it is meant that the location of these tasks is yet to be determined during the floorplanning step. In addition, the user can specify several control parameters including the number of different floorplans to be explored , the number of best floorplans recorded in the best M list and evaluated later using the integrated cycle-accurate simulator and the mesh size . The main steps of the proposed design methodology are described in the following sections.

5.1. Partitioning of the CTG

This step is done only for the 3-layer architecture. Because in this case, the IP/cores are placed on two layers (1 and 3), we first partition the CTG of each application into two subgraphs (see Figure 4), which will be placed by the floorplanner in the next step. The bipartitioning is done such that the total area of the cores in each partition is balanced while the number of arcs (an arc represents a source-destination communication pair in the communication task graph) cut is minimized. The two partitions have to be balanced to minimize the footprint area of the 3D chip, which will be determined by the maximum area of the accumulated area of the blocks in each partition. The minimization of the cut size between the two partitions has as a result that highly connected cores are placed on the same layer. This, as observed in our experiments, helps these cores to be floorplanned closer to each other, which in turn leads to better overall latencies.

For this step, we use the well-known hMetis partitioner [40, 41]. hMetis is a multilevel move-based partitioner, which can achieve balanced and minimum cutsize partitions very efficiently.

5.2. Exploration of and Recording of Best Floorplans

The integrated floorplanner is based on the B*Tree representation from [42]. It employs a simulated annealing-based algorithm, with a cost function that combines area and wirelength (user can specify ) Connections between cores are weighted by the communication volume (available from the CTG) so that the resulting floorplanning solution minimizes first the connections with higher communication volume.

A number of different floorplans are generated by running the floorplanner times with the selected weights for area and wirelength and with different seeds for the internal random number generator. During this step, a number of best floorplans are recorded in the best M list. The selection is made according to the chosen criterion of smaller area or shorter total wirelength, which is related to the total communication volume inside the application. The default values of and are , . However, they also may be specified by the user. A typical result after floorplanning is shown in Figure 5.

For the 3-layer architecture, the floorplanning step is different. In this case, cores are placed on layers 1 and 3 as dictated by the result of the partitioning step. To do that, we have modified the floorplanning algorithm to be able to handle vertical constraints. As a result, this step is split into two substeps (i) In the first substep, the first partition is floorplanned on layer 1 using the original version of the floorplanning algorithm (this is similar to the 2-layer architecture). (ii) During the second sub-step, the second partition is floorplanned on layer 3 using the modified floorplanner. In this case, connections between cores of this partition and cores of the first partition (already placed and fixed on layer 1) act as vertical constraints for the floorplanning process on layer 3. Vertical constraints aim at minimizing the overall wirelength of the top-level application. Intuitively, cores on layer 3 connected in the top-level communication task graph to cores on layer 1 should be overlapping or vertically aligned to shorten the communication distance via the network on layer 2. A typical result after floorplanning for the 3-layer architecture is shown in Figure 6.

5.3. Routers Assignment

In this step, each floorplan from the list of best floorplans undergoes the routers assignment step. The regular mesh NoC is constructed on layer 2. This square regular mesh network utilizes the minimum number of routers that can guarantee at least one router for each IP/core. This topology is referred to as the direct topology. However, the mesh can optionally be expanded to a larger number of routers in both directions. Since we deal with heterogeneous floorplans, it is not possible to guarantee the presence of routers at the locations of IP/core corners (or even the IP/core layout). Therefore, some of the IP/cores will have to use extralinks to connect to the assigned routers. These extra-links introduce additional delays (included inside the cycle-accurate simulator) that affect the overall performance.

The goal of the routers assignment step is to associate each IP/core with a router from the regular mesh on layer 2 such that the total wirelength of the extra-links between each IP/core and its assigned router is minimized. This is a linear assignment problem solved by using the efficient Kuhn-Munkres algorithm [43]. The algorithm utilizes a bipartite graph (see Figure 7) with two sets of nodes: left-nodes representing the application IP/cores and right-nodes representing the routers of the regular mesh NoC. Edges connect each node from one set to all nodes in the other set. Edge weights are proportional to the Manhattan distance between the IP/core and routers. In this way, we treat the assignment of all IP/cores simultaneously and achieve an overall minimal total length of the extra-links. This step is the same for both 2-layer and 3-layer architectures. The examples from Figures 5 and 6 also show the result of the routers assignment step.

5.4. NoC Simulation

In the last step, each of the best NoC topologies is verified using the integrated cycle-accurate simulator. The simulator is an adapted version of the one studied in [37]. We use the following default values for the NoC topology: packet size of 5 flits with each flit being 64 bits wide, input buffer size of 12 flits, and two virtual channels. We use XY routing and wormhole flow control, which is known to be very efficient and requiring small hardware overheads. The cycle-accurate simulator is always run until all injected flits reached their destination and the average latency is computed allowing first 1000 warm-up cycles. The router architecture is similar to the one presented in [44]. The final average flit latency, which is obtained during this step, is recorded for each of the floorplans from the best M list. The NoC topology with the best overall latency is selected as the final result.

Finally, we note that ideally, one would use the routers assignment and the cycle-accurate simulation inside the optimization loop of the simulated annealing based floorplanning algorithm (the concept of unifying different design flow steps to better explore the design solution space has been applied successfully for example to mapping and routing in [45].) However, this becomes computationally too expensive due to the long CPU runtimes required by the cycle-accurate simulator.

6. Experimental Results

6.1. Experimental Setup and Testcases

We implemented the proposed design methodology, which integrates the partitioner, the floorplanner, the routers assignment, the NoC cycle-accurate simulator, and the GUI, using C++. The tool can be downloaded from [46]. In our experiments, we used six testcases whose characteristics are shown in Table 1. In this table, we also present the size of the direct topologies. We constructed these testcases from the classic MCNC testcases, whose area was scaled to achieve an average size of about , which is a typical area for NoCs reported in the literature [24, 4749]. The initial connectivity between the modules was used to compute the communication volume in the communication task graph associated with each testcase floorplan.

For the simulated annealing-based floorplanning step, we used an alpha value of 0.25, which in our experiments proved to be a good balance between area and wirelength while the aspect ratio of the resulting floorplan was close to 1. In the NoC simulation step, each testcase was subject to uniform traffic with packets injected at each source router at a rate proportional to the communication volume of the corresponding source-destination communication pair.

Because in our methodology the length of the physical links between the network routers varies with the network size, we estimate the link delay by extrapolating the physical link delay from [37] using a simple Elmore delay formula [50]. The same delay estimation technique was applied to the extra-links between IP/cores and routers, which were assumed to be L-shaped (with negligible via delay between metal layers). We do, however, consider the delay of the through silicon vias (TSVs) between two device layers of the 3D architectures. We estimated the TSV delay by technology projection [51] using the delay data from [17]. Based on the analyses in [17, 52], we assume that the area required by TSVs is negligible and that TSVs can be accommodated within the white space available in typical floorplans. The CPU runtime is approximately 30 minutes (Linux machine, 2.5 GHz, 2 GB memory) for the largest testcase ami49.

6.2. Exploration of the 2-Layer Architecture

In the first part of the experiments, several variations are applied to the default network specifications. The purpose of these experiments is to identify the optimal network that minimizes the average flit latency.

6.2.1. Varying the Virtual Channels Count

We start by investigating the impact of increasing the number of virtual channels. We can afford to do that because routers are expected to be smaller than the average core size (roughly 20% of the total cores area), which means that on layer 2 there is extra area that may be utilized to further improve the NoC performance. In this experiment, we varied the number of virtual channels between 2 and 6.

The results for the average flit latency (as reported by the cycle-accurate NoC simulator) are shown in Figures 8, 9, 10, 11, 12, and 13. We observe that the average flit latency generally improves with the increase of the number of virtual channels. We also plot (see Figure 14) the normalized latency (with respect to the latency achieved using the default of 2 virtual channels) for the packet injection rate when the network saturation occurs. We find that the optimal number of virtual channels is different for different testcases.

These results are expected because the overall congestion in the network is intuitively reduced if the number of virtual channels multiplexed in the time-domain over a physical channel is increased. However, it is also evident that increasing the number of virtual channels more than necessary can have a negative impact on performance. This is explained as follows: as the network gets loaded with injected packets, the average amount of stalling due to arbitration inside routers (with increasingly more virtual channels) may increase and affect negatively the latency.

6.2.2. Varying the Buffers Size

In this experiment, instead of increasing the number of virtual channels, we increase the buffers size. Since we assume the area occupied by routers to be roughly 20% of the total cores area, we increase the area of each router to up to 5x by increasing both the input and output buffer sizes of each port (buffers occupy most of the area inside the router architecture).

Due to space limitations, we report (see Figure 15) only the normalized latency (with respect to the latency achieved using the minimum buffer size) for the packet injection rate when the network saturation occurs. We observe that the average flit latency improves with the increase of the buffer size. However, once a certain buffer size is reached (which is roughly the 3X data point except for xerox) further increasing the buffer size does not improve the latency. We note that in general, latency is improved more by increasing the number of virtual channels rather than increasing the buffers size.

6.2.3. Varying the Mesh Size

In this experiment, we investigate the impact of increasing the size of the regular mesh network (, ) on the average flit latency. Increasing the number of routers in both dimensions for the same testcase translates in shorter physical links between routers. As a result, the entire system can be clocked at higher frequencies, which significantly improves the saturation throughput. For example, the result of this experiment for testcase ami49 is shown in Figure 16. The results of the rest of the testcases are similar; we do not include the rest of the plots here due to space limitations.

It has to be noted that this result is achieved under the assumption that the delay of the router pipeline is a clock period, which is given by the delay of the physical link. This assumption is made in [37], from where we adopted the NoC simulator, and can be validated by using speculation and lookahead as discussed in [8]. If, however, this assumption is removed, the router pipeline should incur a delay equal to two, three, or four clock cycles (to account for typical operations including routing computation, virtual channel allocation, switch allocation, and switch traversal), depending on the type of flit (head, body or tail) and the degree of speculation and lookahead [8]. On the other hand, if the router pipeline is assumed to incur a constant and fixed delay—irrespective of the physical link length—then the system clock frequency will be given by the maximum between the delay of the physical link and the delay of the router pipeline. In this case, the impact of increasing the mesh size will be less significant, and it could actually lead to an increase of the flit latency as studied in [36].

6.3. Comparison of the 2-Layer and 2D Architectures

In order to compare the proposed 2-layer architecture against a traditional approach we construct the 2D architecture by artificially expanding the IP/cores with 20% [12, 16] to account for the space occupied by routers and network interface (implemented within the cores boundaries on the same layer).

The simulation results are shown in Figures 17, 18, 19, 20, 21, and 22. We observe that the performance of the 2-layer architecture is better for each testcase. For the 2-layer architecture, the additional delays incurred due to the TSVs negatively impact the flit latency. However, the footprint area is smaller (as cores are smaller), and therefore the physical links are shorter, which leads to significantly smaller flit latencies. In addition, if we exploit the extra area on the top layer of the 2-layer architecture as described in the previous sections, then the latency can be further improved as illustrated by the 3D 2-layer best data points from Figures 1722.

We also note that the performance of the 2D architecture may be improved by designing custom NoCs similar to those studied in [28, 30]. However, the main goal of this paper is not to show that regular homogeneous 3D NoCs are better than custom 2D NoCs, which is unlikely, but to propose 3D architectures as alternatives to regular 2D NoCs and explore their performance when the number of virtual channels, buffers size, and mesh size is varied to use the extra space available on layer 2.

6.4. Comparison of the 2-Layer and 3-Layer Architectures

In the last part of our experiments, we compare the average flit latencies of the 2-layer and 3-layer architectures. The simulation results, using the default values for mesh size, number of virtual channels and buffers size are shown in Figures 23, 24, 25, 26, 27, and 28. As expected, because in the case of the 3-layer architecture the physical links are shorter, the clock frequency at which the system can work is higher. Hence, the average flit latency is improved significantly. Therefore, we conclude that the 3-layer architecture is better than the 2-layer architecture. However, the design methodology for the 3-layer architecture requires additionally the partitioning step and the modification of the floorplanning algorithm. It may potentially require more complex thermal management too due to the increased integration density and the 3D fabrication technology will be more complex due to the alignment of three layers.

7. Conclusion and Future Work

In this paper, we proposed 3D 2-layer and 3-layer NoC architectures that utilize homogeneous networks on a separate layer and heterogeneous floorplans on different layers. A design methodology that consists of floorplanning, routers assignment, and cycle-accurate NoC simulation was implemented and utilized to investigate the new architectures. Experimental results showed that increasing the number of virtual channels rather than the buffers size is more effective in improving the NoC performance. In addition, increasing the mesh size can significantly improve the NoC performance under the assumption that the clock frequency is given by the length of the physical links. Moreover, the 3-layer architecture can offer significantly better NoC performance compared to the 2-layer architecture.

As future work, we plan to address the problems of energy consumption and thermal profile optimization [53, 54] possibly in a unified fashion inside the floorplanning algorithm. The floorplanning step will be modified to consider the allocation of white space and TSVs planning under area constraints.

Acknowledgment

This paper was supported by the Electrical and Computer Engineering Department at North Dakota State University (NDSU).