Abstract
Lowcost FPGAs have comparable number of Configurable Logic Blocks (CLBs) with respect to resourcerich FPGAs but have much less routing tracks. For CAD tools, this situation increases the difficulty of successfully mapping a circuit into the lowcost FPGAs. Instead of switching to resourcerich FPGAs, the designers could employ depopulationbased clustering techniques which underuse CLBs, hence improve routability by spreading the logic over the architecture. However, all depopulationbased clustering algorithms to this date increase critical path delay. In this paper, we present a timingdriven nonuniform depopulationbased clustering technique, TNDPack, that targets critical path delay and channel width constraints simultaneously. TNDPack adjusts the CLB capacity based on the criticality of the Basic Logic Element (BLE). Results show that TNDPack reduces minimum channel width by 11.07% while increasing the number of CLBs by 13.28% compared to TVPack. More importantly, TNDPack decreases critical path delay by 2.89%.
1. Introduction
Fieldprogrammable gate arrays (FPGAs) were first introduced in 1980s. While they are less efficient than ASICs, FPGAs are becoming more popular because of their low nonrecurrent engineering cost and fast timetomarket. Currently, commercial FPGAs can be categorized as lowcost and resourcerich families. As shown in Table 1, lowcost FPGA family (Spartan) has comparable number of Configurable Logic Blocks (CLBs) with resourcerich family (Virtex), but less memory, multipliers, and routing tracks. Limitation on interconnect resources increases the probability of nets being routed through longer paths and nets becoming unroutable due to congestion. For the sake of routability, when nets go through longer paths, critical path delay may also increase. We may solve these problems by migrating to the resourcerich FPGA device which has more routing resources by paying 7 price. In order to avoid this, FPGA CAD flow must improve the routability as well as timing performance to make the lowcost device a feasible option.
FPGA CAD flow includes four stages: technology mapping to form a netlist of logic blocks, clustering to combine blocks into CLBs, placement to allocate physical positions to each CLB, and routing to define paths for all nets in the design. Clustering is the foundation of layout and has strong influence on area efficiency, timing, and power [1]. Figure 1 categorizes clustering techniques. Based on the target utilization objectives, we identify two types of clustering techniques: targeting maximum logic utilization and targeting less than maximum logic utilization. Most clustering approaches fully populate CLBs with optimization goal for routability (area), timing, or power.
However, maximum logic utilization may cause routing congestion in some parts of the FPGA. A CLB contains N basic logic elements (BLEs), where a typical BLE used in many academic studies is formed of a 4input LUT, a flipflop, and a MUX to choose the output from either the LUT or the flipflop. A group of CLBs is strongly connected if they share a large number of nets. After placement, such CLBs appear close together in a specific region on the FPGA. Filling these CLBs to the limit (N) increases the demand on the interconnect resources through this region to be able to route the connections among them. As a result, channel width requirements for such regions become higher than others. This leads to an increase in peak channel width and hence the design requires more routing resources.
It has long been known that, as CLBs are depopulated, better channel widths can be achieved. First proposed in [2], the depopulationbased clustering techniques can lower peak channel width and improve routability. Instead of targeting maximum logic utilization, depopulation is a technique that underuses CLBs by not filling them to capacity. The regions with strongly connected CLBs are spread over a larger area on the FPGA. This reduces the demand for routing resources; hence in such a region, more resources become available to route the connections.
However depopulation leads to more number of external connections among CLBs and typically results with an increase in critical path delay, because the interCLB delays are much larger than the intraCLB delays [1]. For example, the latest depopulationbased clustering technique [3] decreases minimum channel width by 15% while increasing the number of CLBs by 17.32% compared to TVPack [4]. Additionally, total area increases by 5%, along with 7% increase in critical path delay. All depopulationbased clustering algorithms [3, 5, 6] increase critical path delay, while enhancing the routability.
In this paper, we propose the first depopulationbased clustering approach that takes timing into account. Categorized in Figure 1, we develop a seedbased routability and timingdriven nonuniform depopulation technique, TNDPack. We adjust the CLB capacity under construction based on the criticality of the BLE under consideration. For example, we cluster the nets on the critical path to full capacity. That way, we reduce the interCLB delay which helps decrease critical path delay. Meanwhile, we depopulate on the paths with low criticality to avoid routing congestion and hence reduce the channel width requirements. To achieve this idea, we modify both the algorithm flow and cost function of TVPack. Results show that TNDPack decreases minimum channel width by 11.07% while increasing the number of CLBs by 13.28% compared to TVPack. More importantly, as opposed to the trend we see in other depopulation techniques, TNDPack decreases critical path delay by 2.89%. With the new technique, instead of moving to a resourcerich FPGA (in the case of Spartan 3s500e versus Virtex 2vp7 of Table 1), designers may, for example, move to Spartan 3s1200e and pay 2 instead of 7 cost. Furthermore, this paper stands as a guide when it comes to understanding the effects of depopulation on area and delay performance for FPGAs.
Rest of the paper is organized as follows. Section 2 presents the review of the related work on depopulationbased clustering techniques. Section 3 introduces our clustering technique, TNDPack. Section 4 presents and analyzes our experimental results. Section 5 compares our work with both depopulation and nondepopulationbased approaches. Section 6 presents our conclusion and future work.
2. Related Work
Several depopulation techniques were proposed previously. We categorize them into two types (Algorithm 1): uniform depopulation [5, 6] and nonuniform depopulation [3, 7]. Uniform depopulation sets a fixed “upper limit” per CLB and each CLB is filled to that “upper limit” capacity. In nonuniform depopulation, the “upper limit” varies among CLBs. Let us assume that cluster size is 8. While a uniform depopulation scheme may use a fixed “upper limit” of 6 for all CLBs, a nonuniform scheme will result in a CLB distribution with sizes from 1 to 8. nonuniform depopulation sets a very low “upper limit” to prioritize routability for a congested area and sets a high “upper limit” for the congestion free area to save more CLBs. Therefore, nonuniform depopulation results with better routability in congested area and higher CLB utilization in uncongested area compared to uniform scheme.

Tom and Lemieux [7] proposes the first nonuniform depopulation methodology. Tom uses 20 MCNC benchmark circuits [8] and connects them with three different topologies (independent, pipelined, and clique). Each topology represents an SoC. Each benchmark is an IP block and uses its own “upper limit.” Results show that the SoC design with the help of depopulation technique requires less channel width compared to TVPack. However, total area increases while maintaining similar critical path delay relative to TVPack. Tom's approach [7] stands as a good study in terms of showing the potential benefit of nonuniform depopulation. However, the methodology determines the “upper limit” for each IP block manually based on the congestion inside the same IP block and no algorithm is given.
The nonuniform depopulation technique, Un/DoPack, was proposed by Tom et al. in [3]. This technique runs the FPGA CAD flow twice. First iteration is the regular CAD flow. In the second iteration, clustering stage uses the layout result of the first iteration and depopulates the congested regions. While reducing the channel width, Un/DoPack, similar to the other depopulationbased clustering approaches, observes an increase in total area and critical path delay.
3. TNDPack
In this section, we describe our seedbased routability and timingdriven nonuniform depopulation clustering technique, TNDPack. We present the pseudocode and notable implementation insights.
3.1. Algorithm Flow
TNDPack chooses the seed block based on criticality first. The first block that is clustered into a CLB is called the seed block of this CLB. Then TNDPack packs more blocks into the CLB by following the nonuniform depopulation clustering scheme.
We define two strategies for depopulation:
(i)BLElimit: limit the number of BLEs used in a CLB [3, 6, 7], (ii)inputlimit: limit the number of inputs used for a CLB [5].TNDPack employs either “BLElimit” or “inputlimit” strategy to achieve variable utilization level. We evaluate the performance of each and present their effect on minimum channel width and critical path delay separately. In this paper, the “utilization level” measures the amount of resources used by a CLB in terms of the number of BLEs or inputs where “high utilization level” means most resources are used. For the “BLElimit” strategy, utilization level refers to the number of BLEs used in a CLB, and for the “inputlimit” strategy, utilization level refers to the number of inputs used by a CLB.
Algorithm1 shows the pseudocode for TNDPack. First, the algorithm computes the criticality of each block (Ln. 1) and sorts them based on their criticality (Ln. 2).
Then TNDPack begins to fill CLBs. Algorithm keeps clustering blocks into CLBs until no unclustered block is left (Ln. 3). In each iteration, we have the following.
() TNDPack packs the seed block that has the maximum criticality (Ln. 4). We regard that the criticality of the seed block represents the criticality of the net. Therefore, we determine the “maximum utilization level” based on the ranking of the seed's criticality value (Ln. 5). In this paper, the “maximum utilization level” means the maximum number of BLEs or inputs that are allowed to be used for a CLB. If the ranking of the seed's criticality value is high, algorithm sets the “maximum utilization level” to a high value to decrease the critical path delay. Nevertheless, if the ranking is low, the “maximum utilization level” is set to a low value to reduce the routing requirement.
Table 2 shows a “maximum utilization table” (MUT) used to look up the value of “maximum utilization level” for “BLElimit” strategy. This MUT allows more BLEs to be used in a CLB for the seed with 95% criticality ranking than the seed with 50% criticality ranking. We explain how to generate MUT in Section 4.2.
() Then, TNDPack starts to cluster blocks, till the CLB under consideration reaches its maximum utilization level (Ln. 6). Figure 2 shows the algorithm flow for packing one block into the CLB (Ln. 7–Ln. 20).
3.2. Cost Function
The cost function in TNDPack considers the criticality in terms of delay and routability simultaneously (1) similar to the clustering cost function of the TVPack [4]. The “” parameter balances the criticality and the routability. The criticality is defined in [4] and calculated based on the sensitivity of a connection to the delay of the whole circuit. TNDPack introduces the current utilization level as a factor to the routability component in the clustering cost function of the TVPack. As current utilization level increases, the probability of sharing inputs and outputs increases. Therefore the value of the routability component increases. TNDPack gradually scales more on routability part to provide informed attention to criticality:
3.3. Unrelated Block Clustering
In this section, we explain how to cluster an unrelated block. TNDPack tries to cluster the related block with the highest gain value first. If no related block is available and only if the current utilization level is less than the “unrelated block threshold” (UBT), TNDPack allows clustering the unrelated block (Ln. 13  Ln. 14). This rule avoids clustering very few unrelated blocks and the possible interCLB delay. Also, this rule reduces the connections between CLBs to improve routability. For example, as shown in Figure 3, we want to cluster two nets into two CLBs, in the order of Net 1 followed by Net 2. In Figure 3(a), after Net 1 is clustered, a block of Net 2 is clustered in CLB 1. This introduces an interCLB delay for Net 2 and a connection between CLB 1 and CLB 2. As an alternative solution, in Figure 3(b), all blocks of Net 2 are clustered in CLB 2. In this solution, there is no interCLB delay or connection between CLBs. Compared to Figure 3(b), the solution in Figure 3(a) requires more routing resources and has a larger delay. Therefore if few available BLEs are left in a CLB and related block is not available, it is wiser to leave the BLEs unused.
(a)
(b)
Typically, clustering techniques modify the cost function ([4, 5, 9]) or the algorithm flow [10] or both ([11–13]). Here we summarize in what capacity the wellknown approaches enhance the clustering flow and highlight where our approach stands relative to them.
TRPack [9] uses the same algorithm flow as TVPack and modifies the cost function. TRPack modifies the routability part in the cost function by taking into account the individual contributions of both shared and nonshared nets between the CLB under construction and the block under consideration. TRPack improves minimum channel width compared to TVPack.
iRAC [5] develops a new method to choose the seed block. This technique chooses the unclustered block with the most used inputs and minimum connectivity as the seed block. iRAC then clusters each BLE into the CLB under construction using a new cost function that is based on the weight of the intersecting net and its pins that are already in the CLB. Furthermore, it uses the uniform depopulation with inputlimit strategy. The algorithm flow is similar to TVPack. However, with the modifications, iRAC achieves large reduction in the number of external nets which leads to reduction in minimum channel width.
The latest clustering technique, HDPack [13], uses a global placer to determine approximate BLE locations. Then the algorithm uses this placement information (physical information) in the clustering cost function. HDPack further incorporates a prepacking step. However, major contribution for improvement is based on clustering with the usage of physical information. The prepacking step leads to little improvement over the modified cost function.
In summary, as shown in Algorithm 1 and Figure 2, we adjust the cost function of TVPack to pay informed attention to routability and timing by taking utilization level into account. We also modify the clustering algorithm significantly by
(i)adjusting the “maximum utilization level” at run time with maximum utilization table (MUT);(ii)forming the “candidate block list” with candidate block threshold (CBT);(iii)setting the “unrelated block threshold” (UBT) for clustering.4. Experimental Results
4.1. Methodology
We implement TNDPack based on TVPack and conduct several experiments with the 20 largest MCNC benchmarks. We examine the performance of our proposed clustering technique and explore the effects of two depopulation strategies (“BLElimit” and “inputlimit”). Table 3 lists the main architecture parameters that we used in the experiments where segment length is the number of CLBs that a wire length spans, Fc describes the flexibility of connection blocks, and Fs describes the flexibility of switch blocks [14]. Figure 4 shows the CAD flow. The VPR version used in the experiments is v4.30.
As opposed to [3], our method runs the CAD flow once. Technologymapped circuit and the architecture description are the inputs to the clustering stage. TNDPack carries out clustering and VPR [15] handles placement and routing. We obtain the number of used CLBs, minimum channel width, and critical path delay for performance comparison against [4, 5, 9, 13].
4.2. Tuning the Parameters for BLELimit
We tune various parameters in our algorithm to identify the configuration which gives the best performance. In order to find the suitable value of “”, “UBT”, “MUT”, and “CBT”, we performed a set of experiments following the CAD flow described in Section 4. Here we only discuss the parameter tuning study based on “BLElimit” strategy. The parameter values for “inputlimit” strategy rely on the observations on the “BLElimit.” We will discuss this in Section 4.3.
(i): This coefficient balances the tradeoff between routability and delay. Marquardt et al. [4] shows that the value of 0.75 results with best area and delay efficient design. We believe that the behavior of the value in our cost function is similar to [4]. Therefore, we varied within 0.6, 0.7, and 0.75 in our experiments.(ii)“UBT”: Unrelated block threshold is used for allowing an unrelated block to be clustered into CLB. We assigned the values of 2, 4, and 6 for this parameter. During our preliminary experiments, we observed that a large value led to CLBs with too many unrelated BLEs, whereas a small value led to under utilization of CLBs. Therefore we fix UBT to 4.(iii)“MUT”: Maximum utilization table is used for setting the maximum utilization level for a CLB. We divide 0% to 100% range into 2 to 5 partitions. The maximum utilization level for the partition with the highest range is set to be the CLB capacity, 8, and this value descends by 1 relative to the ranking. Figure 5 shows the criticality value distribution of netlist “elliptic.” We observe that 40% of the CLBs have criticality less than 0.12, and afterwards, criticality value is more or less evenly distributed till 0.82 criticality value (40% to 95% range). We also observe that few CLBs have criticality value larger than 0.82. We capture the nature of this distribution in MUT. Firstly, we set the upper boundary of the partition with lowest range near 40% and the lower boundary of the partition with highest range near 95%. We then partition 40% to 95% evenly based on the MUT size. Table 4(a) shows an MUT with three partitions. We also adjust the boundary by 5% or 10% to derive alterative MUTs as shown in Table 4(b). If any of the MUTs results with a good performance, we fine tune that MUT by adjusting the boundary by 1% or 3% (Table 4(c)). If not, we continue adjusting the boundary by 5% or 10%. After finding the MUT configuration that results with a good performance, it is fixed and used for all benchmarks. We do not use a different MUT for each benchmark.(iv)“CBT”: Candidate block threshold is used for allowing clustering a block into a CLB based on its criticality. CBT ranges from 0 to 1. If the current utilization level of the CLB is low (the number of BLEs used at that time for this CLB is smaller than 6), then we do not take criticality into account, and set CBT to be 0 to focus on routability. Otherwise we set CBT to be 0.2 or 0.4 for current utilization level of 6 and set CBT to be 0.8 or 0.9 for current utilization level of 7.4.3. Effect of BLELimit and InputLimit Exploration
As shown in Algorithm 2, we sweep through , MUT, and CBT within their predefined ranges to evaluate the “BLElimit” strategy. For each configuration, we run the clustering algorithm over 20 MCNC benchmarks and compute averages for the number of used CLBs, minimum channel width and critical path delay. Figure 6 shows the minimum channel width and critical path delay reduction of TNDPack with “BLElimit” relative to TVPack. “axis” shows the increase in the number of CLBs and “axis” shows the reduction in minimum channel width and critical path delay. For each configuration, we generate two data points: a triangle representing channel width reduction and a diamond representing critical path delay reduction. We show each pair of data points (a triangle and a diamond) with a link indicating that they use the same parameter configuration. We then draw solid lines passing through the data points resulting with best reduction value in channel width and critical path delay separately. We then label the points on the line with solid triangle and diamond. We will use these solid points for analysis in Section 4.4.

For the “inputlimit” strategy, instead of sweeping all parameters, we choose sample points from Figure 6 that are on the bestline (solid triangle and diamond points) and run them with “inputlimit” constraint. In our experiments, cluster size (N) and the number of inputs per CLB (I) hold expression, which generates the best area and delay product [16]. (As Table 3 shows, and in our architecture.) Therefore, we use this relationship and adjust the MUT and CBT used for “BLElimit” to accommodate “inputlimit” strategy as shown in Table 5 (converted based on Table 4(a)). Similarly, we adjust UBT to 10. Figure 7 shows the minimum channel width and critical path delay reduction of TNDPack for the “inputlimit” strategy. We will also use the solid points on this chart for analysis in Section 4.4.
4.4. Evaluation of BLELimit and InputLimit
Based on Figures 6 and 7, we tune various parameters in our algorithm to identify the good configurations whose performances are shown in Figures 8 and 9. Figure 8 shows reduction in minimum channel width and Figure 9 shows reduction in critical path delay for TNDPack with respect to TVPack based on “BLElimit” and “inputlimit” strategies, respectively. Solid line represents “BLElimit” strategy and dashed line represents “inputlimit” strategy. Each point with the same xvalue in Figures 8 and 9 is generated with the same configuration of the parameters. Figures 8 and 9 show the following.
Furthermore, we observe that the “BLElimit” strategy is better than the “inputlimit” strategy. We see a couple of reasons for this behavior. For example, if the criticality of the seed block is high, algorithm sets a high value for the maximum utilization level of the CLB under construction (e.g., 16 out of 18 inputs). This affects the logic utilization significantly in a CLB. We observed cases like usage of 4 out of 8 BLEs. In another case, for a seed that has low criticality, our algorithm allows 12 inputs for that CLB. However, due to the input sharing, most of the inputs were absorbed (6 BLEs). Therefore “inputlimit” technique in some cases worked against the objective of depopulation technique.
Based on these observations, we choose “BLElimit”based technique for performance comparison against other clustering techniques. As shown in Figure 8, the channel width increases along with an increase in the number of CLBs. We decompose total area into logic and routing and use (2) as a model to derive the area estimate. In this paper, we regard 70% for routing area as a good estimation for the commercial FPGAs [5]. As used in [3], let “” be the number of CLBs and let “” be the channel width, then where new represents after depopulation, and old represents before depopulation.
Among the points in Figures 8 and 9, we find that 13.28% average increase in the number of CLBs is the data point that leads to the best areadelay product. Table 6 shows the parameter values used for this data point. We run 20 MCNC benchmarks with the configuration parameters shown in Tables 3 and 6. We then compare minimum channel width, critical path delay, and the number of CLBs with TVPack in Table 9. On average TNDPack reduces minimum channel width by 11.07%. This results with 4.50% area increase. On average, the critical path delay decreases by 2.89%. Spreading the logic among the available CLBs is expected to increase the critical path delay. We observe this trend for some of the benchmarks with TNDPack; however for most of the benchmarks we observe a reduction in critical path delay.
4.5. RunTime
Table 7 compares TVPack and TNDPack based on the time it takes to run the clustering stage for all 20 MCNC benchmarks. Adjusting the level of depopulationbased on the criticality contributes to the execution time; therefore TNDPack increases the runtime of the clustering stage on average by 0.16 seconds. However, this overhead is minor when the execution time for the CAD flow is considered. Since TNDPack generates more number of CLBs to be placed and routed, we also observe an increase in the execution time for the placement and routing stages.
5. Discussion
In this section, we compare TNDPack with other depopulationbased stateoftheart clustering techniques and Table 8 summarizes it.
Un/DoPack [3] is a nonuniform depopulation technique. Un/DoPack achieves up to 40% channel width reduction through aggressive depopulation with a critical path delay penalty of 20%. In contrast, TNDPack reduces critical path delay as the intensity of the depopulation increases. The trend line in Figure 8 shows that TNDPack can further improve on channel width and continue reducing the critical path delay by using more CLBs (e.g., TNDPack^{4} versus TNDPack^{5} in Table 8). However, this leads to a significant area penalty which may prevent the designer from mapping the design onto a lowcost FPGA.
iRAC [5] is a routability driven uniform depopulation clustering technique. iRAC achieves 25.09% reduction in channel width. However, [5] reports its results based on a different placement algorithm, iRAP, which reduces channel width over VPR. We use VPR for the placement. Since neither iRAC nor iRAP is publicly available, it is not feasible to make a fair comparison without implementing their algorithms. iRAC [5] does not report timing results. It is also not feasible to reach a conclusion on overall performance without considering area and delay simultaneously.
6. Conclusion and Future Work
It has long been known that, as CLBs are depopulated, better channel widths can be achieved. However depopulation leads to more external connections among CLBs and typically results with an increase in critical path delay. While enhancing routability through depopulation is essential for utilizing the lowcost FPGAs, at the same time there is a need for addressing the critical path delay. We achieve this goal with TNDPack by adjusting the capacity of the CLB under construction based on the criticality of the logic block under consideration.
In this study, we show that the depopulationbased clustering techniques while reducing the stress on routing can also achieve reduction in critical path delay. This is significant as this study shows that depopulationbased clustering potentially allows the designer to stay with the lowcost FPGA family instead of migrating to the costly resourcerich FPGA family.
In [17, 18], Pandit introduces a wirelength prediction technique that accurately estimates postplacement individual wirelength information for a given netlist before the clustering stage. As future work, we plan to incorporate this mechanism into our clustering cost function to further improve the performance of the TNDPack.