Abstract

On-chip clock networks are remarkable in their impact on the performance and power of synchronous circuits, in their susceptibility to adverse effects of semiconductor technology scaling, as well as in their strong potential for improvement through better CAD algorithms and tools. Existing literature is rich in ideas and techniques but performs large-scale optimization using analytical models that lost accuracy at recent technology nodes and have rarely been validated by realistic SPICE simulations on large industry designs. Our work offers a methodology for SPICE-accurate optimization of clock networks, coordinated to satisfy slew constraints and achieve best tradeoffs between skew, insertion delay, power, as well as tolerance to variations. Our implementation, called Contango, is evaluated on 45 nm benchmarks from IBM Research and Texas Instruments with up to 50 K sinks. It outperforms all published results in terms of skew and shows superior scalability.

1. Introduction

Accurate distribution of clock signals is a major limiting factor for high-performance integrated circuits when unintended clock skew narrows down the useful portion of the clock cycle. Historically, clock skew became one of the first victims of semiconductor scaling, when wire delay started growing in significance relative to transistor delay. H-trees, popular in the industry, offered symmetric distribution networks that guaranteed nearly equal geometric lengths from the chip's center to individual clocked elements. However, H-trees did not immediately account for different sink capacitance and uneven distribution of sinks throughout the chip and did not minimize wire capacitance. The first geometric algorithms for clock routing evaluated skew in terms of wirelength from the source to sinks and produced minimum-wirelength trees for a given sink clustering (which is not difficult to optimize) using the deferred merging and embedding (DME) principle [1]. The tree structure facilitated powerful dynamic programming, and DME algorithms were extended to (1) handle skew in terms of Elmore delay, (2) balance uneven sink capacitance, and (3) minimize wire capacitance under nonzero skew bounds [2]. The DME family of algorithms were a major research achievement, both as mathematical insights and in terms of their computational performance. BST-DME algorithms [3] developed in the late 1990s reduced skew to single ps in fairly large circuits, while requiring only minutes of CPU time.

Semiconductor scaling in the 1990s made clock optimization more challenging. While transistors continued scaling, interconnect lagged in performance [4]. This phenomenon boosted demands for repeaters in clock networks, raised their power profile, and complicated their synthesis. Research in delay-driven buffering of single signal nets—arguably an easier problem and on a smaller scale—has blossomed well into the late 2000s, leaving clock-tree synthesis a difficult, high-value target. As the accuracy of compact delay models for transistors and wires deteriorated, clock-network design in the industry moved to SPICE-driven optimizations [5, 6].

Clock networks were among the first circuits to suffer the impact of process, voltage and temperature variations. Systematic variations can affect paths to different sinks in different ways, making effective skew higher than nominal skew. Intradie variations may be stronger on some paths than on others, which would further increase effective skew. These challenges have motivated research at the device, circuit, and algorithm levels [7]. In general, smaller sink latencies and shorter tree paths decrease exposure to variations. Some researchers tried to increase the tolerance of buffers to CD changes and to temperature [8], some proposed to tune wires or buffers based on postsilicon measurements [9], and some developed methodologies for inserting cross-links into the trees [1012], arguing that such links can decrease the impact of variation on skew. Existing literature tends to (1) rely on closed-form delay models during large-scale optimization, (2) frequently focus on a single optimization technique in analysis and evaluation, and (3) neglect the difficulties in modifying highly optimized clock trees. Our work seeks to address these omissions and develops a practical methodology for effective SPICE-accurate optimization, rather than elegant algorithms with provable abstract properties. With process variation in mind, microprocessor designers combine regular meshes with local or global trees [6]. However, meshes have much higher capacitance and use more power.

Our work focuses on clock-network synthesis for ASICs and SoCs, where clock frequencies are not as aggressive as in high-performance CPUs, but power is limited, especially for portable applications. In this context, tree topologies remain the most popular choice, potentially with further tuning and enhancements. The SoC context introduces another twist—layout obstacles. SoCs include numerous pre-designed blocks (CPUs, RAMs, DSPs, etc.) and datapaths. While it may be possible to route wires over such obstacles, buffer insertion is typically not allowed. One can fathom the difficulty of such optimization through comparison to signal-net routing, where obstacle-avoiding Steiner trees currently remain an active area of research [13]. Our contributions include the following: (i)a careful analysis of design steps and optimizations for high-performance clock trees, including the range, accuracy, and substitutability of specific techniques,(ii)notions of slowdown and speedup slack for clock trees,(iii)tree optimizations driven by accurate delay models,(iv)a simple and robust technique for obstacle avoidance in clock trees subject to slew constraints,(v)a provably-good sink-polarity correction algorithm, (vi)a methodology for clock-tree optimizations that outperforms the best results at the ISPD’09 contest on every benchmark by 2.15–3.99 times, while reducing skew to 2.2–4.6 ps (Table 5). It outperforms all published results in terms of skew (Table 6). On newer Texas Instruments benchmarks with up to 50 K sinks, skew remains <11 ps.

Selecting best parameters for each benchmark can further improve results, at the cost of increased runtime. But global skew <20 ps is considered very small for ASICs and SoCs.

In the remainder of this paper, Section 2 reviews relevant previous work and the ISPD’09 CNS contest. Section 3 describes our analysis of the clock-network synthesis problem and introduces slowdown and speedup slacks. Major optimization steps are described in Sections 4 and 5 presents empirical results.

2. Background and Prior Work

DME Algorithms
Traditionally, clock trees have been constructed with respect to simple delay models—geometric pathlength or Elmore delay. In this context, the results in [1, 1417] show how to build zero-skew trees (ZSTs) with minimal wirelength, improving upon H-trees and fishbones.
The deferred merge embedding (DME) algorithm, using the concept of merging segment [1, 14, 15] for constructing zero-skew tree, was extended to the bounded-skew tree (BST) problem. BST/DME algorithms [2, 3] generalize merging segments to merging regions. When BST/DME algorithms were introduced in the early 1990s, many chip designs included one large central buffer to drive clock signals through the entire chip. Today traditional clock trees cannot satisfy slew constraints in large ICs because the maximal length of unbuffered interconnect decreased significantly due to technology scaling [4]. Furthermore, the Elmore delay model used by published clock-tree optimizations lost accuracy due to resistive shielding and the impact of slew on delay.
BSTs allow one to trade off a small increase in skew for reduced total wirelength. Figure 1 shows that BSTs are shorter than ZSTs. However, BSTs are less balanced than ZSTs and Elmore delay used in BST generation is inaccurate, thus the capacitance saved on wires can be lost when compensating for skew with accurate timing analysis. After initial buffer insertion, slow sinks and fast sinks are more clustered in ZSTs. Since our skew optimization techniques exploit these clusters, BSTs need greater resources to reach near zero-skew than ZSTs. Table 1 shows the impact of BST skew bounds on final results (CLR is defined at the end of Section 2). The skew bounds during BST construction are based on Elmore delay, and the final results are based on SPICE simulations. Based on overwhelming empirical evidence against BSTs, Contango does not use them.

Obstacle-Avoiding Clock Trees
The concept of merging regions in BST/DME was extended to obstacle-avoiding trees in [18], where (i) obstacles were assumed rectangular, (ii) no routing over obstacles was allowed, and (iii) buffering was not considered. The authors noted that obstacle processing slowed down their BST/DME algorithm and hinted at more advanced geometric data structures. Unlike in [18], the ISPD’09 contest allowed routing but not buffering over obstacles, with SoCs in mind. ISPD’09 benchmarks included abutting obstacles that formed monolithic rectilinear obstacles.

Fast Buffer Insertion
L. van Ginneken introduced an algorithm for buffering RC-trees [19], which minimizes Elmore delay and runs in 𝑂 ( 𝑛 2 ) time, given 𝑛 possible buffer locations and buffer specification. While not intended for clock trees, it minimizes worst delay rather than skew. The 𝑂 ( 𝑛 l o g 𝑛 ) time variant of van Ginneken's algorithms proposed in [20] is more appropriate for large trees. A key insight into van Ginneken's algorithms and its faster variant makes them applicable to our work—while trying to minimize source to sink latencies, these algorithms insert almost same number of buffers on every path and therefore result in low skew if the initial tree was already balanced.
Other buffering techniques have been proposed as well, for example, a linear-time algorithm from [21] that minimizes the number of buffers while bounding capacitive load and slew rate, but does not minimize delay or skew. A dynamic program from [22] inserts a limited number of buffers subject to a maximal skew in buffer counts on source-to-sink paths. At the ISPD’09 contest, slew constraints were checked by SPICE, but capacitance limits were relatively generous. Our competitors predominantly used greedy bottomup buffer-insertion algorithms that added each buffer as high in the tree as possible, while satisfying slew constraints. Such technique seek to minimize capacitance as the top priority. However, we chose the (faster variant of) van Ginneken's algorithm, which seeks to minimize worst sink latency. Our rationale was that process variations can be moderated by lowering sink latency and that it is relatively easy to slow down paths that are too fast, but it is harder to speed up slow paths. It is difficult to make a rigorous comparison with slew-based buffering [23]. In particular, some of our competitors at the ISPD 2009 contest relied on it and produced relatively poor results, but others did better. In any case, our overall results compare favorably to the best published results, especially in terms of nominal skew, and we were unable to improve them further by using slew-based buffering.
The ISPD09 clock-network synthesis contest was organized by IBM Austin Research Laboratory and based on a 45 nm technology [24]. Sink latencies and clock skew were evaluated by SPICE. The main objective was the difference between the least sink latency @1.2 V (supply) and the greatest sink latency @1 V (supply). This Clock Latency Range (CLR) metric was intended to capture the impact of multiple power modes with different supply voltages [25], but nominal skew was also recorded. The 10%–90% slew rate of 100 ps and total power were strictly limited.
Several papers were published inspired by the ISPD’09 contest. Researchers from NTU proposed in [26] a dynamic nearest-neighbor algorithm (DNNA) to generate tree topology and a walk-segment breadth first search (WSBFS) for routing and buffering. To further refine the tree, they use dangling branches to adjust capacitance of wires (see our discussion in Section 4.7). Researchers from NCTU proposed in [27] a three-stage CLR-driven CTS flow based on an obstacle-avoiding balanced clock tree routing algorithm, monotonic parallel buffer insertion, as well as wire-sizing (BIWS) and wire-snaking. A dual-MST (DMST) geometric matching approach is proposed by researchers from HKPU in [28] for topology construction, along with recursive buffer insertion and a way to handle blockages. A timing-model independent buffered clock-tree synthesis is proposed in [29]. The authors proposed a branch-number plan, a cake-cutting partitioning and an embedding-region construction for nonbinary symmetrical buffered clock tree synthesis. They achieved low skew but do not explain how to generate obstacle-avoiding clock trees.

3. Problem Analysis

The design of a clock network offers a large amount of freedom in topology selection, spacing and sizing of inverters, as well as the sizing of individual wires. Traditionally, network topology is decided first. Trees offer unparalleled flexibility in optimization because latency from the root to each sink can be tuned individually, while large groups of sinks can be tuned by altering nodes and edges high up in the tree.

Composite buffers can be built by stacking up inverters in parallel and/or in series. Parallel composition decreases driver resistance, but it increases input pin capacitance, while leaving the intrinsic delay intact. The spacing of buffers is largely responsible for preventing slew violations and also affects clock skew. It is sensitive to driver resistances, the maximal capacitance (wire and input pins) that can be driven by a given composite buffer, as well as branches in the buffer's fanout, which determine the number of input pins driven. A single wire segment can be split into smaller segments, and each can be sized independently.

3.1. Optimization Objectives and Timing Analysis Techniques

Accurate clock network design is complicated by the fact that the optimization objectives are not available in closed form and take significant CPU resources to evaluate. Skew optimization requires much higher accuracy than popular Elmore-like delay models. For example, a 5 ps error represents only 1% of 500 ps sink latency, but 50% of 10 ps skew. Closed-form models do not capture resistive shielding in long wires, do not propagate slew with sufficient accuracy, and do not account for slew's impact on delay well. Newer, more sophisticated models are laborious to implement and only available in modern commercial tools. Our strategy is to use simple analytical models at the first steps of the proposed flow—(1) to construct zero-skew clock trees and (2) to perform initial fast buffer insertion,—but drive further optimizations by SPICE runs, Arnoldi approximation, or any other available timing analysis tool/model.

To minimize the number of time-consuming SPICE invocations, we pursued several techniques. Runtime can be significantly reduced using localization and batch-mode evaluation. During localization, one prunes large portions of the clock tree that do not affect latencies to the sinks impacted by the changes in question [12]. This does not reduce the number of SPICE calls, but rather decreases the complexity of each run. On the other hand, a batch of changes can be evaluated by a single SPICE run, as long as multiple changes do not affect the same path from root to a sink.

Another avenue to streamlined SPICE-driven optimizations is to use mathematical properties of circuit delay, such as monotonicity, convexity, and linearity with respect to some parameters. Monotonicity and convexity support binary search, where an optimal value is sought on a certain interval. At each step of the search, the middle point of the interval is evaluated by SPICE (e.g., a wire can be sized half-way) and the result determines whether to recur to the left or right half-interval. Linearity enables extrapolation of multiple values based on several SPICE runs.

3.2. Nominal Skew Optimization

An initial buffered clock tree is constructed early in the design flow. Assuming no slew violations, the latency of each sink 𝑠 ( 𝑇 𝑠 ) is known from SPICE simulations (or faster techniques, such as Arnoldi-based delay calculations), at which point minimal and maximal latencies ( 𝑇 m a x and 𝑇 m i n ) can be found (separately for rising and falling transitions, for each PVT corner.) Since sink latencies are significantly larger than skew ( 𝑇 m a x 𝑇 m i n ), skew can be improved by either decreasing 𝑇 m a x (speeding up the slowest sinks) or increasing 𝑇 m i n (slowing down the fastest sinks) without critical adverse effect on sink latencies.

Definition 1. Consider a clock tree and its sink 𝑠 . The slowdown slack S l a c k s l o w 𝑠 (speedup slack S l a c k F a s t 𝑠 ) of 𝑠 is the amount in ps by which the sink latency can be unilaterally increased (decreased) without increasing clock skew. In other words, S l a c k s l o w 𝑠 = 𝑇 m a x 𝑇 𝑠 and S l a c k F a s t 𝑠 = 𝑇 𝑠 𝑇 m i n .
Slow sinks often cluster together, and so do fast sinks. Hence, clock skew can be improved by modifying a few nodes or edges high in the tree. To find desired delay change, we propagate slack information up the tree as follows.
Let S i n k s 𝑒 be the set of downstream sinks for edge 𝑒 .

Definition 2. Consider a clock tree and its edge 𝑒 . The slowdown slack S l a c k s l o w 𝑒 ( speedup slack S l a c k F a s t 𝑒 ) of 𝑒 is the amount in ps by which the edge delay can be unilaterally increased (decreased) without increasing clock skew.

Lemma 1. For any edge 𝑒 in the tree (i) S l a c k s l o w 𝑒 = m i n 𝑠 S i n k s 𝑒 S l a c k s l o w 𝑠 , (ii) S l a c k F a s t 𝑒 = m i n 𝑠 S i n k s 𝑒 S l a c k F a s t 𝑠 . Given slacks on 𝑛 sinks, all edge slacks can be computed in 𝑂 ( 𝑛 ) time.

Lemma 2. For any edge 𝑒 and its parent in the tree, S l a c k s l o w 𝑒 S l a c k s l o w p a r e n t ( 𝑒 ) and S l a c k F a s t 𝑒 S l a c k F a s t p a r e n t ( 𝑒 ) .

The flexibility of a tree edge is limited by each downstream sink. Therefore, for edges close to the root we often have S l a c k s l o w 𝑒 = S l a c k F a s t 𝑒 = 0 . It is important to note that the validity of slacks-related calculations does not depend on the use of specific delay models or SPICE simulations. When visualizing clock trees, we color their edges with a red-green gradient, indicating low slack with red and high slack with green, as shown in Figure 4.

Lemma 2 suggests that instead of changing the delay of an edge, one can change the delay of its downstream edges by an equal amount, as long as only one delay change is applied on each root-to-sink path. When choosing between tree edges on the same path, we prefer (at early stages of optimization) to tune edges as high in the tree as possible, so as to minimize (i) the amount of change, (ii) the risk of introducing slew violations and (iii) power overhead. However, in a highly optimized tree, we tune bottom-level edges where we can better predict the impact on skew. The preference for high-level tree edges can be formalized as follows.

Proposition 1. For each edge 𝑒 in the tree, define Δ s l o w 𝑒 = S l a c k s l o w 𝑒 S l a c k s l o w p a r e n t ( 𝑒 ) . If every edge is slowed down exactly by Δ s l o w 𝑒 , the tree's skew will become zero, and both slowdown and speedup slacks will become zero.

Naturally Δ f a s t 𝑒 = S l a c k f a s t 𝑒 S l a c k f a s t p a r e n t ( 𝑒 ) , and a mirror statement holds. For a tree edge 𝑒 , it is possible that Δ f a s t 𝑒 > 0 and Δ s l o w 𝑒 > 0 , facilitating conflicting optimizations. If optimizations are not coordinated well, some edges may be sped up and some slowed down, while the overall skew is unchanged. To avoid such conflicts, one can perform rounds of speedup and rounds of slowdown, separated by SPICE-based analysis and slack update. In practice, it is easier to slow down an edge than to speed it up. Thus, any possible speedup, for example, by using stronger buffers, is performed first. Rounds of speedup and slowdown are more conveniently performed top-down, so that when an edge cannot be tuned by the desired amount, the remainder is passed to its downstream edges.

We found that after nominal skew is sufficiently optimized, both rising and falling transitions can individually limit speedup and slowdown slacks. We handle the two transitions separately and define edge slacks as the smaller of rise-slack and fall-slack. Furthermore, speedup and slowdown slacks can be computed for each process corner given (two in the ISPD’09 contest). In order to improve the multicorner CLR objective, a tree edge can be sped up conservatively by the minimum of its speedup slacks, and can be slowed down by the minimum of its slowdown slacks.

3.3. CLR Optimization

Our methodology pursues two objective functions—nominal skew and the ISPD09 CNS contest metric, CLR, introduced above. Due to significant correlation between CLR and nominal skew, some of the optimizations in our flow target skew optimization, some target CLR, and some address both (see Table 3). In practice this approach achieves a good tradeoff between the two optimization objectives, and is representative of multi-objective optimization required in many practical settings. Recall that the CLR calculation is based on the sink latencies at two different supply voltage settings. There are mainly two strategies to reduce CLR. First, reducing skew directly contributes to reducing CLR until skew becomes very small (e.g., less than 5 ps). Let sink L be the sink with the least sink latency @1.2 V ( 𝑇 1 . 2 V 𝐿 ) and sink 𝐺 be the sink with the greatest sink latency @1.0 V ( 𝑇 1 . 0 V 𝐺 ). Then 𝐶 𝐿 𝑅 = 𝑇 1 . 0 V 𝐺 𝑇 1 . 2 V 𝐿 . When we consider the latency of sink 𝐺 @1.2 V ( 𝑇 1 . 2 V 𝐺 ), then CLR = ( 𝑇 1 . 0 V 𝐺 𝑇 1 . 2 V 𝐺 ) +( 𝑇 1 . 2 V 𝐺 𝑇 1 . 2 V 𝐿 ). We call ( 𝑇 1 . 0 V 𝐺 𝑇 1 . 2 V 𝐺 ) the variational part of CLR and ( 𝑇 1 . 2 V 𝐺 𝑇 1 . 2 V 𝐿 ) the skew part of CLR. The skew part of CLR can be reduced by skew optimization techniques. Since the corner sinks of skew are not always same to the corner sinks of CLR (sink 𝐿 and 𝐺 ), CLR needs to be measured after any skew optimization to check CLR improvement. The second strategy for CLR optimization targets the variational component of CLR. The detailed descriptions of optimizations for the skew and variational part of CLR are discussed in Section 4.

3.4. Coordinating Multiple Optimizations

We found that different clock-tree optimizations exhibit different strength/range and different accuracy (see Tables 3 and 4).

Our strategy in coordinating clock-tree optimizations is to start with optimizations that offer the greatest range, and then transition to optimizations with greater accuracy. Each step should decrease the main optimization objective sufficiently to be within the range of the next optimization.

4. Proposed SoC Clock-Synthesis Methodology

Our proposed clock-network synthesis methodology and its major algorithmic steps are shown in Figure 2. Contango first builds an initial tree using a ZST/DME algorithm [3] and alters it to avoid obstacles. It then uses an 𝑂 ( 𝑛 l o g 𝑛 ) time variant of van Ginneken’s buffer insertion algorithm [20] to ensure small insertion delay and to satisfy slew constraints. A series of novel clock-tree optimizations are applied next.

4.1. Obstacle-Avoiding Clock Trees

As we pointed out in Section 2, obstacle-avoiding clock trees can be built by repairing obstacle violations in ZSTs. This approach is attractive when large obstacles abut the chip's periphery because ZSTs naturally avoid areas without clock sinks. This approach is also attractive when obstacles are small or thin enough that a buffer inserted immediately before the obstacle can drive the wire over the obstacle, so that no rerouting is necessary. A third convenient case occurs when a wire can be rerouted around the obstacle without an increase in length. Most obstacles are rectangular in shape, but such rectangles may abut, creating rectilinear-shaped obstacles. When two obstacles abut, we cannot place a buffer between them, and therefore handle them as one compound obstacle. Contango detours wires using the following algorithm, illustrated in Figure 3 for a composite obstacles.

Step 1. Identify all wires that intersect obstacles. For each point-to-point connection, perform shortest-path maze routing around the obstacles. For subtrees that cross an obstacle, find L-shaped segments that link points inside and outside the obstacle. For each L-shape, choose one of the two possible configurations that minimizes overlap with the obstacle.

Step 2. When a wire crosses an obstacle, Contango captures an entire subtree enclosed by the obstacle (see Figure 3). The total capacitance of the subtree is then measured and compared to the capacitance that can be driven by the driving buffer without risking slew violations. Subtrees that can be driven by the driving buffer do not require detours.

Step 3. For obstacles crossed by a subtree that cannot be safely driven by the driving buffer, Contango establishes a detour along the contour of the obstacle as follows. First, the entire contour is considered a detour. Then, to ensure that the clock network remains a tree, one segment is removed between tree sinks adjacent along the contour. If we were to minimize total capacitance, we would remove the longest segment of the contour between two adjacent tree sinks. However, we minimize the longest detoured source-to-sink path and, therefore, remove the segment furthest from the tree source (counting distances along the contour). In other words, we first find the sink most distant from the source along the contour and include in the detour the entire shortest path to the source. The other segment incident to the sink is removed, but the shortest path from its other end to the source is included (see Figure 3).
Modern SoC layouts are littered with obstacles, which upset regular structures such as meshes and H-trees. In the ISPD 2009 contest, such layouts required numerous detours. Detouring may significantly increase skew, but the subsequent skew optimization techniques can compensate for that.

4.2. Composite Inverter/Buffer Analysis

Most technology libraries support dedicated clock buffers or inverters that are larger and more reliable than those for signal nets. Industry designs usually offer at least six different sizes. Parallel composition of buffers increases driver strength, helping with slew constraints and improving robustness to variations. Yet, buffer sizes must be moderated to satisfy total power limits. For a given buffer library, we consider many possible composite buffers. Using dynamic programming, we select several nondominated configurations that can be further evaluated during buffer insertion. Algorithmic details are omitted here because the ISPD’09 contest used only two inverter types— 𝑙 𝑎 𝑟 𝑔 𝑒 and 𝑠 𝑚 𝑎 𝑙 𝑙 . Table 2 shows that eight parallel 𝑠 𝑚 𝑎 𝑙 𝑙 inverters exhibit smaller output resistance than one 𝑙 𝑎 𝑟 𝑔 𝑒 inverter, and smaller input/output capacitance. Hence, Contango used 8 × small inverters instead of 𝑙 𝑎 𝑟 𝑔 𝑒 inverters, in batches of 1 6 × , 2 4 × , and so forth. This benchmark-independent optimization, along with buffer sizing, plays an important role in our methodology.

4.3. Initial Buffer Insertion with Sizing

Given a clock tree with buffers, it is easy to increase the latency of a given sink, but it is difficult to speed up a sink. Therefore, our strategy is to first make sinks as fast as possible, and then reduce skew with wiresnaking and wiresizing. When buffers are inserted into an Elmore-balanced tree, source-to-sink paths contain practically the same numbers of buffers (can be off by one in some cases).

We adapted the 𝑂 ( 𝑛 l o g 𝑛 ) time variant of van Ginneken's algorithm from [20]. Due to its speed, it can be launched with different inverter configurations, effectively performing simultaneous optimization across multiple parameters. Our experiments indicate that driver strength is a major factor in moderating the impact of supply-voltage variations. Therefore, to reduce the variational part of CLR, 𝑇 1 . 0 V 𝐺 𝑇 1 . 2 V 𝐺 (Section 3.3), Contango performs fast buffer insertion with different composite buffers until it finds the best-performing solution with strongest composite buffers within 90% of the power limit. Slew-constraint violations are not a concern at this point since minimizing delay involves avoiding high slew rate (recall that there is positive correlation between delay and slew rate). The experiments on various clock trees with initial buffer insertion suggest that even the worst slew rate is well under 60% of the slew limit. We reserve 𝛾 = 1 0 % of power budget to facilitate more accurate optimizations.

The 𝑂 ( 𝑛 l o g 𝑛 ) variant of van Ginneken's algorithm [20] used in our work assumes that all available clock buffers preserve polarity, therefore, the use of inverters typically leads to incorrect polarity at some sinks. The buffering algorithm can be extended to directly account for sink polarity, or it can be postprocessed by inserting additional inverters near sinks with incorrect polarity. To this end, we use the polarity-correction approach described in our conference paper [30]. In practice, it requires very few additional buffers, and its skew overhead is small enough to be compensated for by our downstream optimizations.

4.4. Buffer Sliding and Interleaving

We now discuss targeted improvement of robustness to variations in device performance. The iterative buffer sizing introduced in Section 4.5 is primarily used to reduce the variational component of CLR ( 𝑇 1 . 0 V 𝐺 𝑇 1 . 2 V 𝐺 ), while buffer sliding and interleaving are applied as preliminary steps. Extensive experiments suggest that the impact of variations on skew is best reduced by (i) decreasing sink latency (insertion delay), and (ii) using the strongest possible buffers. Since our initial buffer insertion algorithm focuses on the former metric with the latter metric as a secondary objective, it is possible to further improve the variational component of CLR ( 𝑇 1 . 0 V 𝐺 𝑇 1 . 2 V 𝐺 ) by emphasizing the latter metric. Therefore, based on the results of initial buffer insertion, Contango attempts to size buffers up.

Sizing up a single inverter increases its input pin capacitance and can lead to slew violations. To prevent such violations, it is often possible to slide the inverter up the tree to reduce upstream wire capacitance and interleave an inverter when two inverters move too far apart after sliding. The increase in downstream wire capacitance is balanced with the increase in the inverter’s driving strength. Sizing a single inverter may increase the skew and require further correction. Therefore, we focused on the top-most levels of the tree, whose impact on skew is relatively small. Given a clock source at the chip boundary, DME algorithms generate a long wire leading to the center of the chip, and the tree branches out from the center. This long wire— the tree trunk—is later populated with a chain of inverters, which can be up- or downsized without significant impact on skew because this equally affects all sinks. However, since roughly 1/3 to 1/2 of sink latency is due to the tree trunk, it accounts for a large fraction of variational impact on latency.

The trunk’s variational impact is different for voltage and process variations, and this must be accounted for during optimizations. Stronger buffers in the trunk reduce the sensitivity of latency to supply voltage (e.g., in the case of different power modes), and help optimizing the CLR objective from the ISPD 2009 contest. However, process variations in the trunk do not affect skew. In the ISPD 2010 contest, process variations were included in the skew constraint, while the primary objective was to minimize total capacitance. Therefore, one of successful strategies to weaken the buffers in the tree trunk and avail the capacitance saved to other optimizations.

4.5. Iterative Buffer Sizing

After sliding and interleaving top-level buffers, we invoke iterative buffer sizing. First, this algorithm sizes up buffers in the tree trunk. At the 𝑖 th iteration of buffer sizing, Contango sizes up the composite inverters by at most 𝑝 𝑖 = 1 0 0 / ( 𝑖 + 3 ) %. The iterations continue until results improve without slew violation. Buffer sizing in tree branches incurs a greater capacitance penalty. To compensate, Contango borrows capacitance by downsizing bottom-level buffers.

However, sizing up buffers after the trunk often makes the tree unbalanced in terms of skew and results in greater load for the following skew optimization algorithms. For better performance of skew optimizations, typically 4 or 5 levels after the first branch are sized up by capacitance borrowing buffer sizing algorithm.

4.6. Iterative Top-Down Wiresizing

Before skew optimization, Contango computes slowdown slacks at every edge as described in Section 3, and the Δ s l o w 𝑒 parameters. This suggests the amount by which a given tree edge can be slowed down before skew would be negatively affected. Since fast sinks often cluster together, skew can be lowered by slowing down either many bottom-level wires or few wires higher in the tree. Our top-down algorithm pursues the latter, seeking to minimize tree modifications.

We build an ad hoc linear model based on the impact of downsizing a unit-length ( 𝑙 w s ) wire segment. Contango chooses several independent wire segments with same length ( 𝑙 w s ) in the middle of the tree and downsizes them to observe the impact on latencies of downstream sinks, ensuring that every sink is affected by only one downsized wire. This requires a single SPICE run and produces a single parameter 𝑇 w s —maximal latency increase by downsizing a unit-length ( 𝑙 w s ) wire segment. When downsizing a wire, the scaling factor 𝑘 is calculated based on S l a c k 𝑒 divided by 𝑇 w s and 𝑘 × 𝑙 w s of the wire is downsized. When 𝑘 is small, the latency increases almost linearly since the downsized length is much smaller than the length of the wire. Therefore, we can estimate that the maximum latency increase is equal to or less than 𝑘 × 𝑇 w s . To utilize this linearity, we limit 𝑘 by 𝑘 m a x . 𝑘 m a x is experimentally determined by observing the threshold at which the linearity breaks significantly. Also, the scaling factor 𝑘 can be limited by slew constraints. Wiresizing typically increases slew rate because of increase in resistance. Even though 𝑘 < 𝑘 m a x holds, Contango does not allow any downsizing on a wire whose downstream node has slew rate above 80% of the slew limit.

Since we selected 𝑇 w s as the maximal latency increase from the SPICE simulation, the actual increase (calculated by SPICE) is smaller—our modifications are intentionally conservative to avoid excessive increase of latency, which increases the maximal latency of the tree and consequently causes increase of slack for the entire tree. After running SPICE, collecting sink latencies and recomputing slowdown slacks, Contango repeats top-down wiresizing to reduce skew based on current data. This process is performed iteratively until the objective function (CLR or nominal skew) stops improving. Iterative wiresizing is detailed in Algorithm 1.

𝑇 w s = T w s E s t i m a t i o n ( ) ;
repeat
 SaveSolution(); ComputeWireSlacks();
𝑄 = { r o o t } ; 𝑅 𝑆 𝑙 𝑎 𝑐 𝑘 = { 0 } ; 𝑖 = 0 ;
while   𝑖 < 𝑠 𝑖 𝑧 𝑒 ( 𝑄 )   do
  If ( 𝑆 𝑙 𝑎 𝑐 𝑘 [ 𝑄 𝑖 ] 𝑅 𝑆 𝑙 𝑎 𝑐 𝑘 𝑖 > 𝑇 w s ) then
    𝑘 = ( 𝑆 𝑙 𝑎 𝑐 𝑘 [ 𝑄 𝑖 ] 𝑅 𝑆 𝑙 𝑎 𝑐 𝑘 𝑖 ) / 𝑇 w s ;
   DownSize( 𝑊 𝑖 𝑟 𝑒 [ 𝑄 𝑖 ] , 𝑘 ); 𝑅 𝑆 𝑙 𝑎 𝑐 𝑘 𝑖 + = 𝑘 𝑇 w s ;
  end  if
  for   𝑗 = 1 to Size( 𝐶 𝑖 𝑙 𝑑 [ 𝑄 𝑖 ] ) do
    𝑄 .push( 𝐶 𝑖 𝑙 𝑑 [ 𝑄 𝑖 ] [ 𝑗 ] ); 𝑅 𝑆 𝑙 𝑎 𝑐 𝑘 .push( 𝑅 𝑆 𝑙 𝑎 𝑐 𝑘 𝑖 );
  end  for
   + + 𝑖 ;
end  while
 SpiceSimulation();
until (no improvement slew violation)

4.7. Iterative Top-Down Wiresnaking

Wiresizing can reduce large skew by applying small changes, which is appropriate after the initial tree construction. An experienced clock-network designer suggested to us that a small amount of wire-snaking is often used to improve clock skew, as long as added capacitance does not significantly affect power. Wiresnaking alters a given route so as to increase its length and can be applied on fast paths.

We develop an accurate top-down wiresnaking process, which we invoke after top-down wiresizing. This step uses the same slowdown slack computation we described earlier. A SPICE simulation is performed (other accurate delay model can be used) to measure 𝑇 w n , the worst-case delay of wiresnaking with unit length 𝑙 w n . 𝑙 w n affects the accuracy of the wiresnaking algorithm; smaller 𝑙 w n offers greater accuracy but typically leads to more SPICE runs since skew reduction in each round of top-down wiresnaking is smaller. 𝑙 w n was set based on empirical analysis of the 45 nm technology used at the ISPD contest before contest benchmarks became available. The applicability of wiresnaking depends on the VLSI context. If the clock tree is competing for routing resources with signal nets, then every effort should be taken to reduce the utilization of routing resources. In particular, wiresnaking cannot be used in areas of routing congestion (also, clock trees should avoid such areas to minimize crosstalk noise). On the other hand, some ICs include abundant routing resources. This is the case for pad-limited designs and designs whose area is determined by large IP blocks. The number of available metal layers also plays a major role in the design of clock trees, and can vary dramatically between different designs, ranging from 6 to 12 layers as of 2010. In some high-performance designs, clock networks are given a dedicated metal layer, which makes wiresnaking much more attractive.

One of the top-three teams at the ISPD 2009 clock-tree routing contest (NTU [26]) used dangling wires instead of wiresnaking. Rather than elongate a route, this strategy adds a dead-end branch. The goal is to increase wire capacitance, and, therefore, increase the delay. In comparing dangling wires to wire-snaking, we note that the former does not alter the resistance that affects propagation delay. Therefore, to achieve a particular slowdown, a much longer wire branch is needed. On the positive side, the dependence of delay increase on branch length is linear, and this may allow for more accurate tuning. In other words, this technique offers a potentially greater accuracy, but smaller range because the range of such optimizations is limited by the capacitance budget. Therefore, if dangling wires are found useful, they should be used at a later stage in the optimization flow.

4.8. Bottom-Level Fine-Tuning and Limits to Further Optimization

After two top-down skew reduction phases, skew becomes small enough to perform bottom level optimizations. Bottom-level wiresnaking optimize the wires directly connected to sinks. This technique is more accurate than the top-down optimizations since each sink is tuned individually. Contango performs SPICE-driven bottom-level wiresnaking until the results stop improving. Typically the gain of bottom-level tuning is under 2 ps, but can be a significant fraction of remaining skew.

We found that with skew <5 ps, the corner sinks of rising transition and falling transition are often different.

This rise-fall divergence makes further improvements to the clock tree very difficult. Indeed, reducing rising skew by slowing down a fast sink for rising transition may increase falling skew due to excessive slowdown of a slow sink for falling transition. In the Contango flow, the average skew after bottom-level tuning is 3.21 ps on ISPD’09 CNS contest benchmarks.

Table 3 shows the improvement of CLR and skew by each optimization algorithm. Note that after iterative buffer sizing (TBSz), skew is increased but CLR does not change much. This implies that TBSz reduced the variational part of CLR ( 𝑇 1 . 0 V 𝐺 𝑇 1 . 2 V 𝐺 ) significantly. TBSz is performed before skew optimization, because it increases the skew part of CLR ( 𝑇 1 . 2 V 𝐺 𝑇 1 . 2 V 𝐿 ). The increased skew is reduced below 5 ps after our skew optimizations.

5. Empirical Validation

To validate our proposed techniques, we first present results on ISPD’09 benchmarks with detail comparison to state-of-the-art academic clock network synthesis tools according to the contest protocol, then discuss the significance of specific optimizations used by Contango, and then evaluate the scalability of our C++ implementation on larger benchmarks from our industry colleagues. We measured runtimes on a 2.4 GHz Intel QuadCore CPU running Linux, similar to CPUs used at the ISPD contest.

ISPD’09 benchmarks include seven 45 nm chips up to 17 mm × 17 mm in size, with up to 330 selected clock sinks [24]. Table 5 compares results of our software Contango to the top three teams of the ISPD’09 clock-network synthesis contest. On average, Contango reduces CLR by 2.15×, 3.99× and 2.35× versus contest results by NTU, NCTU and U. of Michigan respectively, excluding failures of NTU and NCTU on benchmarks with many obstacles. All results are within the capacitance limits, but Contango nearly exhausts the limits as a part of its strategy. On ISPD’09 benchmarks, maximum sink latency averages 1120 𝑝 𝑠 , while the average number of composite-buffer locations is 223. A clock tree built by Contango is shown in Figure 4.

More recent results for ISPD’09 benchmarks from ASPDAC’10 [2628] are summarized in Table 6. a Dynamic Nearest-Neighbor Algorithm (DNNA) for topology construction, along with the results in Table 6 show that Contango outperforms NTU and NCTU by skew and CLR. HKPU [28] claims a 20% advantage in CLR, but more than doubles nominal skew. Another interesting aspect of the HKPU work is that they rely on SPICE very little in their optimizations and instead use the Elmore delay model, which explains their low runtimes. The algorithms in [28] focus entirely on the optimization of nominal skew, which does not explain the results—high nominal skew and low CLR. As the authors of [28] have kindly provided their clock trees on our request, we observed that those trees use very large buffers at the top levels of the tree (including but not limited to the trunk) and small buffers toward the sinks. This strategy minimizes the impact of supply voltage variations, but makes it more difficult to optimize nominal skew given a limited capacitance budget.

Significance of Individual Optimization
Several optimizations we have implemented were superseded by more powerful techniques. For example, skew reduction by buffer insertion was unnecessary and undermined the robustness to variations. However, it can be used as a last resort when detours around obstacles introduce extremely high skew. Our wiresizing can be refined but probably not beyond the accuracy of subsequent wiresnaking. In practice, wiresnaking is very limited, so as to preserve the routability of signal wires (unless clock wiring is given a dedicated metal layer). Dangling wires, used by NTU instead of wire snaking, would be even less acceptable.
To further study the relative significance of optimizations in Contango, we show in Table 4 the impact of removing each skew optimization step from the flow. It can be seen that each step is necessary to achieve competitive results. Removing top-down wiresizing effects the greatest impact because this optimization offers the greatest range, and subsequent optimizations cannot fully compensate for its omission.

Scalability Studies
The ISPD’09 contest was limited to unrealistically small numbers of sinks due to limitations of the open-source ngSPICE software [31] it relied upon. To evaluate the scalability of our optimizations, we replaced ngSPICE with industry-standard HSPICE software [32].(The numbers produced by ngSPICE and HSPICE were fairly close, with the main difference being runtime and scalability.) Working with a recent Texas Instruments chip sized 4.2 mm × 3.0 mm, we identified locations of 135 K sinks and randomly sampled them to create a family of benchmarks. For this experiment, our algorithm used groups of large inverters instead of groups of 8 parallel small inverters, improving runtime eightfold at the cost of increasing CLR and skew by 1-2 ps and increasing capacitance by 15%. It produced highly optimized clock trees with up to 50 K sinks. Table 7 shows that total capacitance scales linearly with the number of sinks, and skew remains in single ps. The number of HSPICE runs grows very slowly, but HSPICE remains the bottleneck.

6. Conclusions

Existing literature on clock networks offers several elegant algorithms but does not describe end-to-end solutions to clock-network synthesis that can handle modern interconnect. Our work makes several contributions to this end. First, we develop specialized optimization algorithms necessary to bridge the gaps between well-known point-optimizations. Our emphasis is on robust techniques, that do not require tuning and are amenable to embedding into design flows. Second, we develop an EDA methodology for integrating clock-network optimization steps. Third, we describe a robust software implementation, called Contango, that outperforms best results from the ISPD’09 contest [24] by a factor of two.(The use of two wire sizes, two inverter types, and two process corners in the ISPD’09 contest is not a limitation of our algorithms and methodology. Likewise, any accurate delay evaluator can be used, including FastSpice, and Arnoldi approximations.) Fourth, we scale our implementation to large industrial clock networks.

Based on their strong empirical results, our techniques may improve timing and power of future ASICs and SoCs [5]. In CPU designs, our trees can be integrated with meshes [6]. Here, better trees may facilitate smaller meshes and reduce power consumption, which can be traded off for higher performance or longer battery life in portable applications.