Research Article  Open Access
Yu Shen Lin, Damu Radhakrishnan, "Delay Efficient 32Bit CarrySkip Adder", VLSI Design, vol. 2008, Article ID 218565, 8 pages, 2008. https://doi.org/10.1155/2008/218565
Delay Efficient 32Bit CarrySkip Adder
Abstract
The design of a 32bit carryskip adder to achieve minimum delay is presented in this paper. A fast carry lookahead logic using group generate and group propagate functions is used to speed up the performance of multiple stages of ripple carry adders. The group generate and group propagate functions are generated in parallel with the carry generation for each block. The optimum block sizes are decided by considering the critical path into account. The new architecture delivers the sum and carry outputs in lesser unit delays than existing carryskip adders. The adder is implemented in 0.25?m CMOS technology at 3.3?V. The critical delay for the proposed adder is 3.4?nanoseconds. The simulation results show that the proposed adder is 18 faster than the current fastest carryskip adder.
1. Introduction
The everincreasing demand for mobile electronic devices requires the use of powerefficient VLSI circuits. Computations in these devices need to be performed using lowpower, areaefficient circuits operating at greater speed. Addition is the most basic arithmetic operation; and adder is the most fundamental arithmetic component of the processor. Depending on the area, delay and power consumption requirements, several adder implementations, such as ripple carry, carryskip, carry lookahead, and carry select, are available in the literature [1, 2]. The ripplecarry adder (RCA) is the simplest adder, but it has the longest delay because every sum output needs to wait for the carryin from the previous fulladder cell. It uses area and a delay of for an nbit adder. The carry lookahead adder has delay and uses area. On the other hand, the carryskip adder and carryselect adders have O(vn) delay and use area [3].
In this paper, we present the design of a lowpower adder with less delay while using minimum hardware. The standard carry generatepropagate logic is used to reduce the critical delay of the adder while blocks of RCAs are used for lesser power consumption. In our design, the generatepropagate logic balances the delay and the number of inputs to the skip logic limits the critical path delay. By applying our design procedure, we speed up the adder by 18% when compared to the current fastest 32bit adder [4]. In Section 2, we will discuss the previous work done in the area of highperformance adders. In Section 3, we present the design of our adder. Section 4 presents the design of a few basic CMOS cells used in the adder. In Section 5, we present the simulation results for our adder and compare it to other fast adders.
2. Theoretical Background and Previous Work
The design of a carryskip adder is based on the classical definition of generate and propagate signals as follows [1, 2]: where is the propagate signal and is the generate signal, and and are the input operands to the th adder cell. The carry out from the th adder cell is expressed as where is the carry input to the th cell.
Two signals, group generate and group propagate, are also defined in [1, 2] and are given by where and are group generate and group propagate signals from th cell to th cell, respectively. Then, the expression for carry out from the whole group is given by
Different adder implementations have been developed to optimize various design parameters. Most adder implementations tend to trade off performance and area. One of the earliest adder implementations of this kind was a regular parallel adder layout also known as the BrentKung adder โ82 [5]. It is a variation of the basic carry lookahead adder . They emphasized the need for regularity in VLSI circuits to reduce design and implementation costs. They use two types of processor cells: white processor and black processor. The black processor performs the associative concatenation defined in [5] and the white processor simply transmits the data. The adder delay was calculated in terms of the number of exclusiveor (XOR) operations performed while treating each XOR delay as one unit time. For an nbit adder, the BrentKung adder has a delay of and uses area.
WeiThompsonโ85 [6] proposed an areatime optimal adder design using three types of adder cells: black cells, white cells, and driver cells. The black and white cells are quite similar to the ones used in BrentKung adder. They divided the nbit adder into ascending and descending halves so as to limit the number of bits in the final stage. The concentration of the maximum number of bits was in the middle of the adder and was defined as the height of the adder. The algorithm ends up in an unbalanced binary tree with a delay of consuming an area
The ELMadder design presented in [7] computes the sum bits in parallel; thereby reducing the number of interconnects. It implements an nbit adder as a tree of processors to directly compute the sums in time. The area used is . The adder design was expressed in terms of standard cells, which do not compute carry for each stage. Instead, partial sums were computed for each stage.
Kantabutraโ93 [8] presents the design of a onelevel carryskip adder using an approach that is very similar to that of WeiThompson. In contrary to WeiThompsonโs approach, this design ends up in a symmetrical binary tree of adders. The fanin to the carryskip logic increases linearly towards the middle of the adder. A twolevel carryskip adder is presented in [9], where the whole adder stage is divided into a number of sections, each consisting of a number of RCA blocks of linearly increasing length. These adders reduce the delay at the cost of an increase in area and less regular layout.
Nagendraโ96 [3] did a survey of various adder designs and concluded that the ELM adder was superior in terms of area, power, delay, and powerdelay product. RCA was concluded to have utilized the least power, but has the highest delay due to its carry chain. A variablewidth carryskip adder was shown to be superior to constantwidth carryskip adder, the advantage being greater at higher precisions.
A fully static carryskip adder designed by Chircaโ04 [4] achieved lowerpower dissipation and higher performance. To reduce delay and power consumption, the adder is divided into variablesized blocks that balance the inputs to the carry chain. The main principle behind this design was to utilize the lower blocks and make them work in parallel with higher blocks. This paper is a deviation from the tree approach presented in the ELM adder. A 32bit adder implementation with a delay of 7 logic levels using carryskip adders and ripplecarry adders was presented in [4]. This is shown in Figure 1. The logiclevel delay defined in the paper is equivalent to the delay of a complex CMOS gate. Efficient andorinvert (AOI) and orandinvert (OAI) CMOS gates were used to reduce delay and power.
The 32bit adder is divided into 4 adder blocks as shown in Figure 1. Carryselect adders were used in the final CS4 block, which significantly increases the hardware. The paper claims that the output will be ready with a delay of 7 logic levels, with the assumption that the critical delay path is the carry propagation path of bit. But a closer examination of the previous block CS18 reveals that the th bit of the sum output will be available only after a delay of 9 logic levels.
3. New Design for the 32bit CarrySkip Adder
The 32bit carryskip adder design presented in this paper uses a combination of RCAs together with carryskip logic (SKIP), carrygenerate logic (CG), and group generatepropagate logic (PG). The complete adder is divided into a number of variablewidth blocks. Both the carry generation and skip logic use AOI and OAI circuits. The width of each block is limited by the target delay T.
Each block is further divided into subblocks. A subblock may contain additional levels of subblocks in a recursive manner. The lowestlevel subblock is formed by a number of variable width RCAs. The adder structure is described as follows:
The 32bit adder is divided into four blocks. A block diagram of the first three blocks is shown in Figure 2. The first block (LSB) is a full adder by itself. The carry from the first block is fed into the second block and is also fed into the skip logic. The generate and propagate functions are generated separately for each full adder in one unit time, where one unit time is defined as the delay of a complex CMOS gate with at most three transistors connected in series from the output node to any supply rail. In Figure 2, the numbers shown in parenthesis represent the number of unit delays of the signal arrival times at the appropriate signal leads. Since the delay of a complex CMOS gate is quadratic on its stack height, in our design, the stack height is limited to 3. This implies that the maximum number of transistors (NMOS or PMOS) in any series connected path is 3. This also restricts the maximum number of inputs to the carryskip logic to 7. On the other hand, when the generatepropagate outputs are used for group generation and group propagation outputs, a stack height of 3 in the CMOS implementation will allow a 4bit RCA.
The carrygeneration delay from the skip logic is minimized by alternately complementing the carry outputs. Hence, the carry signals generated are and so forth. For the very first 1bit block , the carrygeneration logic is more important than the sumgeneration logic since the overall delay of the adder is dependent on the carry from this block. Hence, this block is designed by minimizing the carry out delay as much as possible. The simplest expression of carry out from the LSB full adder is given by where and are the operand bits and is the input carry. An AOI gate implements this.
The block in Figure 2 is implemented as a kbit RCA. For any kbit RCA, the total number of propagate and generate outputs would be 2?k. These 2?k outputs together with the carry from the previous block are fed into carryskip logic to generate the new carry signal. The fanin restriction of 7 to the carryskip logic therefore limits the number of bits in the RCA to 3. The carry out from skip logic for block is given by Since โs and โs can be best implemented in complementary form, we can rewrite as By inspection, can be implemented by an orandinvert (OAI) gate and is available in 2 time units. The final Sum output from this 3bit RCA will be available in 4 time units. The sum outputs for this RCA are generated either as or depending on the carry signal value The carry out and are implemented as and respectively.
Now consider block in Figure 2. The delay of carry signal arriving at the input of the skip logic is 2 time units. This implies that the group generatepropagate logic outputs feeding the skip logic must also be available in 2 time units. Hence, the inputs to the logic must be available in 1 time unit. This implies that the inputs to the logic must be the propagate and generate signals of the full adders. Block is divided into three subblocks and (in this case, each subblock is an RCA). The maximum width of each RCA is limited to 4 bits due to the fanin restrictions imposed on the block. The width of each RCA is also limited by the target delay of the 32bit adder. The width of the first RCA is given as where is the arrival delay of the carry output from the previous block. The width of all remaining higher order RCAs in the same block will be 1 bit less because of the delayed arrival times of their carry input by an additional time unit. The carry inputs and to RCAs and are generated using AOI logic as follows:
For a target delay of 6 time units, the width of the first RCA in is 4 bits and the widths of the remaining RCAs are each 3 bits. The number of RCAs in is limited to 3 due to the fanin restriction of 7 on the skip logic. Each RCA in block also represents a subblock of The carry out from the skip logic is implemented using AOI logic as
A detailed block diagram of the first three blocks of the 32bit adder (an expanded view of Figure 2) is shown in Figure 3. The three blocks together form a 14bit adder.
Next let us consider the final block of the 32bit adder. Block is divided into a number of subblocks. The maximum number of subblocks is again limited to 3 due to the fanin restrictions on the skip logic. A block diagram of with an expanded view of subblock is shown in Figure 4. The subblock is further divided into RCAs. The number of inputs to the CG logic increases, successively, by 2 for each RCA and is limited to a maximum of 7 in any subblock. Hence, the number of RCAs in any subblock is limited either by the number of inputs to the CG block or by the number of inputs to the block. Therefore, subblock 0 can accommodate 4 RCAs. The carry input to the skip logic, as well as, to the first RCA arrives in 3 time units. The propagate and generate signals ( and ) from each RCA will be available with a delay of 1 time unit. This implies that we can have two levels of logic inside the block while satisfying the time delay constraints. Using (9), the width of the first RCA is 3 bits, and the widths of the remaining RCAs are 2 bits each. Hence, the total width of subblock is 9 bits.
Figure 5 shows block with an expanded view of subblock The number of RCAs in is limited to 3 due to the condition stated earlier. The carry input to the first RCA of this subblock is given by
With an AOI logic implementation, will be available in 4 time units, thereby limiting the length of the first RCA to 2 bits. The carry inputs and to the remaining RCAs in subblock are also available in 4 time units. Thus, the maximum width of subblock is 6 bits. The carry input to the final subblock is given by
The maximum width of subblock can be calculated as 4 bits. This subblock can accommodate only 2 RCAs due to the fanin limits of the CG blocks. Hence, the total width of block is 19 bits. By combining the 4 blocks and a 33bit adder can be implemented. The width of subblock can be shortened to 3bits for a 32bit adder. The carry out from the skip logic is given by where,
An OAI logic implementation generates in 4 time units. A detailed block diagram of block is shown in Figure 6. The final breakdown of the 32bit adder into 4 blocks is shown in Figure 7. A reduction in hardware can be achieved by moving subblock from block and placing it as another block This will eliminate 1 carry generate logic (OAI) and logic.
Although our adder has already achieved the 32bit requirement, we still have room to extend the width further, while keeping the target delay the same. The schemes for the th and th blocks are shown in Figure 8. The fifth block is divided into three subblocks. The subblocks have the same structure as block Since the carry fed into the th block has 4 unit delays, the maximum width of the first RCA will be 2 bits. The remaining RCAs will be 1 bit each. Thus, the maximum width for the fifth block will be 20 bits. The first subblock (11 bits) is divided into subblocks of 5, 3, 2, and 1 bit. The subblock (6 bits) is divided into 3 subblocks of 3, 2, and 1 bit. Similarly, the final subblock (3 bits) is divided into subblocks of 2 and 1 bit. The first 5bit subblock consists of a 2bit RCA and 3 individual full adders. Individual full adder cells form all other subblocks. The th block is a single bit full adder. Thus, the total width of the adder becomes 54.
Based on the adder design procedure, we can derive a formula for calculating the maximum number of full adders in every block. The following notations are used in the derivation. T Target delay of the nbit adder in time units, The number of RCAs in block i, The width of RCA โjโ in block i.
For any block , the number of RCAs is defined by a recursive function . The recursive function is not valid for the blocks and , and the values for and when used in the recursive function are assumed to be zero The width of an RCA is defined in terms of the target delay. The width of the RCAโjโ in any blockโiโ is defined as where min is the minimum value among and
The carry input to the first RCA of the block can be obtained directly from the previous carryskip stage. Hence, the calculation of width for the first block is done differently from the others.
The maximum number of full adders in block is given by
Table 1 lists the maximum adder size for a given target delay using our design procedure.

4. Design of Basic CMOS Cells
A few basic CMOS cells are used for the design of the adder stage. They are: AOI, OAI, and FA cells. Three different cells are used for AOI and OAI (3input, 5input, and 7input). These cells are labeled as AOIn and OAIn, where n refers to the number of inputs to the cell. The 3input and 5input cells are implemented in a straightforward manner, and are given by the following Boolean expressions: where and are the inputs for the gate. The 3input OAI is expressed as The expressions for 5input AOI and OAI are given as where and are the inputs to the cells. When the 7input AOI and OAI cells are implemented in the above manner, the delay is prohibitive and hence we decided to implement them as a cascade connection of a number of smaller modules. Their corresponding Boolean expressions are given by where and are the inputs to the cells. Since we reduce the stack height of the transistors connected in series from 4 to 3, the 7input AOI and OAI cells will be speeded up and the propagation delay will be almost the same as the 5input AOI and OAI. The full adder cell used in our design is the lowenergy CMOS adder cell presented in [10].
5. Simulation
The adder was implemented using Tanner tools pro 11.03. Ledit was used to generate the layout and Tspice was used for performing the simulation. The generic 0.25?ยตm CMOS technology was used with 3.3 volts supply voltage. The different CMOS cells (AOI, OAI, and FA) were simulated for worstcase delays and the delays are tabulated in Table 2. From Table 2, it may be noted that the 5 and 7input cell delays are comparable to that of the FA, while the 3input cells have a much less delay. The average power was measured by feeding 10,000 random vectors at a frequency of 500?MHz and is also shown in Table 2.

For comparison purposes, we selected two other types of adders. They are (i) 32bit carry skipadder proposed in [4] and (ii) 32bit multilevel carryskip adder proposed in [11]. The first one is referred here as Chirca adder and the second one is referred as Gayles adder. These adders were compared with our 32bit adder by measuring the critical path delays. To get a more realistic estimation of the delays involved, we laid out the complete 32bit adder stages and performed TSPICE simulation. The simulation was carried out at a frequency of 100?MHz. The simulation results are shown in Table 3. These results show that our 32bit adder has the minimum delay of 3.4 nanoseconds while Gayles adder exhibited a maximum delay of 4.39 nanoseconds. The Chirca adder had a delay of 4.15 nanoseconds. Thus, our design has a speedup of 18% and 22% compared to those of Chirca and Gayles adders, respectively. Our 32bit adder was then extended to a 54bit adder with marginal delay increase, and these simulation results are also included in Table 3. Even this 54bit adder is found to be faster than the 32bit Gayles adder.

The power consumption showed a marginal increase of power for our adder compared to Gayles adder while outperforming Chirca adder. Overall, our 32bit adder achieved the lowest powerdelay product.
6. Conclusions
In this paper, we presented a new 32bit adder using carryskip logic. The adder was implemented by dividing the adder into several blocks. The size of each block is limited by the delay of the carryin signal and the final target delay. An algorithm is used to calculate the maximum size of the adder satisfying the target delay. The delay of a full adder is used as the unit of measurement in our analysis. The adder has been implemented by generating the layout with Generic 0.25?ยตm CMOS technology. The TSPICE simulations carried out at a frequency of 100?MHz and supply voltage of 3.3?V showed a critical path delay of 3.4 nanoseconds. The comparison results show that our adder is faster than Chirca and Gayles carryskip adders. Overall our proposed adder is 18% and 22% faster compared to the Chirca and Gayles adders, respectively. Furthermore, a 54bit adder implemented using our approach can operate almost at the same speed as a 32bit Chirca adder or Gayles adder. Even though our adder has a marginal increase in power consumption compared to the Gayles adder, overall, we achieved the lowest powerdelay product.
References
 I. Koren, Computer Arithmetic Algorithms, A. K. Peters, Natick, Mass, USA, 2nd edition, 2002.
 B. Parhami, Computer Arithmetic Algorithms and Hardware Designs, Oxford University Press, Oxford, UK, 2000.
 C. Nagendra, M. J. Irwin, and R. M. Owens, โAreatimepower tradeoffs in parallel adders,โ IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 43, no. 10, pp. 689โ702, 1996. View at: Publisher Site  Google Scholar
 K. Chirca, M. Schulte, J. Glossner et al., โA static lowpower, highperformance 32bit carry skip adder,โ in Proceedings of the EUROMICRO Symposium on Digital System Design (DSD '04), pp. 615โ619, Rennes, France, AugustSeptember 2004. View at: Publisher Site  Google Scholar
 R. P. Brent and H. T. Kung, โA regular layout for parallel adders,โ IEEE Transactions on Computers, vol. 31, no. 3, pp. 260โ264, 1982. View at: Publisher Site  Google Scholar
 B. W. Y. Wei, C. D. Thompson, and Y. F. Chen, โTime optimal design of a CMOS adder,โ in Proceedings of the 19th Annual Asilomar Conference on Circuits, Systems, and Computers, pp. 186โ191, Pacific Grove, CA, USA, November 1985. View at: Google Scholar
 T. P. Kelliher, R. M. Owens, M. J. Irwin, and T.T. Hwang, โELMA fast addition algorithm discovered by a program,โ IEEE Transactions on Computers, vol. 41, no. 9, pp. 1181โ1184, 1992. View at: Publisher Site  Google Scholar
 V. Kantabutra, โDesigning optimum onelevel carryskip adders,โ IEEE Transactions on Computers, vol. 42, no. 6, pp. 759โ764, 1993. View at: Publisher Site  Google Scholar
 V. Kantabutra, โAccelerated twolevel carryskip addersa type of very fast adders,โ IEEE Transactions on Computers, vol. 42, no. 11, pp. 1389โ1393, 1993. View at: Publisher Site  Google Scholar
 S. Goel, S. Gollamudi, A. Kumar, and M. Bayoumi, โOn the design of lowenergy hybrid CMOS 1bit full adder cells,โ in Proceedings of the 47th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS '04), vol. 2, pp. 209โ212, Hiroshima, Japan, July 2004. View at: Publisher Site  Google Scholar
 E. Gayles, R. M. Owens, and M. J. Irwin, โLow power circuit techniques for fast carryskip adders,โ in Proceedings of the 39th IEEE Midwest Symposium on Circuits and Systems, vol. 1, pp. 87โ90, Ames, Iowa, USA, August 1996. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2008 Yu Shen Lin and Damu Radhakrishnan. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.