Research Article  Open Access
Architectures and Arithmetic for Low Static Power Consumption in Nanoscale CMOS
Abstract
This paper focuses on leakage reduction at architecture and arithmetic level. A methodology for considerable reduction of the static power consumption is shown. Simulations are done in a typical 130 nm CMOS technology. Based on the simulation results, the static power consumption is estimated and compared for different filter architectures. Substantial power reductions are shown in both FIRfilters and IIRfilters. Three different types of architectures, namely, bitparallel, digitserial, and bitserial structures are used to demonstrate the methodology. The paper also shows that the relative power ratio is strongly dependent on the used word length; that is, the gain in power ratio is larger for longer word lengths. A static power ratio at 0.48 is shown for the bitserial FIRfilter and a power ratio at 0.11 is shown in the arithmetic part of the FIRfilter. The static power ratio in the IIRfilter is 0.36 in the bitserial filter and 0.06 in the arithmetic part of the filter. It is also shown that the use of storage, such as registers, relatively the arithmetic part, affects the power ratio. The relatively lower power consumption in the IIRfilter compared to the FIRfilter is due to the lower use of registers.
1. Introduction
The power consumption is becoming a major obstacle in future circuit design. Referring to Moore’s law, by adding more functionality in an exponential way, we will also increase the power consumption in the same pace. From the power consumption perspective, the dynamic power consumption has been the major concern, for a long time, when integrating digital CMOS circuits.
However, in today’s technologies it can be noticed that the static power consumption is an important factor for the total power consumption. In fact, in ITRS predictions [1] the static power will dominate below the 65 nm technology. A number of methods to limit the static power consumption on device level as well as logic level have been shown in the literature. Examples are dual threshold design, respectively, self reverse biasing of transistor stacks [2]. However, to reduce the total power consumption, all abstraction levels must be considered. That is, system, algorithm, architecture, arithmetic, as well as logic and device level are all important to reduce the total power consumption. The focus for this paper is mainly static but also dynamic power reduction at arithmetic and architecture level, an area where relatively few papers are published. Results on architectural and arithmetic level are presented in [3, 4]. Especially, [3] shows how to trade off circuitry towards transitions for an optimum power consumption in bitparallel architectures.
This paper focuses on architectures using bitserial, digitserial, and bitparallel arithmetic in order to make the best choice in reducing the static power consumption, in regions where the dynamic power consumption can be neglected, that is, in low data rate applications. Such applications can be wireless and battery operated medical applications or sensor network applications.
The idea of bitserial arithmetic was invented long time ago. An early paper that describes the arithmetic is [5]. Another publication related to information technology is [6]. More recent, bitserial architectures in the area of filters and signal processing related to VLSI can be found in [7]. Digitserial arithmetic is often an alternative to bitserial arithmetic. The arithmetic is based on the same idea as bitserial arithmetic; compare Figure 1(b) and 1(c). One good reason to choose digitserial arithmetic is that the needed clock frequency is lower for the same throughput. In [8–10] there are various aspects on digitserial architectures.
(a)
(b)
(c)
This paper is organized as follows. Section 2 gives a short background to static power consumption and Section 3 presents the methodology for arithmetic power reduction of the static power consumption. Section 4 describes the circuitry used in the simulations. In Section 5 the simulation results are presented, which is the base for Section 6, where the results are applied on FIRfilter structures as well as in Section 7 where the methodology is applied on IIRfilters.
2. Background
The transistors leakage is the main source for the static power consumption, which consists of many fractions particularly in technologies below 50 nm [2]. Examples are gate oxide tunneling, junction bandtobandtunneling (BTBT), drain induced barrier lowering (DIBL), and the gate induced drain leakage (GIDL) [2]:
However, in present technology generations, for example, 65 nm, the main contributor to the leakage is the subthreshold current, which often is described by (1) [2].
From the equation, it can be seen that the leakage is dependent on the transistor width W. The total static power is thus dependent on the amount of transistors on the chip; that is, it increases with the number of arithmetic blocks that are used. A good concept for reducing the static power would therefore be to use arithmetic that requires a low number of arithmetic blocks. Bitserial arithmetic is an arithmetic with very few arithmetic blocks, which instead are reused many times.
3. Arithmetic for Static Power Reduction
Figure 1(a) illustrates the concept of bitparallel arithmetic, using a carry ripple adder. In the parallel adder all bits, a_{i} and b_{i}, are available to the adder at the same time. To get the bits s_{i} in the output sum the carry, c_{i}, has to ripple through all the adder cells in the adder.
Figure 1(b) shows the corresponding bitserial adder [11]. The signals, a_{i} and b_{i}, are provided serially, with the LSB bits first, to form the corresponding output sum. The output carry c_{i} is delayed one clock cycle to be added with next higher significant input bits. In Figure 1(c), 1(a) 4bit digitserial adder is shown as an example. In this adder the word that should be processed is divided in digits. It basically has the same function as the bitserial adder but here every fourth carry is stored.
3.1. Conceptual Discussion
In general, bitparallel arithmetic uses n onebit arithmetic units, adder cells, to process nbit words, where bitserial arithmetic only requires a single onebit arithmetic unit to process nbit words. The reduction in number of arithmetic units is thus traded off to a higher number of clock cycles.
Bitserial arithmetic shows essentially the same properties as the bitparallel, for the dynamic power consumption, which is illustrated by the left term in (2) and (3):
In the left term of (2), n onebit units are used once, but in (3), only one single unit is used n times. The clock frequency is thus increased n times, for the same throughput. Both circuits will thus charge and discharge the same amount of capacitance per addition. However, it shall be noted that (2) and (3) are conceptual. The bitparallel arithmetic often suffers from glitching which increases the dynamic power, sometimes substantially. In the case of bitserial arithmetic, the register and the AND gate are neglected for the moment. However, it will be taken into account from Section 4 and forward.
The big difference is when comparing the static power consumption in the two adders. The static power consumption is dependent on the amount of transistors as shown in (1), that is, the number of adder cells in this case. As shown in the right term of (2), the bitparallel adder dissipates n equivalent currents, , whereas the bitserial arithmetic that only uses one single unit, only dissipates one equivalent current, as shown in (3). The bitserial arithmetic has thus the potential to have n times lower static power consumption; for example, a 12bit combinatorial bitparallel datapath will have a static power consumption that is 12 times higher than the bitserial datapath.
As shown in (2) and (3) the bitserial arithmetic needs a clock frequency at bitlevel for the same throughput as the bitparallel. However, that does not mean that the bitserial adder cannot reach the same maximum throughput. The carry has to go through the same number of adder cells in both cases which means that the total delay for an adder operation will be about the same. In the used cells described later in Section 5, the bitserial addition has a simulated maximum throughput that is about half the bitparallel. The highest throughput for the two adders is thus comparable.
If the bitserial arithmetic still gives an inconveniently high clock frequency, an alternative is digitserial arithmetic, where the word is divided in digits instead of single bits [12, 13]; see Figure 1(c). The digitserial arithmetic is thus a tradeoff; it gives more leakage than the bitserial but less than the bitparallel.
4. Circuitry
A Cadence environment with Spectre as the circuit simulator has been used for the circuit simulations. Five cells are designed, a fulladder (FA), a halfadder (HA), a register cell, an ANDgate, and an inverter cell. The full adder and the ANDgate will be used in the simulations in Section 5 and the three others will also be used later in Sections 6 and 7. The same adder cell is used in the bitparallel and the bitserial design. A mirroradder cell, containing 28 transistors, is used for the full adders; see [14, page 567]. A transmissiongate register cell, using 16 transistors is used for the registers; see [14, page 335].
All transistor lengths are minimum sized, that is, 120 nm. The used transistor widths, W, in the cells are shown in Table 1. The widths are made wider depending on the number of stacked transistors. Furthermore, due to the difference in mobility between nchannel and pchannel transistors, the ptype transistors are made wider. The widths are multiplied with the number of serial transistors stacked on top of each other. In addition, the minimum width W = 280 nm is doubled if it is a ptype transistor. Table 2 shows the simulated static power consumption in all cells, when a supply voltage at 1.2 V is used.


5. Simulation Results
A typical 130 nm transistor model, provided by UMC, has been used in the simulations. Both bitparallel and bitserial adders have been simulated to show the difference in power consumption. Two different cases with the word lengths 12 and 24 bits are used as examples to illustrate the total power consumption. The simulations also show that the dynamic and static power consumption can be extracted separately from the simulations. The simulations are performed at maximal switching activity, in all cases; however, without glitching, which often occurs in bitparallel structures.
Figure 2 shows the power consumption for 12bit additions in logarithmic scale for both bitparallel and bitserial adders. The figure shows the total power consumption for throughputs up to 500 MWords/s.
To the right, at high throughputs, the diagram shows a linear increase in power consumption. The power consumption is thus dominated by the dynamic power consumption, which has a linear dependence to frequency, as expected from (2) and (3).
In Figure 2, the total power consumption dominated by the static power consumption is shown to the left in the diagram. The flat part of the diagram shows that the power is more or less constant; that is, the power consumption is dominated by the static power consumption and hence the dynamic power consumption can be neglected.
The power consumption for 24bit additions is, in the same manner, shown in Figure 3. The total power consumption for throughputs up to 250 MWords/s is shown in the figure. Also in this case, the static and dynamic power consumption can be seen separately in the two regions.
Another good measure is to describe the energy per switching event, that is, the power delay products (PDP), defined by P t_{p}, where P is the power consumption and t_{p} is the propagation delay. The PDP is an invariable measure, when the dynamic power consumption dominates, since the PDP is proportional to the power consumption and to the inverse of the throughput. In Table 3, the PDP is shown for higher throughputs. It can be seen that the PDP is about double for the bitserial arithmetic in that region and it can also be seen that it is about two times higher when the word length is doubled.

For the lower throughputs, where the static power consumption dominates, the PDP decreases with increasing throughput. The reason is that the power consumption is constant in this case but not the throughput. The difference in PDP is thus not that illustrated in that region.
When examining the two diagrams, in Figures 2 and 3, some important observations can be made.
5.1. Higher Throughputs
It can be noted that the power consumption for the bitserial arithmetic is slightly higher than that for the bitparallel, for higher throughputs, 1.95 and 1.91 times for the 12bit and the 24bit additions, respectively. This higher dynamic power consumption is caused by the extra circuitry in form of one register cell and one ANDgate, in the bitserial case. Hence, the conceptual discussion from (2) and (3) is a too positive estimate, but it is still reasonable assumption.
5.2. Lower Throughputs
For lower throughputs, 500 kWords/s and less, the static power consumption starts to dominate. It can be noted that the bitparallel power consumption is much higher than the bitserial. For the 12bit case, the static power consumption is simulated to 20.3 nW and 134 nW for the bitserial and bitparallel adders, respectively. The bitparallel power consumption is thus 134/20.3 = 6.6 times higher. In the 24bit additions the corresponding power consumption is 13.4 times higher for bitparallel additions. An important observation is thus that it becomes more and more attractive to use bitserial arithmetic when the word length increases. It can also be noted that the power reduction is not so high as the expectations from (2) and (3). The reason is, again, that the register cell and the ANDgate were neglected in that section. However, the power reduction is still very high.
5.3. The Crossover
There is a crossover in the middle, where the dynamic and static power consumption becomes equal. There are two aspects on where this crossover will appear.
5.3.1. Switching Activity
The simulations are done with maximum switching activity. In a general application, the switching activity is often much lower, for example, down to 10% or less [4]. The crossover will in that case move up to higher throughputs, at about 12 orders of magnitude. It will thus be more and more important to design for low static power consumption.
5.3.2. Technology
The simulations are done for a typical 130 nm technology. However, the technology choice is not important for the results presented in this paper. Denser technologies will show the same behavior with one flat part, one increasing linear part, and a crossover in between. The difference is that the crossover will move up in throughput since more dense technologies will have more and more leakage. In [1], the dynamic and static power consumption is predicted to be equal in the 65 nm node. The static power consumption will thus become more and more important.
5.4. Discussion on Lowering the Supply Voltage
This paper suggests a methodology to reduce the static power consumption on the arithmetic and architectural level. The focus is on low and medium throughputs since it is at those data rates where the static power consumption dominates. However, the dominant methodology for reducing the power consumption, the last decades, has been to lowering the supply voltage, . Often the designer does not have the possibility to choose the supply voltage [3]. Fixed supply voltages and threshold voltages are often assigned at the overall system level and in those cases the discussion in Sections 5.3.1 and 5.3.2 is valid. However, for the case where the supply voltage can be chosen freely it is important to examine how lowering the supply affects the methodology, proposed in this paper.
5.4.1. Higher Throughputs
As described in Section 5.1, the power consumption is about 2 times higher for the bitserial architectures, in the highthroughput region. The simulations also show that the maximum throughput is about 2 times higher for bitparallel architecture. By lowering the bitparallel supply voltage to approximately 0.75 , the dynamic power consumption will be equal to the bitserial, at the same throughput. This is illustrated in Figure 4, where we can see the extrapolated dynamic power consumption for the bitparallel adder when it is on top of the bitserial curve; see the “diagonal” dotted line. The maximum throughputs will be comparable as well. However, the obvious choice, in the region where the dynamic power consumption dominates, is to use bitparallel arithmetic, since it in general is an advantage to run at lower supply voltages.
5.4.2. Lower Throughputs
As described in Section 5.2, the static power consumption dominates in this region. With the same operation, lowering the bitparallel supply voltage to approximately 0.75 , the static power consumption will decrease to about 0.75 P, according to (1). The static power consumption will thus decrease with reduced supply voltage. This is also illustrated in Figure 4, where we can see the extrapolated static power consumption for the bitparallel adder as the upper dotted horizontal line. The effect of lowering the bitparallel supply voltage is thus that the relative gain in reduced static power consumption is lower. However, in the lower throughput region, the same operation can be applied to the bitserial arithmetic as well. By lowering the supply for the bitserial adder, the static power consumption will decrease in the same pace as well. This is shown in Figure 4 as well, where we can see the extrapolated static power consumption for the bitserial adder as the lower dotted horizontal line. For lower throughputs, the relative difference in static power consumption will thus be kept, if the supply voltage is reduced by the same amount. The supply voltage reduction can continue down to voltages where the reliability, the noise margin, sets the limit for a proper function, more or less without changing the power ratio between the bitparallel and the bitserial arithmetic.
6. The Concept Applied On a Hilbert Transformer
The simulation results from Sections 4 and 5 can be used to estimate the static power consumption on larger structures. An 11tap linear phase FIRfilter, a Hilbert transformer [7], is, in this section, used as an example to show the methodology of reducing the static power consumption. In Section 7, the methodology will be shown on an IIRfilter as well.
One characteristic for Hilbert filters is that every second coefficient is zero. They are also antisymmetric; that is, the same coefficient appears two times, where one is positive and the other one is negative. The impulse response for the Hilbert transformer is shown in Figure 5.
In Figure 6, the linearphase FIRfilter corresponding to the impulse response in Figure 5 is shown. In the figure, two samples are subtracted before each coefficient multiplication to gain one multiplication, which is illustrated with the signal S in Figure 6. Six of the used 8bit coefficient values are given in Table 4. All other coefficients are zero.

Three different filter architectures will be used to show the power reduction methodology, one bitparallel, one digitserial, and one bitserial.
6.1. The BitParallel Filter Architecture
The filter has two fixed coefficient multiplications, that is, the and signals. The multiplication is a multiplication with “1”, which thus is trivial. Figure 7 shows a 12bit fixed multiplier for the C_{10} coefficient. The bits for are denoted by a_{i} and the result by s_{i} in the figure. Note that the uppercase letters denote wordlevel data and lowercase letters represent bitlevel data.
The coefficient C_{10} = 00001101 contains three “” at the positions c_{3}, c_{2}, and c_{0}. The multiplier can be realized by using two adders, where the upper adder is used for adding A_{10} for c_{0} together with the twostep leftshifted A_{10} for c_{2}. A threestep leftshifted A_{10} is, after that, added to the result in the lower adder. Figure 8 shows the corresponding methodology to realize the multiplier for the coefficient C_{8} = 00010101.
Both the C_{10} and the C_{8} multipliers contain 20 full adders (FAs) and 2 half adders (HAs) each if the 12bit case is considered. Furthermore, two adders containing 26 FAs and 2 HAs are needed to sum up the results from the multiplications, see Figure 6. In addition, 3 subtractors with 36 FAs are used. The subtractors also need 36 inverters for the two’s complement conversion. The arithmetic in the parallel filter will contain 102 FAs, 6 HAs, and 36 inverters, in total.
6.2. The BitSerial Filter Architecture
The arithmetic part of the bitserial filter is shown in Figure 9. To the left, the three bitserial subtractors are shown. In the middle, there are three multipliers, for the C_{10}, C_{8}, and C_{6} coefficients. Finally, the two adders that sum up the result are placed to the right.
Counting the leaf cells, the bitserial arithmetic gives 9 FAs, 29 register cells, 9 ANDgates, and 3 inverters, in total. The architecture is independent of the word length. However, more clock cycles are needed to process more bits.
6.3. DigitSerial Filter Architecture
Figure 10 shows a 2bit digitserial architecture. The structure is similar to the structure in Figure 9. Accordingly, the flow through the architecture can be compared to the flow through the bitserial architecture but here are two bits processed each clock cycle. The digitserial arithmetic contains 18 FAs, 49 register cells, 9 ANDgates, and 3 inverters.
The cell count for the 12bit structures is shown in Table 5, divided on the different cells. Note that the ANDgates are only needed together with the registers that are used in the bit and digitserial adders; see Figure 1.

From Table 5 and the simulated power consumption in Table 2, we can estimate the power consumption in the filter arithmetic. The power in (4) shows the total sum of the currents in the adders, registers, and the ANDgates:
The following two subsections will discuss the relative power reduction for the arithmetic part of the FIRfilter, in Section 6.4 and for the complete FIRfilters, in Section 6.5.
6.4. The FIRFilter Arithmetic
The power consumption in the arithmetic can thus be estimated separately by excluding the nbit registers, as shown in (5):
Table 6 shows a substantial power reduction for the 12bit arithmetic in the filter, where the bitserial arithmetic only consumes 28% of the static power compared to the bitparallel consumption. The 2bit digitserial arithmetic only dissipates about half the static power, compared to the bitparallel.

There is a strong dependence on the word length as well. As the word length increases, the relative power reduction will be larger, which could be expected from (2) and (3). The word length dependence is shown in Figure 11, where the word length is varied from 4 bits up to 30 bits.
The lower graph shows the bitserial arithmetic and the upper shows the digitserial. The bitserial arithmetic only consumes 11% and the digitserial only 19% of the static power consumption compared to the bitparallel arithmetic, when the word lengths are 30 bits long.
6.5. The Complete Filter
Based on (4), we get the static power consumption for the complete filter as shown in (6). It can be noted that the static power in the bitserial and the digitserial arithmetic, in (5), is independent of the word length. However, that is not the case when comparing the power for the complete filter, since there are ten nbit registers, which vary with the word length, as shown in (6) for the 12bit case:
In Table 7, the filters are compared. The table shows a large reduction of the static power consumption as well, even if it is not as large as when the arithmetic part is examined alone. The reason is that the ten nbit registers do have the same size in both the parallel and the serial cases; see Figure 6. There is thus no relative power gain in the nbit registers, only in the arithmetic.

In Figure 12, the relative power reduction for the complete filter is shown. A power ratio at 0.48 and 0.52 can be noted, respectively. The difference at higher word lengths is thus not so high. The digitserial filter can thus be worth to consider in order to reduce the clock frequency.
7. The Methodology Applied on a HalfBand Filter
A thirdorder IIRfilter structure is here used to show the static power reduction methodology. The filter structure, shown in Figure 13, is a thirdorder bireciprocal lattice wave digital filter [7, 15] also called a halfband filter.
The recursive structure has four adders and three registers. There is also one multiplication with the coefficient 0.5, which corresponds to a shift. By using bitserial arithmetic, a substantial reduction of the static power consumption can be achieved.
The same number and size of the registers is needed in both the bitparallel and the bitserial registers, since the words have to be stored in three parallel registers or fed through three equally sized serial registers. Beside the registers, four single register cells have to be added in the bitserial case, one for each adder; see Figure 1. The number of adder cells is however very different. In the bitparallel case the adders have full width, that is, the same as the word length n, but in the bitserial case only four adder cells are needed.
The cell count for both the 12bit and the 24bit structures is shown in Table 8, dividedbythe number of FAs , register cells , and ANDcells . Note that the ANDgates are only needed together with the registers that are used in the bitserial adders, as shown in Figure 1.

From Table 8 and the simulated power consumption in Table 2, we can estimate the static power consumption in the filter arithmetic, which is shown in (8) and the power consumption in the complete filter in (9):
The power in (7) shows the total sum of the different power terms in the adders, registers cells, and the ANDgates. The power, , is presented for each case in (8) and (9):
The following two subsections discuss the relative power reduction for the filter arithmetic only, in Section 7.1 and for the complete filter, in Section 7.2.
7.1. The Filter Arithmetic
The power consumption in the arithmetic can thus be estimated separately by excluding the nbit registers, as shown in (8).
Table 9 shows a substantial power reduction for the arithmetic in the filter, especially for the 24bit case where the bitserial arithmetic only consumes 7.5% of the static power compared to the bitparallel consumption. The table also shows that there is a strong dependence on the word length. As the word length increases, the relative power reduction will be larger, which could be expected from (2) and (3).

Figure 14 shows the relative power reduction up to a word length of 30 bits. Based on the same discussion as in Section 6, the power ratio for digitserial arithmetic is added as well. The diagram shows that the power ratio goes down to 0.06 and 0.16 for the bitserial and the digitserial arithmetic, respectively, compared to the bitparallel arithmetic.
7.2. The Total Filter
Based on (7), we get the static power for the complete filter shown in (9):
It can be noted that the static power consumption in the serial arithmetic, and in (5), is independent of the word length, for this filter as well. However as for the FIRfilter, that is not the case when comparing the power reduction for the complete filter, since the nbit registers vary with the word length; see and in (9).
In Table 10, the filters are compared. The table shows a large reduction of the static power consumption as well, even if it is not as large as when we are examining the arithmetic part alone. The reason is that the three nbit registers; see Figure 13, do have the same size in both the bitserial and the bitparallel case. There is thus no relative power gain in the registers, only in the arithmetic. However, the resulting gain is still remarkably high, with a reduction down to 37%.

Figure 15 shows a similar diagram as in Figure 14, but for the complete IIRfilter. The power ratio for the complete filter goes down to 0.36 and 0.43, respectively, for the bitserial and digitserial filter compared to the bitparallel filter.
Figures 14 and 15 show the power ratio for the bitserial and the 2bit digitserial architectures. In Figure 16, the power ratio versus the digitserial bit size is shown for bit sizes up to 24bits in the 24bit filter structure. It can be seen that the power ratio increases linearly up to the bitsize 23bits. After that the digitserial filter will be equivalent to the bitparallel filter with the power ratio 1.0. The left part of the diagram when the digit bitsize is 1 corresponds to the bitserial case.
8. Discussion
It has been shown that a good architectural methodology to reduce the static power consumption is to choose arithmetic that has a lower number of processing elements. Bitserial and digitserial arithmetic can be very useful to reduce the number of units in a VLSI design, which thus also reduces the static power consumption. The methodology is especially useful for low and medium data rates where the static power consumption dominates.
The methodology has been tested on two different filter structures, one FIRfilter and one IIRfilter. Beside the power reduction, two other conclusions can be drawn.
(1)The proposed methodology gives a higher power ratio on architectures with long word lengths. The reason is that the number of cells in bitparallel arithmetic increases linearly with the word length, whereas the number of cells in bitserial and digitserial arithmetic is constant for all word lengths.(2)A good strategy when considering algorithms and architectures is to choose those, which have a minimum of storage elements, such as registers. The size of the registers follows the word lengths and is thus independent if bitparallel, bitserial, or digitserial arithmetic is used. This has been shown by the two filters. In this particular FIRfilter the ratio between the registers and the arithmetic is rather high. A power ratio is in that case relatively high, that is, down to 0.48. On the other hand, in the used IIRfilter, the ratio between the registers and the arithmetic is rather low. Here we can see a power ratio at down to 0.36, which is much lower.9. Conclusions
Based on simulation results, the static power in a Hilbert transformer and a half band filter is investigated and compared for three different architecture structures, namely, bitparallel, digitserial and bitserial arithmetic, respectively. The paper shows that a substantial reduction of the static power consumption can be achieved when bitserial and digitserial arithmetic is used. The paper also shows that the power ratio is strongly dependent on the used word length; that is, the reduction is larger for longer word lengths. The ratio is also dependent on the ratio between the arithmetic and the storage, for example, the registers. Architectures where the arithmetic dominates will show a larger reduction in static power consumption. A static power reduction down to half is shown for the bitserial Hilbert transformer and almost down to a third in the half band filter. Looking at the arithmetic part only, a power ratio down to 0.11 and 0.06 is shown, respectively. The overall conclusion is that it is a good idea to switch to serial arithmetic for low and mediumspeed architectures when technologies, such as the 130 nm technology in this paper, are used. For denser technologies such as 65 nm and below, it will be well worth to switch to serial arithmetic in architectures at higher speed as well, as long as the dynamic power consumption is not dominating.
References
 IEEE ITRS Technology Roadmap, http://public.itrs.net.
 A. Agarwal, S. Mukhopadhyay, A. Raychowdhury, K. Roy, and C. H. Kim, “Leakage power analysis and reduction for nanoscale circuits,” IEEE Micro, vol. 26, no. 2, pp. 68–80, 2006. View at: Publisher Site  Google Scholar
 C. Schuster, C. Piguet, J.L. Nagel, and P.A. Farine, “An architecture design methodology for minimal total power consumption at fixed Vdd and Vth,” Journal of LowPower Electronics, vol. 1, no. 1, pp. 1–8, 2005. View at: Publisher Site  Google Scholar
 C. Piguet, C. Schuster, and J.L. Nagel, “Static and dynamic power reduction by architecture selection,” in Proceedings of the 16th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS '06), vol. 4148 of Lecture Notes in Computer Science, pp. 659–668, Montpellier, France, September 2006. View at: Google Scholar
 R. F. Lyon, “Two's complement pipeline multipliers,” IEEE Transactions on Communications, vol. 24, no. 4, pp. 418–425, 1976. View at: Google Scholar
 E. R. Berlekamp, “Bitserial ReedSolomon encoders,” IEEE Transactions on Information Theory, vol. 28, no. 6, pp. 869–874, 1982. View at: Google Scholar
 L. Wanhammar, DSP Integrated Ciruits, Academic Press, New York, NY, USA, 1999.
 M. J. Irwin and R. M. Owens, “A case for digit serial VLSI signal processors,” Journal of VLSI Signal Processing, vol. 1, no. 4, pp. 321–334, 1990. View at: Publisher Site  Google Scholar
 R. I. Hartley and K. K. Parhi, DigitSerial Computation, Kluwer Academic Publishers, Boston, Mass, USA, 1995.
 S. G. Smith and P. B. Denyer, Serial Data Computation, Kluwer Academic Publishers, Boston, Mass, USA, 1988.
 P. Nilsson and M. Torkelson, “A custom digital intermediate frequency filter for the American mobile telephone system,” IEEE Journal of SolidState Circuits, vol. 32, no. 6, pp. 806–815, 1997. View at: Google Scholar
 K. Johansson, O. Gustafsson, and L. Wanhammar, “Multiple constant multiplication for digitserial implementation of low power FIR filters,” WSEAS Transactions on Circuits and Systems, vol. 5, no. 7, pp. 1001–1008, 2006. View at: Google Scholar
 P. Nilsson, “Arithmetic and architectural design to reduce leakage in nanoscale digital circuits,” in Proceedings of the 18th European Conference on Circuit Theory and Design (ECCTD '07), pp. 372–375, Seville, Spain, August 2007. View at: Publisher Site  Google Scholar
 J. M. Rabaey, A. Chandrakasan, and B. Nicolic, Digital Integrated Circuits: A Design Perspective, PrenticeHall, Englewood Cliffs, NJ, USA, 2003.
 P. Nilsson and M. Torkelson, “Method to save silicon area by increasing the filter order,” Electronics Letters, vol. 31, no. 6, pp. 439–441, 1995. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2009 Peter Nilsson. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.