Selected Papers from SPL 2009 Programmable Logic and Applications
View this Special IssueResearch Article  Open Access
HighSpeed FPGA 10's Complement AddersSubtractors
Abstract
This paper first presents a study on the classical BCD adders from which a carrychain type adder is redesigned to fit within the Xilinx FPGA's platforms. Some new concepts are presented to compute the P and G functions for carrychain optimization purposes. Several alternative designs are presented. Then, attention is given to FPGA implementations of add/subtract algorithms for 10's complement BCD numbers. Carrychain type circuits have been designed on 4input LUTs (Virtex4, Spartan3) and 6input LUTs (Virtex5) Xilinx FPGA platforms. All designs are presented with the corresponding time performance and area consumption figures. Results have been compared to straight implementations of a decimal ripplecarry adder and an FPGA 2's complement binary addersubtractor using the dedicated carry logic, both carried out on the same platform. Better time delays have been registered for decimal numbers within the same range of operands.
1. Introduction
In a number of computer arithmetic applications, decimal systems are preferred to the binary ones. The reasons come not only from the complexity of coding/decoding interfaces but mostly from the lack of precision and clarity in the results of the binary systems.
Decimal arithmetic plays a key role in data processing environments such as commercial, financial, and Internetbased applications [1–3]. Performances required by applications with intensive decimal arithmetic are not met by most of the conventional softwarebased decimal arithmetic libraries [1]. Hardware implementation embedded in recently commercialized general purpose processors [3, 4] is gaining importance.
Furthermore, IEEE has recently published a new standard 7542008 [5] that supports the floating point representation for decimal numbers.
At the moment, Binary Coded Decimal (BCD) is used for decimal arithmetic algorithm implementations. Although other coding systems may be of interest, BCD seems to be the best choice until now. Issues of hardware realization of decimal arithmetic units appear to be widely open: potential improvements are expected in what refers to algorithm concepts as well as to hardware design. This paper resumes some new concepts about carrychain type algorithms for adding BCD numbers. Two key ideas have been introduced: (i) the Propagate and generate functions are computed from the input data instead of intermediate BCD sums, and (ii) the functions have been implemented in Xilinx Virtex4 [6] and Virtex5 FPGA platforms [7], taking advantage of the 6input LUTs structure of Virtex5 version.
Signed numbers addition is used as a primitive operation for computing most arithmetic functions, so that it deserves particular attention. It is well known that in classical algorithms the execution time of any program or circuit is proportional to the number N of digits of the operands. In order to minimize the computation time, several ideas have been proposed in the literature [8, 9]. Most of them consist in modifying the classical algorithm in such a way as to minimize the computation time of each carry; the time complexity may still be proportional to N, but the proportionality constant may be reduced. Moreover, it has to be pointed out that, within the same range, decimal addition involves shorter carry propagation process than for the straight binary code. It will be shown in the practical implementations that adding BCD digits can not only save coding interfaces but moreover provides time delay reductions. Hardware consumption for BCD will be greater, if coding and decoding processes are not considered; as of today, the dramatic decreasing of hardware cost stimulates work on time saving.
In this paper, decimal carrychain and ripplecarry adders have been implemented on Virtex4 Xilinx FPGA platforms, for a number of operand sizes; comparative performances are presented for binary and BCD digit operands.
Additionally, three implementations of adderssubtractors have been implemented on FPGA Xilinx Virtex5 platforms for a number of operand sizes; comparative performances are presented for binary and BCD digit operands, respectively. Addersubtractor inputs are 10’s complement signed BCD numbers; signchange algorithm is used whenever subtraction is at hand.
2. Base RippleCarry Adders
Consider the base representations of two digit numbers:
Algorithm 1 (pencil and paper) computes the ()digit representation of the sum where is an initial carry equal to 0 or 1.
Algorithm 1. Classic addition (ripple carry):
in;
for in loop
if then
else end if;
mod ;
end loop;
As is a function of the execution time of Algorithm 1 is proportional to (Figure 1). In order to reduce the execution time of each iteration step, Algorithm 1 can be modified as shown in Section 3.
3. Base CarryChain Adders
First define two binary functions of two valued variables, namely, the propagate () and generate () functions:
The next carry can be calculated as follows:
ifthen
elseend if;
The corresponding modified Algorithm 2 is the following one.
Algorithm 2. Carrychain addition
– computation of the generation and propagation conditions:
for in loop
end loop;
– carry computation:
in;
for in loop
end loop;
– sum computation
for in loop
mod B;
end loop;
Comments
(1)Instruction sentence (3) is equivalent to the following Boolean equation:
Furthermore, if the preceding relation is used, then the definition of the generate function can be modified:
(2)Another Boolean equation equivalent to (4) is
If the preceding relation is used, then the definition of the propagate function can be modified:
The structure of an digit adder with separate carry calculation is shown in Figure 2. It is based on Algorithm 2. The (GeneratePropagate) cell calculates the Generate and Propagate functions (2).
The Cy.Ch (carrychain) cell computes the next carry, that is to say
so that generates a carry, whatever happens upstream in the carrychain, and propagates the carry from level . The mod sum cell calculates
As regards the computation time , the critical path is shaded in Figure 2. It has been assumed that .
Another interesting time is the delay from to assuming that all propagate and generate functions have already been calculated:
Comments
The carrychain cells are binary circuits, whereas the generatepropagate and the mod B sum cells are Bary ones.
Equation (4) can be implemented by a 2to1 binary multiplexer (Figure 3(a)) while (6) by a 2gate circuit (Figure 3(b)). In the first case, the perdigitdelay of a carrychain adder is equal to the delay of a 2to1 binary multiplexer, whatever the base B is.
(a) Carry multiplexer
(b) ANDOR circuit
If and the carrychain cell of Figure 3(a) is used, then and can be chosen equal to, for example, . The corresponding cell for a bit binary adder is shown in Figure 4.
4. Base10 Complement and Addition
4.1. Ten’s Complement Numeration System
’s complement representation general principles are available in the literature as, for example, computer arithmetic books such as [8, 9]. One restricts to 10’s complement system to cope with the needs of this paper. A onetoone function R(x), associating a natural number to x, is defined as follows.
Every integer x belonging to the range
is represented by mod, so that the integer represented in the form is
The conditions (12) may be more simply expressed as
Another way to express a 10’s complement number is
where if and
while the sign definition rule is the following one: if is negative, then ; otherwise
4.2. Ten’s Complement Sign Change
Given an ndigit 10’s complement integer x, the inverse of x is an digit 10’s complement integer. Actually the only case that cannot be represented with n digits is when , so , that is to say The computation of the representation of is based on the following property.
Assuming x to be represented as an ndigit 10’s complement number may be readily computed as
A straightforward inversion algorithm then consists in representing x with digits, complementing every digit to 9, then adding 1. Observe that sign extension is obtained by adding a digit 0 to the left of a positive number or 9 for a negative number, respectively.
5. Base10 Adders
5.1. Base10 RippleCarry Adders
For B = 10, the classic and naïve approach [8] of ripplecarry for a BCD decimal adder cell can be implemented as in Figure 5. Observe that the critical path involves the carry propagation through 7 binary adders plus a 4bit Boolean circuit (checking if the sum is greater than 9 or not).
5.2. Base10 CarryChain Adders
If B = 10, the carrychain circuit remains unchanged but the P and G functions as well as the modulo10 sums are somewhat more complex. In base 2, the mod B sum cell appears to be a single XOR function, while the mod 10 sum cell is more complex as suggested by Figure 5.
In base 2, the P and G cells are, respectively, synthesized by XOR and AND functions, while in base 10, P and G are now defined as follows:
A straightforward way to synthesize P and G is shown at Figure 6. Nevertheless, functions P and G may be directly computed from inputs. The following formulas (18) are Boolean expressions of conditions (17),
where , and are the binary propagator, generator, and carrykill for the jth components of the BCD digits x(i) and y(i).
The BCD carrychain adder ith cell is shown at Figure 7. It is made of a first mod 16 adder stage, a carrychain cell driven by the GP functions, and an output adder stage performing a correction (adding 6) whenever the carryout is one. Actually, a zero carryout c(i+1) identifies that the mod 16 sum does not exceed 9 if c(i) = 0, respectively, 8 if c(i) = 1; so no corrections are needed. Otherwise, the add6 correction applies.
The GP functions may be computed according to Figure 6, using the outputs of the mod 16 stage, including the carryout . With more hardware consumption, but saving time delays, formulas (18) may be used.
6. FPGA Implementations of the Base10 Adders on 4Input LUTs Xilinx Platforms
The base10 adders of Figures 5 and 7 have been implemented on 4LUTs (LookUp Tables up to 4 inputs) Xilinx devices. Virtex4, Spartan 3, and the obsolete Virtex2, Virtex and Spartan 2 are 4input LUTsbased FPGA [6, 10]. In what follows the area is expressed in LUTs. In the Xilinx Virtex4 technology a configurable logic block (CLB) involves 4 slices and a slice is made by two 4LUTs and some additional logic. VHDL models are available at [11].
6.1. Base10 RippleCarry Adder
The classic implementation of the ripple carry adder cell in FPGA implies a 4bit adder, a 4LUT to detect the carry condition, and a final 3bit adder. The delay and area consumption of an Ndigit ripple carry adder are
6.2. FPGA Implementation of the Base10 CarryChain Adder
In order to make the best use of the resources, the design has been achieved using relative location techniques (RLOC) [12] with lowlevel component instantiations. This first architecture is called GP_a.
The adding stages are implemented as shown at Figures 8(a) and 8(b) while the carrychain structure with the GP functions has been implemented as shown at Figure 9 where G is computed according to Figure 6, while P is computed as
(a)
(b)
equivalent to the expression of Figure 6. Figure 9 emphasizes that G depends on while P is computed from and G.
The time delay corresponding to the 4bit adder stage (Figure 8(a)) and the output adder stage (Figure 8(b)) is given as
Both adder stages of Figures 8(a) and 8(b) need the same hardware requirement; computed in slices, the area consumption is given as
The complexity figures of the carrychain circuit for a 4digit unit, as shown at Figure 9, are given as
where stands for the average connection delay between two neighboring slices of the same CLB.
The overall circuit is represented in Figure 10. The overall time delay is computed from formulas (21), (22) and (24):
where stands for the average connection delays between two slices located in neighbor columns. has to be accounted twice to involve both the connection delay between the 4bit adder and the carrychain and the one between the carry chain and the output adder.
From (23) and (25), the area requirement may be computed as
6.3. Other Implementations of Base10 CarryChain Adders
Functions P and G may be directly computed from x(i) and y(i) inputs using the Boolean expression (18). Using 4input LUTs (4LUTs), a first implementation (Figure 11) computes
This architecture called GP_b is shown in Figure 11. The corresponding time and area of a carrychain cell using this architecture is
The complete cell includes a 4bit adder and a conditional 3bit output adder adding 6 whenever necessary (similar to Figure 5). The overall time delay and area consumption using this carrycomputation cell is:
The results in area and speed are poor compared to the GP_a implementation (obtaining GP from the results of the 4bit adder).
Another alternative is based on the use of dedicated multiplexers. Xilinx Spartan 3, Virtex2, and Virtex4 devices have LookUp Table multiplexers (muxf5, muxf6, muxf7, muxf8) in order to construct functions of 5, 6, 7, and 8 variables without using the general purpose routing fabric.
Using this feature the circuit of Figure 12 (GP_c) can be implemented using the following relations:
The corresponding time and area of a carrychain cell GP_c is where stands for the delay from an LUT input to a muxf6 output. The complete cell also includes 4bit adder and a conditional 3bit adder. The overall delayarea for GP_c cell is
7. FPGA Implementations of Base10 Adders and AddersSubtractors on 6Input LUTs Xilinx Platforms
7.1. Base10 BCD CarryChain Adder
In a first version, AdI, the adding stage and correction stage are implemented as shown at Figures 8(a) and 8(b), respectively, while the carrychain structure with the GP functions is computed according to Figure 6.
Xilinx Virtex5 6input/2output LUT is built as two 5input functions while the sixth input controls a 21 multiplexor allowing to implement either two 5input functions or a single 6input one; so G and P functions fit in a single LUT as shown at Figure 13.
In a second version, AdII, the carrychain is speeded up thanks to a direct computation of the GP, namely, using inputs instead of the intermediate sum bits . For this purpose one could use formulas (18); nevertheless, in order to minimize time and hardware consumption the implementation of and is revisited as follows. Remembering that whenever the arithmetic sum one defines a 6input function set to be 1 whenever the arithmetic sum of the first 3 bits of and is 4. Then may be computed as
On the other hand, is defined as a 6input function set to be 1 whenever the arithmetic sum of the first 3 bits of and is 5 or more. So, remembering that whenever the arithmetic sum may be computed as
As Xilinx Virtex5 LUTs may compute 6variable functions, then and may be synthesized using 2 LUTs in parallel while and are computed through an additional single LUT as shown at Figure 14.
7.2. ’s Complement BCD CarryChain AdderSubtractor
To compute similar algorithm as in Section 7.1 is used. In order to compute 10’s complement subtraction algorithm actually adds to .
7.2.1. ’s Complement (ASI)
10’s complement sign change algorithm may be implemented through a digitwise 9’s complement stage followed by an add1 operation. It can be shown that the 9’s complement binary components of a given BCD digit are expressed as
To compute , 10’s complement subtraction algorithm actually adds to . So for a first implementation, ASI, Figure 15 presents a 9’s complement implementation using 6input/2output LUTs, available in the Virtex5 Xilinx technology. is the add/subtract control signal; if (subtract), formulas in (38) apply; otherwise and for all
The ASI circuit is similar to the AdI (Figures 8 and 13) using, instead of input , the input as produced by the circuit of Figure 15.
7.2.2. Improving the Adder Stage
To avoid the delay produced by the 9’s complement step, this operation may be carried out within the first binary adder stage, as depicted in Figure 16, where p(i) and g(i) are computed as
7.2.3. CarryChain Stage Computing G and P Directly from the Input Data (ASII)
As far as addition is concerned, the P and G functions may be implemented according to formulas (36) and (37). The idea of the ASII is computing the corresponding functions in the subtract mode and then multiplexing according to the add/subtract control signal . For this reason, assuming that the operation at hand is , one defines on one hand ppa(i) and gga(i) according to Section 7.1, that is, using the straight values of Y’s BCD components.
On the other hand, pps(i) and ggs(i) are defined according to the same Section 7.1 but using as computed by the 9’s complement circuit shown at Figure 15. As are expressed from the (38), both pps(i) and ggs(i) may be computed directly from x_{k}(i) and y_{k}(i) as shown in Figure 17. Nevertheless, for subtraction, the computation of is carried out at the output LUT level. So formulas (36) and (37) are then expressed as
8. Experimental Results
8.1. Xilinx Virtex4 Adder Implementations
The base10 adders have been implemented on Xilinx Virtex4 FPGA family speed grade11 [6]. The Synthesis and implementation have been carried out on XST (Xilinx Synthesis Technology) [13] and Xilinx ISE (Integrated System environment) version 10.1 [14].
Performances of different Ndigit BCD adders have been compared to those of an Mbit binary carry chain adder (implemented by XST [13] using Xilinx fast carry logic) covering the same range, that is, as
The time and hardware complexities of an Mbit ripplecarry adder implemented on the same 4LUT based Xilinx FPGA are given by
Formulas (26), (30), (34), and (42) show that, asymptotically, should be somewhat inferior to . Nevertheless, as shown by the experimental results, the additive values appearing in (26), (30), and (34) are not negligible for reasonable values of N; so the saving in time will mainly appear for applications where BCDtobinary coding and decoding operations play a significant role in the overall delay.
Post placeandroute time delays and area consumptions are quoted in Tables 1 and 2, respectively, where N stands for the number of BCD digits while M stands for the number of bits required to cover the decimal Ndigit range. The results presented in the table are as follows:


Figure 18 shows the delays for the compared adders. Observe that, for the technology at hand, Table 1 and Figure 18 suggest that for N > 48 the carrychain decimal implementation of adders is faster than the binary one for the equivalent range. Furthermore for small numbers of digits to add (N < 40) the PG_c architecture is faster than other decimal implementations.
8.2. Virtex5 AdderSubtractor Implementations
The addersubtractor circuits have been implemented on Xilinx Virtex5 family with speed grade2 [7]. The synthesis and implementation have been carried out on XST (Xilinx Synthesis Technology) [13] and Xilinx ISE (Integrated System environment) version 10.1 [14]. The critical parts were designed using lowlevel components instantiation (lut6_2, muxcy, xorcy, etc.) in order to obtain the desired behavior. Performances of different Ndigit BCD adders have been compared to those of an Mbit binary carry chain adder (implemented by XST) covering the same range, that is, such that
Table 3 exhibits the postplacement and routing delays in ns for the decimal adder implementations AdI and AdII of Section 7.1; Table 4 exhibits the delays in ns for the decimal addersubtractor implementations ASI and ASII of Section 7.2. Table 5 lists the consumed areas expressed in terms of 6input lookup tables (6input LUTs). The estimated area presented in Table 5 was empirically confirmed.



Comments
Observe that, for large operands, the decimal operations are faster than the binary ones.
The overall area with respect to binary computation is not negligible. In Virtex4 the area increases, with respect to an equal range binary adder, in a factor between 2.4 and 5.4. In the 6input LUT family Virtex5 an addersubtractor is between 3.0 and 3.9 times bigger.
9. Conclusions
The present interest in BCD arithmetic systems stimulates further researches at both the algorithmic and design levels. Considering that the hardware costs are everyday more affordable, full hardware BCD units are now very attractive, with moreover a growing potential in the near future.
This paper has developed some implementations of BCD adders and subtractors in FPGA platforms. Experimental results emphasize time performances with reasonable costs in terms of area. Matched with the binary system, the decimal implementations are faster as operand sizes are growing (break even around 50 digits).
One of the key points about delays comes from the fact that the carrypropagation computation remains binary; then a faster carrychain circuit can be designed because, for the same operand range, the number of digits (therefore of carries to propagate) is lower in decimal than in binary. In the carrychain structures studied in this paper, the propagate P and generate G functions are more complex and therefore more time and area consuming than in the binary ones; therefore, the speed improvements only appear for large enough operands. The breakeven point is obviously technology dependent; so it could be expected to occur for a smaller number of digits in the near future.
The area overhead with respect to binary computation is not negligible; it is around five times in Virtex4 and nearly four times in Virtex5. That is mainly due to the more complex definition of the carry propagate and carry generate functions and to the final mod 10 reduction. The decreasing costs of technology make hardware consumption less central.
For BCD addition, the performance considerations on Xilinx Virtex5 platform are similar to those of 4input LUTsbased Virtex4 technology. That is, the addition time of BCD digits remains faster than the binary counterpart in the same conditions.
Finally, the BCD adder/subtractor, with a relatively small penalty in area, presents time performances quite similar to those of a straight BCD adder.
Acknowledgments
This work is supported by the Universities FASTA, Mar del Plata, Argentina, UNCPBA Tandil, Argentina, UAM, Madrid, Spain, and URV, Tarragona, Spain; it has been partially granted by the CICYT of Spain under contract TEC200768074C0202/MIC.
References
 M. F. Cowlishaw, “Decimal floatingpoint: algorism for computers,” in Proceedings of the 16th IEEE Symposium on Computer Arithmetic, pp. 104–111, June 2003. View at: Google Scholar
 G. Jaberipur and A. Kaivani, “Binarycoded decimal digit multipliers,” IET Computers and Digital Techniques, vol. 1, no. 4, pp. 377–381, 2007. View at: Publisher Site  Google Scholar
 M. Cowlishaw, “General Decimal Arithmetic,” http://speleotrove.com/decimal/. View at: Google Scholar
 F. Y. Busaba, C. A. Krygowski, W. H. Li, E. M. Schwarz, and S. R. Carlough, “The IBM Z900 decimal arithmetic unit,” in Proceedings of the Asilomar Conference on Signals, Systems and Computers, vol. 2, pp. 1335–1339, November 2001. View at: Google Scholar
 IEEE Standard for FloatingPoint Arithmetic (IEEE 754), IEEE, 2008.
 Xilinx Inc., “Virtex4 User Guide,” April 2007, http://www.xilinx.com/. View at: Google Scholar
 Xilinx Inc., “Virtex5 User Guide,” 2008, http://www.xilinx.com/. View at: Google Scholar
 J.P. Deschamps, G. Bioul, and G. Sutter, Synthesis of Arithmetic Circuits: FPGA, ASIC and Embedded Systems, John Wiley & Sons, New York, NY, USA, 2006.
 B. Parhami, Computer Aritmethic: Algorithms and Hardware Designs, Oxford University Press, Oxford, UK, 2000.
 Xilinx Inc., Xilinx, http://www.xilinx.com/.
 “Decimal Arithmetic in FPGA,” http://arithmeticcircuits.org/decimal/. View at: Google Scholar
 Xilinx Inc., Constraints Guide—ISE9.2i, chapter 2, Relative Location (RLOC), 2008.
 Xilinx Inc., “XST User Guide10.1i,” 2008, http://www.xilinx.com/. View at: Google Scholar
 Xilinx Inc., “ISE 10.1 Documentation,” 2008, http://www.xilinx.com/. View at: Google Scholar
Copyright
Copyright © 2010 G. Bioul et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.