Selected Papers from SPL 2009 Programmable Logic and ApplicationsView this Special Issue
High-Speed FPGA 10's Complement Adders-Subtractors
This paper first presents a study on the classical BCD adders from which a carry-chain type adder is redesigned to fit within the Xilinx FPGA's platforms. Some new concepts are presented to compute the P and G functions for carry-chain optimization purposes. Several alternative designs are presented. Then, attention is given to FPGA implementations of add/subtract algorithms for 10's complement BCD numbers. Carry-chain type circuits have been designed on 4-input LUTs (Virtex-4, Spartan-3) and 6-input LUTs (Virtex-5) Xilinx FPGA platforms. All designs are presented with the corresponding time performance and area consumption figures. Results have been compared to straight implementations of a decimal ripple-carry adder and an FPGA 2's complement binary adder-subtractor using the dedicated carry logic, both carried out on the same platform. Better time delays have been registered for decimal numbers within the same range of operands.
In a number of computer arithmetic applications, decimal systems are preferred to the binary ones. The reasons come not only from the complexity of coding/decoding interfaces but mostly from the lack of precision and clarity in the results of the binary systems.
Decimal arithmetic plays a key role in data processing environments such as commercial, financial, and Internet-based applications [1–3]. Performances required by applications with intensive decimal arithmetic are not met by most of the conventional software-based decimal arithmetic libraries . Hardware implementation embedded in recently commercialized general purpose processors [3, 4] is gaining importance.
Furthermore, IEEE has recently published a new standard 754-2008  that supports the floating point representation for decimal numbers.
At the moment, Binary Coded Decimal (BCD) is used for decimal arithmetic algorithm implementations. Although other coding systems may be of interest, BCD seems to be the best choice until now. Issues of hardware realization of decimal arithmetic units appear to be widely open: potential improvements are expected in what refers to algorithm concepts as well as to hardware design. This paper resumes some new concepts about carry-chain type algorithms for adding BCD numbers. Two key ideas have been introduced: (i) the Propagate and generate functions are computed from the input data instead of intermediate BCD sums, and (ii) the functions have been implemented in Xilinx Virtex-4  and Virtex-5 FPGA platforms , taking advantage of the 6-input LUTs structure of Virtex-5 version.
Signed numbers addition is used as a primitive operation for computing most arithmetic functions, so that it deserves particular attention. It is well known that in classical algorithms the execution time of any program or circuit is proportional to the number N of digits of the operands. In order to minimize the computation time, several ideas have been proposed in the literature [8, 9]. Most of them consist in modifying the classical algorithm in such a way as to minimize the computation time of each carry; the time complexity may still be proportional to N, but the proportionality constant may be reduced. Moreover, it has to be pointed out that, within the same range, decimal addition involves shorter carry propagation process than for the straight binary code. It will be shown in the practical implementations that adding BCD digits can not only save coding interfaces but moreover provides time delay reductions. Hardware consumption for BCD will be greater, if coding and decoding processes are not considered; as of today, the dramatic decreasing of hardware cost stimulates work on time saving.
In this paper, decimal carry-chain and ripple-carry adders have been implemented on Virtex-4 Xilinx FPGA platforms, for a number of operand sizes; comparative performances are presented for binary and BCD digit operands.
Additionally, three implementations of adders-subtractors have been implemented on FPGA Xilinx Virtex-5 platforms for a number of operand sizes; comparative performances are presented for binary and BCD digit operands, respectively. Adder-subtractor inputs are 10’s complement signed BCD numbers; sign-change algorithm is used whenever subtraction is at hand.
2. Base- Ripple-Carry Adders
Consider the base- representations of two -digit numbers:
Algorithm 1 (pencil and paper) computes the ()-digit representation of the sum where is an initial carry equal to 0 or 1.
Algorithm 1. Classic addition (ripple carry):
for in loop
else end if;
3. Base- Carry-Chain Adders
First define two binary functions of two -valued variables, namely, the propagate () and generate () functions:
The next carry can be calculated as follows:
The corresponding modified Algorithm 2 is the following one.
Algorithm 2. Carry-chain addition
– computation of the generation and propagation conditions:
for in loop
– carry computation:
for in loop
– sum computation
for in loop
(1)Instruction sentence (3) is equivalent to the following Boolean equation: Furthermore, if the preceding relation is used, then the definition of the generate function can be modified: (2)Another Boolean equation equivalent to (4) is
If the preceding relation is used, then the definition of the propagate function can be modified:
The Cy.Ch (carry-chain) cell computes the next carry, that is to say
so that generates a carry, whatever happens upstream in the carry-chain, and propagates the carry from level . The mod sum cell calculates
As regards the computation time , the critical path is shaded in Figure 2. It has been assumed that .
Another interesting time is the delay from to assuming that all propagate and generate functions have already been calculated:
The carry-chain cells are binary circuits, whereas the generate-propagate and the mod B sum cells are B-ary ones.
Equation (4) can be implemented by a 2-to-1 binary multiplexer (Figure 3(a)) while (6) by a 2-gate circuit (Figure 3(b)). In the first case, the per-digit-delay of a carry-chain adder is equal to the delay of a 2-to-1 binary multiplexer, whatever the base B is.
(a) Carry multiplexer
(b) AND-OR circuit
4. Base-10 Complement and Addition
4.1. Ten’s Complement Numeration System
’s complement representation general principles are available in the literature as, for example, computer arithmetic books such as [8, 9]. One restricts to 10’s complement system to cope with the needs of this paper. A one-to-one function R(x), associating a natural number to x, is defined as follows.
Every integer x belonging to the range
is represented by mod, so that the integer represented in the form is
The conditions (12) may be more simply expressed as
Another way to express a 10’s complement number is
where if and
while the sign definition rule is the following one: if is negative, then ; otherwise
4.2. Ten’s Complement Sign Change
Given an n-digit 10’s complement integer x, the inverse of x is an -digit 10’s complement integer. Actually the only case that cannot be represented with n digits is when , so , that is to say The computation of the representation of is based on the following property.
Assuming x to be represented as an n-digit 10’s complement number may be readily computed as
A straightforward inversion algorithm then consists in representing x with digits, complementing every digit to 9, then adding 1. Observe that sign extension is obtained by adding a digit 0 to the left of a positive number or 9 for a negative number, respectively.
5. Base-10 Adders
5.1. Base-10 Ripple-Carry Adders
For B = 10, the classic and naïve approach  of ripple-carry for a BCD decimal adder cell can be implemented as in Figure 5. Observe that the critical path involves the carry propagation through 7 binary adders plus a 4-bit Boolean circuit (checking if the sum is greater than 9 or not).
5.2. Base-10 Carry-Chain Adders
If B = 10, the carry-chain circuit remains unchanged but the P and G functions as well as the modulo-10 sums are somewhat more complex. In base 2, the mod B sum cell appears to be a single XOR function, while the mod 10 sum cell is more complex as suggested by Figure 5.
In base 2, the P and G cells are, respectively, synthesized by XOR and AND functions, while in base 10, P and G are now defined as follows:
A straightforward way to synthesize P and G is shown at Figure 6. Nevertheless, functions P and G may be directly computed from inputs. The following formulas (18) are Boolean expressions of conditions (17),
where , and are the binary propagator, generator, and carry-kill for the jth components of the BCD digits x(i) and y(i).
The BCD carry-chain adder ith cell is shown at Figure 7. It is made of a first mod 16 adder stage, a carry-chain cell driven by the G-P functions, and an output adder stage performing a correction (adding 6) whenever the carry-out is one. Actually, a zero carry-out c(i+1) identifies that the mod 16 sum does not exceed 9 if c(i) = 0, respectively, 8 if c(i) = 1; so no corrections are needed. Otherwise, the add-6 correction applies.
The G-P functions may be computed according to Figure 6, using the outputs of the mod 16 stage, including the carry-out . With more hardware consumption, but saving time delays, formulas (18) may be used.
6. FPGA Implementations of the Base-10 Adders on 4-Input LUTs Xilinx Platforms
The base-10 adders of Figures 5 and 7 have been implemented on 4-LUTs (Look-Up Tables up to 4 inputs) Xilinx devices. Virtex-4, Spartan 3, and the obsolete Virtex-2, Virtex and Spartan 2 are 4-input LUTs-based FPGA [6, 10]. In what follows the area is expressed in LUTs. In the Xilinx Virtex-4 technology a configurable logic block (CLB) involves 4 slices and a slice is made by two 4-LUTs and some additional logic. VHDL models are available at .
6.1. Base-10 Ripple-Carry Adder
The classic implementation of the ripple carry adder cell in FPGA implies a 4-bit adder, a 4-LUT to detect the carry condition, and a final 3-bit adder. The delay and area consumption of an N-digit ripple carry adder are
6.2. FPGA Implementation of the Base-10 Carry-Chain Adder
In order to make the best use of the resources, the design has been achieved using relative location techniques (RLOC)  with low-level component instantiations. This first architecture is called GP_a.
The adding stages are implemented as shown at Figures 8(a) and 8(b) while the carry-chain structure with the G-P functions has been implemented as shown at Figure 9 where G is computed according to Figure 6, while P is computed as
The complexity figures of the carry-chain circuit for a 4-digit unit, as shown at Figure 9, are given as
where stands for the average connection delay between two neighboring slices of the same CLB.
where stands for the average connection delays between two slices located in neighbor columns. has to be accounted twice to involve both the connection delay between the 4-bit adder and the carry-chain and the one between the carry chain and the output adder.
6.3. Other Implementations of Base-10 Carry-Chain Adders
This architecture called GP_b is shown in Figure 11. The corresponding time and area of a carry-chain cell using this architecture is
The complete cell includes a 4-bit adder and a conditional 3-bit output adder adding 6 whenever necessary (similar to Figure 5). The overall time delay and area consumption using this carry-computation cell is:
The results in area and speed are poor compared to the GP_a implementation (obtaining G-P from the results of the 4-bit adder).
Another alternative is based on the use of dedicated multiplexers. Xilinx Spartan 3, Virtex-2, and Virtex-4 devices have Look-Up Table multiplexers (muxf5, muxf6, muxf7, muxf8) in order to construct functions of 5, 6, 7, and 8 variables without using the general purpose routing fabric.
Using this feature the circuit of Figure 12 (GP_c) can be implemented using the following relations:
The corresponding time and area of a carry-chain cell GP_c is where stands for the delay from an LUT input to a muxf6 output. The complete cell also includes 4-bit adder and a conditional 3-bit adder. The overall delay-area for GP_c cell is
7. FPGA Implementations of Base-10 Adders and Adders-Subtractors on 6-Input LUTs Xilinx Platforms
7.1. Base-10 BCD Carry-Chain Adder
In a first version, Ad-I, the adding stage and correction stage are implemented as shown at Figures 8(a) and 8(b), respectively, while the carry-chain structure with the G-P functions is computed according to Figure 6.
Xilinx Virtex-5 6-input/2-output LUT is built as two 5-input functions while the sixth input controls a 2-1 multiplexor allowing to implement either two 5-input functions or a single 6-input one; so G and P functions fit in a single LUT as shown at Figure 13.
In a second version, Ad-II, the carry-chain is speeded up thanks to a direct computation of the G-P, namely, using inputs instead of the intermediate sum bits . For this purpose one could use formulas (18); nevertheless, in order to minimize time and hardware consumption the implementation of and is revisited as follows. Remembering that whenever the arithmetic sum one defines a 6-input function set to be 1 whenever the arithmetic sum of the first 3 bits of and is 4. Then may be computed as
On the other hand, is defined as a 6-input function set to be 1 whenever the arithmetic sum of the first 3 bits of and is 5 or more. So, remembering that whenever the arithmetic sum may be computed as
As Xilinx Virtex-5 LUTs may compute 6-variable functions, then and may be synthesized using 2 LUTs in parallel while and are computed through an additional single LUT as shown at Figure 14.
7.2. ’s Complement BCD Carry-Chain Adder-Subtractor
To compute similar algorithm as in Section 7.1 is used. In order to compute 10’s complement subtraction algorithm actually adds to .
7.2.1. ’s Complement (AS-I)
10’s complement sign change algorithm may be implemented through a digitwise 9’s complement stage followed by an add-1 operation. It can be shown that the 9’s complement binary components of a given BCD digit are expressed as
To compute , 10’s complement subtraction algorithm actually adds to . So for a first implementation, AS-I, Figure 15 presents a 9’s complement implementation using 6-input/2-output LUTs, available in the Virtex-5 Xilinx technology. is the add/subtract control signal; if (subtract), formulas in (38) apply; otherwise and for all
7.2.2. Improving the Adder Stage
To avoid the delay produced by the 9’s complement step, this operation may be carried out within the first binary adder stage, as depicted in Figure 16, where p(i) and g(i) are computed as
7.2.3. Carry-Chain Stage Computing G and P Directly from the Input Data (AS-II)
As far as addition is concerned, the P and G functions may be implemented according to formulas (36) and (37). The idea of the AS-II is computing the corresponding functions in the subtract mode and then multiplexing according to the add/subtract control signal . For this reason, assuming that the operation at hand is , one defines on one hand ppa(i) and gga(i) according to Section 7.1, that is, using the straight values of Y’s BCD components.
On the other hand, pps(i) and ggs(i) are defined according to the same Section 7.1 but using as computed by the 9’s complement circuit shown at Figure 15. As are expressed from the (38), both pps(i) and ggs(i) may be computed directly from xk(i) and yk(i) as shown in Figure 17. Nevertheless, for subtraction, the computation of is carried out at the output LUT level. So formulas (36) and (37) are then expressed as
8. Experimental Results
8.1. Xilinx Virtex-4 Adder Implementations
The base-10 adders have been implemented on Xilinx Virtex-4 FPGA family speed grade-11 . The Synthesis and implementation have been carried out on XST (Xilinx Synthesis Technology)  and Xilinx ISE (Integrated System environment) version 10.1 .
Performances of different N-digit BCD adders have been compared to those of an M-bit binary carry chain adder (implemented by XST  using Xilinx fast carry logic) covering the same range, that is, as
The time and hardware complexities of an M-bit ripple-carry adder implemented on the same 4-LUT based Xilinx FPGA are given by
Formulas (26), (30), (34), and (42) show that, asymptotically, should be somewhat inferior to . Nevertheless, as shown by the experimental results, the additive values appearing in (26), (30), and (34) are not negligible for reasonable values of N; so the saving in time will mainly appear for applications where BCD-to-binary coding and decoding operations play a significant role in the overall delay.
Post place-and-route time delays and area consumptions are quoted in Tables 1 and 2, respectively, where N stands for the number of BCD digits while M stands for the number of bits required to cover the decimal N-digit range. The results presented in the table are as follows:
Figure 18 shows the delays for the compared adders. Observe that, for the technology at hand, Table 1 and Figure 18 suggest that for N > 48 the carry-chain decimal implementation of adders is faster than the binary one for the equivalent range. Furthermore for small numbers of digits to add (N < 40) the PG_c architecture is faster than other decimal implementations.
8.2. Virtex-5 Adder-Subtractor Implementations
The adder-subtractor circuits have been implemented on Xilinx Virtex-5 family with speed grade-2 . The synthesis and implementation have been carried out on XST (Xilinx Synthesis Technology)  and Xilinx ISE (Integrated System environment) version 10.1 . The critical parts were designed using low-level components instantiation (lut6_2, muxcy, xorcy, etc.) in order to obtain the desired behavior. Performances of different N-digit BCD adders have been compared to those of an M-bit binary carry chain adder (implemented by XST) covering the same range, that is, such that
Table 3 exhibits the postplacement and routing delays in ns for the decimal adder implementations Ad-I and Ad-II of Section 7.1; Table 4 exhibits the delays in ns for the decimal adder-subtractor implementations AS-I and AS-II of Section 7.2. Table 5 lists the consumed areas expressed in terms of 6-input look-up tables (6-input LUTs). The estimated area presented in Table 5 was empirically confirmed.
Observe that, for large operands, the decimal operations are faster than the binary ones.
The overall area with respect to binary computation is not negligible. In Virtex-4 the area increases, with respect to an equal range binary adder, in a factor between 2.4 and 5.4. In the 6-input LUT family Virtex-5 an adder-subtractor is between 3.0 and 3.9 times bigger.
The present interest in BCD arithmetic systems stimulates further researches at both the algorithmic and design levels. Considering that the hardware costs are everyday more affordable, full hardware BCD units are now very attractive, with moreover a growing potential in the near future.
This paper has developed some implementations of BCD adders and subtractors in FPGA platforms. Experimental results emphasize time performances with reasonable costs in terms of area. Matched with the binary system, the decimal implementations are faster as operand sizes are growing (break even around 50 digits).
One of the key points about delays comes from the fact that the carry-propagation computation remains binary; then a faster carry-chain circuit can be designed because, for the same operand range, the number of digits (therefore of carries to propagate) is lower in decimal than in binary. In the carry-chain structures studied in this paper, the propagate P and generate G functions are more complex and therefore more time and area consuming than in the binary ones; therefore, the speed improvements only appear for large enough operands. The breakeven point is obviously technology dependent; so it could be expected to occur for a smaller number of digits in the near future.
The area overhead with respect to binary computation is not negligible; it is around five times in Virtex-4 and nearly four times in Virtex-5. That is mainly due to the more complex definition of the carry propagate and carry generate functions and to the final mod 10 reduction. The decreasing costs of technology make hardware consumption less central.
For BCD addition, the performance considerations on Xilinx Virtex-5 platform are similar to those of 4-input LUTs-based Virtex-4 technology. That is, the addition time of BCD digits remains faster than the binary counterpart in the same conditions.
Finally, the BCD adder/subtractor, with a relatively small penalty in area, presents time performances quite similar to those of a straight BCD adder.
This work is supported by the Universities FASTA, Mar del Plata, Argentina, UNCPBA Tandil, Argentina, UAM, Madrid, Spain, and URV, Tarragona, Spain; it has been partially granted by the CICYT of Spain under contract TEC2007-68074-C02-02/MIC.
M. F. Cowlishaw, “Decimal floating-point: algorism for computers,” in Proceedings of the 16th IEEE Symposium on Computer Arithmetic, pp. 104–111, June 2003.View at: Google Scholar
F. Y. Busaba, C. A. Krygowski, W. H. Li, E. M. Schwarz, and S. R. Carlough, “The IBM Z900 decimal arithmetic unit,” in Proceedings of the Asilomar Conference on Signals, Systems and Computers, vol. 2, pp. 1335–1339, November 2001.View at: Google Scholar
IEEE Standard for Floating-Point Arithmetic (IEEE 754), IEEE, 2008.
J.-P. Deschamps, G. Bioul, and G. Sutter, Synthesis of Arithmetic Circuits: FPGA, ASIC and Embedded Systems, John Wiley & Sons, New York, NY, USA, 2006.
B. Parhami, Computer Aritmethic: Algorithms and Hardware Designs, Oxford University Press, Oxford, UK, 2000.
Xilinx Inc., Xilinx, http://www.xilinx.com/.
Xilinx Inc., Constraints Guide—ISE9.2i, chapter 2, Relative Location (RLOC), 2008.