#### Abstract

This paper first presents a study on the classical BCD adders from which a carry-chain type adder is redesigned to fit within the Xilinx FPGA's platforms. Some new concepts are presented to compute the *P* and *G* functions for carry-chain optimization purposes. Several alternative designs are presented. Then, attention is given to FPGA implementations of add/subtract algorithms for 10's complement BCD numbers. Carry-chain type circuits have been designed on 4-input LUTs (Virtex-4, Spartan-3) and 6-input LUTs (Virtex-5) Xilinx FPGA platforms. All designs are presented with the corresponding time performance and area consumption figures. Results have been compared to straight implementations of a decimal ripple-carry adder and an FPGA 2's complement binary adder-subtractor using the dedicated carry logic, both carried out on the same platform. Better time delays have been registered for decimal numbers within the same range of operands.

#### 1. Introduction

In a number of computer arithmetic applications, decimal systems are preferred to the binary ones. The reasons come not only from the complexity of coding/decoding interfaces but mostly from the lack of precision and clarity in the results of the binary systems.

Decimal arithmetic plays a key role in data processing environments such as commercial, financial, and Internet-based applications [1–3]. Performances required by applications with intensive decimal arithmetic are not met by most of the conventional software-based decimal arithmetic libraries [1]. Hardware implementation embedded in recently commercialized general purpose processors [3, 4] is gaining importance.

Furthermore, *IEEE* has recently published a new standard 754-2008 [5] that supports the floating point representation for decimal numbers.

At the moment, Binary Coded Decimal (BCD) is used for decimal arithmetic algorithm implementations. Although other coding systems may be of interest, BCD seems to be the best choice until now. Issues of hardware realization of decimal arithmetic units appear to be widely open: potential improvements are expected in what refers to algorithm concepts as well as to hardware design. This paper resumes some new concepts about carry-chain type algorithms for adding BCD numbers. Two key ideas have been introduced: (i) the Propagate and generate functions are computed from the input data instead of intermediate BCD sums, and (ii) the functions have been implemented in Xilinx Virtex-4 [6] and Virtex-5 *FPGA* platforms [7], taking advantage of the 6-input LUTs structure of Virtex-5 version.

Signed numbers addition is used as a primitive operation for computing most arithmetic functions, so that it deserves particular attention. It is well known that in classical algorithms the execution time of any program or circuit is proportional to the number *N* of digits of the operands. In order to minimize the computation time, several ideas have been proposed in the literature [8, 9]. Most of them consist in modifying the classical algorithm in such a way as to minimize the computation time of each carry; the time complexity may still be proportional to *N*, but the proportionality constant may be reduced. Moreover, it has to be pointed out that, within the same range, decimal addition involves shorter carry propagation process than for the straight binary code. It will be shown in the practical implementations that adding BCD digits can not only save coding interfaces but moreover provides time delay reductions. Hardware consumption for BCD will be greater, if coding and decoding processes are not considered; as of today, the dramatic decreasing of hardware cost stimulates work on time saving.

In this paper, decimal carry-chain and ripple-carry adders have been implemented on Virtex-4 Xilinx *FPGA* platforms, for a number of operand sizes; comparative performances are presented for binary and BCD digit operands.

Additionally, three implementations of adders-subtractors have been implemented on FPGA Xilinx Virtex-5 platforms for a number of operand sizes; comparative performances are presented for binary and BCD digit operands, respectively. Adder-subtractor inputs are 10’s complement signed BCD numbers; sign-change algorithm is used whenever subtraction is at hand.

#### 2. Base- Ripple-Carry Adders

Consider the base- representations of two *-*digit numbers:

Algorithm 1 (pencil and paper) computes the ()-digit representation of the sum where is an initial carry equal to 0 or 1.

*Algorithm 1. *Classic addition (ripple carry):

in;**for ****in ****loop**** if ****then**** else ****end if**;

mod ;**end loop**;

As is a function of the execution time of Algorithm 1 is proportional to (Figure 1). In order to reduce the execution time of each iteration step, Algorithm 1 can be modified as shown in Section 3.

#### 3. Base- Carry-Chain Adders

First define two binary functions of two -valued variables, namely, the *propagate *() and *generate *() functions:

The next carry can be calculated as follows:

**if****then**

**else****end if**;

The corresponding modified Algorithm 2 is the following one.

*Algorithm 2. *Carry-chain addition

– computation of the generation and propagation conditions:**for ****in ****loop**

**end loop**;

– carry computation:

in;**for ****in ****loop****end loop**;

– sum computation**for ****in ****loop**

mod B;**end loop**;

*Comments*

(1)Instruction sentence (3) is equivalent to the following Boolean equation:
Furthermore, if the preceding relation is used, then the definition of the generate function can be modified:
(2)Another Boolean equation equivalent to (4) is

If the preceding relation is used, then the definition of the propagate function can be modified:

The structure of an -digit adder with separate carry calculation is shown in Figure 2. It is based on Algorithm 2. The (*Generate-Propagate*) cell calculates the *Generate* and *Propagate* functions (2).

The *Cy.Ch* (*carry-chain*) cell computes the next carry, that is to say

so that generates a carry, whatever happens upstream in the carry-chain, and propagates the carry from level . The mod *sum* cell calculates

As regards the computation time , the critical path is shaded in Figure 2. It has been assumed that .

Another interesting time is the delay from to assuming that all propagate and generate functions have already been calculated:

*Comments*

The carry-chain cells are binary circuits, whereas the *generate-propagate* and the mod *B sum* cells are *B*-ary ones.

Equation (4) can be implemented by a 2-to-1 binary multiplexer (Figure 3(a)) while (6) by a 2-gate circuit (Figure 3(b)). In the first case, the *per-digit-delay *of a carry-chain adder is equal to the delay of a 2-to-1 binary multiplexer, whatever the base *B* is.

**(a) Carry multiplexer**

**(b) AND-OR circuit**

If and the carry-chain cell of Figure 3(a) is used, then and can be chosen equal to, for example, . The corresponding cell for a -bit binary adder is shown in Figure 4.

#### 4. Base-10 Complement and Addition

##### 4.1. Ten’s Complement Numeration System

’s complement representation general principles are available in the literature as, for example, computer arithmetic books such as [8, 9]. One restricts to 10’s complement system to cope with the needs of this paper. A one-to-one function *R*(*x*), associating a natural number to *x, *is defined as follows.

Every integer *x *belonging to the range

is represented by mod, so that the integer represented in the form is

The conditions (12) may be more simply expressed as

Another way to express a 10’s complement number is

where if and

while the sign definition rule is the following one: if is negative, then ; otherwise

##### 4.2. Ten’s Complement Sign Change

Given an *n-*digit 10’s complement integer *x*, the inverse of *x* is an -digit 10’s complement integer. Actually the only case that cannot be represented with *n *digits is when , so , that is to say The computation of the representation of is based on the following property.

Assuming *x *to be represented as an *n*-digit 10’s complement number may be readily computed as

A straightforward inversion algorithm then consists in representing *x *with digits, complementing every digit to 9, then adding 1. Observe that sign extension is obtained by adding a digit 0 to the left of a positive number or 9 for a negative number, respectively.

#### 5. Base-10 Adders

##### 5.1. Base-10 Ripple-Carry Adders

For *B* = 10, the classic and naïve approach [8] of ripple-carry for a BCD decimal adder cell can be implemented as in Figure 5. Observe that the critical path involves the carry propagation through 7 binary adders plus a 4-bit Boolean circuit (checking if the sum is greater than 9 or not).

##### 5.2. Base-10 Carry-Chain Adders

If *B* = 10, the carry-chain circuit remains unchanged but the *P *and *G* functions as well as the modulo-10 sums are somewhat more complex. In base 2, the mod *B sum *cell appears to be a single XOR function, while the mod 10* sum* cell is more complex as suggested by Figure 5.

In base 2, the *P *and *G* cells are, respectively, synthesized by XOR and AND functions, while in base 10, *P a*nd *G *are now defined as follows:

A straightforward way to synthesize *P *and *G *is shown at Figure 6. Nevertheless, functions *P *and *G* may be directly computed from inputs. The following formulas (18) are Boolean expressions of conditions (17),

where , and are the binary propagator, generator, and carry-kill for the *j*th components of the BCD digits *x*(*i*) and *y*(*i*).

The BCD carry-chain adder *i*th cell is shown at Figure 7. It is made of a first mod 16 adder stage, a carry-chain cell driven by the *G-P* functions, and an output adder stage performing a correction (adding 6) whenever the carry-out is one. Actually, a zero carry-out *c*(*i*+1) identifies that the mod 16 sum does not exceed 9 if *c*(*i*) = 0, respectively, 8 if *c*(*i*) = 1; so no corrections are needed. Otherwise, the *add-*6 correction applies.

The *G-P* functions may be computed according to Figure 6, using the outputs of the mod 16 stage, including the carry-out . With more hardware consumption, but saving time delays, formulas (18) may be used.

#### 6. FPGA Implementations of the Base-10 Adders on 4-Input LUTs Xilinx Platforms

The base-10 adders of Figures 5 and 7 have been implemented on 4-LUTs (Look-Up Tables up to 4 inputs) Xilinx devices. Virtex-4, Spartan 3, and the obsolete Virtex-2, Virtex and Spartan 2 are 4-input LUT*s*-based FPGA [6, 10]. In what follows the area is expressed in LUT*s*. In the Xilinx Virtex-4 technology a configurable logic block (*CLB*) involves 4 *slices *and* a slice* is made by two 4-LUT*s* and some additional logic. VHDL models are available at [11].

##### 6.1. Base-10 Ripple-Carry Adder

The classic implementation of the ripple carry adder cell in *FPGA* implies a 4-bit adder, a 4-LUT to detect the carry condition, and a final 3-bit adder. The delay and area consumption of an *N*-digit ripple carry adder are

##### 6.2. FPGA Implementation of the Base-10 Carry-Chain Adder

In order to make the best use of the resources, the design has been achieved using relative location techniques (*RLOC*) [12] with low-level component instantiations. This first architecture is called *GP_a*.

The adding stages are implemented as shown at Figures 8(a) and 8(b) while the carry-chain structure with the *G-P* functions has been implemented as shown at Figure 9 where *G* is computed according to Figure 6, while *P *is computed as

**(a)**

**(b)**

equivalent to the expression of Figure 6. Figure 9 emphasizes that *G* depends on while *P* is computed from and *G*.

The time delay corresponding to the 4-bit adder stage (Figure 8(a)) and the output adder stage (Figure 8(b)) is given as

Both adder stages of Figures 8(a) and 8(b) need the same hardware requirement; computed in slices, the area consumption is given as

The complexity figures of the carry-chain circuit for a 4-digit unit, as shown at Figure 9, are given as

where stands for the average connection delay between two neighboring *slices* of the same *CLB*.

The overall circuit is represented in Figure 10. The overall time delay is computed from formulas (21), (22) and (24):

where stands for the average connection delays between two *slices* located in neighbor columns. has to be accounted twice to involve both the connection delay between the 4-bit adder and the carry-chain and the one between the carry chain and the output adder.

From (23) and (25), the area requirement may be computed as

##### 6.3. Other Implementations of Base-10 Carry-Chain Adders

Functions *P* and *G* may be directly computed from *x*(*i*) and *y*(*i*) inputs using the Boolean expression (18). Using 4-input LUTs (4-LUTs), a first implementation (Figure 11) computes

This architecture called *GP_b *is shown in Figure 11. The corresponding time and area of a carry-chain cell using this architecture is

The complete cell includes a 4-bit adder and a conditional 3-bit output adder adding 6 whenever necessary (similar to Figure 5). The overall time delay and area consumption using this carry-computation cell is:

The results in area and speed are poor compared to the *GP_a* implementation (obtaining *G-P* from the results of the 4-bit adder).

Another alternative is based on the use of dedicated multiplexers. Xilinx Spartan 3, Virtex-2, and Virtex-4 devices have Look-Up Table multiplexers (muxf5, muxf6, muxf7, muxf8) in order to construct functions of 5, 6, 7, and 8 variables without using the general purpose routing fabric.

Using this feature the circuit of Figure 12 (*GP_c*) can be implemented using the following relations:

The corresponding time and area of a carry-chain cell *GP_c* is
where stands for the delay from an LUT input to a muxf6 output. The complete cell also includes 4-bit adder and a conditional 3-bit adder. The overall delay-area for *GP_c* cell is

#### 7. FPGA Implementations of Base-10 Adders and Adders-Subtractors on 6-Input LUTs Xilinx Platforms

##### 7.1. Base-10 BCD Carry-Chain Adder

In a first version, *Ad*-I, the adding stage and correction stage are implemented as shown at Figures 8(a) and 8(b), respectively, while the carry-chain structure with the *G-P *functions is computed according to Figure 6.

Xilinx Virtex-5 6-input/2-output LUT is built as two 5-input functions while the sixth input controls a 2-1 multiplexor allowing to implement either two 5-input functions or a single 6-input one; so *G *and *P *functions fit in a single LUT as shown at Figure 13.

In a second version, *Ad-*II, the carry-chain is speeded up thanks to a direct computation of the *G-P*, namely, using inputs instead of the intermediate sum bits . For this purpose one could use formulas (18); nevertheless, in order to minimize time and hardware consumption the implementation of and is revisited as follows. Remembering that whenever the arithmetic sum one defines a 6-input function set to be 1 whenever the arithmetic sum of the first 3 bits of and is 4. Then may be computed as

On the other hand, is defined as a 6-input function set to be 1 whenever the arithmetic sum of the first 3 bits of and is 5 or more. So, remembering that whenever the arithmetic sum may be computed as

As Xilinx Virtex-5 LUTs may compute 6-variable functions, then and may be synthesized using 2 LUTs in parallel while and are computed through an additional single LUT as shown at Figure 14.

##### 7.2. ’s Complement BCD Carry-Chain Adder-Subtractor

To compute similar algorithm as in Section 7.1 is used. In order to compute 10’s complement subtraction algorithm actually adds to *. *

###### 7.2.1. ’s Complement (AS-I)

10’s complement sign change algorithm may be implemented through a digitwise 9’s complement stage followed by an add-1 operation. It can be shown that the 9’s complement binary components of a given BCD digit are expressed as

To compute , 10’s complement subtraction algorithm actually adds to . So for a first implementation, *AS*-I, Figure 15 presents a 9’s complement implementation using 6-input/2-output LUTs, available in the Virtex-5 Xilinx technology. is the add/subtract control signal; if (subtract), formulas in (38) apply; otherwise and for all

The *AS*-I circuit is similar to the *Ad*-I (Figures 8 and 13) using, instead of input , the input as produced by the circuit of Figure 15.

###### 7.2.2. Improving the Adder Stage

To avoid the delay produced by the 9’s complement step, this operation may be carried out within the first binary adder stage, as depicted in Figure 16, where *p*(*i*) and *g*(*i*) are computed as

###### 7.2.3. Carry-Chain Stage Computing G and P Directly from the Input Data (AS-II)

As far as addition is concerned, the *P* and *G* functions may be implemented according to formulas (36) and (37). The idea of the *AS*-II is computing the corresponding functions in the subtract mode and then multiplexing according to the add/subtract control signal . For this reason, assuming that the operation at hand is , one defines on one hand *ppa*(*i*) and *gga*(*i*) according to Section 7.1, that is, using the straight values of *Y*’s BCD components.

On the other hand, *pps*(*i*) and *ggs*(*i*) are defined according to the same Section 7.1 but using as computed by the 9’s complement circuit shown at Figure 15. As are expressed from the (38), both *pps*(*i*) and *ggs*(*i*) may be computed directly from *x _{k}*(

*i*) and

*y*(

_{k}*i*) as shown in Figure 17. Nevertheless, for subtraction, the computation of is carried out at the output LUT level. So formulas (36) and (37) are then expressed as

#### 8. Experimental Results

##### 8.1. Xilinx Virtex-4 Adder Implementations

The base-10 adders have been implemented on Xilinx Virtex-4 *FPGA* family speed grade-11 [6]. The Synthesis and implementation have been carried out on *XST* (Xilinx Synthesis Technology) [13] and Xilinx *ISE* (Integrated System environment) version 10.1 [14].

Performances of different *N*-digit BCD adders have been compared to those of an *M*-bit binary carry chain adder (implemented by *XST* [13] using Xilinx fast carry logic) covering the same range, that is, as

The time and hardware complexities of an *M*-bit ripple-carry adder implemented on the same 4-LUT based Xilinx* FPGA* are given by

Formulas (26), (30), (34), and (42) show that, asymptotically, should be somewhat inferior to *. *Nevertheless, as shown by the experimental results, the additive values appearing in (26), (30), and (34) are not negligible for reasonable values of *N*; so the saving in time will mainly appear for applications where BCD-to-binary coding and decoding operations play a significant role in the overall delay.

Post place-and-route time delays and area consumptions are quoted in Tables 1 and 2, respectively, where *N* stands for the number of BCD digits while *M *stands for the number of bits required to cover the decimal *N*-digit range. The results presented in the table are as follows:

*Ripple:*Naïve implementation of base-10 ripple-carry (Section 5.1, Figures 1 and 5),(ii)

*PG_a:*base-10 carry chain using an adder to produce

*P-G*values (Section 5.2, Figures 2, 6, 8, 9, and 10),(iii)

*PG_b:*base-10 carry chain computing directly

*P-G*values (Section 6.3, Figures 2 and 11), (iv)

*PG_c:*base-10 carry chain computing directly

*P-G*values using muxf5 and muxf6 (Section 6.3, Figures 2 and 12),(v)

*M-bit Binary:*base-2 carry chain adder covering the same range as an digit adder.

Figure 18 shows the delays for the compared adders. Observe that, for the technology at hand, Table 1 and Figure 18 suggest that for *N* > 48 the carry-chain decimal implementation of adders is faster than the binary one for the equivalent range. Furthermore for small numbers of digits to add (*N* < 40) the *PG_c* architecture is faster than other decimal implementations.

##### 8.2. Virtex-5 Adder-Subtractor Implementations

The adder-subtractor circuits have been implemented on Xilinx Virtex-5 family with speed grade-2 [7]. The synthesis and implementation have been carried out on *XST* (Xilinx Synthesis Technology) [13] and Xilinx *ISE* (Integrated System environment) version 10.1 [14]. The critical parts were designed using low-level components instantiation (lut6_2, muxcy, xorcy, etc.) in order to obtain the desired behavior. Performances of different *N*-digit BCD adders have been compared to those of an *M*-bit binary carry chain adder (implemented by *XST*) covering the same range, that is, such that

Table 3 exhibits the postplacement and routing delays in *ns* for the decimal adder implementations *Ad*-I and *Ad*-II of Section 7.1; Table 4 exhibits the delays in *ns* for the decimal adder-subtractor implementations *AS*-I and *AS*-II of Section 7.2. Table 5 lists the consumed areas expressed in terms of 6-input look-up tables (6-input LUTs). The estimated area presented in Table 5 was empirically confirmed.

*Comments*

Observe that, for large operands, the decimal operations are faster than the binary ones.

The overall area with respect to binary computation is not negligible. In Virtex-4 the area increases, with respect to an equal range binary adder, in a factor between 2.4 and 5.4. In the 6-input LUT family Virtex-5 an adder-subtractor is between 3.0 and 3.9 times bigger.

#### 9. Conclusions

The present interest in BCD arithmetic systems stimulates further researches at both the algorithmic and design levels. Considering that the hardware costs are everyday more affordable, full hardware BCD units are now very attractive, with moreover a growing potential in the near future.

This paper has developed some implementations of BCD adders and subtractors in *FPGA* platforms. Experimental results emphasize time performances with reasonable costs in terms of area. Matched with the binary system, the decimal implementations are faster as operand sizes are growing (break even around 50 digits).

One of the key points about delays comes from the fact that the carry-propagation computation remains binary; then a faster carry-chain circuit can be designed because, for the same operand range, the number of digits (therefore of carries to propagate) is lower in decimal than in binary. In the carry-chain structures studied in this paper, the propagate *P* and generate *G* functions are more complex and therefore more time and area consuming than in the binary ones; therefore, the speed improvements only appear for large enough operands. The breakeven point is obviously technology dependent; so it could be expected to occur for a smaller number of digits in the near future.

The area overhead with respect to binary computation is not negligible; it is around five times in Virtex-4 and nearly four times in Virtex-5. That is mainly due to the more complex definition of the *carry propagate* and *carry generate* functions and to the final mod 10 reduction. The decreasing costs of technology make hardware consumption less central.

For BCD addition, the performance considerations on Xilinx Virtex-5 platform are similar to those of 4-input LUTs-based Virtex-4 technology. That is, the addition time of BCD digits remains faster than the binary counterpart in the same conditions.

Finally, the BCD adder/subtractor, with a relatively small penalty in area, presents time performances quite similar to those of a straight BCD adder.

#### Acknowledgments

This work is supported by the Universities FASTA, Mar del Plata, Argentina, UNCPBA Tandil, Argentina, UAM, Madrid, Spain, and URV, Tarragona, Spain; it has been partially granted by the CICYT of Spain under contract TEC2007-68074-C02-02/MIC.