Selected Papers from ReconFig 2009 International Conference on Reconfigurable Computing and FPGAs (ReconFig 2009)
View this Special IssueResearch Article  Open Access
Malte Baesler, SvenOle Voigt, Thomas Teufel, "A Decimal FloatingPoint Accurate Scalar Product Unit with a Parallel FixedPoint Multiplier on a Virtex5 FPGA", International Journal of Reconfigurable Computing, vol. 2010, Article ID 357839, 13 pages, 2010. https://doi.org/10.1155/2010/357839
A Decimal FloatingPoint Accurate Scalar Product Unit with a Parallel FixedPoint Multiplier on a Virtex5 FPGA
Abstract
Decimal Floating Point operations are important for applications that cannot tolerate errors from conversions between binary and decimal formats, for instance, commercial, financial, and insurance applications. In this paper, we present a parallel decimal fixedpoint multiplier designed to exploit the features of Virtex5 FPGAs. Our multiplier is based on BCD recoding schemes, fast partial product generation, and a BCD4221 carry save adder reduction tree. Pipeline stages can be added to target low latency. Furthermore, we extend the multiplier with an accurate scalar product unit for IEEE 7542008 decimal64 data format in order to provide an important operation with least possible rounding error. Compared to a previously published work, in this paper, we improve the architecture of the accurate scalar product unit and migrate to Virtex5 FPGAs. This decreases the fixedpoint multiplier's latency by a factor of two and the accurate scalar product unit's latency even by a factor of five.
1. Introduction
Financial calculations are usually carried out using decimal arithmetic, because the conversion between decimal and binary numbers introduces unacceptable errors that may even violate legal accuracy requirements [1]. Therefore, commercial application often use nonstandardized software to perform decimal floatingpoint arithmetic. These software implementations are usually 100 to 1000 times slower than equivalent binary floatingpoint operations in hardware [1]. Because of the increasing importance, specifications for decimal floatingpoint arithmetic have been added to the recently approved IEEE 7542008 Standard for FloatingPoint Arithmetic [2] that offers a more profound specification than the former RadixIndependent Floating Point Arithmetic IEEE 7541987 [3]. Therefore, new efficient algorithms have to be investigated, and providing hardware support for decimal arithmetic is becoming more and more a topic of interest. However, most modern microprocessors still lack of support for decimal floatingpoint arithmetic, because additional hardware is costly. The POWER6 is the first microprocessor with implementing the IEEE 7542008 decimal floatingpoint format fully in hardware [4, 5], while the earlier released Z9 architecture already supports decimal floatingpoint operations but implements them mainly in millicode [6]. Nevertheless, the POWER6 decimal floatingpoint unit is as small as possible and optimized to low cost. Thus, its performance is low. It reuses registers from the binary floatingpoint unit, and the computing unit mainly consists of a wide decimal adder. Other floatingpoint operations such as multiplication and division are based on this adder, that is, they are performed sequentially.
Due to the increasing integration density of CMOS devices, Fieldprogrammable Gate Arrays (FPGAs) have recently become attractive for complex computing tasks, rapid prototyping, and testing algorithms. Furthermore, today's FPGA vendors integrate additional dedicated hardwired logic, such as embedded multipliers, DSP slices, large amount of onchip RAM, and fast serial transceiver modules. Thus, using FPGA platforms as coprocessors is an interesting alternative to traditional and expensive VLSI designs.
Besides the four basic arithmetic floatingpoint operations, that is, addition +, subtraction −, multiplication ×, and division /, a fifth arithmetical operation was introduced in the IEEE 7542008 standard, that is called fused multiplyaccumulate (MAC). This operation can assist to improve the accuracy of scalar products. Unfortunately, this approach does not go far enough as consecutively applied MAC operations, for example, a scalar product, can still lead to totally wrong results because of cancellation. The reason is rounding of intermediate results. For example, the summation of , , , and , each with 16 digits precision, can lead to four different results, depending on the order of execution Scalar products are calculated in many applications, in which cancellation may cause serious problems or numerical overhead slows down algorithms. This includes linear system solving, least squares problems, and eigenvalue problems [7]. In order to overcome these problems, we consider another operation, the socalled accurate scalar product or accurate MAC [8] which is calculated in two steps. First, the products are computed exactly and are added to a long fixedpoint register without loss of accuracy. Then, to obtain a floatingpoint number, the result is rounded only once. This approach guarantees an optimal scalar product with least significant bit accuracy. It can be shown that by providing the accurate scalar product all operations of computer arithmetics can be performed with maximum accuracy, too [9].
Specifications for decimal arithmetic have been added to IEEE 7542008 mainly for financial applications. Generally, these applications only use a limited range of floatingpoint numbers such that cancellation errors are not an issue, and an accurate scalar product unit seems to be no gain for decimal arithmetic. Nevertheless, the accurate scalar product unit proposed in this work might be useful because scalar product calculations and accumulations are common operations in financial mathematics, for instance, in portfolio valuation and optimization. Thus, even if cancellation is not an issue, the accurate scalar product unit speeds up these operations because one multiplication and accumulation are computed in the pipeline in one cycle without interlocks, and the high accuracy is gained at no extra cost.
As specified by IEEE 7542008 [2], the computation of the elementary floatingpoint operations +,−, ×, and / is performed by the computation of the exact (infinitely precise) result followed by a rounding to the destination format. We extend this accuracy requirement to the accurate scalar product operation. Let us denote a floatingpoint system, where is the radix, is the significand's precision, and and are the exponent's range. Moreover, is a rounding operation that induces floatingpoint addition and multiplication such that Then the exact scalar product can be expressed by and the accurate floatingpoint scalar product by For comparison, the traditional floatingpoint scalar product is computed by software, rounding each intermediate result. It can be expressed by
The novelty of the decimal fixedpoint multiplier presented here is its parallel and pipelined FPGA nature that is faster than other comparable FPGA implementations and is even time competitive with binary multipliers implemented in FPGAs. The concept of accurate scalar product is not new, but hardware support for binary MAC is seldom and even more rare for the decimal accurate MAC. However, [9] presents a decimal accurate scalar product, but most of the components are serial and have long latencies. Contrary to this, in the new FPGAbased design presented here, we use a fast parallel decimal multiplier and a parallel accurate scalar product unit that can be pipelined to improve latency. This paper summarizes and extends the research published in [10]; in particular, it gives a more detailed introduction and description of the proposed architecture. Furthermore, we improved the speed of the decimal fixedpoint multiplier by a factor of two and the accurate scalar product unit by a factor of five, respectively. The outline is given as follows. Section 2 begins with an overview of decimal fixedpoint multiplication followed by the description of our proposed parallel decimal multiplier. Section 3 shortly introduces accurate scalar product and presents our proposed architecture. In Section 4, the accurate MAC unit is extended by the concept of working spaces which allow a quasiparallel use of the accurate scalar product unit. Postplace & route results are presented in detail in Section 5, and finally in Section 6, the main contributions of this paper are summarized. Additionally two proofs about complement calculation and simplification of the summation of sign extensions are given in the appendix.
2. Decimal FixedPoint Multiplier
The Decimal FixedPoint Multiplier is the basic component of the accurate MAC unit. It computes the product of the unsigned decimal multiplicand and multiplier , both natural numbers with the same precision .
Decimal multiplication is more complex than binary multiplication due to the inefficiency of the digit representation on binary logic. It requires to handle carries across decimal and binary boundaries and introduces digit correction stages. Furthermore, the number of multiplicand multiples that have to be computed is higher because each digit ranges from 0 to 9. To reduce this complexity, several different approaches have been proposed that are described in the following. All of them have in common that the multiplication is performed in two steps: the generation of partial products and their accumulation. However, they differ in the optimization of these steps.
For the calculation of the partial products, there are two approaches proposed. The first method generates and stores the required multiplicand multiples a priori which are then distributed to the reduction stage through multiplexers controlled by the multipliers digits. Since this approach requires the generation of eight multiples and some of them, for example, , require a timeconsuming carry propagation, Erle et al. [11, 12] proposed a reduced set of multiples . All remaining multiplicand multiples can be generated by adding only two from the set. Lang and Nannarelli [13] describe a parallel design that recodes the multiplier's digit set into the digit sets and exploiting that the multiples and can be calculated very fast due to the absence of carry propagation. Vazquez et al. [14, 15] present three different multiplier recoding schemes. The SignedDigit Radix10 Recoding transforms the digit set into the signed digit set . The drawback is the need of a carry propagate adder for the calculation of the multiple . The two others recoding schemes are SignedDigit Radix4 Recoding and SignedDigit Radix5 Recoding using the transformation sets , and , . Both do not need a slow carry propagate adder for partial product generation but require a more complex partial product reduction.
The second method generates the partial products only as needed using digitbydigit multipliers with overlapped partial products. To reduce the many combinations, in [16] is proposed a digit recoding of both operands from to . In [17] is described a direct implementation for BCD digit multipliers. It implements a binary digit multiplication followed by a binary product to BCD conversion. Compared to this, in [18] the digitbydigit multiplier is implemented by means of the FPGA's memory; however, no digit recoding is applied.
The accumulation of the partial products consists of two stages: the fast reduction (addition) to a twooperand and a final carry propagate addition. Similar to binary multiplication, the accumulation of the partial products can be performed sequentially, in parallel, or by a semiparallel approach. A sequential multiplier iteratively adds up each partial product to an accumulated sum. In [19], the accumulation is performed sequentially by decimal (3 : 2) carry save adders and a final carry propagate adder which leads to a short critical path delay and low area usage but longer latency. It performs a multiplication in cycles. In parallel multipliers, the area consumption is much higher, but the latency can be reduced and the architecture can be pipelined to achieve a higher throughput. In [13], a fully parallel multiplier with digit recoding (see above) is presented. The accumulation is performed by a tree of carry save adders and a final carry propagate adder. Vazquez et al. [14] present a new family of parallel decimal multipliers. The carrysave addition in the reduction stage uses new redundant decimal BCD4221 and BCD5211 digit encodings. In [20] is introduced a new method of partial product generation and together with the reduction scheme of [13] and the carry propagate addition method of [14] this design is believed to be the fastest design in literature but sacrifices area for high speed. Despite the partial product reduction scheme presented in [20] is the fastest for ASIC designs, the reduction scheme presented in [14] is more appropriate to FPGA designs. The reason is that [20] is based on BCD full adders which introduce a delay of two lookup tables per reduction stage, whereas the reduction scheme presented in [14] can be implemented with a delay of only one lookup table per reduction stage.
Contrary to the several works on implementations in ASICs, decimal multipliers are not often implemented in FPGAs. These few are [10, 18, 21]. The method in [21] exploits the FPGA's internal binary adders and uses decimal to binary conversion and vice versa. This approach is only feasible for small multipliers. The decimal multiplication in [18] is sequentially and is based on digitbydigit multipliers that are implemented by memory (BRAM or distributed RAM). It also describes a combinational multiplier design which is only applicable for small precisions . In a recent work [10], we proposed a fully combinational decimal fixedpoint multiplier optimized for Xilinx VirtexII Proarchitectures [22]. It is based on fast partial product generation and a combinational fast carry save adder tree. It can be pipelined to achieve a high throughput which is a crucial feature for the usage in an accurate scalar product unit. In this work, we adapted the design for Xilinx Virtex5 devices [23], and in doing so we could double speed and throughput.
2.1. Proposed Parallel Decimal Multiplier
The proposed Decimal FixedPoint Multiplier computes the product of the unsigned decimal multiplicand and multiplier . It is fully combinational and can be pipelined. In particular, it is based on BCD recoding schemes, fast partial product generation, and a BCD4221 carry save adder (CSA) reduction tree, which is based on [15]. It is optimized for use on Xilinx Virtex5 FPGAs. A decimal natural number is called BCD coded when can be expressed by Timecritical components are BCD8421 carry propagate adders (CPAs) that are used in partial product generation to calculate the multiplicand's triple fold and for final addition. The adders are proposed in [24] and are designed and placed on slice level, considering a minimum carry chain length and least possible propagation delays. Figure 1 shows an elementary BCD8421 full adder. It consists of an adding and a correction stage using two binary 4bit adder and a fast carry computation unit that is depicted in Figure 2. It exploits the FPGA's internal fast carry chains to minimize latency. The fast carry computation unit implements two functions on the intermediate result of the first stage, propagate and generate with , and the carry signal yields to Altogether the adder consumes 9 lookup tables (LUTs) per digit. In particular, the fast carrybypass logic (carry computation unit) spans only over one LUT.
Generally, the fixedpoint multiplier consists of six functional blocks as depicted in Figure 3. The basic idea is to generate partial products and to sum them up which is performed by the parallel carry save adder tree (CSAT) and the final BCD8421 carry propagate adder (CPA). The CSAT is based on (3 : 2) CSA blocks for BCD4221 format. The partial products are the multiplicand's multiples and are selected via the partial product multiplexer (PPMux). Due to the multiplier recoding that transforms the multiplier's digit set into the signed digit set [15], and a simple method to handle negative partial sums (10's complement), only five multiples (, , , , ) have to be generated by the multiplicand multiples generator (MMGen) a priori. It can be easily proven that the 10's complement can be calculated by inverting each bit of all digits and adding one (see the appendix). The functionality of the negative digits correction (NegDC) block is explained in the following.
The MMGen is similar to the generator of multiplicand multiples for SD radix10 encoding in [15], but the decimal quaternary tree is replaced by the BCD8421 CPA. It exploits the correlation between shift operation and constant value multiplication. For example, a BCD5421 coded decimal number left shifted by one bit is equivalent to a multiplication by 2, and the result is being BCD8421 coded. Similarly, a BCD5211 coded number left shifted by one results in a multiplication by two with a BCD4221 coded result. And finally, a BCD8421 coded decimal number shifted by three results in a multiplication by 5, and the result is of type BCD5421. A recoding operation is very fast and consumes two (6 : 2) LUTs per digit, whereas a constant shift operation costs nothing because it is just a renaming of signals. Hence, with exception of , all multiples can be easily generated by simple shift operations and digit recodings. For the multiple, an additional CPA is inevitable which unfortunately limits the maximum working frequency and thus emphasizes the need of pipelining. Alternatively, the multiples could be composed of two operands and be added in the following CSAT, as proposed in [12]. This would speedup the MMGen but would also double the inputs to the CSA and increase significantly its complexity and resource consumption. Figure 4 depicts the functionality of the MMGen. It is similar to the generator of multiplicand multiples presented in [14], but we replaced the decimal quaternary tree by our BCD8421 adder.
The decimal recoding unit (DRec) depicted in Figure 3 reduces the number of multiplicand multiples that have to be computed by the MMGen, as proposed by Vazquez et al. [15]. In the first step, it transforms each multiplier's digit from the digit set into the signed digit set and an output carry bit which coincides with the sign signal In the second step, the carry signal from the previous digit is added to the intermediate result This recoding increases the number of partial products by one () but gets along without any ripple carry, hence it is a very fast operation.
Since the multiplier's output is of length but one single partial product is of length , for 10's complement generation each partial product has to be extended and if necessary padded with 9. To keep the input length of CSAT short, the negative digits correction unit (NegDC) combines the paddings of all partial products in a single word and passes it to the CSAT. This is feasible because adding several words, composed of leading nines and following zeros, always yields to a decimal word composed of only 0, 8, and 9 (see the appendix). For example,
Moreover, as shown in Figure 5 the position of the nines and eights can be calculated very fast by means of the FPGA's fast carry chain considering the following equations:
The reduction of the partial products is based on BCD4221 (3 : 2) CSAs [15] that reduce three BCD4221 digits to a sum and a carry digit, both of BCD4221 coding scheme. In a first version, CSA1, the carry save adder is implemented as proposed by Vazquez et al. [15]. It consists of a 4bit binary (3 : 2) CSA and a BCD4221 to BCD5211 digit recoder. By means of an implicit shift operation of the BCD5211 coded carry digit, we obtain a multiplication by two. The block diagram of CSA1 is shown in Figure 6. It consumes overall six LUTs per digit. The drawback of this architecture is that the computation of the sum digit has a latency of one LUT, whereas the computation of the carry digit has a latency of two LUTs. To reduce the computation latency of , we propose a new type of carry save adder, CSA2. It consists of a 2bit binary (3 : 2) CSA and a carry digit computation unit. The block diagram of CSA2 is shown in Figure 7. The 2bit binary (3 : 2) CSA sums up the two least significant bits of the three input digits and generates the sum digit. The carry digit is computed from the remaining six most significant bits of the three input digits which requires four (6 : 2) LUTs. The CSA2 method also consumes six LUT per digit but has a lower latency than CSA1.
The ( : 2) CSA tree is composed of parallel and consecutively wired (3 : 2) CSAs. It reduces decimal words to two BCD4221 coded decimal words. The decimal words are composed of partial products and one summand that regards the sign paddings, as described previously. The CSAT is organized in stages, each reduces words to words. As in general the ranges of the input words differ, the word length increases with each stage as depicted exemplary in Figure 8.
The redundant carrysave format of the CSAT can be further reduced by a carry propagate adder of length to obtain a unique result. However, this CPA can be omitted because the accurate scalar product unit processes on the carrysave format directly.
The maximum frequency of the fixedpoint multiplier is limited on the one hand by timecritical components like the CPA and on the other hand by the FPGA's routing overhead. While the maximum propagation delay of the timecritical components can be determined in advance, the routing delay depends highly on the overall project's size. Hence, several pipeline registers can be optionally implemented by means of VHDL generic switches. For a digits multiplier, this is one possible pipelining stage to buffer the input words, three for the MMGen, one for PPMux, six for the CSAT (one for each reduction stage), and two for the final BCD8421 conversion and CPA. Altogether, these are 11 possible pipeline registers for the BCD4221 carrysave format output and 13 stages for the final BCD8421 carrypropagation format output. It should be noted that the last CSA stage can be combined with the final BCD8421 converter, as it is proposed by [15]. However, since the following accurate scalar product unit accumulates redundant BCD4221 numbers, this improvement could not be applied.
3. Accurate Scalar Product
The accurate scalar product is important for applications in which cancellation may cause problems or numerical overhead slows down algorithms. It is calculated in two steps. First, the products are computed exactly and are summed up to a long fixedpoint register without loss of accuracy. Then the result is rounded only once to obtain a floatingpoint number. Hardware support for the accurate binary scalar product is rare; the accurate decimal scalar product is even less supported by hardware. In [25] is presented a coprocessor with an accurate binary scalar product using the concept of the long accumulator. Reference [9] presents a decimal floatingpoint arithmetic with hardware as well as software support. It implements the concept of accurate scalar product, but due to the given hardware restrictions most of the components are serial and have long latencies. Contrary to this, in the new FPGAbased design presented here, we use a fast parallel multiplier and parallel shift registers. We accelerate the scalar product's accumulation by use of carry save adders and get rid of overflow and carry signals by the concept of carry caches. Our design is pipelined and requires generally five cycles to multiply and accumulate with an operating frequency of more than 100 MHz.
3.1. Proposed Accurate Scalar Product Unit
The fundamental concept of the long accumulator (LA) is to provide a fixedpoint register that covers the entire floatingpoint range of products, as well as adder, that accumulates these products without loss of accuracy, see Figure 9. When computing the scalar product (3a) , individual results coming from the decimal fixedpoint multiplier are shifted and added to a section of the LA. The respective section depends on the operands' exponents and is calculated by the address generator (AGen). In order to avoid timeconsuming carry propagation, the central adder (CAdd) is implemented as carrysave adders which implies a doubling of the LA's memory to store both operands. Contrary to [9], positive as well as negative operands are accumulated in the same LA by using 10's complement data format. To prevent timeconsuming ripplecarry propagations due to sign swapping and overflow, we use a socalled carry cache (CC) that buffers any overflow signals. Contrary to a previously published paper [10], in this work, we have simplified the carry handling by removing the principle of fast carry resolution in case of a carry cache overflow. Instead, we have increased the block size of the long accumulator for carry cache (LACC) to 16 digits, assuming that the CC will never overflow. Actually, in the worst case scenario, it would take the CC over three years to overflow at a reasonable working frequency of 100 MHz. Applying this simplification, we could increase the operating frequency significantly. Before the final accurate scalar product can be output and stored on a temporary result stack (ResSt), the two carrysave operands of the long accumulator for operands (LAOPs) and the entries of the LACC must be summed up and reduced by a final carry propagate adder (FCPA). Therefore, the entire long accumulator would have to be traversed which is a highly inefficient step, since due to locality most applications normally use a minor percentage of the LA and the remaining entries equal zero. To solve this problem, we introduced a socalled touched blocks register (TBR). During MAC operation, the TBR marks the corresponding blocks of the LA as touched, which means they are most likely unequal to zero. During final result calculation, only these blocks, that have previously been marked as touched, are actually addressed and read out.
The required length in digits of the long accumulator can be calculated from the significand's length and the minimum exponent's value and maximum exponent's value , respectively. In order to consider possible overflows, more guarding digits are provided on the left. For our design, the number of guarding digits is chosen. Considering a maximum working frequency of 100 MHz, it would take the LA over 300 years to overflow. Hence, 18 guarding digits are a reasonable choice. Since a multiplication doubles the significand's length and the exponent's range, the LA must hold a total number of digits as follows: We implemented the MAC unit for IEEE 7542008 decimal64 interchange format with digits precision. With , , and , the accumulator length results in . The interchange format decimal32 with 7 digits precision is downward compatible and thus can be applied to the decimal MAC unit, too.
The LA is implemented by use of local Block SelectRAM (BRAM). It is organized in segments, each covers digits. Since the shifted multiplier's result always fits into digits, three arbitrary consecutive segments can be addressed, yielding a word of digits. Therefore, the LA is organized in three blocks with lines. It provides memory for both the long accumulator for operands (LAOP) as well as for the long accumulator for carry cache (LACC). To each LAOP block, an LACC block is assigned that handles any overflow signals during accumulation. This prevents pipeline interrupts and allows the storage of negative numbers in 10's complement data format. One LA line comprises of LACC and LAOP each with three blocks composed of 16 digits with 4 bits. As the central adder is of (4 : 2) carrysave type with length , two carrysave operands and two carry cache entries must be stored separately. The advantage of this approach is its high speed because of the absence of a ripple carry signal. The drawback is twice as much memory consumption. Since BRAM is a dualported memory, the two carrysave operands can be accessed simultaneously through different ports. Thus, bits must be addressed in parallel which requires 12 parallel dualported BRAMs with 32 bit data ports. Each BRAM has a memory depth of 1024, but both operands only need a depth of . The remaining memory can be used for the implementation of the socalled working spaces (WSs) which are introduced below. The LA runs at double data rate, because within one cycle the operands and carry cache entries from the LA have to be fetched and added to the multiplier's output and then in the same cycle the result has to be written back to the LA.
When a block address is not a multiple of three, then the operand spans over two memory lines, that is, the least significant digits (LSDs) are not located in the first block but in the second or third. The block alignment is performed by the shift register which is therefore implemented as a cyclic shift register, see Figure 10. Alternatively, the block alignment could have been implemented between LA and CAdd, but this approach would have increased the longest path and would have reduced the overall operating frequency.
(a)
(b)
(c)
The drawback of the memory organization in lines comprising three segments is a complicated address generation, that is, the need of a division by three. An alternative solution with four blocks per line leads to an easier address calculation but also requires larger multiplexers for operand shift operations. Fortunately, the complicated division by three can be accomplished by applying an embedded binary multiplier, as described in the following.
The address generator (AGen) shown in Figure 9 transforms the input exponents into three addresses (column, block, and line address) to access the LA and to control the shift register. The line and block addresses define a segment , and the column address locates the position inside this segment. Thus, each digit in the LA can be characterized by its exponent that relates to the three addresses as follows: . The central adder can only sum up blockaligned operands. For that reason, the multiplier's result has to be shifted cyclically. The shift left amount (SLA) arises from the column and block addresses, whereas the block and line addresses are used to address the LA, Unfortunately, the memory partitioning applies a division by to determine the line address. That division is accomplished by inverse multiplication considering the maximum digit's exponent of . This approach requires besides logical, shift, and add operations one additional binary fixedvalue multiplication which can be performed by the dedicated multiplier of the FPGA's DSP48E slices, see Algorithm 1.

Once the result has been computed from the decimal multiplier it enters the shift register before it is accumulated by the central adder and is stored in the LA, as already described above. The shift register extends the decimal multiplier's outputs from to length and shifts the operands according to the column address. Because the decimal multiplier internally uses digit recoding combined with 10's complement representation, there might arise a carry signal (whenever at least one multiplier's digit is greater than or equal to five) which is discarded by the subsequent CPA but is still present as a hidden carry in the output of carrysave format. In such cases, the most significant digits (MSDs) of the extended word must be padded with 9's, and the overflow has to be cleared by a subtraction of 1 in the carry cache adder, see Algorithm 2. However, the main challenge is the vast shift depth up to 47 digits along with a large number of operands to be shifted, that is two operands each with four bits per digit. These are 48bit cyclic shift register. Since serial shift register with low resource consumption cannot be pipelined, only parallel solutions are applicable. Two solutions for parallel cyclic shift registers are analyzed, the first one is a shift register using multiplexers and the second one applies the hardwired multiplier of the DSP48E slices. The latter one is possible because an Lkshift operation complies with a multiplication by . Virtex5 devices support the design of large multiplexers by using the dedicated F7AMUX, F7BMUX, and F8MUX multiplexers. Hence, four LUTs can be combined into a 16 : 1 multiplexer.

A 48bit shift register can be implemented by three 16bit shift register stages wired consecutively. These shift registers are composed of 16bit multiplexers or multipliers. Each stage can be pipelined to obtain a low latency as shown in Figure 11. Table 1 summarizes the maximum delay and the number of LUTs used for both cyclic shift register solutions. The multiplexerbased solution is faster but requires much more LUTs, up to 6.25 times more. Since the longest path in the accurate scalar product unit is bounded by the central adder (approximately 10 ns), the multiplierbased cyclic shift register is preferred because of its far less resource usage.

The central adder is a (4 : 2) CSA to keep latency low. The four inputs are two cyclically shifted words from the decimal multiplier and two operands from the long accumulator. The central adder is composed of two sequentially arranged (3 : 2) CSA stages. Furthermore, negative numbers are applied by their 10's complement that requires an additional correction of . Since the multiplier's output is of redundant carrysave type, two correction factors of are needed. For this purpose, the carry inputs of the (3 : 2) CSA stages are used. Each CSA stage also produces a carry signal that has to be absorbed by the Carry Cache described below. One (3 : 2) CSA stage comprises of three 16digit (3 : 2) CSAs that are interconnected depending on the block address, see Figure 12.
To handle overflow during accumulation without interfering the pipeline and to allow the storage of negative numbers in 10's complement format without carry propagation, we introduced the CC. It temporarily adds and stores carry and sign signals. The CC uses the carrysave format, too. To each LAs operand block is assigned a CC block which consists of 16 digits and adsorbs the two carry signals of the LA (cout1, cout2) and the two negative sign signals due to 10s complement (sign). Because of its size, the CC blocks are not supposed to overflow. Finally, the CCAs neutralize the hidden carry signal, too, that is weighted negative in case of positive numbers but positive in case of negative numbers. Summarizing all factors yields to the pseudocode depicted in Algorithm 2.
The final result is computed by successively reading out the LA, starting with the least significant digit (LSD) and reducing the CC's entries as well as the LA's operands by means of the CAdd. Finally, this redundant result is summed up by the final carry propagate adder and stored on the result stack (ResSt). Hence, the FCPA produces a series of positive and negative floatingpoint numbers with the precision of digits and ascending block aligned exponents. The carry out signal of the FCPA is fed back to the FCPA's carry input. The ResSt is composed of a dual ported memory. On the one port, the result of the FCPA is written into the memory, whereby zero entries are omitted. On the other port, the result is accessible for external components with either greatest or smallest number first, depending on requirements of the further data processing. For example, when a final rounding is required to fit the result into IEEE 7542008 data format, then it is advantageous to read out the greatest number first.
As application is usually subject to locality only a small percentage of the LA is filled with nonzero entries. Thus, it would be very inefficient to traverse the complete LA during final readout. Due to performance issues, we introduced the socalled touched blocks register (TBR). Each time the MAC unit accesses a block in the LA, an according flag in the TBR is set to indicate highly probable nonzero data. Only these previously touched blocks in the LA are regarded to compute the final result. In order to reduce the complexity for final result computation, four consecutive blocks are marked as touched instead of three as might be expected. This method simplifies the final result computation because possible overflows are already considered and no further exceptions must be regarded.
The parallel fixedpoint multiplier as well as the accurate scalar product unit are designed to support pipelining. As already described, the fixedpoint multiplier with redundant carrysave output has 11 configurable pipeline registers that can be switched on and off by VHDL generic switches. The accurate scalar product unit further adds three stages for the cyclic shift register and also three stages for the final carry propagate adder. Especially the latter ones are important to reduce the longest asynchronous path and to achieve high operating frequencies.
4. Working Spaces
The introduction of socalled working spaces (WS) allows the quasiparallel use of the MAC unit, that is, there can be several users concurrently accessing the MAC unit without interfering each other. The users can be different processors or different processes on one processor. There can even be a single process that handles more than one accurate scalar product unit, for example, to compute complex scalar products, interval scalar products, and so forth. Working spaces are realized by duplications of all memory elements together with some additional multiplexers. These are the long accumulator with operand storage and carry cache, the touched blocks register, and the reset stack. The assignment and access to the working spaces has to be managed by a central control unit, for example, an operating system. The number of working spaces can be set by VHDL generics, too. Actually, it is only limited by available resources.
5. Synthesis Results
All circuits are modeled using VHDL. For synthesis and implementation Xilinx ISE 10.1 [26] has been used. The fixedpoint multiplier and the accurate scalar product unit have been implemented for Xilinx Virtex5 speed grade 2 devices. Firstly, only the fixedpoint multiplier with unique carry propagate output has been implemented for several pipeline configurations, see Table 2 and Figure 13.

The results show that the minimum overall latency of about 18 ns can be achieved with no pipeline registers, and the best operating frequency of 234 MHz can be obtained with 10 pipeline registers. However, using 6 or more pipeline registers does not reduce the longest path delay significantly and increases the overall latency instead. The LUT usage varies only slightly for different pipeline configurations. In [18], combinational and sequential memorybased digitbydigit multipliers are analyzed for Xilinx Virtex4 platforms. A combinational multiplier uses 22,033 LUTs and has an overall latency of 26.9 ns. A sequential multiplier uses 1,054 LUTs, 8 BRAMs, and has an overall latency of 110.5 ns. A fair speed comparison with the design proposed in this work is difficult because of different FPGA devices. Nevertheless, the unpipelined design proposed in this work is 50% faster than the combinational multiplier proposed in [18]. The sequential multiplier uses rather few LUTs. But contrary to the combinational multiplier, it has a poor latency and cannot be pipelined. Thus, only the combinational multiplier might be suitable for an accurate scalar product unit. However, it uses a considerable amount of LUTs more than the multiplier proposed in this work.
To compare our design with multiplier designs implemented for the same FPGA chip, we have analyzed a binary multiplier on a Virtex5 provided by the Xilinx Core Generator, see Table 3. Our architecture is faster than the DSP48Ebased binary multiplier. On the other hand, our fixedpoint multiplier consumes approximately twice as much LUTs as the binary LUTbased multiplier and is slower, but it has to be considered that decimal multiplication is much more complex than binary multiplication.

The accurate MAC unit has been implemented with two pipeline registers for the decimal fixedpoint multiplier. Together with the three pipeline registers of the cyclic shift register, this amounts to a 5cycle latency to calculate and store the product of two operands on the long accumulator. The accurate MAC unit can be clocked with up to 100 MHz. Compared to a previously published paper [10], this is an improvement by a factor of five. In comparison, a software implementation of a single 16 digits floatingpoint multiplication without any long accumulator on a highperformance processor already uses 233 cycles, on lower performance architectures even more [27].
The resource consumption of the accurate MAC unit depends on the number of implemented working spaces. Table 4 summarizes the resource consumption for different configurations.
 
^{1}Decimal fixedpoint multipliers in accurate MAC use two pipeline registers. 
6. Conclusion
In this paper, we presented a decimal fixedpoint multiplier that maps onto FPGA architectures and can help to implement a fully IEEE 7542008 compliant coprocessor. We analyzed the performance with respect to the number of pipeline registers. Moreover, we integrated the decimal multiplier into an MAC unit which can compute scalar products without loss of accuracy and thus can prevent numerical cancellation. Using the MAC unit on multitasking machines is supported by the concept of working spaces. Compared to a previously published paper [10], we ported our former architecture that was designed to map on (4 : 1) LUTbased Xilinx VirtexII Pro devices to up to date (6 : 2) LUTbased Xilinx Virtex5 devices. Furthermore, we improved the algorithm of the accurate scalar product unit. For the fixedpoint multiplier, we could achieve a speedup of two, and for the entire accurate scalar product unit we could even achieve a speedup of five. Even though the migration from VirtexII to Virtex5 devices has improved the speed of the accurate scalar product unit, the greater part of the speedup is attributable to the improved algorithm.
Appendix
Proofs
Theorem 1 (B's complement). Let denote a radix, the precision, a positive integer with digits , , and .
If , then the B's complement of X can be easily computed by
Proof. Firstly, we prove the B's complement for one digit Then, we calculate the complement of a number
Theorem 2 (Sum of leading nines). The sum of words composed of leading 9's and following 0's is a decimal word for which the least significant digits consist of zero or more leading 9's followed only by 0's, 8's, and 9's, that is, with , and .
Proof. We prove the assumption by complete induction. (1)If all , for all , then the assumption is true because . (2)If , then the assumption is also true because the sum consists of leading 9's followed by 0's. (3)Let us assume that the hypothesis is true for all with , that is, with and considers the most significant digits of the sum. Then, for the next inductive step , we have to consider and . The condition leads to , that is, the assumption is true. For follows which finally proves the assertion.
References
 M. F. Cowlishaw, “Decimal floatingpoint: algorism for computers,” in Proceedings of the 16th IEEE Symposiumon Computer Arithmetic (ARITH16 '03), pp. 104–111, IEEE Computer Society, Washington, DC, USA, 2003. View at: Google Scholar
 IEEE, “IEEE 7542008 Standard for FloatingPoint Arithmetic,” September 2008. View at: Google Scholar
 ANSI/IEEE, “ANSI/IEEE 7541987 Standard for RadixIndependent FloatingPoint Arithmetic,” October 1987. View at: Google Scholar
 L. Eisen, J. W. Ward, and J. W. Ward, “IBM POWER6 accelerators: VMX and DFU,” IBM Journal of Research and Development, vol. 51, no. 6, pp. 663–683, 2007. View at: Publisher Site  Google Scholar
 E. Schwarz and S. Carlough, “Power6 decimal divide,” in Proceedings of the ApplicationSpecific Systems, Architectures, and Processors (ASAP '07), IEEE Computer Society, 2007. View at: Google Scholar
 A. Y. Duale, M. H. Decker, H. G. Zipperer, M. Aharoni, and T. J. Bohizic, “Decimal floatingpoint in z9: an implementation and testing perspective,” IBM Journal of Research and Development, vol. 51, no. 12, pp. 217–227, 2007. View at: Google Scholar
 X. Li, J. W. Demmel, and J. W. Demmel, “Design, implementation and testing of extended and mixed precision BLAS,” ACM Transactions on Mathematical Software, vol. 28, no. 2, pp. 152–205, 2002. View at: Publisher Site  Google Scholar  MathSciNet
 U. Kulisch, Advanced Arithmetic for the Digital Computer, Design of Arithmetic Units, Springer, Secaucus, NJ, USA, 2002.
 G. Bohlender and T. Teufel, “BAPSC: a decimal floating point processor for optimal arithmetic,” in Computer Arithmetic: Scientific Computation and Programming Languages, pp. 31–58, B. G. Teubner, Stuttgart, Germany, 1987. View at: Google Scholar
 M. Baesler and T. Teufel, “FPGA implementation of a decimal floatingpoint accurate scalar product unit with a parallel fixedpoint multiplier,” in Procedings of the International Conference on ReConFigurable Computing and FPGAs (ReConFig '09), pp. 6–11, IEEE, December 2009. View at: Publisher Site  Google Scholar
 M. Erle and M. Schulte, “Decimal multiplication via carrysave addition,” in Proceedings of the IEEE International Conference on ApplicationSpecific Systems, Architectures and Processors, pp. 0–348, 2003. View at: Google Scholar
 M. A. Erle, B. J. Hickmann, and M. J. Schulte, “Decimal floatingpoint multiplication,” IEEE Transactions on Computers, vol. 58, no. 7, pp. 902–916, 2009. View at: Publisher Site  Google Scholar
 T. Lang and A. Nannarelli, “A radix10 combinational multiplier,” in Proceedings of the 40th Asilomar Conference on Signals, Systems, and Computers (ACSSC '06), pp. 313–317, October 2006. View at: Publisher Site  Google Scholar
 A. Vázquez, E. Antelo, and P. Montuschi, “A new family of high  Performance parallel decimal multipliers,” in Proceedings of the18th IEEE Symposium on Computer Arithmetic (ARITH '07), pp. 195–204, June 2007. View at: Publisher Site  Google Scholar
 A. Vazquez, E. Antelo, and P. Montuschi, “Improved design of highperformance parallel decimal multipliers,” IEEE Transactions on Computers, vol. 59, no. 5, Article ID 5313798, pp. 679–693, 2010. View at: Publisher Site  Google Scholar
 E. M. Schwarz, M. A. Erle, and M. J. Schulte, “Decimal multiplication with efficient partial product generation,” in Proceedings of the17th IEEE Symposium on Computer Arithmetic (ARITH '05), pp. 21–28, IEEE Computer Society, Washington, DC, USA, 2005. View at: Google Scholar
 G. Jaberipur and A. Kaivani, “Binarycoded decimal digit multipliers,” IET Computers and Digital Techniques, vol. 1, no. 4, pp. 377–381, 2007. View at: Publisher Site  Google Scholar
 G. Sutter, E. Todorovich, G. Bioul, M. Vazquez, and J.P. Deschamps, “FPGA implementations of BCD multipliers,” in Proceedings of the International Conference on ReConFigurable Computing and FPGAs (ReConFig '09), pp. 36–41, IEEE, December 2009. View at: Publisher Site  Google Scholar
 M. A. Erle, M. J. Schulte, and B. J. Hickmann, “Decimal floatingpoint multiplication via carrysave addition,” in Proceedings of the 18th IEEE Symposiumon Computer Arithmetic (ARITH '07), pp. 46–55, June 2007. View at: Publisher Site  Google Scholar
 G. Jaberipur and A. Kaivani, “Improving the speed of parallel decimal multiplication,” IEEE Transactions on Computers, vol. 58, no. 11, Article ID 5184812, pp. 1539–1552, 2009. View at: Publisher Site  Google Scholar
 H. C. Neto and M. P. Vestias, “Decimal multiplier on FPGAusing embedded binary multipliers,” in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL '08), pp. 197–202, September 2008. View at: Google Scholar
 Xilinx. VirtexII Pro and VirtexII Pro X Platform FPGAsComplete Data Sheet, November 2007.
 Xilinx, “Virtex5 Family Overview,” February 2009. View at: Google Scholar
 M. Vazquez, G. Sutter, G. Bioul, and J. P. Deschamps, “Decimal adders/subtractors in FPGA: efficient 6input LUT implementations,” in Proceedings of the International Conference on ReConFigurable Computing and FPGAs (ReConFig '09), pp. 42–47, December 2009. View at: Publisher Site  Google Scholar
 C. Baumhof, Ein VektorarithmetikKoprozessor in VLSATechnikzur Unterstuetzung des Wissenschaftlichen Rechnens, 1996.
 Xilinx Inc, “Xilinx ISE 10.1 Design Suite Software Manuals and Help,” 2008, http://www.xilinx.com. View at: Google Scholar
 M. Schulte, N. Lindberg, and A. Laxminarain, “Performance evaluation of decimal floatingpoint arithmetic,” in Proceedings of the 6th IBM Austin Center for Advanced Studies Conference, 2005. View at: Google Scholar
Copyright
Copyright © 2010 Malte Baesler et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.