Selected Papers from the 2011 International Conference on Reconfigurable Computing and FPGAs (ReconFig 2011)View this Special Issue
Analysis of Fast Radix-10 Digit Recurrence Algorithms for Fixed-Point and Floating-Point Dividers on FPGAs
Decimal floating point operations are important for applications that cannot tolerate errors from conversions between binary and decimal formats, for instance, commercial, financial, and insurance applications. In this paper we present five different radix-10 digit recurrence dividers for FPGA architectures. The first one implements a simple restoring shift-and-subtract algorithm, whereas each of the other four implementations performs a nonrestoring digit recurrence algorithm with signed-digit redundant quotient calculation and carry-save representation of the residuals. More precisely, the quotient digit selection function of the second divider is implemented fully by means of a ROM, the quotient digit selection function of the third and fourth dividers are based on carry-propagate adders, and the fifth divider decomposes each digit into three components and requires neither a ROM nor a multiplexer. Furthermore, the fixed-point divider is extended to support IEEE 754-2008 compliant decimal floating-point division for decimal64 data format. Finally, the algorithms have been synthesized on a Xilinx Virtex-5 FPGA, and implementation results are given.
Many applications, particularly commercial and financial applications, require decimal floating-point operations to avoid errors from conversions between binary and decimal formats. This paper presents five different decimal fixed-point dividers and analyzes their performances and resource requirements on FPGA platforms. All five architectures apply a radix-10 digit recurrence algorithm but differ in the quotient digit selection (QDS) function.
The first fixed-point divider (type1) implements a simple shift-and-subtract algorithm. It is characterized by an unsigned and nonredundant quotient digit calculation. Nine divisor multiples are precomputed, and in each iteration step nine carry-propagate subtractions are performed on the residual. Finally, the smallest, nonnegative difference is selected by a large fan-in multiplexer. This type1 implementation is characterized by a high area use.
The second divider (type2) uses a signed-digit quotient calculation with a redundancy of and operands scaling to get a normalized divisor in the range of . The quotient digit selection (QDS) function can be implemented fully by a ROM because it depends only on the two most significant digits (MSDs) of the residual as well as the divisor. The residual uses a redundant carry-save representation but, because of performance issues, the two MSDs are implemented by a nonredundant radix-2 representation.
The quotient digit selection (QDS) functions of the third and fourth divider (type3.a and type3.b) are based on comparators for the two most significand digits. The comparators consist of short binary carry-propagate adders (CPA), which can be implemented very efficiently in the FPGA’s slice structure. The corresponding comparative values depend on the divisor’s value and are precomputed and stored in a small ROM. The redundancy is ; thus, 17 binary CPAs are required. Similar to the type2 divider, the type3.a and type3.b dividers use prescaling of the divisor , a redundant carry-save representation of the residual, and a nonredundant radix-2 representation of the two MSDs. The dividers type3.a and type3.b differ only in the implementation of large fan-in multiplexers in the digit recurrence step. The type3.a divider implements multiplexers that minimize the LUT usage but have long latencies, whereas the type3.b divider implements a faster dedicated multiplexer that exploits the FPGA’s internal carry chains.
The quotient digit selection function of the last divider (type4) requires neither a ROM nor a multiplexer. It is characterized by divisor scaling () and a signed-digit redundant quotient calculation with a redundancy of . The quotient digit is decomposed into three components having values , , and . The components are computed one by one, whereby the digit selection function is constant; that is, the selection constants do not depend on the divisor’s value. Similar to the type2, type3.a, and type3.b dividers, the type4 divider uses a carry-save representation for the residual with a nonredundant radix-2 representation of the two MSDs.
The fixed-point division algorithms are implemented and analyzed on a Virtex-5 FPGA. Finally, the type2 divider, which shows the best tradeoff in area and delay, is extended to a floating-point divider that is fully IEEE 754-2008 compliant for decimal64 data format, including gradual underflow handling and all required rounding modes.
The architectures of the type1, type2, and type4 dividers have already been published in . However, this paper gives a more detailed description of the previous research and introduces two new dividers, which fill the design gap between the type1 and type2 dividers because they are based on two extreme examples of algorithms: the type1 divider implements a restoring quotient digit selection (QDS) function that requires nine decimal carry-propagate adders (CPAs) of full precision, whereas the nonrestoring QDS function of the type2 divider is implemented fully by means of a ROM with limited precision. In comparison, the new dividers implement nonrestoring QDS functions that are based on fast binary CPAs with limited precision.
The outline of this paper is given as follows: Section 2 motivates the use of decimal floating-point arithmetic and its advantage compared to binary floating-point arithmetic. The underlying decimal floating-point standard IEEE 754-2008 is introduced in Section 3. The digit recurrence division algorithms as well as the five different architectures of fixed-point dividers are presented in Section 4. These fixed-point dividers are extended to a decimal floating-point divider in Section 5. Postplace and route results are presented in Section 6, and finally in Section 7 the main contributions of this paper are summarized.
2. Decimal Arithmetic
Since its approval in 1985, the binary floating-point standard IEEE 754-1985  is the most widely used implementation of floating-point arithmetic and the dominant floating-point standard for all computers. In contrast to binary arithmetic, decimal units are more complex, require more area, and are more expensive, and the simple binary coded decimal (BCD) data format has a storage overhead of approximately 20%. Thus, at that time of approval the use of binary in preference to decimal floating-point arithmetic was justified by the better efficiency.
Most people in the world think in decimal arithmetic. These decimal numbers must be converted to binary numbers when using a computer. However, some common finite numbers can only be approximated by binary floating-point numbers. The decimal number , for example, has a periodical continued fraction . It cannot be represented exactly in a binary floating-point arithmetic with finite precision, and the conversion causes rounding errors.
As a consequence, binary floating-point arithmetic cannot be used for any calculations which do not tolerate conversion errors between decimal and binary numbers. These are, for instance, financial and business applications that even require decimal arithmetic by law . Therefore, commercial application often use nonstandardized software to perform decimal floating-point arithmetic. However, these software implementations are usually from 100 to 1000 times slower than equivalent binary floating-point operations in hardware .
Because of the increasing importance, specifications for decimal floating-point arithmetic have been added to the IEEE 754-2008 standard for floating-point arithmetic  that has been approved in 2008 and offers a more profound specification than the former radix-independent floating-point arithmetic IEEE 854-1987 . Therefore, new efficient algorithms have to be investigated, and providing hardware support for decimal arithmetic is becoming more and more a topic of interest.
IBM has responded to this market demand and integrates decimal floating-point arithmetic in recent processor architectures such as z9 , z10 , Power6 , and Power7 . The Power6 is the first microprocessor that implements IEEE 754-2008 decimal floating-point format fully in hardware, while the earlier released z9 already supports decimal floating-point operations but implements them mainly in millicode. Nevertheless, the Power6 decimal floating-point unit is as small as possible and is optimized to low cost. It reuses registers from the binary floating-point unit, and the computing unit mainly consists of a wide decimal adder. Thus, its performance is rather low. Other floating-point operations such as multiplication and division are based on this adder and are performed sequentially. The decimal floating-point units of z10 and Power7 are designed similarly to those of the Power6 [7, 9].
3. IEEE 754-2008
The floating-point standard IEEE 754-2008  has revised and merged the IEEE 754-1985 standard for binary floating-point arithmetic  and IEEE 854-1987 standard for radix-independent floating-point arithmetic . As a consequence, the choice of radices has been focused on two formats: binary and decimal. In this paper we consider only the decimal data format. A decimal number is defined by the triple consisting of the sign , significand , and exponent : with , , and . IEEE 754-2008 uses two different designators for the exponent ( and ) as well as for the significand ( and ). The exponent is applied when the significand is regarded as an integer digit and fraction field, denoted by . The exponent is applied when the significand is regarded as an integer, denoted by . The relation is given through In this paper we exclusively use the integer representation with the exponent .
Unlike binary floating-point format, the decimal floating-point number is not necessarily normalized. This leads to a redundancy, and a decimal number might have multiple representations. The set of representations is called the floating-point number’s cohort. For example, the numbers , , and are all members of the same cohort. More precisely, if a number has significant digits, the number of representations is .
IEEE 754-2008 defines three decimal interchange formats (decimal32, decimal64, and decimal128) of fixed width 32, 64, and 128 bits. As depicted in Figure 1, a floating-point number is encoded by three fields: the sign bit , the combination field , and the trailing significand field. The combination field encodes whether the number is finite, infinite, or not a number (NaN). Furthermore, in case of finite numbers the combination field comprises the biased exponent () and the most significant digit (MSD) of the significand. The remaining digits are encoded in the trailing significand field of width . The trailing significand can either be implemented as a binary integer or as a densely packed decimal (DPD) number . Binary encoding makes software implementations easier, whereas DPD encoding is favored by hardware implementations, as it is the case in this paper.
The encoding parameters for the three fixed-width interchange formats are summarized in Table 1. This paper focuses on the data format decimal64 with DPD coded significand. DPD encodes three decimal digits (four bits each) into a declet (10 bits) and vice versa . It results in an storage overhead of only 0.343% per digit.
IEEE 754-2008 defines five rounding modes. These are two modes to the nearest (round ties to even and round ties to away) and three directed rounding modes (round toward positive, round toward negative, and round toward zero) . As floating-point operations are obtained by first performing the exact operation in the set of real numbers and then mapping the exact result onto a floating-point number, rounding is required whenever all significant digits cannot be placed in a single word of length . Moreover, inexact, underflow, or overflow exceptions are signaled when necessary.
4. Decimal Fixed-Point Division
Oberman and Flynn  distinguish five different classes of division algorithms: digit recurrence, functional iteration, very high radix, table look-up, and variable latency, whereby many practical algorithms are combinations of multiple classes. Compared to binary arithmetic, decimal division is more complex. Currently, there are only a few publications concerning radix-10 division.
Wang and Schulte  describe a decimal divider based on the Newton-Raphson approximation of the reciprocal. The latency of a decimal Newton-Raphson approximation directly depends on the latency of the decimal multiplier. A pipelined multiplier has the advantage that more than one division operation can be processed in parallel; otherwise the efficiency is poor. However, the algorithm lacks remainder calculation, and the rounding is more complex. The first FPGA-based decimal Newton-Raphson dividers are presented in . The dividers as well as the underlying multipliers are sequential; hence, these dividers have a high latency.
Digit recurrence division is the most widely implemented class of division algorithms. It is an iterative algorithm with linear convergence; that is, a fixed number of quotient digits is retired every iteration step. Compared to radix-2 arithmetic, radix-10 digit recurrence division is more complex because, on the one hand, decimal logic is less efficient by itself and, on the other hand, the range of the quotient digit selection (QDS) function comprises a larger digit set. Therefore, the performance of the digit recurrence divider depends on the choice of the QDS function and the implementation of decimal logic.
Nikmehr et al.  select quotient digits by comparing the truncated residual with limited precision multiples of the divisor. Lang and Nannarelli  replace the divisor’s multiples by comparative values obtained by a look-up table and decompose the quotient digit into a radix-2 digit and a radix-5 digit in such a way that only five and two times the divisor are required. Vázquez et al.  take a different approach: the selection constants in the QDS function are obtained of truncated multiples of the divisor, avoiding look-up tables. Therefore, the multiples are computed on-the-fly. Moreover, the digit recurrence iteration implements a slow carry-propagate adder, but an estimation of the residual is computed to make the determination of the quotient digits independent of this carry-propagate adder. The decimal divider of the Power6 microprocessor  uses extensive prescaling to bound the divisor to be greater than or equal to 1.0 but lower than 1.1112 in order to simplify the digit selection function. However, this prescaling is very costly, it requires a 2-digit multiply, and it needs overall six cycles on each operand. Furthermore, the digit selection function still requires a look-up table.
The first decimal fixed-point divider designed for FPGAs applies digit recurrence algorithm and is proposed by Ercegovac and McIlhenny [17, 18]. However, it has a poor cycle time, because it does not fit well on the slice structure of FPGAs but would probably fit well into ASIC designs with dedicated routing.
4.1. Digit Recurrence Division
In the following, we use different decimal number representations. A decimal number is called BCD- coded when can be expressed by A number representation is called redundant if one or more digits have multiple representations.
Moreover, we consider a division in which the decimal dividend and the decimal divisor are positive and normalized fractional numbers of precision , that is, . The quotient and the remainder are calculated as follows: Then the radix-10 digit recurrence is implemented by with the initial residual and with the quotient digit calculated by the selection function Digit recurrence division is subdivided into the classes of restoring and nonrestoring algorithms, that differ in the quotient digit selection (QDS) function and the dynamical range of the residual. A restoring divider selects the next positive quotient digit such that the next partial residual is as small as possible but still positive. The IBM z900 architecture, for instance, implements such a decimal restoring divider . By contrast, the QDS function of a decimal nonrestoring divider uses a digit set that is positive as well as negative , and the partial residual might also be negative.
One advantage of restoring division is the enhanced performance since estimates of limited precision ( and ) might be used in the QDS function . This performance gain is achieved by using a redundant digit set, , which defines a redundancy factor Then, in each iteration step the quotient digit should be selected such that the next residual is bounded by which is called the convergence condition .
This paper presents five different decimal fixed-point dividers that are described in the following. The first divider (type1) is a restoring divider whereas the four others (type2, type3.a, type3.b, and type4) are nonrestoring dividers.
4.2. Type1 QDS Function
The QDS function of the type1 algorithm is very simple. Nine multiples of the divisor () are subtracted in parallel from the residual by carry-propagate adders (CPAs), and the smallest positive result is selected. The CPAs exploit the FPGA’s internal fast carry logic, as described in .
The nine multiples are precomputed and are composed of the multiples , , , and their negatives (10’s complement), which require at most one additional CPA per multiple. The multiples and can be easily computed by digit recoding and constant shift operations where (9) is read as follows. A BCD-5421 coded number left-shifted by one bit is equivalent to the corresponding BCD-8421-coded number multiplied by two. In a similar fashion we obtain a multiplication by five using (10).
The implementation of the type1 digit recurrence divider is characterized by a high area use, due to the utilization of nine parallel CPAs. The corresponding algorithm of the type1 digit recurrence is shown in Algorithm 1.
4.3. Type2 QDS Function
The approach of the type2 division algorithm is based on the implementation of the QDS function fully by a ROM. This ROM is addressed by estimates of the residual and divisor. The use of estimates is feasible because a signed digit set together with a redundancy greater than is used. The estimates and are obtained by truncation; that is, only a limited number of MSDs of the residual and divisor are regarded. Furthermore, the residual is implemented in BCD-4221 carry-save representation such that the maximum error introduced by an estimation with precision (one integer digit and fractional digits) is bounded by . Negative residuals are represented by their 10’s complement.
The divisor is subdivided into subranges of equal width . For each subrange , the QDS function is defined by selection constants : These selection constants are bounded by selection intervals , with the containment condition 
Furthermore, the continuity condition states that every value must belong at least to one selection interval . Considering the maximum error due to truncation, the continuity condition can be expressed by If we suppose the subrange width to be constant () and consider that the minimum overlapping occurs for minimum and maximum , then we obtain the term We choose () and that leads to and . Thus, for the divisors’s estimation is required an accuracy of two fractional digits, and for the residual’s estimation is required an accuracy of one integer and one fractional digit. Furthermore, the divisor must be pre-scaled such that .
The P-D diagram is a visualization technique for designing a quotient digit selection function and for computing the decision boundaries . It plots the shifted residual versus the divisor. The P-D diagram for the type2 quotient digit selection function with the selection constants is shown in Figure 2.
The two MSDs of the dividend and residual address the ROM in order to obtain the signed quotient digit, which comprises 5 bits. In order to reduce the ROM’s size, the two MSDs of the divisor and the residual are implemented by using a nonredundant radix-2 representation. This leads to an address that is composed of 15 bits: seven bits from the unsigned radix-2 representation of the divisor’s two MSDs and eight bits from the signed radix-2 representation of the residual’s two MSDs. Hence, the size of the ROM is bits. Unfortunately, the use of radix-2 representation complicates the multiplication by 10, which is required according to (5). Therefore, an additional binary adder is needed: The corresponding algorithm for the calculation of the ROM entries can be derived from the P-D diagram and is depicted in Algorithm 2. It should be noted that the radix-2 representations of and are truncations of and ; that is, and .
Once the quotient digit has been determined, the multiple is subtracted from the current residual to compute the next residual, as stated in (5). The subtracter is a fast redundant BCD-4221 carry-save adder (CSA), as described in , except for the two MSDs in wich we apply radix-2 CPAs. Thus, the multiple is composed of two summands: where are precomputed. The multiplies and can be easily and fast computed by digit recoding and constant shift operation as shown in (9) and (10). The 10’s and 2’s complements are applied for . For both, radix-2 and BCD-4221 radix-10 representation, the complements can be computed by inverting each bit and adding one. In summary, the subtraction uses a (3 : 1) radix-2 CPA for the two MSDs and a redundant radix-10 (4 : 2) CSA for the remaining digits. The digit recurrence algorithm of the type2 divider is summarized in Algorithm 3, and its block diagram is depicted in Figure 3.
4.4. Type3 QDS Function
The frequency limiting component of the type2 divider is the digit recurrence step, which comprises the ROM (BRAM) to calculate the next quotient digit. This BRAM is slower compared to common FPGA logic. Hence, it appears to be beneficial to remove the slow BRAM from the critical path and use fast FPGA logic instead. To analyze this impact of the BRAM delays, we implement two dividers (type3.a and type3.b), which can also be realized without BRAM in the critical path.
Lang and Nannarelli  propose a divider that implements a quotient digit selection function based on CPAs and sign detection. These CPAs subtract fixed values from the current residual with limited precision. These fixed values depend on the estimates of the divisor and are pre-computed and stored in a ROM. Furthermore, the quotient digit is divided into two parts, which further simplifies the quotient digit selection function and reduces the critical path. Unfortunately, this divider cannot be implemented efficiently on FPGAs because the required carry-save adder with small error estimation shows a poor performance on the FPGA’s slice structure. Nevertheless, to investigate the impact of BRAM delays in the type2 divider, we implement another two dividers (type3.a and type3.b) that have no BRAM in the critical path but a CPA-based quotient digit selection function instead.
The type3 dividers are modifications of the type2 divider. The common architectural features are (i)the multiples of the divisor are compos of two components: , where , (ii)the fast redundant BCD-4221 carry-save adders for the digit recurrence step as stated in (5), (iii)the radix-2 implementation of the two MSDs, and (iv)the 10’s and 2’s complements for . Furthermore, since the architectures of the type2 and type3 dividers are similar, with the exception of the quotient digit selection function, most of the dividers’ characteristics are also identical, including the (i)the P-D diagram, (ii)the redundancy factor , , (iii)the need of prescaling the divisor , (iv)the accuracy of the divisor’s estimation , and (v)the accuracy of the residual’s estimation . Contrary to the type2 divider, the QDS functions of the type3 dividers are implemented by 16 carry-propagate adders with sign detection. These adders subtract fixed selection constants from the estimation of the current residual. The selection constants are dependent on the current divisor and are stored in a ROM, which is then no more part of the critical path. The selection constants are computed according to Algorithm 4. The common digit recurrence algorithm of the type3.a and type3.b dividers is shown in Algorithm 5, and the corresponding block diagram is depicted in Figure 4.
The sign signals of the binary carry-propagate adders are used to determine the corresponding quotient digit and to select the multiples of the divisor in the digit recurrence iteration step. The advantage of the type3 QDS function is its short critical path, which comprises only one 8-bit binary carry-propagate adder. However, this short critical path is bought at a high price because the complexity is moved to the selection of the divisors multiples. Hence, large fan-in multiplexers with poor latencies are required. In order to minimize the impact of these multiplexers, we designed a new dedicated (17 : 1) multiplexer that exploits the FPGA’s fast carry chains and has a delay of only one LUT instance. In the following we name the divider with improved fast multiplexers type3.b and with traditional multiplexers type3.a. The new dedicated (17 : 1) multiplexer uses the 16-sign bits as select lines and is coded as follows: with and . In other words, bit is selected when and . The selection of two input signals is implemented in one (5 : 1) LUT, and all signals are combined by a long OR gate that is implemented exploiting the FPGA’s fast carry chain. Therefore, one (17 : 1) multiplexer requires nine LUTs, and the type3.b divider with fast multiplexers uses much more LUTs than the type3.a divider (see Section 6). The block diagram of such a (17 : 1) multiplexer is depicted in Figure 5.
4.5. Type4 QDS Function
The type4 divider applies a new algorithm as proposed by us in a previous paper . It is based on the decomposition of the signed quotient digit into three components having values , , and as well as a fast constant digit selection function. In this implementation, neither a ROM for the QDS function nor a multiplexer to select the multiple in the digit recurrence iteration step is required. The divider is intended to utilize less resources than other implementations. As the type2, type3.a, and type3.b dividers, the type4 divider uses signed-digit redundant quotient calculation, carry-save representation of the residual, fast BCD-4221 CSAs for the digit recurrence, and a radix-2 implementation of the MSDs.
Quotient digits are decomposed into three components , and each component is computed by a distinct selection function. These components can hold three values so that only two comparators per component are required to distinguish between these values. Since , we get a redundancy factor of . Furthermore, the selection functions become very simple due to prescaling of the divisor (); that is, they are constant functions that do not depend on the divisor anymore:
The recurrence is then defined as follows:
The multiples and can be easily and fast computed according to (9) and (10). Each selection function requires two comparators implemented as carry-save adders that subtract constant values from the residuals’ estimations (, , and ). As we will show in the following, these estimations require only the two most significant digits (MSDs) of the corresponding exact values.
First, the selection intervals for with and have to be determined, where is the smallest and is the greatest value of the selection constant such that the next residual is still bounded. Applying the convergence condition (8), the digit recurrence (19a), (19b), and (19c), and the redundancy factor , we obtain
Likewise, the lower limits can be computed
Due to the redundancy factor we have overlapping regions , where more than one quotient digit component may be selected. The decision boundary of a selection function should lie inside this overlapping regions. Figure 6 shows the P-D diagram for . The figures for and look similar. As can hold three different values, the selection function requires two decision boundaries. Generally, it is implemented by a staircase function (see Figure 2), but for the bounded divisor it is independent of the divisor and is reduced to a constant function (see Figure 6). Fortunately, the selection functions for and are also independent of the divisor and can also be implemented by constant functions in a similar fashion.
We now determine the required number of fractional digits for the estimates , , and . Moreover, we choose suitable selection constants and . These constants depend on the maximum error due to the estimates and on the minimum overlapping width, which is given by as well as . Since we use BCD-4221 carry-save redundant digit representation, for a precision of digits (one integer digit and fractional digits), we have a maximum error of . Moreover, the estimations are computed by truncations (). This means, for given positive and negative selection constants and , the following expressions must be true:
If we choose the selection constants as listed in Table 2, all conditions are fulfilled for a precision of digits, and each selection function is reduced to two simple 2-digit comparators.
As soon as one component of the quotient digit is computed, the corresponding multiple of the divisor must be subtracted from the partial residual. Each component can hold three different values that are multiplied with the corresponding weighted divisor with . In other words, is either passed, negated, or reset. Passing and resetting the multiple of the divisor is performed at no extra cost. Negation is accomplished by bit inversion and adding +1 through the carry inputs of the carry-save adders in (19a), (19b), and (19c). In summary, this is a very fast operation and requires only two LUTs per digit.
Similar to type2 and type3 division, we implement the two MSDs of the type4 residual by using a nonredundant radix-2 representation, which requires less LUTs and results in faster quotient digit selection function. The pseudocode for the digit recurrence iteration is shown in Algorithm 6, and the corresponding block diagram is depicted in Figure 7.
4.6. Proposed Decimal Fixed-Point Divider
For each type of division algorithms presented in the preceding sections we have implemented a corresponding fixed-point divider, which is described in the following. For all fixed-point dividers we expect the input operands to be normalized.
The block diagram of the type1 divider is depicted in Figure 8. First, the divisor multiples are precomputed. In the following cycles quotient digits ( if the first digit is zero) are computed one by one, followed by an additional rounding digit. The quotient (Z) and the quotient +1 (ZP1) are computed on-the-fly by using two registers that are updated every cycle. The algorithm has the advantage that no additional slow decimal CPA is required to compute the incremented quotient. It is described more precisely in Section 5. The normalization of the result is also performed on-the-fly by locking the conversion when quotient digits for the significand and the rounding digit have been computed; that is, in the worst case, when the first quotient digit is zero, cycles are required. Furthermore, the sticky bit is calculated, which is required for rounding. It is set to one whenever the final remainder is unequal to zero.
The architectures of the type2, type3.a, type3.b, and type4 are similar in their structure but differ in their digit recurrence algorithm and scaling. The common block diagram is shown in Figure 9. First, the dividend and the divisor are re-scaled and five as well as two times the divisor are precomputed according to (9) and (10). Then, the operands are recoded because the digit recurrence iteration uses a redundant digit representation with BCD-4221 carry-save adders.
The digit recurrence retires in each iteration one signed quotient digit . The on-the-fly-conversion algorithm converts this signed-digit representation to the BCD-8421-coded result  and computes the quotient (Z), the quotient +1 (ZP1), and quotient −1 (ZM1). This conversion is accomplished every iteration step and does not need a slow CPA. The incremented and decremented quotients are required for rounding. Moreover, the quotient is also normalized on-the-fly by locking the conversion when quotient digits (including the rounding digit) have been computed. The algorithm is described explicitly in Section 5 because it is accomplished together with the gradual underflow handling of the floating-point divider.
Furthermore, the sticky bit is calculated, which is required for rounding. It is set to one whenever the final remainder is unequal to zero. Moreover, when the final remainder is negative, the quotient of the type2, type3.a, type3.b, and type4 dividers has to be adjusted by subtracting one LSD. This subtraction does not need another slow CPA because the quotient −1 (ZM1) has already been computed on-the-fly. The calculations of both, the sticky and sign bit, require the reduction of the redundant remainder by a CPA. This CPA might be subdivided into multiple smaller CPAs to keep the latency low.
5. IEEE 754-2008 Floating-Point Division
The IEEE 754-2008 compliant decimal floating-point divider is an extension of the normalized fixed-point divider. The divider presented in this paper supports the interchange format IEEE 754-2008 decimal64 with DPD encoding, but it can be easily adapted to any other precision and exponent range.
The block diagram of the divider is depicted in Figure 10. The DFP division begins with decoding of the dividend and divisor and the extraction of their signs, significands, and exponents. In the following, the significands of the operands and are regarded as integers. Therefore, the corresponding exponents are and .
According to , if one of the operands is a signaling NaN (sNAN) or quiet NaN (qNAN), then the result is also a quiet NaN with the payload of the original NaN. Hence, in order to preserve the payload while using an unmodified divider, the NaN handling unit sets the NaN holding operand as the dividend and resets the divisor to , as depicted in Algorithm 7.
The fixed-point dividers presented in Section 4 require normalized operands. Therefore, the number of leading zeros for both operands is counted, and the significands are normalized by barrel shifters. Due to performance issues, the leading zeros counter exploit the FPGA’s fast carry chains, as proposed in .
The exponent of a floating-point division is first estimated by the normalized quotient in fractional representation (QNF) and is updated iteratively in each digit recurrence step. Since most of the decimal floating-point numbers specified by IEEE 754-2008 have multiple representations of the same value, the result of a division might also have more than one correct representations that differ in the exponent. However, to obtain a unique and reproducible result in the case of such an ambiguity, IEEE 754-2008 defines the exponent of the result, which is called preferred exponent (QP).
Both exponents, QNF and QP, are positive integers and are determined as follows: The offset1 is a bias and assures the quotients to be positive since unsigned integer calculation is used in this paper. The bias is composed of The normalized fixed-point division unit then computes significand digits and one additional rounding digit. Additionally, the divider encompasses the on-the-fly conversion unit that also detects and handles gradual underflow with zero delay overhead. One signed quotient digit is retired in each cycle, and the on-the-fly-conversion algorithm converts this signed-digit representation into the BCD-8421-coded partial result . The conversion is accomplished every iteration step and does not need a slow CPA (see Algorithm 8). The partial quotient is stored in a register Z′. Moreover, two additional registers ZM1′ and ZP1′ are provided. ZM1′ is the partial quotient decremented by one least significant digit (LSD), and ZP1′ is the partial quotient incremented by one LSD. ZM1′ is selected when the final residual in the last iteration is negative, and ZP1′ is required by the following rounding operation. Furthermore, the exponent for the normalized significand in integer representation (QNI) is calculated (see (29)), and gradual underflow handling is performed at no extra cost (see Algorithm 8). The exponent QNI is computed by decrementing the exponent of the normalized significand in fractional representation (QNF) in each iteration step by one. The recurrence iteration terminates earlier in case of gradual underflow. The gradual underflow signal (GUF) is asserted high when , and the calculated integer quotient is less than : The extended quotient (composed of the calculated quotient and the rounding digit) must be decremented by one if the final remainder is less than zero. The sticky bit (sb) is used for rounding and is asserted high whenever the remainder is unequal to zero.
Moreover, the number of trailing zeros (TZ) is computed on-the-fly. This number indicates by how many digits the computed quotient might be shifted to the right without loosing accuracy. The number of trailing zeros has to be counted only for Z′ because if ZM1′ is selected, the residual will be unequal to zero and anyway. The number of trailing zeros is used for the selection between the preferred exponent (QP) and the normalized exponent for integer representation (QNI), as described in the following paragraph.
The exponent calculation unit selects either QP or QNI. The preferred exponent should be selected whenever possible. This is only feasible when ; that is, there is a sufficient number of trailing zeros to shift the significand to the right. Furthermore, the final exponent must satisfy the minimum and maximum exponent range. The algorithm of the exponent calculation is illustrated in Algorithm 9, where the value is used to add the IEEE 754-2008 bias and to remove offset1 again (which was introduced in (27)).
Rounding is required when the number of significant digits of the quotient exceeds the length of a decimal word. The divider presented in this paper provides the five rounding modes as requested by . Rounding either selects the integer quotient or the incremented quotient . As proven by Theorem A.1 in the appendix, rounding overflow cannot occur in DFP division. Rounding overflow would occur, if overflows due to rounding. The selection of the quotient depends on the rounding mode, the rounding digit (), the sticky bit (sb), and the least significant digit (LSD) of the quotient. The calculation of the round up detection is summarized in Table 3.
Finally, the result is encoded again, and the exception unit might assert six exception signals. These are division by zero, invalid operation, result is infinite, inexact, overflow, and underflow. Division by zero is asserted when the divisor equals zero but the dividend is unequal to zero. The operation is called invalid when both operands are zero, both operands are infinity, or any of the operands is either a signaling or a quiet NaN. The inexact flag is asserted when the rounding digit or the sticky bit is unequal to zero. Overflow and underflow are computed as described in the exponent calculation unit and are listed in Algorithms 8 and 9. The infinity exception is asserted when the result is greater than the largest number, that is, when either overflow or division by zero exception is asserted, or when the dividend is infinity while the divisor is a finite number.
6. Implementation Results
All dividers are modeled using VHDL and are implemented for Xilinx Virtex-5 devices with speed grade −2 using Xilinx ISE 10.1. The postplace and route results of the fixed-point dividers are listed in Table 4.
All five types of fixed-point dividers require 19 cycles to perform a division. This includes 17 cycles to determine the quotient digits and the rounding digit, one additional cycle to normalize the result (when the first quotient digit is zero), and one cycle latency to perform on-the-fly conversion and sticky bit determination. As expected, the type1 divider uses the most resources in terms of look-up tables (LUTs) and flip-flops (FFs). The type2, type3.a, and type3.b divider consumes less LUTs but require, further five (type2) or two (type3.a and type3.b) BRAMs. However, comparing the delay of the three dividers leads to an unexpected result. Contrary to decimal divider implementations in ASIC designs, on FPGA platforms the shift-and-subtract algorithm is the fastest. The reason is that the signal propagation on the FPGA’s internal fast carry chains is faster than interconnections between slices over the FPGA’s general routing matrix. The type1 divider exploits this fast carry chains, whereas type2, type3.a, type3.b, and type4 dividers only use the normal, slow slice interconnection resources. In ASIC implementations, for instance, the longest paths of the type4 divider would be much shorter than the longest path of the type1 divider, but on FPGA architectures the situation is the opposite.
One of the fastest decimal fixed-point divider on ASICs is the design of Lang and Nannarelli . They minimize the critical path in the digit recurrence by using a fast quotient digit selection function based on binary carry-propagate adders that subtract fixed values from the estimation of the current residual. These fixed values are dependent on estimations of the divisor and are precomputed and stored in a ROM. Therefore, the latency of the ROM does not contribute to the critical path of the digit recurrence. The implementation of such a divider on FPGAs would have a poor performance because the required carry-save adder with small error estimation cannot be implemented efficiently on the FPGA’s slice structure. However, the concept of removing the ROM from the critical path is applied to the type3.a and type3.b dividers, and the corresponding postplace and route results are shown in Table 4. These results point out that the type3.a utilizes a similar amount of LUTs as the type2 divider but the cycle time is much higher. The reason for this increased cycle time can be explained by the raised complexity of the multiplexers for the selection of the divisor’s multiples in the digit recurrence step. These multiplexers show a poor performance in terms of propagation delay because they are implemented as a tree with slow slice interconnections. On the contrary, the type3.b divider applies dedicated fast multiplexers, which have a propagation delay of only one LUT instance and eight fast carry chains. This dedicated fast multiplexer speeds up the divider and reduces the maximum cycle time by one nanosecond. Unfortunately, the number of used LUTs is increased dramatically by approximately 1000 LUTs.
The comparison of the type2, type3.a, type3.b, and type4 dividers in terms of speed and area (number of occupied LUTs and FFs combined) shows that the type2 algorithm is the fastest and requires the least number of slices. Hence, there is no benefit in removing the ROM’s latency from the critical path because the complexity is moved to the selection of the divisor’s multiples. The type1 divider is only 5% faster than the type2 divider, but the speed is bought at a high price because the number of occupied slices is 75% higher. Therefore, the type2 divider shows a good tradeoff in terms of area and latency and is used for the floating-point divider presented in this paper. In the following paragraph we compare the type2 implementation with other published fixed-point dividers.
Four other FPGA-based dividers are presented by Ercegovac and McIlhenny [17, 18], Deschamp and Sutter , Zhang et al. , and Véstias and Neto . These implementations are compared to our type2 divider in the following. The divider presented in [17, 18] is based on a digit recurrence algorithm that only requires limited-precision multipliers, adders, and LUTs. Furthermore, a compensation term is computed in the digit recurrence that compensates the error caused by this limited precision. The design is optimized for Virtex-5 FPGAs and has a good area characteristic (a 14-digit divider requires 1263 LUTs) but suffers from a high cycle time of 13.1 ns, which is more than 50% higher compared to the type1 or type2 design proposed in this paper. The reason for that high cycle time is (similar to the type4 divider) the long critical paths across many LUTs that cannot exploit the FPGA’s fast carry chains. Therefore, the design would probably fit better on ASICs with dedicated routing. The dividers implemented in  apply two different digit recurrence algorithms with a redundancy of . The quotient digit selection function is very complex because it requires three MSDs of the divisor and four MSDs of the residual. The cycle times are high but it must be taken into account that these designs are optimized for Virtex-4 FPGAs. Nevertheless, the number of used LUTs is very high. The divider presented in  applies a radix-100 digit recurrence algorithm with comprehensive prescaling of the dividend and divisor. It is optimized for Virtex-II Pro devices; hence, it is hard to compare it with the dividers presented in this paper. However, the critical path includes a decimal carry-save adder tree, two carry-propagate adder stages, and three multiplexer stages. It can therefore be assumed that this algorithm would also have a poor performance on Virtex-5 devices, although two quotient digits are retired in each iteration step. Véstias and Neto  propose four different decimal dividers optimized for different speed and area tradeoffs. Contrary to the circuits presented in this paper, the dividers apply the Newton-Raphson algorithm. The used multiplier is sequential; therefore, many cycles per division are required. However, each divider has a higher delay and a greater LUT usage compared to our type2 implementation.
Table 5 lists the results of the dividers presented in [17, 24, 25] together with the results of the type2 divider. However, for a fair performance comparison the precision of the type2 divider is equalized to match the number of calculated quotient digits of the corresponding divider. Furthermore it should be noted that the dividers presented in [12, 17, 24, 25] do not provide rounding support.
For the floating-point implementation, the fixed-point divider of type2 is used because type1 and type3.b consume too many resources, and type3.a and type4 are too slow. However, five additional BRAMs are required, but these are available in sufficient quantities on Virtex-5 devices. The corresponding postplace and route results for two configurations with different numbers of serialization stages are listed in Table 6. A comparison with other implementations is complicated because there are no other FPGA-based implementations of decimal floating-point dividers published yet. Therefore, we can only compare the decimal divider with binary dividers implemented on the same FPGA. In Table 7 postplace and route results of binary floating-point dividers for the double data format on a Virtex-5 provided by the Xilinx Core Generator  are listed. The dividers also provide exception signals overflow, underflow, invalid operation, and divide-by-zero. Unfortunately, the binary floating-point divider does not support gradual underflow because this feature increases the complexity. The divider with one cycle per division is an unpipelined and fully parallel design, and the divider that requires 55 cycles per division is a fully sequential design. The binary floating-point divider with 20 cycles per division is best suited for a comparison with the decimal floating-point divider because it requires a comparable number of cycles for one operation. Obviously, decimal arithmetic has a great overhead regarding the total number of used LUTs, but the cycle time and latency are similar. This resource overhead can be explained, on the one hand, by the inefficient representation of decimal values on binary logic and, on the other hand by the more complex specification of the decimal standard IEEE 754-2008. For instance, decimal floating-point division requires additional encoders, decoders, and a normalization stage.
In this paper we have presented five different radix-10 digit recurrence division algorithms. The type1 divider implements the simple shift-and-subtract algorithm. The type2 divider is based on a nonrestoring algorithm with ROM-based quotient digit selection function. The type3.a and type3.b dividers are both similar to type2 but use fast binary carry-propagate adders in the quotient digit selection function. However, the type3.a divider uses LUT-based multiplexers whereas the type3.b uses fast carry chain-based multiplexers. The type4 divider applies a new algorithm with constant digit selection functions. This type4 divider requires neither a ROM nor multiplexers to select multiples of the divisor.
We have shown that the subtract-and-shift algorithm with the worst latency in ASIC architectures is the fastest design on FPGAs, but uses the most FPGA resources in terms of LUTs. A good tradeoff between latency and area is the architecture type2 with a redundant carry-save representation of the residual, a radix-2 representation of the two MSDs, and a quotient digit selection function implemented in ROM. Furthermore, we have extended this fixed-point to a fully IEEE 754-2008 compliant decimal floating-point divider for decimal64 data format, and, finally, we have shown implementation results of all dividers.
Theorem A.1 (rounding overflow). Let us consider a decimal floating-point division with the signs , the significands , the exponents , and the precision p Then, rounding overflow, that is, an overflow that arises due to adding +1 to the least significant digit of the final result, cannot occur.
Proof. Assume to the contrary that there are a floating-point dividend X, a floating-point divisor D, and a proper rounding mode, where the decimal floating-point division produces a rounding overflow. Without loss of generality we consider in the following only positive operands (), normalized significands (), and the exponents to be zero (). In the case of rounding overflow, the digits of the quotient must be all nines, and the remainder must be unequal to zero, that is,
Condition (A.2) is fulfilled for
but this violates condition (A.3) because the remainder is zero (); that is, the result is exact, no round-up is applied, and hence no rounding overflow occurs.
Otherwise, if the divisor is greater than it follows that with . The left term of the equation has, by definition, a precision of digits while the right term of the equation has a precision of more than digits. This contradiction, finally, proves that no rounding overflow can occur.
M. Baesler, S. Voigt, and T. Teufel, “FPGA implementations of radix-10 digit recurrence fixed-point and floating-point dividers,” in Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig '11 ), pp. 13–19, IEEE Computer Society, Los Alamitos, CA, USA, December 2011.View at: Publisher Site | Google Scholar
IEEE Task P754. ANSI/IEEE 754-1985, Standard for Binary Floating- Point Arithmetic. New York, NY, USA, August 1985.
M. F. Cowlishaw, “Decimal floating-point: algorism for computers,” in Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH '03), pp. 104–111, IEEE Computer Society, Washington, DC, USA, June 2003.View at: Google Scholar
IEEE Task P754. IEEE 754-2008, Standard for Floating-Point Arithmetic. New York, NY, USA, August 2008.
ANSI/IEEE. ANSI/IEEE Std 854-1987: An American National Standard: IEEE Standard for Radix-Independent Floating-Point Arithmetic. New York, NY, USA, October 1987.
A. Y. Duale, M. H. Decker, H. G. Zipperer, M. Aharoni, and T. J. Bohizic, “Decimal floating-point in z9: an implementation and testing perspective,” IBM Journal of Research and Development, vol. 51, no. 1-2, pp. 217–227, 2007.View at: Google Scholar
S. F. Oberman and M. J. Flynn, “Division algorithms and implementations,” IEEE Transactions on Computers, vol. 46, no. 8, pp. 833–854, 1997.View at: Google Scholar
M. Véstias and H. Neto, “Revisiting the newton-raphson iterative method for decimal divisionpages,” in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL '11), pp. 138–143, IEEE Computer Society Press, September 2011.View at: Google Scholar
A. Vázquez, E. Antelo, and P. Montuschi, “A radix-10 SRT divider based on alternative BCD codings,” in Proceedings of the 25th IEEE International Conference on Computer Design (ICCD '07), pp. 280–287, IEEE Computer Society Press, Los Alamitos, CA, USA, October 2007.View at: Publisher Site | Google Scholar
M. D. Ercegovac and R. McIlhenny, “Design and FPGA implementation of radix-10 algorithm for division with limited precision primitives,” in Proceedings of the 42nd Asilomar Conference on Signals, Systems and Computers (ASILOMAR '08), pp. 762–766, IEEE Computer Society, Pacific Grove, Calif, USA, October 2008.View at: Publisher Site | Google Scholar
M. D. Ercegovac and R. McIlhenny, “Design and FPGA implementation of radix-10 combined division/square root algorithm with limited precision primitives,” in Proceedings of the 44th Asilomar Conference on Signals, Systems and Computers (Asilomar '10), pp. 87–91, IEEE Computer Society, Pacific Grove, Calif, USA, November 2010.View at: Publisher Site | Google Scholar
M. Ercegovac and T. Lang, Division and Square Root: Digit-Recurrence Algorithms and Implementations, Kluwer Academic Publishers, Norwell, Mass, USA, 1994.
M. Baesler, S. O. Voigt, and T. Teufel, “A radix-10 digit recurrence division unit with a constant digit selection function,” in Proceedings of the 28th IEEE International Conference on Computer Design (ICCD '10), pp. 241–246, IEEE Computer Society, Los Alamitos, CA, USA, October 2010.View at: Publisher Site | Google Scholar
M. Baesler, S. O. Voigt, and T. Teufel, “An IEEE 754-2008 decimal parallel and pipelined FPGA floating-point multiplier,” in Proceedings of the 20th International Conference on Field Programmable Logic and Applications (FPL '10), pp. 489–495, IEEE Computer Society, Washington, DC, USA, September 2010.View at: Publisher Site | Google Scholar
Y. Zhang, D. Chen, L. Chen et al., “Design and implementation of a readix-100 decimal division,” in Proceedings of IEEE Symposium on Circuit and System (ISCAS '09), Taibei, Taiwan, May 2009.View at: Google Scholar
Xilinx. Xilinx LogiCORE Floating-Point Operator v4.0, April 2008.