Special Issue

## Selected Papers from the 2011 International Conference on Reconfigurable Computing and FPGAs (ReconFig 2011)

View this Special Issue

Research Article | Open Access

Volume 2013 |Article ID 453173 | https://doi.org/10.1155/2013/453173

Malte Baesler, Sven-Ole Voigt, "Analysis of Fast Radix-10 Digit Recurrence Algorithms for Fixed-Point and Floating-Point Dividers on FPGAs", International Journal of Reconfigurable Computing, vol. 2013, Article ID 453173, 16 pages, 2013. https://doi.org/10.1155/2013/453173

# Analysis of Fast Radix-10 Digit Recurrence Algorithms for Fixed-Point and Floating-Point Dividers on FPGAs

Accepted17 Sep 2012
Published07 Mar 2013

#### Abstract

Decimal floating point operations are important for applications that cannot tolerate errors from conversions between binary and decimal formats, for instance, commercial, financial, and insurance applications. In this paper we present five different radix-10 digit recurrence dividers for FPGA architectures. The first one implements a simple restoring shift-and-subtract algorithm, whereas each of the other four implementations performs a nonrestoring digit recurrence algorithm with signed-digit redundant quotient calculation and carry-save representation of the residuals. More precisely, the quotient digit selection function of the second divider is implemented fully by means of a ROM, the quotient digit selection function of the third and fourth dividers are based on carry-propagate adders, and the fifth divider decomposes each digit into three components and requires neither a ROM nor a multiplexer. Furthermore, the fixed-point divider is extended to support IEEE 754-2008 compliant decimal floating-point division for decimal64 data format. Finally, the algorithms have been synthesized on a Xilinx Virtex-5 FPGA, and implementation results are given.

#### 1. Introduction

Many applications, particularly commercial and financial applications, require decimal floating-point operations to avoid errors from conversions between binary and decimal formats. This paper presents five different decimal fixed-point dividers and analyzes their performances and resource requirements on FPGA platforms. All five architectures apply a radix-10 digit recurrence algorithm but differ in the quotient digit selection (QDS) function.

The first fixed-point divider (type1) implements a simple shift-and-subtract algorithm. It is characterized by an unsigned and nonredundant quotient digit calculation. Nine divisor multiples are precomputed, and in each iteration step nine carry-propagate subtractions are performed on the residual. Finally, the smallest, nonnegative difference is selected by a large fan-in multiplexer. This type1 implementation is characterized by a high area use.

The second divider (type2) uses a signed-digit quotient calculation with a redundancy of and operands scaling to get a normalized divisor in the range of . The quotient digit selection (QDS) function can be implemented fully by a ROM because it depends only on the two most significant digits (MSDs) of the residual as well as the divisor. The residual uses a redundant carry-save representation but, because of performance issues, the two MSDs are implemented by a nonredundant radix-2 representation.

The quotient digit selection (QDS) functions of the third and fourth divider (type3.a and type3.b) are based on comparators for the two most significand digits. The comparators consist of short binary carry-propagate adders (CPA), which can be implemented very efficiently in the FPGA’s slice structure. The corresponding comparative values depend on the divisor’s value and are precomputed and stored in a small ROM. The redundancy is ; thus, 17 binary CPAs are required. Similar to the type2 divider, the type3.a and type3.b dividers use prescaling of the divisor , a redundant carry-save representation of the residual, and a nonredundant radix-2 representation of the two MSDs. The dividers type3.a and type3.b differ only in the implementation of large fan-in multiplexers in the digit recurrence step. The type3.a divider implements multiplexers that minimize the LUT usage but have long latencies, whereas the type3.b divider implements a faster dedicated multiplexer that exploits the FPGA’s internal carry chains.

The quotient digit selection function of the last divider (type4) requires neither a ROM nor a multiplexer. It is characterized by divisor scaling () and a signed-digit redundant quotient calculation with a redundancy of . The quotient digit is decomposed into three components having values , , and . The components are computed one by one, whereby the digit selection function is constant; that is, the selection constants do not depend on the divisor’s value. Similar to the type2, type3.a, and type3.b dividers, the type4 divider uses a carry-save representation for the residual with a nonredundant radix-2 representation of the two MSDs.

The fixed-point division algorithms are implemented and analyzed on a Virtex-5 FPGA. Finally, the type2 divider, which shows the best tradeoff in area and delay, is extended to a floating-point divider that is fully IEEE 754-2008 compliant for decimal64 data format, including gradual underflow handling and all required rounding modes.

The architectures of the type1, type2, and type4 dividers have already been published in . However, this paper gives a more detailed description of the previous research and introduces two new dividers, which fill the design gap between the type1 and type2 dividers because they are based on two extreme examples of algorithms: the type1 divider implements a restoring quotient digit selection (QDS) function that requires nine decimal carry-propagate adders (CPAs) of full precision, whereas the nonrestoring QDS function of the type2 divider is implemented fully by means of a ROM with limited precision. In comparison, the new dividers implement nonrestoring QDS functions that are based on fast binary CPAs with limited precision.

The outline of this paper is given as follows: Section 2 motivates the use of decimal floating-point arithmetic and its advantage compared to binary floating-point arithmetic. The underlying decimal floating-point standard IEEE 754-2008 is introduced in Section 3. The digit recurrence division algorithms as well as the five different architectures of fixed-point dividers are presented in Section 4. These fixed-point dividers are extended to a decimal floating-point divider in Section 5. Postplace and route results are presented in Section 6, and finally in Section 7 the main contributions of this paper are summarized.

#### 2. Decimal Arithmetic

Since its approval in 1985, the binary floating-point standard IEEE 754-1985  is the most widely used implementation of floating-point arithmetic and the dominant floating-point standard for all computers. In contrast to binary arithmetic, decimal units are more complex, require more area, and are more expensive, and the simple binary coded decimal (BCD) data format has a storage overhead of approximately 20%. Thus, at that time of approval the use of binary in preference to decimal floating-point arithmetic was justified by the better efficiency.

Most people in the world think in decimal arithmetic. These decimal numbers must be converted to binary numbers when using a computer. However, some common finite numbers can only be approximated by binary floating-point numbers. The decimal number , for example, has a periodical continued fraction . It cannot be represented exactly in a binary floating-point arithmetic with finite precision, and the conversion causes rounding errors.

As a consequence, binary floating-point arithmetic cannot be used for any calculations which do not tolerate conversion errors between decimal and binary numbers. These are, for instance, financial and business applications that even require decimal arithmetic by law . Therefore, commercial application often use nonstandardized software to perform decimal floating-point arithmetic. However, these software implementations are usually from 100 to 1000 times slower than equivalent binary floating-point operations in hardware .

Because of the increasing importance, specifications for decimal floating-point arithmetic have been added to the IEEE 754-2008 standard for floating-point arithmetic  that has been approved in 2008 and offers a more profound specification than the former radix-independent floating-point arithmetic IEEE 854-1987 . Therefore, new efficient algorithms have to be investigated, and providing hardware support for decimal arithmetic is becoming more and more a topic of interest.

IBM has responded to this market demand and integrates decimal floating-point arithmetic in recent processor architectures such as z9 , z10 , Power6 , and Power7 . The Power6 is the first microprocessor that implements IEEE 754-2008 decimal floating-point format fully in hardware, while the earlier released z9 already supports decimal floating-point operations but implements them mainly in millicode. Nevertheless, the Power6 decimal floating-point unit is as small as possible and is optimized to low cost. It reuses registers from the binary floating-point unit, and the computing unit mainly consists of a wide decimal adder. Thus, its performance is rather low. Other floating-point operations such as multiplication and division are based on this adder and are performed sequentially. The decimal floating-point units of z10 and Power7 are designed similarly to those of the Power6 [7, 9].

#### 3. IEEE 754-2008

The floating-point standard IEEE 754-2008  has revised and merged the IEEE 754-1985 standard for binary floating-point arithmetic  and IEEE 854-1987 standard for radix-independent floating-point arithmetic . As a consequence, the choice of radices has been focused on two formats: binary and decimal. In this paper we consider only the decimal data format. A decimal number is defined by the triple consisting of the sign , significand , and exponent : with , , and . IEEE 754-2008 uses two different designators for the exponent ( and ) as well as for the significand ( and ). The exponent is applied when the significand is regarded as an integer digit and fraction field, denoted by . The exponent is applied when the significand is regarded as an integer, denoted by . The relation is given through In this paper we exclusively use the integer representation with the exponent .

Unlike binary floating-point format, the decimal floating-point number is not necessarily normalized. This leads to a redundancy, and a decimal number might have multiple representations. The set of representations is called the floating-point number’s cohort. For example, the numbers , , and are all members of the same cohort. More precisely, if a number has significant digits, the number of representations is .

IEEE 754-2008 defines three decimal interchange formats (decimal32, decimal64, and decimal128) of fixed width 32, 64, and 128 bits. As depicted in Figure 1, a floating-point number is encoded by three fields: the sign bit , the combination field , and the trailing significand field. The combination field encodes whether the number is finite, infinite, or not a number (NaN). Furthermore, in case of finite numbers the combination field comprises the biased exponent () and the most significant digit (MSD) of the significand. The remaining digits are encoded in the trailing significand field of width . The trailing significand can either be implemented as a binary integer or as a densely packed decimal (DPD) number . Binary encoding makes software implementations easier, whereas DPD encoding is favored by hardware implementations, as it is the case in this paper.

The encoding parameters for the three fixed-width interchange formats are summarized in Table 1. This paper focuses on the data format decimal64 with DPD coded significand. DPD encodes three decimal digits (four bits each) into a declet (10 bits) and vice versa . It results in an storage overhead of only 0.343% per digit.

 Parameter dec32 dec64 dec128 storage width (bits) 32 64 128 : precision (digits) 7 16 34 : min. exponent −101 −398 −6176 : max. exponent 90 369 6111 101 398 6176 : sign bit 1 1 1 : combination field (bits) 11 13 17 : trailing significand field (bits) 20 50 110

IEEE 754-2008 defines five rounding modes. These are two modes to the nearest (round ties to even and round ties to away) and three directed rounding modes (round toward positive, round toward negative, and round toward zero) . As floating-point operations are obtained by first performing the exact operation in the set of real numbers and then mapping the exact result onto a floating-point number, rounding is required whenever all significant digits cannot be placed in a single word of length . Moreover, inexact, underflow, or overflow exceptions are signaled when necessary.

#### 4. Decimal Fixed-Point Division

Oberman and Flynn  distinguish five different classes of division algorithms: digit recurrence, functional iteration, very high radix, table look-up, and variable latency, whereby many practical algorithms are combinations of multiple classes. Compared to binary arithmetic, decimal division is more complex. Currently, there are only a few publications concerning radix-10 division.

Wang and Schulte  describe a decimal divider based on the Newton-Raphson approximation of the reciprocal. The latency of a decimal Newton-Raphson approximation directly depends on the latency of the decimal multiplier. A pipelined multiplier has the advantage that more than one division operation can be processed in parallel; otherwise the efficiency is poor. However, the algorithm lacks remainder calculation, and the rounding is more complex. The first FPGA-based decimal Newton-Raphson dividers are presented in . The dividers as well as the underlying multipliers are sequential; hence, these dividers have a high latency.

Digit recurrence division is the most widely implemented class of division algorithms. It is an iterative algorithm with linear convergence; that is, a fixed number of quotient digits is retired every iteration step. Compared to radix-2 arithmetic, radix-10 digit recurrence division is more complex because, on the one hand, decimal logic is less efficient by itself and, on the other hand, the range of the quotient digit selection (QDS) function comprises a larger digit set. Therefore, the performance of the digit recurrence divider depends on the choice of the QDS function and the implementation of decimal logic.

Nikmehr et al.  select quotient digits by comparing the truncated residual with limited precision multiples of the divisor. Lang and Nannarelli  replace the divisor’s multiples by comparative values obtained by a look-up table and decompose the quotient digit into a radix-2 digit and a radix-5 digit in such a way that only five and two times the divisor are required. Vázquez et al.  take a different approach: the selection constants in the QDS function are obtained of truncated multiples of the divisor, avoiding look-up tables. Therefore, the multiples are computed on-the-fly. Moreover, the digit recurrence iteration implements a slow carry-propagate adder, but an estimation of the residual is computed to make the determination of the quotient digits independent of this carry-propagate adder. The decimal divider of the Power6 microprocessor  uses extensive prescaling to bound the divisor to be greater than or equal to 1.0 but lower than 1.1112 in order to simplify the digit selection function. However, this prescaling is very costly, it requires a 2-digit multiply, and it needs overall six cycles on each operand. Furthermore, the digit selection function still requires a look-up table.

The first decimal fixed-point divider designed for FPGAs applies digit recurrence algorithm and is proposed by Ercegovac and McIlhenny [17, 18]. However, it has a poor cycle time, because it does not fit well on the slice structure of FPGAs but would probably fit well into ASIC designs with dedicated routing.

##### 4.1. Digit Recurrence Division

In the following, we use different decimal number representations. A decimal number is called BCD- coded when can be expressed by A number representation is called redundant if one or more digits have multiple representations.

Moreover, we consider a division in which the decimal dividend and the decimal divisor are positive and normalized fractional numbers of precision , that is, . The quotient and the remainder are calculated as follows: Then the radix-10 digit recurrence is implemented by with the initial residual and with the quotient digit calculated by the selection function Digit recurrence division is subdivided into the classes of restoring and nonrestoring algorithms, that differ in the quotient digit selection (QDS) function and the dynamical range of the residual. A restoring divider selects the next positive quotient digit such that the next partial residual is as small as possible but still positive. The IBM z900 architecture, for instance, implements such a decimal restoring divider . By contrast, the QDS function of a decimal nonrestoring divider uses a digit set that is positive as well as negative , and the partial residual might also be negative.

One advantage of restoring division is the enhanced performance since estimates of limited precision ( and ) might be used in the QDS function . This performance gain is achieved by using a redundant digit set, , which defines a redundancy factor Then, in each iteration step the quotient digit should be selected such that the next residual is bounded by which is called the convergence condition .

This paper presents five different decimal fixed-point dividers that are described in the following. The first divider (type1) is a restoring divider whereas the four others (type2, type3.a, type3.b, and type4) are nonrestoring dividers.

##### 4.2. Type1 QDS Function

The QDS function of the type1 algorithm is very simple. Nine multiples of the divisor () are subtracted in parallel from the residual by carry-propagate adders (CPAs), and the smallest positive result is selected. The CPAs exploit the FPGA’s internal fast carry logic, as described in .

The nine multiples are precomputed and are composed of the multiples , , , and their negatives (10’s complement), which require at most one additional CPA per multiple. The multiples and can be easily computed by digit recoding and constant shift operations where (9) is read as follows. A BCD-5421 coded number left-shifted by one bit is equivalent to the corresponding BCD-8421-coded number multiplied by two. In a similar fashion we obtain a multiplication by five using (10).

The implementation of the type1 digit recurrence divider is characterized by a high area use, due to the utilization of nine parallel CPAs. The corresponding algorithm of the type1 digit recurrence is shown in Algorithm 1.

 (1) Compute (2) if then and else if then and ⋮ else if then and else and

##### 4.3. Type2 QDS Function

The approach of the type2 division algorithm is based on the implementation of the QDS function fully by a ROM. This ROM is addressed by estimates of the residual and divisor. The use of estimates is feasible because a signed digit set together with a redundancy greater than is used. The estimates and are obtained by truncation; that is, only a limited number of MSDs of the residual and divisor are regarded. Furthermore, the residual is implemented in BCD-4221 carry-save representation such that the maximum error introduced by an estimation with precision (one integer digit and fractional digits) is bounded by . Negative residuals are represented by their 10’s complement.

The divisor is subdivided into subranges of equal width . For each subrange , the QDS function is defined by selection constants : These selection constants are bounded by selection intervals , with the containment condition 

Furthermore, the continuity condition states that every value must belong at least to one selection interval . Considering the maximum error due to truncation, the continuity condition can be expressed by If we suppose the subrange width to be constant () and consider that the minimum overlapping occurs for minimum and maximum , then we obtain the term We choose () and that leads to and . Thus, for the divisors’s estimation is required an accuracy of two fractional digits, and for the residual’s estimation is required an accuracy of one integer and one fractional digit. Furthermore, the divisor must be pre-scaled such that .

The P-D diagram is a visualization technique for designing a quotient digit selection function and for computing the decision boundaries . It plots the shifted residual versus the divisor. The P-D diagram for the type2 quotient digit selection function with the selection constants is shown in Figure 2.

The two MSDs of the dividend and residual address the ROM in order to obtain the signed quotient digit, which comprises 5 bits. In order to reduce the ROM’s size, the two MSDs of the divisor and the residual are implemented by using a nonredundant radix-2 representation. This leads to an address that is composed of 15 bits: seven bits from the unsigned radix-2 representation of the divisor’s two MSDs and eight bits from the signed radix-2 representation of the residual’s two MSDs. Hence, the size of the ROM is bits. Unfortunately, the use of radix-2 representation complicates the multiplication by 10, which is required according to (5). Therefore, an additional binary adder is needed: The corresponding algorithm for the calculation of the ROM entries can be derived from the P-D diagram and is depicted in Algorithm 2. It should be noted that the radix-2 representations of and are truncations of and ; that is, and .

 (1) compute and for each and according to (12a) and (12b) (2) for all with compute the quotient digits as follows: if then else if then else if then else if then else if then

Once the quotient digit has been determined, the multiple is subtracted from the current residual to compute the next residual, as stated in (5). The subtracter is a fast redundant BCD-4221 carry-save adder (CSA), as described in , except for the two MSDs in wich we apply radix-2 CPAs. Thus, the multiple is composed of two summands: where are precomputed. The multiplies and can be easily and fast computed by digit recoding and constant shift operation as shown in (9) and (10). The 10’s and 2’s complements are applied for . For both, radix-2 and BCD-4221 radix-10 representation, the complements can be computed by inverting each bit and adding one. In summary, the subtraction uses a (3 : 1) radix-2 CPA for the two MSDs and a redundant radix-10 (4 : 2) CSA for the remaining digits. The digit recurrence algorithm of the type2 divider is summarized in Algorithm 3, and its block diagram is depicted in Figure 3.

 (1) Select (2) Compute

##### 4.4. Type3 QDS Function

The frequency limiting component of the type2 divider is the digit recurrence step, which comprises the ROM (BRAM) to calculate the next quotient digit. This BRAM is slower compared to common FPGA logic. Hence, it appears to be beneficial to remove the slow BRAM from the critical path and use fast FPGA logic instead. To analyze this impact of the BRAM delays, we implement two dividers (type3.a and type3.b), which can also be realized without BRAM in the critical path.

Lang and Nannarelli  propose a divider that implements a quotient digit selection function based on CPAs and sign detection. These CPAs subtract fixed values from the current residual with limited precision. These fixed values depend on the estimates of the divisor and are pre-computed and stored in a ROM. Furthermore, the quotient digit is divided into two parts, which further simplifies the quotient digit selection function and reduces the critical path. Unfortunately, this divider cannot be implemented efficiently on FPGAs because the required carry-save adder with small error estimation shows a poor performance on the FPGA’s slice structure. Nevertheless, to investigate the impact of BRAM delays in the type2 divider, we implement another two dividers (type3.a and type3.b) that have no BRAM in the critical path but a CPA-based quotient digit selection function instead.

The type3 dividers are modifications of the type2 divider. The common architectural features are (i)the multiples of the divisor are compos of two components: , where , (ii)the fast redundant BCD-4221 carry-save adders for the digit recurrence step as stated in (5), (iii)the radix-2 implementation of the two MSDs, and (iv)the 10’s and 2’s complements for . Furthermore, since the architectures of the type2 and type3 dividers are similar, with the exception of the quotient digit selection function, most of the dividers’ characteristics are also identical, including the (i)the P-D diagram, (ii)the redundancy factor , , (iii)the need of prescaling the divisor , (iv)the accuracy of the divisor’s estimation , and (v)the accuracy of the residual’s estimation . Contrary to the type2 divider, the QDS functions of the type3 dividers are implemented by 16 carry-propagate adders with sign detection. These adders subtract fixed selection constants from the estimation of the current residual. The selection constants are dependent on the current divisor and are stored in a ROM, which is then no more part of the critical path. The selection constants are computed according to Algorithm 4. The common digit recurrence algorithm of the type3.a and type3.b dividers is shown in Algorithm 5, and the corresponding block diagram is depicted in Figure 4.

 (1) compute and for each and according to (12a) and (12b) (2) compute the selection constants : ⋮ ⋮

 (1) if else if ⋮ else if else (2) Compute

The sign signals of the binary carry-propagate adders are used to determine the corresponding quotient digit and to select the multiples of the divisor in the digit recurrence iteration step. The advantage of the type3 QDS function is its short critical path, which comprises only one 8-bit binary carry-propagate adder. However, this short critical path is bought at a high price because the complexity is moved to the selection of the divisors multiples. Hence, large fan-in multiplexers with poor latencies are required. In order to minimize the impact of these multiplexers, we designed a new dedicated (17 : 1) multiplexer that exploits the FPGA’s fast carry chains and has a delay of only one LUT instance. In the following we name the divider with improved fast multiplexers type3.b and with traditional multiplexers type3.a. The new dedicated (17 : 1) multiplexer uses the 16-sign bits as select lines and is coded as follows: with and . In other words, bit is selected when and . The selection of two input signals is implemented in one (5 : 1) LUT, and all signals are combined by a long OR gate that is implemented exploiting the FPGA’s fast carry chain. Therefore, one (17 : 1) multiplexer requires nine LUTs, and the type3.b divider with fast multiplexers uses much more LUTs than the type3.a divider (see Section 6). The block diagram of such a (17 : 1) multiplexer is depicted in Figure 5.

##### 4.5. Type4 QDS Function

The type4 divider applies a new algorithm as proposed by us in a previous paper . It is based on the decomposition of the signed quotient digit into three components having values , , and as well as a fast constant digit selection function. In this implementation, neither a ROM for the QDS function nor a multiplexer to select the multiple in the digit recurrence iteration step is required. The divider is intended to utilize less resources than other implementations. As the type2, type3.a, and type3.b dividers, the type4 divider uses signed-digit redundant quotient calculation, carry-save representation of the residual, fast BCD-4221 CSAs for the digit recurrence, and a radix-2 implementation of the MSDs.

Quotient digits are decomposed into three components , and each component is computed by a distinct selection function. These components can hold three values so that only two comparators per component are required to distinguish between these values. Since , we get a redundancy factor of . Furthermore, the selection functions become very simple due to prescaling of the divisor (); that is, they are constant functions that do not depend on the divisor anymore:

The recurrence is then defined as follows:

The multiples and can be easily and fast computed according to (9) and (10). Each selection function requires two comparators implemented as carry-save adders that subtract constant values from the residuals’ estimations (, , and ). As we will show in the following, these estimations require only the two most significant digits (MSDs) of the corresponding exact values.

First, the selection intervals for with and have to be determined, where is the smallest and is the greatest value of the selection constant such that the next residual is still bounded. Applying the convergence condition (8), the digit recurrence (19a), (19b), and (19c), and the redundancy factor , we obtain

From the recurrence , , , and replacing by the upper limit , we get Similarly, we obtain the upper limits and from the recurrence (19b) and (19c), respectively,

Likewise, the lower limits can be computed

Due to the redundancy factor we have overlapping regions , where more than one quotient digit component may be selected. The decision boundary of a selection function should lie inside this overlapping regions. Figure 6 shows the P-D diagram for . The figures for and look similar. As can hold three different values, the selection function requires two decision boundaries. Generally, it is implemented by a staircase function (see Figure 2), but for the bounded divisor it is independent of the divisor and is reduced to a constant function (see Figure 6). Fortunately, the selection functions for and are also independent of the divisor and can also be implemented by constant functions in a similar fashion.

We now determine the required number of fractional digits for the estimates , , and . Moreover, we choose suitable selection constants and . These constants depend on the maximum error due to the estimates and on the minimum overlapping width, which is given by as well as . Since we use BCD-4221 carry-save redundant digit representation, for a precision of digits (one integer digit and fractional digits), we have a maximum error of . Moreover, the estimations are computed by truncations (). This means, for given positive and negative selection constants and , the following expressions must be true:

If we choose the selection constants as listed in Table 2, all conditions are fulfilled for a precision of digits, and each selection function is reduced to two simple 2-digit comparators.

 0.9 −1.5 0.656 0.611 0.1 −0.7 0.656 0.611 0.1 −0.3 0.255 0.211

As soon as one component of the quotient digit is computed, the corresponding multiple of the divisor must be subtracted from the partial residual. Each component can hold three different values that are multiplied with the corresponding weighted divisor with . In other words, is either passed, negated, or reset. Passing and resetting the multiple of the divisor is performed at no extra cost. Negation is accomplished by bit inversion and adding +1 through the carry inputs of the carry-save adders in (19a), (19b), and (19c). In summary, this is a very fast operation and requires only two LUTs per digit.

Similar to type2 and type3 division, we implement the two MSDs of the type4 residual by using a nonredundant radix-2 representation, which requires less LUTs and results in faster quotient digit selection function. The pseudocode for the digit recurrence iteration is shown in Algorithm 6, and the corresponding block diagram is depicted in Figure 7.

 (1) select and compute (2) select and compute (3) select and compute

##### 4.6. Proposed Decimal Fixed-Point Divider

For each type of division algorithms presented in the preceding sections we have implemented a corresponding fixed-point divider, which is described in the following. For all fixed-point dividers we expect the input operands to be normalized.

The block diagram of the type1 divider is depicted in Figure 8. First, the divisor multiples are precomputed. In the following cycles quotient digits ( if the first digit is zero) are computed one by one, followed by an additional rounding digit. The quotient (Z) and the quotient +1 (ZP1) are computed on-the-fly by using two registers that are updated every cycle. The algorithm has the advantage that no additional slow decimal CPA is required to compute the incremented quotient. It is described more precisely in Section 5. The normalization of the result is also performed on-the-fly by locking the conversion when quotient digits for the significand and the rounding digit have been computed; that is, in the worst case, when the first quotient digit is zero, cycles are required. Furthermore, the sticky bit is calculated, which is required for rounding. It is set to one whenever the final remainder is unequal to zero.

The architectures of the type2, type3.a, type3.b, and type4 are similar in their structure but differ in their digit recurrence algorithm and scaling. The common block diagram is shown in Figure 9. First, the dividend and the divisor are re-scaled and five as well as two times the divisor are precomputed according to (9) and (10). Then, the operands are recoded because the digit recurrence iteration uses a redundant digit representation with BCD-4221 carry-save adders.

The digit recurrence retires in each iteration one signed quotient digit . The on-the-fly-conversion algorithm converts this signed-digit representation to the BCD-8421-coded result  and computes the quotient (Z), the quotient +1 (ZP1), and quotient −1 (ZM1). This conversion is accomplished every iteration step and does not need a slow CPA. The incremented and decremented quotients are required for rounding. Moreover, the quotient is also normalized on-the-fly by locking the conversion when quotient digits (including the rounding digit) have been computed. The algorithm is described explicitly in Section 5 because it is accomplished together with the gradual underflow handling of the floating-point divider.

Furthermore, the sticky bit is calculated, which is required for rounding. It is set to one whenever the final remainder is unequal to zero. Moreover, when the final remainder is negative, the quotient of the type2, type3.a, type3.b, and type4 dividers has to be adjusted by subtracting one LSD. This subtraction does not need another slow CPA because the quotient −1 (ZM1) has already been computed on-the-fly. The calculations of both, the sticky and sign bit, require the reduction of the redundant remainder by a CPA. This CPA might be subdivided into multiple smaller CPAs to keep the latency low.

#### 5. IEEE 754-2008 Floating-Point Division

The IEEE 754-2008 compliant decimal floating-point divider is an extension of the normalized fixed-point divider. The divider presented in this paper supports the interchange format IEEE 754-2008 decimal64 with DPD encoding, but it can be easily adapted to any other precision and exponent range.

The block diagram of the divider is depicted in Figure 10. The DFP division begins with decoding of the dividend and divisor and the extraction of their signs, significands, and exponents. In the following, the significands of the operands and are regarded as integers. Therefore, the corresponding exponents are and .

According to , if one of the operands is a signaling NaN (sNAN) or quiet NaN (qNAN), then the result is also a quiet NaN with the payload of the original NaN. Hence, in order to preserve the payload while using an unmodified divider, the NaN handling unit sets the NaN holding operand as the dividend and resets the divisor to , as depicted in Algorithm 7.

 else // no changes

The fixed-point dividers presented in Section 4 require normalized operands. Therefore, the number of leading zeros for both operands is counted, and the significands are normalized by barrel shifters. Due to performance issues, the leading zeros counter exploit the FPGA’s fast carry chains, as proposed in .

The exponent of a floating-point division is first estimated by the normalized quotient in fractional representation (QNF) and is updated iteratively in each digit recurrence step. Since most of the decimal floating-point numbers specified by IEEE 754-2008 have multiple representations of the same value, the result of a division might also have more than one correct representations that differ in the exponent. However, to obtain a unique and reproducible result in the case of such an ambiguity, IEEE 754-2008 defines the exponent of the result, which is called preferred exponent (QP).

Both exponents, QNF and QP, are positive integers and are determined as follows: The offset1 is a bias and assures the quotients to be positive since unsigned integer calculation is used in this paper. The bias is composed of The normalized fixed-point division unit then computes significand digits and one additional rounding digit. Additionally, the divider encompasses the on-the-fly conversion unit that also detects and handles gradual underflow with zero delay overhead. One signed quotient digit is retired in each cycle, and the on-the-fly-conversion algorithm converts this signed-digit representation into the BCD-8421-coded partial result . The conversion is accomplished every iteration step and does not need a slow CPA (see Algorithm 8). The partial quotient is stored in a register Z′. Moreover, two additional registers ZM1′ and ZP1′ are provided. ZM1′ is the partial quotient decremented by one least significant digit (LSD), and ZP1′ is the partial quotient incremented by one LSD. ZM1′ is selected when the final residual in the last iteration is negative, and ZP1′ is required by the following rounding operation. Furthermore, the exponent for the normalized significand in integer representation (QNI) is calculated (see (29)), and gradual underflow handling is performed at no extra cost (see Algorithm 8). The exponent QNI is computed by decrementing the exponent of the normalized significand in fractional representation (QNF) in each iteration step by one. The recurrence iteration terminates earlier in case of gradual underflow. The gradual underflow signal (GUF) is asserted high when , and the calculated integer quotient is less than : The extended quotient (composed of the calculated quotient and the rounding digit) must be decremented by one if the final remainder is less than zero. The sticky bit (sb) is used for rounding and is asserted high whenever the remainder is unequal to zero.

 QNI′ = QNF + 1 = ZM1′ = ZP1′ = 0 TZ′ = 0, while ( ) and (QNI′ ) { QNI′ = QNI′ − 1 } QNI = QNI′ + 1 // remove the impact of the rounding digit if GUF = 1 then underflow!

Moreover, the number of trailing zeros (TZ) is computed on-the-fly. This number indicates by how many digits the computed quotient might be shifted to the right without loosing accuracy. The number of trailing zeros has to be counted only for Z′ because if ZM1′ is selected, the residual will be unequal to zero and anyway. The number of trailing zeros is used for the selection between the preferred exponent (QP) and the normalized exponent for integer representation (QNI), as described in the following paragraph.

The exponent calculation unit selects either QP or QNI. The preferred exponent should be selected whenever possible. This is only feasible when ; that is, there is a sufficient number of trailing zeros to shift the significand to the right. Furthermore, the final exponent must satisfy the minimum and maximum exponent range. The algorithm of the exponent calculation is illustrated in Algorithm 9, where the value is used to add the IEEE 754-2008 bias and to remove offset1 again (which was introduced in (27)).

 offset2 = 783 − 398 = 385 if (QP QNI) or (QNI + TZ < QP) or (QP > ) then (a) QNI > overflow! (b) QNI = QNI − offset2 ZP1′e = ZP1 (c) QNI < underflow! else (a) QP = QP − offset2 = Z (QP − QNI) = ZP1   (QP − QNI) (b) QNI < underflow!

Rounding is required when the number of significant digits of the quotient exceeds the length of a decimal word. The divider presented in this paper provides the five rounding modes as requested by . Rounding either selects the integer quotient or the incremented quotient . As proven by Theorem A.1 in the appendix, rounding overflow cannot occur in DFP division. Rounding overflow would occur, if overflows due to rounding. The selection of the quotient depends on the rounding mode, the rounding digit (), the sticky bit (sb), and the least significant digit (LSD) of the quotient. The calculation of the round up detection is summarized in Table 3.

 Rounding mode Round-up detection Round Ties To Even > Round Ties To Away Round Toward Positive Round Toward Negative