Research Article  Open Access
Efficient Scheme for Implementing Large Size Signed Multipliers Using Multigranular Embedded DSP Blocks in FPGAs
Abstract
Modern FPGAs contain embedded DSP blocks, which can be configured as multipliers with more than one possible size. FPGAbased designs using these multigranular embedded blocks become more challenging when high speed and reduced area utilization are required. This paper proposes an efficient design methodology for implementing large size signed multipliers using multigranular small embedded blocks. The proposed approach has been implemented and tested targeting Altera's Stratix II FPGAs with the aid of the Quartus II software tool. The implementations of the multipliers have been carried out for operands with sizes ranging from 40 to 256 bits. Experimental results demonstrated that our design approach has outperformed the standard scheme used by Quartus II tool in terms of speed and area. On average, the delay reduction is about 20.7% and the area saving, in terms of ALUTs, is about 67.6%.
1. Introduction
Nowadays modern FPGAs offer highly sophisticated resources in the form of embedded blocks. These blocks vary in complexity from small size multipliers to core processors. Some of these blocks offer high degree of flexibility to cover a wide range of applications. A typical example is the DSP blocks in Altera’s FPGA. These blocks can be configured to operate as , , or bit multipliers [1]. This flexibility of operating as multigranular embedded blocks can be used to develop optimized realizations of large size computing functions, such as large size multiplications.
Arithmetic computations are needed in a wide range of applications and products. Some of these arithmetic computations deal with large size operands. Typical applications include scientific computation, cryptography, and data intensive systems. For instance, in climate modeling and computational physics, highprecision floating point processing is needed [2, 3], and these in turn require large operand fixed point multipliers. Another application which requires large size multiplications is processinginmemory system used in data intensive multimedia and video applications [4].
There are various techniques presented in the literature which deal with efficient realization of signed array multiplication through optimization of partial product generation and partial product addition. The focus primarily is to reduce the delay of the critical path, and sometimes to meet other design objectives such as power dissipation and chip area [5–7]. Most of these techniques are based on bit level, hence suitable for ASIC or custom implementation. To map the same algorithms on LUTs in FPGAs, the realization becomes fairly inefficient, due to the interconnect delay and the generic nature of the LUTs. With the availability of highly optimized embedded multiplier blocks and incorporation of microarithmetic operation within the LUTs, the strategy for realization of multipliers is changed. For instance, an attempt has been made to realize the multipliers by utilizing compressors provided by the 6LUTs on Altera’s FPGAs [1, 8]. This technique has achieved satisfactory results, and sometimes outperformed implementations based on the embedded multipliers. Another scheme has proposed a hybrid approach which utilizes embedded blocks and LUTs [9]. Both of these two techniques are effective when dealing with small size multipliers, however, large size multipliers require the use of highly efficient structures and to minimize interconnect delays. This is only achievable through the use of the embedded multipliers or DSP blocks, which are available nowadays on myriad of FPGAs offered by vendors such as Xilinx and Altera.
For large size multiplication using FPGA devices, known algorithms normally segment the input operands based on single size embedded blocks [10, 11]. For the case of Xilinx' FPGAs for instance, a signextensionbased approach [12] is used for realizing the large size multiplications, which employs bit embedded signed multipliers as the basic blocks. For Altera’s FPGAs, a standard approach is to use single decomposition to implement the large size multipliers [1], which is also based on the embedded multipliers. However, since the embedded DSP blocks in newer Altera's FPGA devices, such as Stratix II and later, can be configured as bit, bit, and bit multipliers, then it is possible to use multiple size embedded blocks as the basic units to efficiently implement large size multipliers.
In our previous work, we developed an efficient design approach for the implementation of large size unsigned multipliers [13]. A more systematic approach was presented in [14] with a set of design rules for addition of partial products leading to more efficient realization. A structured methodology was then developed to implement large size 2's complement multiplier based on BaughWooley algorithm. Taking advantage of the multigranularity of the embedded DSP blocks, the authors proposed a scheme to design highly efficient 2's complement multiplier using new approach for sign extension [15]. In this paper, we present a design scheme for the general case to realize large size signed multipliers based on multiple size embedded blocks. We propose a divideandconquerbased strategy with a multilevel decomposition procedure, followed by an optimized approach for realizing the required additions of the partial products to obtain the final result. We have also dealt with special cases so that more improvements can be achieved for a range of input operands from 40 to 284 bits.
The remainder of this paper is organized as follows. Section 2 describes the architecture of large size multipliers and a new signextension scheme used in this paper. Section 3 presents the proposed design approach of large size multigranularblockbased signed multipliers, and a design example for a bit multiplier is provided in Section 4. In Section 5, experimental results and comparisons are presented. Finally, conclusions are given in Section 6.
2. Implementation of Large Size Multipliers Based on Single Size Embedded Blocks
In this section, we describe the decomposition method of large size multiplications based on single size embedded blocks, and a new signextension scheme to sum the generated partial products.
2.1. Architecture of SingleSizeEmbeddedBlockBased Large Size Multipliers
To implement a large size multiplication using single size embedded multipliers in FPGAs, the input operands are decomposed based on the size of the embedded blocks [14]. Assuming that the size of each 2's complement embedded block is n bits, each input operand is decomposed into m segments with , where k is the size of the large size multiplier, and is the ceil function of z. The expressions of the operands are represented as follows: In (1), the segment or is bit positive number for . For , the segment or is bit signed number, which is in the range of 2 to n bits.
By multiplying the segmented inputs presented in (1), the output of the multiplier is expressed as According to the optimized design approach of large size multipliers proposed in [14], all partial products in (2) can be organized as shown in Figure 1, where is denoted as .
After all operands shown in Figure 1 are achieved, multiple level additions are required for adding all these 2's complement operands. To reduce the area and the execution delay, a set of optimization design rules can be followed, which is proposed in [14].
2.2. New SignExtension Scheme for Large Size Signed Multipliers
To save area and reduce the execution delay, our new signextension scheme proposed in [15] first organizes the additions following the set of design rules, and then extends the sign bits according to the resulting organized additions. For example, the first level addition of the large size multiplication is to add each pair of operands that have the same size as shown in Figure 1. The sign extension requires only one bit to take care of the carry of the addition.
After the first level addition, all operands to be added further are 2's complement and have different sizes. Then, at the second level and subsequent levels of addition, our proposed new signextension scheme extends the sign bits of the larger size operand, as shown in Figure 2, by one bits, and sign extend of the smaller size operands to the same position as that of the larger one. Moreover, to reduce the size of adders, the least significant bits that do not overlap with the other operand are concatenated to the output of the adder.
3. Design of Large Size Signed Multipliers Using Multigranular Embedded Blocks
In this section, we describe our proposed multilevel decomposition approach for the implementation of large size signed multiplications using multigranular embedded multipliers.
3.1. Decomposition of Large Size Multipliers Based on Multigranular Embedded Blocks
We assume that multigranularity is based on three types of building blocks. They are of different bit widths:, and , where For Altera's FPGAs, for instance, and .
To optimize the design of the large size multipliers, the decomposition is first processed based on the largest size building blocks. Figure 3 illustrates the decomposition of the multiplication, where X and Y are the input operands of the multiplier to be implemented.
By multiplying the segmented inputs, three types of multipliers are required to generate the partial products. They are
The first type of multiplication can be implemented using bit embedded signed multiplier with the sign bit forced to zero.
The second type of multiplication, bit signed multiplier required only once, can be implemented using one of or bit embedded signed multiplier according to the value of . If is less than or equal to p bits, then a bit embedded multiplieris used; if is greater than p but less than or equal to t bits, then a bit embedded multiplieris required; otherwise, an bit embedded multiplier is needed.
The last type of multiplication is bit signed multiplier. To efficiently implement this kind of multiplier, smaller size embedded blocks, such as bit embedded multipliers, are utilized. There are two scenarios to be considered. Equation (5) illustrates these two situations: where and . Table 1 categorizes multipliers for bit sizes from 37 to 281 for these two situations with segments m from 2 to 8.

In Range 1, smaller size embedded multipliers can be used for the implementation based on doublelevel decomposition. To do this, the size of the first level decomposition is based on bits instead of bits since it needs to be decomposed further as two subsegments. Figure 4 illustrates the first level decomposition for the size in Range 1. For example, a bit multiplier, which is one of the cases in Range 1, can be decomposed into four segments of 18, 34, 34, and 34 bits.
The second level decomposition is to separate each 34bit operand as two 17 bits. Thus, the bit multiplication can be implemented using bit embedded multipliers.
On the other hand, the cases in Range 2, double level decomposition will not lead to optimized solution since the size of the most significant segment, in this case, is more than t bits. For example, for a bit multiplier, if the 121bit operand first is decomposed as 19, 34, 34, and 34, the sizes of all operands in this multiplication are greater than t bits, so the bit embedded multiplier cannot be used. Therefore, only singlelevel decomposition is needed and bit embedded multipliers are required. The block size for this decomposition is equal to bits. For this example, the 121bit operand is decomposed into 4 segments of 16, 35, 35, and 35 bits.
3.2. Implementation of Large Size Multipliers Based on Multigranular Embedded Multipliers
In this section, we will describe the implementation approaches for the realization of large size multipliers for the scenarios presented in the Section 3.1.
3.2.1. Implementation of Large Size Multipliers Based on Double Decomposition
Double decomposition is used for the bit size located in Range 1. The first level decomposition is to decompose each input operand, X or Y, into segments, where k is the size of the multiplication to be implemented, m is the number of segments, and is the size of the largest embedded multiplier. After the first level decomposition, the segmented input operands are multiplied and the partial products are organized as shown in Figure 1. The partial product can be implemented by bit embedded multipliers if the size of or is greater than p bits, or by bit embedded multipliers if it is equal to or less than p bits. The partial product with the size of bits, and it can be implemented using the bit embedded multipliers with the most two significant bits forced to zeros. The last partial product, , or , has bits in or , and less than or equal to t bits in or . Then, a second level decomposition is performed. The bit operand is decomposed as two subsegments with bits each. After the second level decomposition, the segmented multiplication, or , is implemented using embedded multipliers. Figure 5 presents an example for carrying out this multiplication. This process requires two bit embedded signed multipliers and one adder. Signextension is performed before the addition. Also, the concatenation operation is used for the last bits of the partial products to reduce the size of the adder. To use signed embedded multipliers for unsigned numbers, the sign bits of the embedded multipliers for the bit operands are forced to zeros.
Once all the segmented partial products are generated, the required additions can be performed following the design rules presented in [14].
3.2.2. Implementation of Large Size Multipliers Based on Single Decomposition
In the case of Range 2, single decomposition is performed since the most significant segment of the input operands contains more than t bits. In this case, the decomposition is based on bits, where n is the largest size of embedded blocks. After the decomposition, all partial products can be organized in the same way as shown in Figure 1 with . The optimized addition operations for summing these partial products are performed also based on the design rules summarized in [14].
3.3. Special Cases of the Implementation for the Large Size Multiplier
The special cases are referred to the multiplications such that the most significant segment of each operand has or bits, where r is in the range of 1 to 3 bits. For these special cases, the realizations of the multiplications, which involve these small size segments, are implemented using (Look Up Tables) LUTs instead of using embedded blocks. Based on our experimental analysis, for the cases with greater than 3, the use of LUTs will result in larger delay and area utilization. Table 2 lists the special cases of Range 1 and Range 2. In the following, we will explain the algorithms for the designs of the special cases. The focus of these algorithms is to reduce the number of embedded blocks required in the designs.

3.3.1. Design of Special Cases of Range 1
In this special case, the most significant segment, or is a signed number with a bit size of to , and the other segments are positive of size of . To explain the algorithm for this special case, let's assume that the signed segment, or , is referred to as the A operand, and the other positive segment is referred to as the B operand. The A operand is decomposed as and . has the most significant t bits of A, and has the rest of the bits of A, which is 1 to 3 bits. The B operandis decomposed also to two segments with bits each. The algorithm for the special cases of Range 1 is graphically illustrated in Figure 6 and the pseudocode is given below and named as Algorithm 1. In this case, two bit embedded multipliers are required instead of one bit embedded block.

In addition, Algorithm 1 can be extended to the partial product, . Since both operands, or , have or bits, it can be decomposed as two segments. One segment has the most significant t bits and the other segment has the rest of the bits, 1, 2 or 3, respectively. This multiplication is implemented by one bit embedded multiplier with some additions rather than one bit embedded multiplier.
3.3.2. Design of the Special Cases of Range 2
For these cases, the size of the most significant segments of the input operands, and , is , or bits. These segmented operands are multiplied with the other segments that have a size of bits.
We also let A denotes or , and B is any other positive number segmented with bits. The operand A is decomposed as and . has the most significant n bits of A, and has the rest, 1 to 3 bits. The multiplication of these two operands is illustrated in Figure 7, and the realization is based on the pseudocode given below and named as Algorithm 2.

In the special cases of Range 2, the partial product can be implemented in a similar way as it is done for Range 1. In this case, only one bit embedded multiplier with some additions is required.
4. Design Example: A 256256bit Multigranular Signed Multiplier
As a design example, this section summarizes the implementation of a bit signed multiplier using multigranular embedded blocks with , and . This design example is described with the following four steps.
Step 1 (First level decomposition). Since the size of the operands is in Range 1, two levels of decomposition are required. The first level decomposition is based on bits. In this case, each of the 256bit input operands is decomposed from the right to lefthand side with 34 bits each. Because of and , eight segments are required and the most significant segment has 18 bits. Since all the segmented operands have more than 9 bits, the bit embedded blocks cannot be used in this case.
Step 2 (Generation of the segmented partial products). After the firstlevel decomposition, these segmented input operands are multiplied and the organization of all the partial products is shown in Figure 8.
Step 3 (Second level decomposition). In Figure 8, two types of
basic multipliers are required. One is 34 × 34bit unsigned multipliers and the
other is 18 × 34bit signed multipliers. The 34 × 34bit unsigned multiplication is
implemented using 36 × 36bit signed embedded multiplier with the first two bits
forced to zeros, and the 18 × 34bit multiplication has to be decomposed again.
At second level
decomposition, each 34bit operand is split into 17 bits each. Then, the 18 × 34bit
multiplication is implemented using the process shown in Figure 5. Also, the sign
bits of the embedded multipliers for the 17bit operands are forced to zeros.
In this implementation, one
18 × 18bit, forty nine 34 × 34bit and fourteen 18 × 34bit multipliers are
required. Based on Altera's FPGAs, the total number of embedded blocks used (in
terms of 9input DSP elements) is equal to 49 × 8 + (14 × 2) × 2 + 1 × 2 = 450. However, if
there is no second level decomposition, the 18 × 34bit multiplication requires a
36 × 36bit embedded multiplier. Under this condition, the total number of
embedded 9input DSP elements used is equal to 49 × 8 + 14 × 8 + 1 × 2 = 506, an increase
of 12.5%.
Step 4 (Summing the partial products). The last step is to sum all partial products shown in Figure 8 to get the final result of the large size multiplier. Using the rules proposed in [14], the additions can be performed as shown in Figure 9 based on twoinput operand adders.
5. Implementation Results
The multigranular signed multipliers were implemented using Altera's FPGAs. The synthesis tool used is Quartus II version 7.2 targeting the device EP2S180F1508C3 from the Stratix II family [1]. To test the effectiveness of our proposed approach, 9input DSP element is used as the unit of computation. Results of our proposed design approach are compared with the standard scheme adopted by the Quartus synthesis tool [16], which uses primarily 18 × 18 and 9 × 9bit embedded multipliers as building blocks. All the designs in this paper are registered at the inputs and outputs. The size of input operands is ranged from 40 to 255 bits with an increase of 5 bits from one case to the next one. The case of bit multiplier, which has a range of applications, is also assessed. Moreover, some special cases are implemented to reduce the number of embedded blocks based on the algorithms presented in Section 3.3.
The proposed approach and the traditional scheme are compared based on the following metrics extracted from the implementation summary and the timing analyzer summary: (1) the clock period, (2) the number of (Adaptive Look Up Tables) ALUTs used, (3) the number of embedded blocks in terms of 9input DSP elements used. All these results are presented in Figure 10. Then, the delayALUTs product and the delayDSPelement product are computed based on the implementation results and presented in Figure 11.
(a) Delay
(b) Area
(c) DSP blocks
(a) DelayALUT product
(b) DelayDSP element product
Compared to the results of the standard scheme, the proposed multigranular multiplier method has resulted in considerable improvements in terms of timing and area saving. The performance has been improved by 20.7% compared to the standard scheme. For the number of ALUTs used, the multigranular approach consumes an average of 67.6% less area compared to the standard scheme. Although our approach has outperformed the standard method, however, there are roughly 25% cases where our approach requires more number of 9input DSP elements than that of the standard scheme, as shown in Figure 10(c).
Moreover, the implementation results of the multiplications can be improved using the algorithms of special cases as explained earlier, which allow reducing the number of embedded blocks. These results will be presented later. Also, considering the product of the delay and the number of ALUTs as well as the product of the delay and the number of embedded blocks, a significant improvement has been achieved as it can be noticed from Figure 11. The average reductions of the delayALUT product and the delayDSP element product are 74.4% and 15.8%, respectively, compared to the standard scheme.
For the case of bit multiplier, results are listed in Table 3. Comparing to the standard scheme, the delay is reduced by about 21.7% and the number of ALUTs saving is up to 71.4%, however, the use of 9input DSP elements has been increased by one block. This is translated into 0.22% penalty. The delayALUT product and the delay DSPelement product for the multigranular approach have been improved by 77.6% and 21.6%, respectively, compared to the standard scheme.

For the special cases, Figure 12 graphically illustrates the implementation results for six cases ranging from 3 to 7 segments. In this figure, the “proposed general” refers to the design approach presented in Section 3.2. The “proposed special” refers to the special cases, which are implemented using the algorithms presented in Section 3.3. From Figure 12, it is clear that the number of DSP elements used in the special cases is reduced and now it is exactly the same as that used in the standard approach. Although this resulted in an increase in the number of ALUTs, however, it is still significantly less than that used in the standard approach.
(a) Delay
(b) Number of ALUTs used
(c) Number of DSP elements used
6. Conclusions
The focus of this paper is to realize large size signed multipliers using DSP blocks with multigranular embedded signed multipliers in FPGAs. Multiple decompositions are used to efficiently make use of the multigranularity offered in modern FPGAs. The effectiveness of the proposed design approach has been tested using various benchmarks, and compared with a standard approach using commercial tool. Although this tool has complete access to the features available in the DSP blocks and in the 6LUTs of the target FPGA, however, using our methodology has always outperformed the standard scheme.
References
 Altera, “DSP blocks in stratix II & stratix II GX devices,” in Stratix II Device Handbook, Volume 2, Altera, San Jose, Calif, USA, 2006. View at: Google Scholar
 G. Govindu, L. Zhuo, S. Choi, and V. Prasanna, “Analysis of highperformance floatingpoint arithmetic on FPGAs,” in Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS '04), vol. 1, p. 149, Santa Fe, NM, USA, April 2004. View at: Publisher Site  Google Scholar
 A. Akkas and M. J. Schulte, “A quadruple precision and dual double precision floatingpoint multiplier,” in Proceedings of the Euromicro Symposium on Digital Systems Design (DSD '03), pp. 76–81, BelekAntalya, Turkey, September 2003. View at: Google Scholar
 J. Draper, J. Sondeen, and C. W. Kang, “Implementation of a 256bit wideword processor for the dataintensive architecture (DIVA) processinginmemory (PIM) chip,” in Proceedings of the 28th European SolidState Circuits Conference (ESSCIRC '02), pp. 77–80, Florence, Italy, September 2002. View at: Google Scholar
 I. Koren, Computer Arithmetic Algorithms, A K Peters, Natick, Mass, USA, 2nd edition, 2001.
 R. Fried, “Minimizing energy dissipation in highspeed multipliers,” in Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED '97), pp. 214–219, Monterey, Calif, USA, August 1997. View at: Google Scholar
 C. R. Baugh and B. A. Wooley, “A two's complement parallel array multiplication algorithm,” IEEE Transactions on Computers, vol. 22, no. 12, pp. 1045–1047, 1973. View at: Publisher Site  Google Scholar
 H. ParandehAfshar, P. Brisk, and P. Ienne, “Efficient synthesis of compressor trees on FPGAs,” in Proceedings of the 13th Asia and South Pacific Design Automation Conference (ASPDAC '08), pp. 138–143, Seoul, Korea, January 2008. View at: Publisher Site  Google Scholar
 B. R. Lee and N. Burgess, “Improved small multiplier based multiplication, squaring and division,” in Proceedings of the 11th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines (FCCM '03), pp. 91–97, Napa, Calif, USA, April 2003. View at: Google Scholar
 G. Quan, J. P. Davis, S. Devarkal, and D. A. Buell, “Highlevel synthesis for large bitwidth multipliers on FPGAs: a case study,” in Proceedings of the 3rd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, vol. 1, pp. 213–218, Jersey City, NJ, USA, September 2005. View at: Publisher Site  Google Scholar
 N. Nedjah and L. de Macedo Mourelle, “A reconfigurable recursive and efficient hardware for KaratsubaOfman's multiplication algorithm,” in Proceedings of IEEE Conference on Control Applications (CCA '03), vol. 2, pp. 1076–1081, Istanbul, Turkey, June 2003. View at: Publisher Site  Google Scholar
 Xilinx Inc., “Xilinx DSP Design Considerations,” XtremeDSP for Virtex4 FPGAs, UG073, (v2.2), 2006. View at: Google Scholar
 S. Gao, N. Chabini, D. AlKhalili, and P. Langlois, “Optimised realisations of large integer multipliers and squarers using embedded block,” IET Computers & Digital Techniques, vol. 1, no. 1, pp. 9–16, 2007. View at: Publisher Site  Google Scholar
 S. Gao, D. AlKhalili, and N. Chabini, “Optimized realization of largesize two's complement multipliers on FPGAs,” in Proceedings of the 50th IEEE International Midwest Symposium on Circuits and Systems Joint with the 5th IEEE Northeast Workshop on Circuits and Systems (NEWCAS '07), pp. 494–497, Montreal, Canada, August 2007. View at: Publisher Site  Google Scholar
 S. Gao, N. Chabini, and D. AlKhalili, “$256\times 256$bit multiplier using multigranular embedded DSP blocks in FPGAs,” in Proceedings of the 6th International IEEE Northeast Workshop on Circuits and Systems and TAISA Conference (NEWCASTAISA '08), pp. 253–256, Montreal, Canada, June 2008. View at: Publisher Site  Google Scholar
 http://www.altera.com.
Copyright
Copyright © 2009 Shuli Gao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.