International Journal of Reconfigurable Computing

Volume 2017 (2017), Article ID 2410408, 12 pages

https://doi.org/10.1155/2017/2410408

## Efficient Realization of BCD Multipliers Using FPGAs

^{1}Department of Electrical and Computer Engineering, Royal Military College of Canada, Kingston, ON, Canada^{2}Department of Computer Engineering, École Polytechnique de Montréal, Montréal, QC, Canada

Correspondence should be addressed to Dhamin Al-Khalili

Received 22 October 2016; Revised 2 February 2017; Accepted 9 February 2017; Published 6 March 2017

Academic Editor: Seda Ogrenci-Memik

Copyright © 2017 Shuli Gao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

In this paper, a novel BCD multiplier approach is proposed. The main highlight of the proposed architecture is the generation of the partial products and parallel binary operations based on 2-digit columns. 1 × 1-digit multipliers used for the partial product generation are implemented directly by 4-bit binary multipliers without any code conversion. The binary results of the 1 × 1-digit multiplications are organized according to their two-digit positions to generate the 2-digit column-based partial products. A binary-decimal compressor structure is developed and used for partial product reduction. These reduced partial products are added in optimized 6-LUT BCD adders. The parallel binary operations and the improved BCD addition result in improved performance and reduced resource usage. The proposed approach was implemented on Xilinx Virtex-5 and Virtex-6 FPGAs with emphasis on the critical path delay reduction. Pipelined BCD multipliers were implemented for 4 × 4, 8 × 8, and 16 × 16-digit multipliers. Our realizations achieve an increase in speed by up to 22% and a reduction of LUT count by up to 14% over previously reported results.

#### 1. Introduction

The traditional approach of using binary number system based operations in a decimal system requires frontend and backend conversion. These conversions can take a significant amount of processing time and consume large area. A more important problem with fractional decimal numbers expressed in a binary format may result in lack of accuracy. This can have major impact in finance and commercial applications. To solve these problems, interest in hardware design of decimal arithmetic is growing. This has led to the incorporation of specifications of decimal arithmetic in the IEEE-754 2008 standard for floating-point arithmetic [1]. The development of decimal operations in hardwired designs with high performance and low resource usage is expected to facilitate the implementation of various applications [2].

Multiplication is a complex operation among decimal computations. To speed up this operation, early decimal multipliers were designed at the gate level targeting ASICs. The authors in [3] proposed an improved iterative decimal multiplier approach to reduce the number of iteration cycles. To avoid a large number of decimal to binary conversions, a two-digit stage was used as the basic block for the iterative Binary Coded Decimal (BCD) multiplier. To further speed up the multiplication, parallel decimal multipliers were proposed. Binary multiplier and binary to BCD conversion were utilized to implement 1 × 1-digit multipliers, and different binary compressors were employed for the result of the multiplier [4–6]. To avoid the binary to decimal conversion, recoding methods were used to generate the partial products of the BCD multiplier [7, 8]. A Radix10 combinational multiplier was introduced in [7] and Radix4 and Radix5 recoding methods were presented in [8]. In [9], Radix5 recoding was combined with BCD code converters using BCD4221 and BCD5211 codes to simplify the partial product generation and reduction. In the recent two years, some ASIC-based designs for the realization of decimal multiplication were proposed in [10–14]. The recoding methods and BCD code conversions were used in these designs for efficient implementation in ASIC.

Although there are a number of approaches to implement decimal multipliers in ASICs, utilizing the same methods in FPGA devices is not necessarily efficient. With recent advancements in FPGA technology, enhanced architectures, and availability of various hardware resources, the FPGA platform is recognized as a viable alternative to ASICs in many cases. To make efficient use of FPGA resources in the implementation of decimal multiplication, new algorithms and approaches have been developed. The authors in [15] implemented decimal multipliers using embedded binary multiplier blocks in FPGAs. The binary-BCD conversion was implemented using base-1000 as an intermediate base, and the result was converted to BCD using a shift-add-3 algorithm. In [16], the authors presented a double-digit decimal multiplier technique that performs 2-digit multiplications simultaneously in one clock cycle; then the overall multiplication was performed serially. In [17, 18], a 1 × 1-digit multiplier was designed directly with BCD inputs/outputs and implemented using 6-input or 4-input LUTs. To sum the results of 1 × 1-digit multipliers, a fast carry-chain decimal adder was also proposed in [18]. These decimal-operation-based approaches avoided the conversions but also impacted the speed. Vázquez and De Dinechin implemented a BCD multiplier using a recoding technique [19]. Signed-Digit (SD) Radix5 was employed to recode one of the input operands of the multiplier for the generation of the partial products. 6-input LUTs and fast carry chains in Xilinx FPGAs were used to generate the building blocks and the decimal adders. To increase the performance, the authors in [20] implemented a parallel decimal multiplier based on Karatsuba-Ofman algorithm. The building blocks used in Karatsuba-Ofman algorithm were deigned based on the approach proposed in [19]. Another SD-based decimal multiplier approach was proposed in [21]. The recoding was based on SD Radix10. BCD4221, 5211, and 5421 converters were used for the partial product generation. BCD4221-based compressors and adders were utilized in this approach. Although the BCD4221-based operations are similar to binary operation, the recoding and the different code conversions still lead to delay and resource cost.

In this paper, we propose a new parallel binary-operation-based decimal multiplier approach. Binary operations are performed for the 1 × 1-digit multiplication and the partial product reduction based on the columns with two digits in each column. The operations for all columns are processed in parallel. After the column-based binary operations, binary to decimal conversions are required but the bit sizes of the operands to be converted are limited based on the columns. In this paper, an improved 6-LUT-based BCD adder and a 2-digit column-based binary-decimal compressor are also presented. Our proposed approach was implemented in Xilinx Virtex-5 and Virtex-6 FPGAs. The results are compared with Radix-recoding-based approaches using a BCD4221 coding scheme. The proposed approach achieves improved FPGA performance in part because of the parallel binary operations and small size conversions.

The organization of this paper is as follows. Section 2 presents optimized building blocks required by the BCD multiplication. The proposed multiplier architecture and the schemes of the partial product generation and reduction are presented in Section 3. The implementation results of -digit BCD multipliers are depicted in Section 4. Conclusions are given in Section 5.

#### 2. Proposed Building Blocks for the Realization of BCD Multiplication

In this section, proposed schemes for an improved 6-input LUTs-based BCD adder and a mixed binary-decimal compressor are presented. These schemes will be utilized as the basic building blocks to construct our proposed BCD multipliers presented in Section 3.

##### 2.1. 6-Input LUTs-Based 1-Digit BCD Adder

The 6-input LUTs-based 1-digit BCD adder is based on the use of 6-input LUTs and MUX-XOR networks in FPGAs. It is an improved version of the architecture presented in [19].

Assume that the input operands of the adder are and in BCD8421 format. The input operands are decomposed as Then, the addition is presented asIn (2), or has the binary set , and the full adder has two outputs, the carry and the sum . The function is a three-bit adder with the add-3 correction merged, which can be expressed as In (3), the cannot be []_{2}, []_{2}, or []_{2} because of the +3 correction. Also, since the maximal value of and is []_{2}, the maximal value of is []_{2}. The function has 6 inputs; therefore, it can be efficiently mapped in a single level of 6-input LUTs.

To calculate the final result in BCD format, the carry of the full adder must be added to . As a special case, an add-3 correction must be considered if and to achieve a correct final result. Table 1 is the truth table for the final correction.