International Journal of Reconfigurable Computing

Volume 2015 (2015), Article ID 518272, 16 pages

http://dx.doi.org/10.1155/2015/518272

## High Efficiency Generalized Parallel Counters for Look-Up Table Based FPGAs

Department of Computer Science & Engineering, National Institute of Technology Srinagar, Jammu and Kashmir 190006, India

Received 29 June 2015; Revised 1 September 2015; Accepted 20 September 2015

Academic Editor: Martin Margala

Copyright © 2015 Burhan Khurshid and Roohie Naaz Mir. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Generalized parallel counters (GPCs) are used in constructing high speed compressor trees. Prior work has focused on utilizing the fast carry chain and mapping the logic onto Look-Up Tables (LUTs). This mapping is not optimal in the sense that the LUT fabric is not fully utilized. This results in low efficiency GPCs. In this work, we present a heuristic that efficiently maps the GPC logic onto the LUT fabric. We have used our heuristic on various GPCs and have achieved an improvement in efficiency ranging from 33% to 100% in most of the cases. Experimental results using Xilinx 5th-, 6th-, and 7th-generation FPGAs and Stratix IV and V devices from Altera show a considerable reduction in resources utilization and dynamic power dissipation, for almost the same critical path delay. We have also implemented GPC-based FIR filters on 7th-generation Xilinx FPGAs using our proposed heuristic and compared their performance against conventional implementations. Implementations based on our heuristic show improved performance. Comparisons are also made against filters based on integrated DSP blocks and inherent IP cores from Xilinx. The results show that the proposed heuristic provides performance that is comparable to the structures based on these specialized resources.

#### 1. Introduction

Multioperand addition is an important operation in many arithmetic circuits. It is frequently used in many applications like filtering [1], motion estimation [2], array multiplication [3–7], and so forth. Compressor trees form the basic elements in multioperand additions. Compressor trees based on carry save adders (CSA) typically provide higher speeds due to the avoidance of long carry chains. Dadda [3] and Wallace [7] trees are CSA based compressor trees which are frequently used in application specific integrated circuit (ASIC) design. However, the introduction of fast carry chains in FPGAs has made ripple carry addition faster than the carry save addition. Evidently CSA based compressor trees are not well suited for implementation involving FPGAs [8].

Prior work on compressor tree synthesis using FPGAs has used GPCs as basic constituent element. It has been demonstrated that the usage of GPCs can lead to a considerable reduction in the critical path delay with comparable resource utilization [8–14]. Initial attempts in this regard were made by Parandeh-Afshar et al. [8–11]. In [9] they claim to report the first method that synthesizes compressor trees on FPGAs. The proposed heuristic constructs compressor trees from a library of GPCs that can be efficiently implemented on FPGAs. Their latter work [11] focuses on further reducing the combinational delay and any increase in area by formulating the mapping of GPCs as an integer linear programming (ILP) problem. The authors reported an average reduction in delay by 32% and area by 3% when compared to an adder tree. In [10] focus is on reducing the combinational delay by using embedded fast carry chains. This concept was further extended in [8] and a delay reduction of 33% and 45% was achieved in Xilinx Virtex-5 and Altera Stratix-III FPGAs, respectively.

Matsunaga et al. [12, 14] also formulated the mapping of GPCs as an ILP with speed and power as optimization goals. Their results show a 28% reduction in GPC count when compared to [9]. A reduction in GPC count results in reduction of compression stages thereby reducing the delay and power consumption.

Recent attempts from Kumm and Zipf [15, 16] focus on exploiting the low-level structure of Xilinx FPGAs to develop novel GPCs with high compression ratios and efficient resource utilization. Both general purpose LUT fabric and specialized carry chains have been used for synthesizing resource-efficient delay-optimal GPCs.

All the abovementioned approaches (except [9]) focus on exploiting the fast carry chain embedded in modern FPGAs. The idea is to use the fast carry chain to connect the adjacent logic cells and bypass the programmable routing network to reduce delay [10]. This results in reduced critical path delay and hence increased speed. The mapped GPCs, however, suffer from poor efficiency. In this paper, we use a heuristic that improves the efficiency of the mapped GPCs by reducing the number of LUTs required to map the GPC logic. Our heuristic mainly targets 6-input LUTs from Xilinx FPGAs that can implement a single 6-input function or two 5-input functions with shared inputs. However, the same heuristic can be used for Altera FPGAs that support decomposition of 6-input LUTs into dual 4-input LUTs, when used in the arithmetic or shared-arithmetic mode. Additionally in both devices the fast carry chain is used to handle the carry rippling so that there is no increase in the critical path delay.

The rest of the paper is organized as follows: Section 2 briefly introduces the basic preliminaries about the GPCs, the Xilinx and Altera LUT architecture, and the terminology used in this paper. Section 3 discusses the heuristic used to synthesize different GPCs. Synthesis and implementation are carried out in Section 4. Conclusions are drawn in Section 5 and references are listed at the end.

#### 2. Preliminaries and Terminology

A compressor tree is a circuit that takes , -bit unsigned operands and generates two output values, sum () and carry (), such that

A generalized parallel counter computes the sum of bits having different weights. A GPC is traditionally represented as a tuple (), where denotes the number of input bits of weight and is the number of output bits. The upper limit on the value of GPC is given by

The efficiency of a GPC is measured by the number of reduced bits in relation to the hardware resources and is given bywhere is the number of input bits; is the number of output bits; and is the number of LUTs used. As an example, a (1, 4, 1, 5; 5) GPC has five input bits of weight 0; one input bit of weight 1; four input bits of weight 2; and one input bit of weight 3. The upper limit on the output value is 31 and five bits are required to represent the output.

Logic synthesis is concerned with hardware realization of a desired functionality with minimum possible cost. The* cost* of a circuit is a measure of its speed, resource utilization, power consumption, or any combination of these. A* Boolean network* is a directed acyclic graph (DAG) that represents a combinational function. Logic gates, primary inputs (PIs), and primary outputs (POs) within this network are represented by* nodes*. A node may have zero or more predecessor nodes known as* fan-in* nodes. Similarly a node may drive zero or more successor nodes known as* fan-out* nodes. A network is said to be* k-bounded* if the fan-in of every node does not exceed . Each node implements a* local function*. A* global function* is implemented by connecting the logic implemented by individual nodes. The* level* of the node is the length of the longest path from any PI node to . Network* depth* is defined as the largest level of a node in the network. The critical path delay and area of a circuit are measured by the depth and number of LUTs, respectively. The transformation of a Boolean network into targeted logic elements gives the* circuit-netlist*. For FPGAs the targeted element is a -LUT.

Altera Stratix IV and V FPGAs have Adaptive Logic Module (ALM) as the basic logic cell. The LUT resources within each ALM are divided into two adaptive LUTs (ALUTs). Normal operating mode uses a combination of these ALUTs within an ALM to implement functions with up to eight different inputs [17]. However, Altera supports specialized arithmetic and shared-arithmetic modes for each Stratix ALM for arithmetic extensive applications. In these modes each individual LUT can implement two 4-input functions with shared inputs. The arithmetic and shared-arithmetic modes also enable the use of fast carry chains that result in efficient implementation of different arithmetic functions. A typical Stratix IV ALM in arithmetic mode driving the carry chain is shown in the schematic of Figure 1.