International Journal of Reconfigurable Computing

Volume 2016, Article ID 8972065, 11 pages

http://dx.doi.org/10.1155/2016/8972065

## On-Chip Reconfigurable Hardware Accelerators for Popcount Computations

Department of Electronics, Telecommunications and Informatics/IEETA, University of Aveiro, 3810-193 Aveiro, Portugal

Received 26 November 2015; Accepted 21 February 2016

Academic Editor: Eduardo Marques

Copyright © 2016 Valery Sklyarov et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Popcount computations are widely used in such areas as combinatorial search, data processing, statistical analysis, and bio- and chemical informatics. In many practical problems the size of initial data is very large and increase in throughput is important. The paper suggests two types of hardware accelerators that are (1) designed in FPGAs and (2) implemented in Zynq-7000 all programmable systems-on-chip with partitioning of algorithms that use popcounts between software of ARM Cortex-A9 processing system and advanced programmable logic. A three-level system architecture that includes a general-purpose computer, the problem-specific ARM, and reconfigurable hardware is then proposed. The results of experiments and comparisons with existing benchmarks demonstrate that although throughput of popcount computations is increased in FPGA-based designs interacting with general-purpose computers, communication overheads (in experiments with PCI express) are significant and actual advantages can be gained if not only popcount but also other types of relevant computations are implemented in hardware. The comparison of software/hardware designs for Zynq-7000 all programmable systems-on-chip with pure software implementations in the same Zynq-7000 devices demonstrates increase in performance by a factor ranging from 5 to 19 (taking into account all the involved communication overheads between the programmable logic and the processing systems).

#### 1. Introduction

Popcount (which is a short for “population count,” also called Hamming weight) of a binary vector is the number of ones in the vector . It is also defined for any vector (not obligatory binary) as the number of the vector’s nonzero elements. In many practical applications, the execution time for popcount computations over vectors has a significant impact on overall performance of systems that use the results of such computations. They are widely requested in different areas and we will show below just a few examples.

Let us consider the covering problem which can be formulated on sets or matrices . Let be a incidence matrix. The subset contains all columns covered by a row (i.e., the row has the value 1 in all columns of the subset ). The minimal row cover is composed of the minimal number of the subsets that cover all the matrix columns. Clearly, for such subsets there is at least one value 1 in each column of the matrix. Different algorithms have been proposed to solve the covering problem, such as greedy heuristics [1, 2] and a very similar method [3]. Highly parallel algorithms are described in [4]. It is suggested that the given matrix (set) be unrolled in such a way that all its rows and columns are saved in the FPGA registers. Note that more than a hundred of thousands of such registers are available in recent even low-cost devices. This technique permits all rows and columns to be accessed and processed concurrently counting HW for all the rows and columns in parallel.

In recent years genetic data analysis has become a very important research area and the size of data to be processed has been increased significantly. For example, to represent genotypes of 1000 individuals 37 GB array is created [5]. To store such large arrays a huge memory is required. The compression of genotype data can be done in succinct structures [6] with further analysis in such applications as BOOST [7] and BiForce [8]. Succeeding advances in the use of succinct data structures for genomic encoding are provided in [5]. The methods proposed in [5] intensively compute popcounts for very large data sets and it is underlined that further performance increase can be possible in hardware accelerators of popcount algorithms. Similar problems arise in numerous bioinformatics applications such as [5–12]. For instance, in [9], Hamming distance filter for oligonucleotide probe candidate generation is built to select candidates below the given threshold. The Hamming distance between two vectors and is the number of positions they differ in. Since , the distance can easily be found.

Similarity search is widely used in chemical informatics to predict and optimize properties of existing compounds [13, 14]. A fundamental problem is to find all the molecules whose fingerprints have Tanimoto similarity no less than a given value. It is shown in [15] that solving this problem can be transformed to Hamming distance query. Many processing cores have the relevant instructions; for instance, POPCNT (population count) [16] and VCNT (Vector Count Set Bits) [17] are available for Intel and ARM Cortex microchips, respectively. Operations (like POPCNT and VCNT) are needed in numerous applications and can be applied to very large sets of data (see, e.g., [14]).

Popcount computations are widely requested in many other areas. Let us give a few examples. To recognize identical web pages, Google uses SimHash to get a 64-dimension vector for each web page. Two web pages are considered as near-duplicate if their vectors are within Hamming distance 3 [18, 19]. Examples of other applications are digital filtering [20], matrix analyzers [21], piecewise multivariate functions [22], pattern matching/recognition [23, 24], cryptography (finding the matching records) [25], and many others.

The paper proves that popcount computations can be done in FPGA significantly faster than in software. The following new contributions are provided:(i)Highly parallel methods in FPGA-based systems which are faster than existing alternatives.(ii)A hardware/software codesign technique implemented and tested in recent all programmable systems-on-chip from the Xilinx Zynq-7000 family.(iii)Data exchange between software and hardware modules through high-performance interfaces in such a way that the implemented burst mode enables run-time popcounts computations to be combined with data transfer avoiding any additional delay.(iv)The result of experiments and comparisons demonstrating increase of throughput comparing to the best known hardware and software alternatives.

The remainder of the paper contains 6 sections. Section 2 presents a brief overview of related work and analyzes highly parallel circuits for popcount computations. Section 3 suggests system architectures for the two proposed design techniques. Particular solutions with experiments and comparisons are presented in Sections 4 and 5. Section 4 is dedicated to FPGA-based designs and Section 5 to the designs based on all programmable systems-on-chip from the Xilinx Zynq-7000 family. Section 6 discusses the results. Section 7 concludes the paper.

#### 2. Related Work

State-of-the-art hardware implementations of popcount computations have been exhaustively analyzed in [26–30]. The results were presented in form of charts in [26, 28, 30] that compare the cost and the latency of four selected methods. The basic ideas of these methods are summarized below: (1)Parallel counters from [26] are tree-based circuits that are built from full-adders.(2)The designs from [27] are based on sorting networks, which have known limitations; in particular, when the number of source data items grows, the occupied resources are increased considerably.(3)Counting networks [28] eliminate propagation delays in carry chains that appear in [26] and give very good results especially for pipelined implementations. However, they occupy many general-purpose logical slices which are very extensively employed for the majority of practical applications frequently running parallel with popcount computations.(4)The designs [30] are based on embedded to FPGA digital signal processing (DSP) slices that either use a very small number of logical slices or do not use them at all.

Different software implementations in general-purpose computers and application-specific processors are also very broadly discussed [14, 29, 31]. A number of benchmarks are given in [14] which will be later used for comparisons. Since hardware circuits allow high-level parallelism to be provided they are faster and we will prove it in Sections 4-5. Besides, popcount computations for long vectors, required for a number of applications, involve multiple data exchange with memory that can be avoided in FPGA-based solutions where the implemented circuits can easily be customized for any size of vectors.

We suggest here novel designs for popcount computations giving better performance than the best known alternatives. All the results will thoroughly be evaluated and compared with existing solutions on available benchmarks (such as [14]).

FPGAs operate on a lower clock frequency than nonconfigurable application-specific integrated circuits and broad parallelism is evidently required to compete with potential alternatives. Let us use such circuits that enable to process in parallel as many bits of a given binary vector as possible.

One feasible approach is based on the frequently researched networks for sorting [27, 31]. However, they are very resource consuming [32]. In [28] a similar technique was used for parallel vector processing with noncomparison operations. The proposed circuits are targeted mainly towards various counting operations and they are called counting networks. In contrast to competitive designs based on parallel counters [26], counting networks do not involve a carry propagation chain needed for adders in [26]. Thus, the delays are reduced and this is clearly shown in [28]. The networks [28] are easily parameterizable and scalable allowing thousands of bits to be processed in combinational circuits. Besides, a pipeline can easily be created. A competitive circuit can be built directly from FPGA look-up tables (LUTs) using the methods [33]. A LUT with inputs and outputs can be configured to implement arbitrary Boolean functions of variables . In recent FPGAs (e.g., the Xilinx 7th series and the Altera Stratix V family), most often is 6 and is either 1 or 2. If we consider the FPGA generations during the last decade, we can see that these values (, in particular) have been periodically increased. Clearly, elements LUT can be configured to calculate the popcount of , where the number of LUTs . It is important to note that the delay is very small (e.g., in the Xilinx 7th family FPGAs it is less than 1 ns). The idea is to build a network from LUTs that can calculate the popcount for an arbitrary vector of size . For filtering problems that appear, in particular, in genetic data analysis this weight is compared with either a fixed threshold or the popcount of another binary vector to be found similarly.

From experiments in [28, 30, 33] we can see that counting networks and LUT- and DSP-based circuits are the fastest comparing to other alternative methods and we will base popcount computations on a combination of them.

#### 3. System Architectures

Data that have to be processed are kept in memories with capacity of up to tens of GB [4, 5]. Thus, we need to transmit very large volumes of data to the counter (computing popcounts) and this process involves communication time that can exceed the processing time. We suggest the following two design techniques targeted to FPGA and to all programmable systems-on-chip (APSoCs) [34]:(1)FPGA-based accelerators for general-purpose computers with architecture shown in Figure 1. The complexity of recent FPGAs permits the complete system (or large subsystem of the system) to be entirely implemented in hardware and accelerators (like those computing popcounts) are the system components.(2)APSoC responsible for solving a relatively independent problem and potentially interacting with a general-purpose computer as it is shown in Figure 2.