Selected Papers from the 17th Reconfigurable Architectures Workshop (RAW2010)
View this Special IssueResearch Article  Open Access
Nikolaos Alachiotis, Alexandros Stamatakis, "A VectorLike Reconfigurable FloatingPoint Unit for the Logarithm", International Journal of Reconfigurable Computing, vol. 2011, Article ID 341510, 12 pages, 2011. https://doi.org/10.1155/2011/341510
A VectorLike Reconfigurable FloatingPoint Unit for the Logarithm
Abstract
The use of reconfigurable computing for accelerating floatingpoint intensive codes is becoming common due to the availability of DSPs in newgeneration FPGAs. We present the design of an efficient, pipelined floatingpoint datapath for calculating the logarithm function on reconfigurable devices. We integrate the datapath into a standalone LUTbased (Lookup Table) component, the LAU (Logarithm Approximation Unit). We extended the LAU, by integrating two architecturally independent, LAUbased datapaths into a larger component, the VLAU (vectorlike LAU). The VLAU produces 2 results/cycle, while occupying the same amount of memory as the LAU. Under single precision, one LAU is 12 and 1.7 times faster than the GNU and Intel Math Kernel Library (MKL) implementations, respectively. The LAU is also 1.6 times faster than the FloPoCo reconfigurable logarithm architecture. Under double precision, one LAU is 20 and 2.6 times faster than the respective GNU and MKL functions and 1.4 times faster than the FloPoCo logarithm. The VLAU is approximately twice as fast as the LAU, both under single and double precision.
1. Introduction
The use of FPGAs as accelerators for computeintensive codes is driven by their potential for implementing deeply pipelined architectures and for executing hundreds of operations in parallel. As the devices become larger, new fabrics, in particular DSPs, allow for a wider range of applications, in particular floatingpoint intensive codes, to be efficiently executed/accelerated on FPGAs.
A large number of scientific applications rely on the frequent and efficient computation of the logarithm function. For instance, multimedia codes need to estimate loglikelihood scores for Gaussian mixture models [1], or bioinformatics programs for evolutionary reconstruction under the maximum likelihood model [2] need to compute loglikelihood scores of evolutionary trees. The logarithm is also commonly used to avoid numerical underflow (especially in statistics) by replacing multiplications via additions.
Many of the applications that rely on the logarithm are either highly compute intensive, such as the phylogenetic likelihood function which represents an important computational kernel in computational Biology [3, 4], or exhibit realtime constraints, such as realtime image processing applications [5] or skin segmentation algorithms [6]. Irrespective of the specific type of application, the deployment of reconfigurable logic (FPGAs) represents a common technique to either speed up applications, prototype hardware designs, or to meet realtime requirements of timecritical applications.
When an FPGA is used for accelerating floatingpoint intensive applications, a thorough exploration of the performance and precision tuning parameter space for the required arithmetic operators can eventually lead to significant performance improvements. In fact, implementations of simple floatingpoint operators like adders or multipliers may require fairly complex reconfigurable architectures. Furthermore, the amount of hardware resources used by a floatingpoint operator generally increases with precision/accuracy requirements. However, accuracy requirements of a specific operator may depend on the application at hand.
In previous work on calculating the logarithm function [7] using reconfigurable logic, we focused on the design of a pipelined Logarithm Approximation Unit (LAU). We demonstrated that the LAU is sufficiently accurate for computing the phylogenetic maximum likelihood (ML) function on a reconfigurable coprocessor for RAxML [8]. RAxML is a widely used bioinformatics code for reconstructing evolutionary trees (evolutionary histories or simply phylogenies) from DNA or protein data under the ML criterion. The LAU architecture utilizes look up tables (LUTs) for calculating the logarithm and can be conveniently adjusted to provide the desired/required applicationspecific accuracy. The LAU is based on the ICSILog approximation method (Vinyals and Friedland in [9]) that is available as opensource code. If not stated otherwise, in this paper, we use the term LUT for referring to the look up table required by the ICSILog approximation method rather than to the lowlevel hardware LUTs on the FPGA device.
As already mentioned, several computational units must be placed on the FPGA and operate in parallel to efficiently exploit the available computational resources. Thus, the resource requirements of a component/unit (e.g., a simple floatingpoint operator or a more complex arithmetic function) need to be minimized to allow for placing several instances on the chip that will then operate in parallel. The input/output (I/O) requirements can be accommodated by parallel I/O ports, for instance, by organizing the embedded memory blocks of a device into several smaller parallel blocks that can provide a sufficient throughput with respect to the arrangement of the parallel components. In Figure 1, we provide a representative example for the potential arrangement of logarithm components and the respective parallel execution of the logarithm function. The block diagram depicts a finegrain parallelization of loglikelihood score computations for a given evolutionary tree topology under a likelihoodbased model (used, e.g., in maximum likelihood or Bayesian phylogeny programs).
In the current paper, we present a significantly extended and optimized vectorlike LAU implementation. The vectorlike Logarithm Approximation Unit (VLAU) can calculate two logarithms within the same clock cycle. Using the VLAU is more resourceefficient compared to instantiating and using two simple independent LAUs in parallel. The underlying idea of the VLAU consists of exploiting the dualport configuration option of embedded memory blocks. This implementation option allows for sharing LUTs between two, otherwise completely independent, LAUbased pipelines. Furthermore, a detailed analysis of resources requirements and performance impact with respect to the latency of the LAU has been conducted for the single and double precision versions. We also extended the C implementation of the ICSILog algorithm (International Computer Science Institute) to support double precision (DP) arithmetics.
Throughout the paper, we denote IEEE754 single precision arithmetics as SP and IEEE754 double precision arithmetics as DP. We denote the single precision software implementation of ICSILog (version 0.6 beta) as SPICSILog and our DP software implementation as DPICSILog. By SPLAU, DPLAU, SPVLAU, and DPVLAU we denote the SP and DP FPGA implementations of the LAUs and VLAUs, respectively.
The DPICSILog C code as well as the hardware descriptions for the LAUs and the VLAUs (including all available latency variants) are available as an open source code for download at http://wwwkramer.in.tum.de/exelixis/logFPGA.tar.bz2. The default hardware configuration that supports both, Virtex 4 and Virtex 5 FPGAs uses an LUT with 4,096 entries. We also provide several COE files for different LUT sizes, such that the LAU/VLAU can be conveniently reconfigured and adapted to the precision required by the respective target application.
The remainder of this paper is organized as follows. Section 2 describes the underlying ideas of the ICSILog algorithm. In Section 3, we review related work on logarithmic units for FPGAs. The LAU architecture is described in Section 4, and the VLAU architecture is introduced in Section 5. In the following Section 6, we present speed and accuracy measurements for LAUs and VLAUs with a LUTsize of 4,096 and provide a detailed evaluation of LAU implementations with latencies ranging between 5 and 22 clock cycles. We also analyze performance and resource utilization of the FPLog implementations and assess the numerical stability of RAxML in software using DPICSILog. We conclude in Section 7.
2. The ICSILog Algorithm
The underlying idea of the ICSILog algorithm consists of increasing the speed of the logarithm computation by using an LUT that resides entirely in the CPU cache. The algorithm exploits the floating point number representation of the IEEE754 standard. An IEEE floatingpoint number consists of three fields: the sign (sgn), the exponent (exp), and the mantissa (man). The decimal floatingpoint value of a number (num) is represented by the sign, followed by the product of the mantissa and the factor : In order to calculate the logarithm of num, one can use the multiplicative property of the logarithmic function and decompose the computation as follows:
Since the realvalued logarithm is only defined for positive numbers, the sign bit can be discarded. The factor by which is multiplied is a constant and only depends on the base of the logarithm; one may use log_{e}(2), log_{2}(2), or log_{10}(2) for instance. Thus, the calculation of the logarithm for an arbitrary base only requires the constant log_{x}(2) and an appropriately initialized fullsize LUT (comprising all values) for the base .
The calculation of the first part of the sum requires the floatingpoint representation for the decimal value of the exponent field. One can use the Xilinx floatingpoint operator (FPO) [10] to obtain this value. However, we use a faster LUTbased method (this is a separate LUT that is exclusively used for this conversion) to obtain the floating point value which is described in Section 4. In Section 6, we provide a performance comparison between the FloatingPoint Operator provided by Xilinx and our approach. Once the floating point value of the exponent is available, the first operand of the final addition is calculated by conducting the multiplication with the constant floatingpoint value.
The calculation of the second part of the sum, that is, the logarithm of the mantissa, requires the use of an LUT. A naïve LUT will thus need to contain all precomputed values for log(man) which requires 32 MB of memory for the SP number range. Vinyals and Friedland found that, the usage of a 32 MB fullsize LUT only yields insignificant performance improvements with respect to the GNU implementation [9]. To improve performance and reduce LUT size, they deploy a quantized mantissa that entirely fits into cache memory. In Figure 2, we provide a schematic outline of the Vinyals and Friedland algorithm at the bit level.
The mantissa LUT is indexed by using the most significant bits of the mantissa under SP and the most significant bits under DP, respectively. The variable is the number of least significant bits of the mantissa that will be ignored by the quantization process. Thus, the variable can be used to appropriately adapt the accuracy and LUT size to the specific requirements of an application. A detailed study of the accuracy loss that is induced by quantizing the mantissa can be found in [9]. The tradeoff between accuracy and embedded memory hardware resources used will be discussed in Section 6.
3. Related Work
A thorough bibliographical search revealed that alternative implementations of fast logarithm algorithms mostly represent special purpose solutions that are tailored to a specific application or hardware platform, that is, there is a lack of a generally applicable solution.
Dedicated software implementations that entail approximation algorithms for the logarithmic function have been developed for accelerating multimedia applications [9, 11]. In 2001, de Soras proposed and made available an algorithm called fast log [11]. This algorithm computes a 3rd order Taylor series approximation of the logarithm for any given IEEE754 floatingpoint number. The algorithm is fast but lacks accuracy in certain cases/number ranges [9]. The LUTbased approach of ICSILog, which we implemented in reconfigurable logic in the LAU and VLAU components, is as fast as fast log but provides better accuracy [9].
De Dinechin et al. have developed FloPoCo (floatingpoint cores), an opensource arithmetic core generator for FPGAs [12]. The logarithmic unit generated by FloPoCo (FPLog) supports SP (SPFPLog), DP (DPFPLog), and userdefined number formats. The FPLog units can be configured to yield exactly the same results as the respective GNU functions; hence, accuracy comparisons between our LAU and FPLog are identical to comparisons between the LAU and the GNU library. The algorithms and implementation techniques that are deployed in FloPoCo for generating the FPLog unit are described in [13, 14].
Section 6 includes a direct comparison between the LAU and FPLog units (using the most recent version 2.0.0 of FloPoCo) in terms of speed and resource utilization on a Virtex 5 FPGA. Note that the FPLog input format slightly differs from the IEEE754 standard. Two additional bits in every input number indicate whether the input should be treated as special number (zero, nan, (+/−) inf) or as normal number. Thus, in order to integrate the FPLog component into a design that complies with the IEEE754 standard, this dedicated input number format specification requires additional logic (which can also be separately generated by FloPoCo) to appropriately set these bits. Furthermore, a common FPGA design paradigm is eventdriven architectures. Unfortunately, the FPLog interface does not provide any additional ports for validity input signals, that is, signals that indicate whether the current values at the input and output ports are valid or not. Consequently, FPLog is harder to integrate with eventdriven architectures.
National instruments [15] have designed a highthroughput natural logarithm function for FPGAs. The specific design is only available commercially, and only a limited amount of information is provided regarding unit performance. The implementation only supports fixedpoint arithmetics, and the input arguments must be unsigned (input number range: ). For numbers outside this input range, the unit generates undefined results. The interface of the logarithm component provides all necessary ports for validity input signals for easy integration with and use in eventdriven environments. The CORDIC algorithm (COordinate Rotation DIgital Computer [16]) is deployed for the specific implementation, and the user can set the desired accuracy level by defining the number of iterations in the CORDIC algorithm. Because this logarithm implementation is not available (not even for a short evaluation period), we were not able to conduct a respective performance comparison with the LAU/VLAU architectures.
Tropea [17] has also presented an areaoptimized FPGA implementation to compute the baseN logarithm function. An important aspect of the specific implementation is that it can be mapped on FPGAs from any vendor. Performance results for Xilinx [18] and Actel [19] FPGAs are provided in [17]. The error analysis section in [17] reveals that highly accurate results can be obtained, while only using a small fraction of the overall hardware resources. The unit utilizes the multiplicative normalization method [20] to calculate the logarithm, and several configurations with various precision levels are evaluated.
Recently, Chrysos et al. [21] presented a general reconfigurable architecture for a Bioinformatics algorithm that uses Interpolated Markov Models (IMMs) for gene finding, known as the Glimmer algorithm. The Glimmer algorithm also requires logarithm calculations. The respective hardware architecture contains 6 logarithm unit instances that operate in parallel. The design of the logarithm component in the Glimmer architecture [21] deploys a similar strategy as the LAU.
In 2008, RaygozaPanduro et al. [22] presented an automatically generated mathematical unit. The hardware description is automatically generated by a JAVA program and can be synthesized. The framework supports a wide range of complex arithmetic operations. The mathematical unit was used to implement a sliding mode controller for a magnetic levitation system. The sytem provides operators for the natural logarithm and the base10 logarithm. The automatically generated mathematical unit was mapped to a Virtex II FPGA; the logarithm functions (natural and base10) only occupy 1% of available FPGA slices and 1% of available LUTs on the device. The resourceefficiency of the unit appears to be mainly induced by the usage of bitwidth reduced floatingpoint arithmetics, that is, only 3 bits are used for the exponent field and only 14 bits for the mantissa field, respectively.
4. The LAU Architecture
In the following, we describe the design of a reconfigurable architecture for the ICSILog algorithm. In Figure 3, we provide the block diagram of the toplevel unit.
The leftmost module is the special_case_detector. As the name suggests, this module assesses whether the LAU input is valid or not. Special cases are negative numbers, nan, −inf, and inf as defined by the IEEE standard. Since the logarithm is not defined on negative numbers, the result is nan. For nan and −inf inputs, the result is defined as nan as well. For an inf input, the unit will return inf again. The module consists of comparators, logic gates, and pipeline registers that detect the special case inputs and produce the corresponding output. The module also outputs a selection signal for the final 2 to 1 multiplexer (bottom left in Figure 3) that is connected to the output port of the LAU.
To the right of the special_case_detector in Figure 3, we have integrated a group of modules that operate on the exponent bits of the input. These modules compute the first operand of the addition that returns the approximation of the logarithm.
Initially, the decimal value of the exponent field needs to be transformed into a floatingpoint number. The straightforward approach to implement this operation is to use the Xilinx FPO [10] (fixedtofloat) operator. However, we deploy an LUTbased approach to carry out this transformation more efficiently. The exp_LUT lookup table in Figure 3 is used for this purpose. Note that this LUT is a special component of our hardware implementation and should not be confused with the mantissa LUT of the ICSILog algorithm (man_LUT). Details about the performance and resource tradeoffs between our approach and the alternative design using the Xilinx FPO are provided in Section 6.4.
Internally, all operations are conducted under SP. For the SPLAU, the exp_LUT contains 128 entries (), while for the DPLAU, there are 1,024 entries (), where 8 and 11 are the number of bits that represent the exponent field of an SP and a DP value, respectively. The reason why the size of the exp_LUT can be reduced by 50% is explained in the next paragraph. Each entry of the exp_LUT contains a total of 9 bits in the SPLAU and 13 bits in the DPLAU. The first 3 bits under SP and the first 4 bits under DP are the least significant bits of the exponent field of the floatingpoint number representation we intend to construct. The remaining 6 (SP) and 9 bits (DP) are the most significant bits of the mantissa. The remaining bits of the exponent field are always set to 10000 for SP and 1000 for DP. Note that, at this point, an SP value is being constructed for the DPLAU as well. The remaining bits of the mantissa are all set to zero.
One can observe that there is a correspondence between the decimal values of the exponent field and the exponents themselves. For DP, while the decimal value ranges from 0 to 2,047, the exponent ranges from −1,023 to 1,024. This correspondence can be used to reduce the size of exp_LUT by 50%, via only storing the bits required to represent floatingpoint numbers in the range 0−1,023. To support the full range (0–2,047), we use additional logic. More specifically, the 11bit mantissa is transformed into a 10bit index for exp_LUT by subtracting the 11bit value from 2,046. For example, an 11bit index with a decimal value in which the most significant bit is set indexes a lookup table entry >1,023. Hence, provides the distance from the last entry of the lookup table with 1,024 entries. Thus, will yield the correct 10bit index for a exp_LUT with half the size. The most significant bit of the exponent field (discarded from the index) becomes the sign of the newly constructed floatingpoint value. After this transformation, the resulting floatingpoint number becomes the first operand of the multiplication; the second operand is a constant value. The overall result produced by this part of the architecture is the first operand of the final addition: .
The man_LUT module in Figure 3 is the standard quantized LUT of the ICSILog algorithm and contains precalculated values of logarithms. We therefore used ICSILog to generate the contents of man_LUT. As previously described, the most significant bits of the mantissa are used for indexing the man_LUT. Each entry of the table (for SP and DP values) consists of an SP floatingpoint number. As outlined in Section 6, one can increase the accuracy of the LAU by increasing the size of man_LUT. For example, in a man_LUT of size 4,096, only the 12 most significant bits of the mantissa field of the input value will be used for indexing. Both lookup tables (exp_LUT and man_LUT) are enhanced by a construct_sp_fp_value unit. These units consist of logic gates, registers, and multiplexers which are used to construct the correct floatingpoint representations from the respective LUT entries. Finally, the sum of the two values generated by exp_LUT, man_LUT, and the respective construct_sp_fp_value units will return an approximation of the logarithm that is identical to the ICSILog software.
As already mentioned, all operations are conducted under SP. Thus, for the SPLAU, the result is simply the output of the final adder. For DP, the result is transformed into DP by appropriately adapting the bit indices of the SP representation. The least significant bits of the mantissa are set to zero, and a bit extension for the most significant bits of the exponent is conducted while maintaining its sign.
The usage of SP arithmetic, even for the DPLAU, does not affect the precision of the output because of the approximation strategy that is being used. DP will only be affected if a man_LUT with more than 2^{23} entries is used (23 is the number of bits in the mantissa field of SP numbers in the IEEE standard). In this case, the mantissa LUT would require 32 MB of memory. Currently, there is no FPGA available with such a large amount of embedded memory. Clearly, the savings in terms of FPGA resources (embedded memory and DSP slices) by internally using SP for our LAU design are significant. Note that, in our DPICSILog software implementation, we transformed the entire algorithm to DP, because the SP algorithm with a type casting operation from float to double in C was slower than a direct implementation under DP.
5. The VLAU Architecture
An additional optimization can be applied to the LAU architecture (Section 4), when several parallel LAUs shall be placed on an FPGA device. This optimization is based on a special feature of embedded memory blocks in newgeneration FPGAs, which can be configured as socalled dualport memories.
Each memory block provides two fully independent ports that yield access to a shared memory space. An appropriate reconfiguration of the LAU look up tables (exp_LUT, man_LUT) for usage as dualport ROMs (Read Only Memories) allows two independent LAUs to use the same memory blocks for lookups.
Figure 4 depicts the optimized VLAU architecture (vectorlike LAU). The shared memory area in the middle of Figure 4 (denoted as shared LUTs) contains the exp_LUT and man_LUT look up tables. The two LAUbased pipelines are located to the left and the right of the shared LUTs in Figure 4. These two pipelines are exact copies of the LAU architecture (Section 4), but the LUTs have been moved to a shared memory area. The individual LAU pipelines are architecturally completely independent from each other, since they only share a readonly memory area. Each LAU pipeline only accesses one of the two ports of the shared LUTs.
The VLAU architecture is well suited for vectorprocessing, since it can accommodate the computation of two logarithms in one cycle. Because the same pipeline design is used for the LAU and VLAU, a twounit VLAU is as fast as two independent LAUs. The main advantage of a twounit VLAU over two independent LAUs is that the VLAU only requires 50% of the memory blocks.
The FPGAbased coprocessor for gene finding by Chrysos et al. [21] represents an example of an architecture that could potentially benefit from the memoryefficient VLAU implementation. The Glimmer architecture is memory intensive and the attained level of parallelism was limited by the number of available embedded memory blocks in the device (personal communication with G. Chrysos; June 14th, 2010). The deployment of VLAUs can thus help to reduce the number of memory blocks required for computing the logarithm and thereby increase the degree of parallelism in the Glimmer architecture.
6. Experimental Results
Initially, we verified the functionality of the LAU/VLAU architectures (Section 6.1). In the following two sections, we investigate the behavior of RAxML [8] using DPICSILog (Section 6.2) and assess the accuracy of the implementation (Section 6.3). Thereafter, Section 6.4 provides a detailed resource usage and performance evaluation for LAUs with various latencies and also for the VLAU architecture. We also compare performance and resource utilization with the FloPoCo logarithm [12]. A thorough run time comparison between the LAU/VLAU architectures and respective software implementations (GNU and MKL [23]) is presented in Section 6.5. Note that all results in Section 6 refer to Xilinx reports as obtained after the implementation process (postplace and route).
6.1. Verification
In order to verify the correctness of the proposed architectures, we conducted extensive postplace and route simulations as well as tests on an actual FPGA. As a simulation tool, we used Modelsim 6.3f by Mentor Graphics. For hardware verification, we used the HTGV5PCIE development platform equipped with a Xilinx Virtex 5 SX95T1 FPGA.
Initially, the advanced verification tool Chipscope Pro Analyzer was used to monitor the output ports of the SP/DPLAUs and SP/DPVLAUs, and the expected signals for given input numbers were tracked. Thereafter, an experimental PCFPGA platform was set up. We use Gigabit Ethernet to communicate between the PC and the FPGA board based on the optimized unit for direct UDP/IPbased PCFPGA communication that we recently made available [24]. A JAVA test application was used on the PC side to generate random SP and DP input values (using the standard java.util.Random class), organize the numbers into bytes, and transmit them to the FPGA. On the FPGA side, the floatingpoint representations were reconstructed from the incoming bytes and forwarded to the LAU/VLAU components. The logarithms of the inputs were then sent back to the JAVA test application on the PC from the FPGA and printed to screen.
6.2. DPICSILog in a RealWorld Application
We integrated DPICSILog into RAxML [8], which is a widely used tool for inferring phylogenies (evolutionary trees) from molecular data that has been developed in our group. The vast majority of logarithm invocations is conducted when the log likelihood scores of alternative tree topologies are computed. We found that an LUT with 4,096 entries is sufficient to guarantee numerical stability of RAxML and yield accurate results (see below). Table 1 indicates the respective log likelihood scores for tree searches using the GNU and DPICSILog implementations on various DNA datasets with 40, 90, 150, and 218 organisms (sequences) as well as a protein dataset with 140 organisms. Based on standard statistical significance tests for comparing log likelihood scores of phylogenetic trees as implemented in the CONSEL tool suite [25], we found that the score differences among the respective trees are not statistically significant. In other words, the trees computed (under the same starting conditions) using the GNU implementation and the DPICSILog (LUT size: 4,096) cannot be statistically distinguished from each other. Hence, DPICSILog with an LUT size of 4,096 provides sufficient application and domainspecific accuracy for RAxML.

6.3. Accuracy Assessment
Initially, we used the standard C rand() function to generate benchmarks with random numbers in order to measure the average error introduced by the logarithm approximation as a function of LUT size with respect to the GNU function. The results are provided in Table 2. We used the ICSILog software to generate the contents of man_LUT, such that it yields exactly the same results as ICSILog. From Table 2, we deduce that an LUT with 4,096 entries represents a good tradeoff between accuracy and LUT size for our purposes (developing a hardware architecture for RAxML), since an LUT of this size only requires 3 block rams (36 Kb each). For a mediumsize newgeneration FPGA, like the Xilinx Virtex 5 SX95T, 3 block rams correspond to only 1% of the total block memory available. As discussed in [9], the size of the LUT increases exponentially for every additional correct bit in the mantissa. Clearly, a specific target application as well as a global view of the entire reconfigurable system that will use the LAU is required to determine the ideal man_LUT size. Since the software implementation is available as opensource code, it is easy to assess the required mantissa LUT size a priori, that is, before modifying the reconfigurable architecture. For instance, the overall RAxML hardware architecture requires a huge amount of memory and reconfigurable fabric for other purposes. Therefore, we chose to minimize the hardware resources consumed for the logarithmic function to the largest possible extent.

Finally, for a man_LUT with 4,096 entries, we also measured the minimum, maximum, average, and mean squared error between the GNU SP and DP library functions, the respective logarithmic approximation implementations (SP/DPICSILog, DPLAU), and the SP/DPMKL library functions. Table 3 provides these errors for 10^{6} random input numbers ranging from 10^{−20} to 10^{20}.

6.4. Performance Assessment versus Hardware
The LAUs and VLAUs were mapped to a Xilinx Virtex 5 SX95T2 FPGA. In Figure 5, we provide resource usage and performance data for LAU (SP on the left and DP on the right) implementations with different latencies. We tested different latencyspecific configuration settings for the Xilinx FloatingPoint adders and multipliers that are generated. The variation of these settings allowed us to generate LAU implementations with latencies that range between 5 and 22 clock cycles. Note that, all measurements in this Section refer to LAUs and VLAUs with a man_LUT size of 4,096 entries.
(a)
(b)
The respective clock frequencies of the LAUs were obtained using the Xilinx Tools (ADVANCED 1.53 speed file) and are also provided in Figure 5. The clock frequencies are obtained from the static timing report, and the default Xilinx Balanced optimization strategy was selected. All implementations (SPLAU, DPLAU, SPVLAU, and DPVLAU) are fully pipelined with a throughput of one result per clock cycle and per pipeline datapath. Since the LAU only comprises a single pipeline datapath, a throughput of one result per clock cycle is achieved, while the VLAU (with two independent pipeline datapaths) can compute two results per clock cycle.
In Figure 6, we provide the clock frequencies of the SPLAU and DPLAU for man_LUT sizes ranging between 512 and 32,768 entries. The frequency reduction with increasing LUT size is due to the additional logic (mostly block rams for the LUT) that is required by the LAU. The number of block rams required increases exponentially for every bit that is added to the mantissa field, which is used as an index for man_LUT. The increase of other reconfigurable resources is significantly lower, that is, a LAU with a 32,768 entry man_LUT size occupies 700% more 36 Kb block rams than a LAU with a 4,096 entry man_LUT size, while only 15% more slices and 9% more slice LUTs are required.
In Table 4, we compare the hardware resources used by our custom LUTbased module and the Xilinx FPO [10] (configured in fixedtofloat mode) for transforming the exponent value into a floating point value. The numbers in parentheses next to the names in the first line of Table 4 represent the latency (number of clock cycles) for alternative configurations. Since the LUTbased approach has a latency of two cycles, we configured the floating point operator to have the same latency and integrated it into the LAU. We also added an 11bit subtractor, such that the LAU produces correct results. The clock frequency of the LAU using the floatingpoint operator was 60 MHz lower than for our LUTbased approach. The LUTbased module occupies 1 BRAM (18 Kb) while the FPO solution does not use BRAM memory. For some applications, trading some memory for a substantially higher clock speed is acceptable, since it can yield a higher overall clock frequency and thereby improved overall system performance. When the FPO is configured with the maximum latency of 6 cycles, the LAU is only 5 MHz faster than with our LUTbased approach. However, the total latency of the LAU increases by 4 cycles, and the FPO requires a larger amount (see Table 4) of hardware resources.

To the best of our knowledge, the only other opensource logarithm for FPGAs is provided by FloPoCo [12] framework. All FloPoCo operators can be fully parameterized, that is, the user can select the desired precision of the result and define desired performance parameters. To conduct a fair comparison between LAU/VLAU and FPLog units, we used the latest release of FloPoCo (version 2.0.0) and mapped the LAUs/VLAUs and FPLogs to the same FPGA (Virtex 5 SX95T2). Table 5 provides a performance and resource usage comparison after the implementation process (postplace and route). The reduced FPLog implementations (denoted as (red.) in the table) offer the same accuracy as the LAU/VLAU implementations, while the full (precise) FPLog implementations (denoted as (full)) yield the same results as the GNU function.

In addition, all available Xilinx optimization strategies were explored to determine the most efficient strategy for each implementation. The available optimization strategies for Virtex 5 devices are Balanced, Area Reduction, Minimum, Runtime, Power Optimization, and Timing Performance. For each implementation/architecture, we only provide the data for the best optimization strategy with respect to clock frequencies in Table 5. The SPLAU occupies slightly less hardware than the full (high precision) SPFPLog, while DPLAU requires significantly less resources than the full DPFPLog. When the SP/DP FPLogs are configured to yield the same accuracy as the SP/DP LAUs, the FPLog implementations are more resource efficient but exhibit significantly lower maximum clock frequencies. Thus, reducedprecision FPLog units are more likely to lie on the critical path, when embedded into a larger, more complex architecture that needs to calculate logarithms. The VLAUs outperform all other implementations in terms of throughput, since they can produce 2 results per cycle. To achieve performance that is comparable to the VLAU with the FPLog(red.) implementation, two FPLog(red.) instances are required. Therefore, a single VLAU is more resource efficient than two FPLog(red.) units, since the total number of occupied slices and the number of slice LUTs used by the VLAU is smaller than the total amount required for two FPLog(red.) instances. At the same time, the resource consumption for all other resources, that is, slice registers, brams, and DSPs, is the same. Nevertheless, the single VLAU still outperforms the two FPLog(red.) instances with respect to clock frequency.
As far as the LAU implementation by Chrysos et al. [21] is concerned, two LUTs are deployed, but the LUTs are not initialized as efficiently as in our LAU. Consequently, additional operations, that is, a concatenation, a floatingpoint multiplication, a floattofixed operation, and a fixedpoint subtraction, are required for calculating the LUT index. The respective LUT entry is then used to calculate the final output which also represents an approximation of the logarithm function. Since the paper by Chrysos et al. [21] focuses on the overall architecture for Glimmer, only a limited amount of information is provided with respect to implementation and performance of the logarithm unit.
Finally, the configurations presented in [17] by Tropea were mapped to a Virtex 4 LX1512 FPGA by Tropea. The highest clock frequency reported in [17] is 191 MHz (the architecture has not been made available by the author). Thus, to conduct a fair comparison, we also mapped the LAU to a Virtex4 LX1512 FPGA. We obtained a clock frequency of 345 MHz for the SP version and of 344 MHz for the DP version.
6.5. Performance Assessment versus Software
We also compared LAU and VLAU performance to a wide range of software implementations: the SP/DPGNU logarithms: logf()/log(), the SP/DPMKL logarithms: vsLn()/vdLn(), and the SP/DPICSILog algorithms. As hardware platform, we used a V5SX95T2 FPGA (speed grade −2) with one arithmetic component, that is, only one LAU and only one VLAU were instantiated, respectively. The software implementations were executed on an Intel Core2 Duo T9600 processor running at 2.8 GHz with 6 MB of L2 Cache. All software (SP/DPICSILog) and hardware implementations (SP/DPLAU) we tested used a mantissa LUT with 4,096 entries.
For software tests, we used the GNU gcc compiler (version 4.3.2) as well as the Intel icc compiler (version 11.1) in order to fully exploit the capabilities of the Intel CPU. We only used −O1 for optimization with gcc because with more aggressive optimizations (−O2 and −O3) the current SPICSILog version yields an average error that is 10^{5} times larger than the error obtained by compiling the code with −O1. Thus, the aggressive gcc compiler optimizations applied under −O2 and −O3 yield numerically unstable code. When icc is used, SPICSILog produces the expected average error, which is in the range of 10^{−5} for all optimization levels (−O1, −O2, −O3). When −O2 or −O3 is used with icc, SPICSILog is only 1.09 times faster on average than the GNU math library. However, when −O1 is used, SPICSILog is on average 4.5 times faster.
Initially, we used the GNU gcc compiler (version 4.3.2, with −O1) and measured the execution times for 10^{3} up to 10^{8} invocations of the GNU library SP function as well as SPICSILog. Note that we used the most recent version of the SPICSILog algorithm, which is faster than the initial release of the ICSILog software. According to the benchmark that is made available by the authors, the current version is approximately 1.7 times faster than the initial version (when compiled with gcc and −O1). Table 6 shows the execution times for the GNU implementation, SPICSILog, the SPLAU, and the SPVLAU. The SP LAU is 12 times faster than the GNU function and 2.2 times faster than SPICSILog, while the SPVLAU is 23 times faster than the GNU functions and 4.1 times faster than SPICSILog.

As already mentioned, the standard release of ICSILog only provides an SP logarithm function. Furthermore, it does not provide builtin error detection/correction for specialcase inputs like nan, inf, −inf , or negative numbers which is critical for applications like RAxML. In order to conduct a fair performance evaluation of the DPLAU, we therefore reimplemented the ICSILog algorithm to support DP inputs and invalid input detection. Our new DP version of ICSILog (DPICSILog) is only 1.5 times slower than the official SP release by Vinyals and Friedland. DPICSILog is also freely available for download together with the LAU/VLAU architectures.
For assessing DP performance, we used gcc (−O1) and measured execution times for 10^{3} up to 10^{8} invocations of the GNU, DPICSILog, DPLAU, and DPVLAU logarithm functions (Table 7). The DPLAU is 20 times faster than the GNU math library and 3.1 times faster than DPICSILog which in turn is up to 6.5 times faster than the GNU implementation. The DPVLAU is 40 times faster than the GNU math library implementation and 6 times faster than DPICSILog.

For our second set of experiments, we used the Intel icc compiler (version 11.1, optimization flag −O1). We also tested the fast logarithm implementation provided by the Intel Math Kernel Library (MKL [23]) for 106 to 10^{8} invocations on random input numbers as in the preceding experiments.
Tables 8 and 9 provide the execution times for the SP and DP MKL, ICSILog, and LAU implementations, respectively. The SPLAU is 1.7 times faster than the MKL logarithm and 1.8 times faster than SPICSILog, while the respective speedups for the SPVLAU are 3.3 and 3.5. Unfortunately, a detailed description of the MKL logarithm implementation is currently not available. The DPLAU is 2.6 times faster than the respective MKL implementation and 2.8 times faster than DPICSILog which is almost as fast as the DPMKL function (speedups vary between 0.8 and 0.9). The DPVLAU is 5.2 times faster than the MKL implementation and 5.6 times faster than DPICSILog.


As already mentioned, SPICSILog becomes unstable when optimization flags −O2 or −O3 are used with gcc. Therefore, we only assessed the performance impact of using −O2 and −O3 with gcc on DPICSILog. We compare DPICSILog execution times with all alternative DP implementations: DPGNU, DPMKL, and DPLAU. Table 10 provides the execution times for DPGNU and DPICSILog for 10^{3} to 10^{8} invocations of the gcccompiled code. The DPLAU is 19 times faster than the GNU math library and 2.7 times faster than DPICSILog, which in turn is up to 7 times faster than the GNU implementation (for −O2 as well as −O3). The DPVLAU is 37.5 times faster than the GNU math library and 5.3 times faster than DPICSILog. Table 11 provides respective execution times under DP for the same experimental setup, but using the Intel icc compiler instead. The DPLAU is 2.2 times faster than the respective MKL implementation and 2.5 times faster than DPICSILog which is as fast as the DPMKL function. Speedups between DPICSILog and the DPMKL function vary between 0.83 and 0.98 for both optimization levels −O2 and −O3. Finally, the DPVLAU is 4 times faster than the MKL function and 5 times faster than DPICSILog.


7. Conclusion and Future Work
We presented an architecture that efficiently calculates an approximation of the logarithm in reconfigurable logic under SP and DP arithmetics and only uses 2% of the computational resources on mediumsize FPGAs. The SP/DPLAUs (LUT size: 4,096) as well as the DP software are freely available for download. To the best of our knowledge, this represents the only IEEE754 compatible opensource implementation of a resourceefficient logarithm approximation unit in reconfigurable logic. Since the accuracy demands of such a basic unit strongly depend on the target application, we also make available several COE files that can be used to initialize LUTs of various sizes and hence easily adapt the LAUs to the desired accuracy level. Except for an increase of block ram usage to hold the mantissa LUT, the proportion of required hardware resources will only slightly increase (see Section 6.4), if the LUT size is increased and the speed will only slightly decrease (see Figure 6).
Finally, we designed a memoryefficient VLAU architecture that exploits the dualport option of embedded memory blocks. The VLAU utilizes two pipelined LAU datapaths but only requires one instance of the readonly lookup tables. This feature allows a VLAU to calculate two results per cycle while requiring half the LUT memory than two independent parallel LAUs. The VLAU can therefore be used for designing large architectures that require the computation of logarithms on vectors. The SP/DPVLAUs are freely available for download.
Acknowledgment
Part of this work was funded under the auspices of the EmmyNoether program by the German Science Foundation (DFG).
References
 D. Ververidis and C. Kotropoulos, “Gaussian mixture modeling by exploiting the Mahalanobis distance,” IEEE Transactions on Signal Processing, vol. 56, no. 7 I, pp. 2797–2811, 2008. View at: Publisher Site  Google Scholar
 J. Felsenstein, “Evolutionary trees from DNA sequences: a maximum likelihood approach,” Journal of Molecular Evolution, vol. 17, no. 6, pp. 368–376, 1981. View at: Google Scholar
 N. Alachiotis, E. Sotiriades, A. Dollas, and A. Stamatakis, “Exploring FPGAs for accelerating the phylogenetic likelihood function,” in Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS '09), pp. 1–8, Rome, Italy, May 2009. View at: Publisher Site  Google Scholar
 N. Alachiotis, A. Stamatakis, E. Sotiriades, and A. Dollas, “A reconfigurable architecture for the phylogenetic likelihood function,” in Proceedings of the 19th International Conference on Field Programmable Logic and Applications (FPL '09), pp. 674–678, Prague, Czech Republic, September 2009. View at: Publisher Site  Google Scholar
 V. M. Preciado, RealTime Wavelent Transform for Image Processing on the Cellular Neural Network Universal Machine, vol. 2085 of Lecture Notes in Computer Science, Springer, Berlin, Germany, 2001.
 B. De Ruijsscher, G. N. Gaydadjiev, J. Lichtenauer, and E. Hendriks, “FPGA accelerator for realtime skin segmentation,” in Proceedings of the IEEE/ACM/IFIP Workshop on Embedded Systems for Real Time Multimedia (ESTIMEDIA '06), pp. 93–97, October 2006. View at: Publisher Site  Google Scholar
 N. Alachiotis and A. Stamatakis, “Efficient floatingpoint logarithm unit for FPGAS,” in Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW '10), Atlanta, Ga, USA, April 2010. View at: Publisher Site  Google Scholar
 A. Stamatakis, “RAxMLVIHPC: maximum likelihoodbased phylogenetic analyses with thousands of taxa and mixed models,” Bioinformatics, vol. 22, no. 21, pp. 2688–2690, 2006. View at: Publisher Site  Google Scholar
 O. Vinyals and G. Friedland, “A hardwareindependent fast logarithm approximation with adjustable accuracy,” in Proceedings of the 10th IEEE International Symposium on Multimedia (ISM '08), pp. 61–65, December 2008. View at: Publisher Site  Google Scholar
 Xilinx, “Floating Point Operator v4.0,” July 2009, http://www.xilinx.com/support/documentation/ip_documentation/floating_point_ds335.pdf. View at: Google Scholar
 L. de Soras, “Fast log() Function,” July 2009, http://www.flipcode.com/cgibin/fcarticles.cgi?show=63828. View at: Google Scholar
 F. De Dinechin, C. Klein, and B. Pasca, “Generating highperformance custom floatingpoint pipelines,” in Proceedings of the 19th International Conference on Field Programmable Logic and Applications(FPL '09), pp. 59–64, September 2009. View at: Publisher Site  Google Scholar
 J. Detrey and F. de Dinechin, “Parameterized floatingpoint logarithmand exponential functions for FPGAs,” Microprocessors and Microsystems, Special Issue on FPGABased Reconfigurable Computing, vol. 31, no. 8, pp. 537–545, 2007. View at: Google Scholar
 J. Detrey and F. De Dinechin, “A parameterizable floatingpoint logarithm operator for FPGAs,” in Proceedings of the 39th Asilomar Conference on Signals, Systems and Computers, vol. 2005, pp. 1186–1190, IEEE Signal Processing Society, November 2005. View at: Google Scholar
 National Instruments, “High Throughput Natural Logarithm Function,” http://www.ni.com/. View at: Google Scholar
 J. E. Volder, “The CORDIC trigonometric computing technique,” in Proceedings of IRE Transactions on Electronic Computers, pp. 330–334, 1959. View at: Google Scholar
 S. E. Tropea, “FPGA implementation of baseN logarithm,” in Proceedings of the 3rd Southern Conference on Programmable Logic (SPL '07), pp. 27–32, February 2007. View at: Publisher Site  Google Scholar
 Xilinx, http://www.xilinx.com/.
 Actel, January 2009, http://www.actel.com/.
 B. Parhami, Computer Arithmetic Algorithms and Hardware Designs, 2000.
 G. Chrysos, E. Sotiriades, I. Papaefstathiou, and A. Dollas, “A FPGA based coprocessor for gene finding using Interpolated Markov Model (IMM),” in Proceedings of the 19th International Conference on Field Programmable Logic and Applications (FPL '09), pp. 683–686, August 2009. View at: Publisher Site  Google Scholar
 J. J. RaygozaPanduro, S. OrtegaCisneros, J. Rivera, and A. de la Mora, “Design of a mathematical unit in FPGA for the implementation of the control of a magnetic levitation system,” International Journal of Reconfigurable Computing, vol. 2008, Article ID 634306, p. 9, 2008. View at: Google Scholar
 Intel, “Intel Math Kernel Library Reference Manual,” http://software.intel.com/enus/articles/intelmkl/. View at: Google Scholar
 N. Alachiotis, S. A. Berger, and A. Stamatakis, “Efficient PCFPGA communication over Gigabit Ethernet,” in Proceedings of the International Conferences on Embedded Software and Systems (ICESS '10), pp. 1727–1734, Bradford, UK, 2010. View at: Google Scholar
 H. Shimodaira and M. Hasegawa, “CONSEL: for assessing the confidence of phylogenetic tree selection,” Bioinformatics, vol. 17, no. 12, pp. 1246–1247, 2002. View at: Google Scholar
Copyright
Copyright © 2011 Nikolaos Alachiotis and Alexandros Stamatakis. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.