#### Abstract

We propose a Domain-Specific Architecture for elementary function computation to improve throughput while reducing power consumption as a model for more general applications: support fine-grained parallelism by eliminating branches, and eliminate the duplication required by coprocessors by decomposing computation into instructions which fit existing pipelined execution models and standard register files. Our example instruction architecture (ISA) extension supports scalar and vector/SIMD implementations of table-based methods of calculating all common special functions, with the aim of improving throughput by (1) eliminating the need for tables in memory, (2) eliminating all branches for special cases, and (3) reducing the total number of instructions. Two new instructions are required, a table lookup instruction and an extended-precision floating-point multiply-add instruction with special treatment for exceptional inputs. To estimate the performance impact of these instructions, we implemented them in a modified Cell/B.E. SPU simulator and observed an average throughput improvement of 2.5 times for optimized loops mapping single functions over long vectors.

#### 1. Introduction

Elementary function libraries are often called from performance-critical code sections and hence contribute greatly to the efficiency of numerical applications, and the performance of these and libraries for linear algebra largely determine the performance of important applications. Current hardware trends impact this performance as(i)longer pipelines and wider superscalar dispatch favour implementations which distribute computation across different execution units and present the compiler with more opportunities for parallel execution but make branches more expensive;(ii)Single-Instruction-Multiple-Data (SIMD) parallelism makes handling cases via branches very expensive;(iii)memory throughput and latency which are not advancing as fast as computational throughput hinder the use of lookup tables;(iv)power constraints limit performance more than area.The last point is interesting and gives rise to the notion of “dark silicon” in which circuits are designed to be un- or underused to save power. The consequences of these thermal limitations versus silicon usage have been analyzed [1], and a number of performance-stretching approaches have been proposed [2] including the integration of specialized coprocessors.

Our proposal is less radical: instead of adding specialized coprocessors, add novel fully pipelined instructions to existing CPUs and GPUs, use the existing register file, reuse existing silicon for expensive operations, for example, fused multiply-add operations, and eliminate costly branches but add embedded lookup tables which are a very effective use of dark silicon. In the present paper, we demonstrate this approach for elementary function evaluation, that is, libm functions and especially vector versions of them.

To optimize performance, our approach takes the successful accurate table approach of Gal et al. [3, 4] coupled with algorithmic optimizations [5, 6] and branch and table unifications [7] to reduce all fixed-power-, logarithm-, and exponential-family functions to a hardware-based lookup followed by a handful of floating-point operations, mostly fused multiply-add instructions evaluating a single polynomial. Other functions like pow require multiple such basic stages, but no functions require branches to handle exceptional cases, including subnormal and infinite values.

Although fixed powers (including square roots and reciprocals) of most finite inputs can be efficiently computed using Newton-Raphson iteration following a software or hardware estimate [8], such iterations necessarily introduce NaN intermediate values, which can only be corrected using additional instructions (branches, predication, or selection). Therefore, our proposed implementations avoid iterative methods.

For evaluation of the approach, the proposed instructions were implemented in a Cell/B.E. [9] SPU simulator, and algorithms for a standard math function library were developed that leverage these proposed additions. Overall, we found that the new instructions would result in an average 2.5 times throughput improvement on this architecture versus current published performance results (Mathematical Acceleration Subsystem, 5.0, IBM). Given the simple data dependency graphs involved, we expect similar improvements from implementing these instructions on all two-way SIMD architectures supporting fused multiply-add instructions. Higher-way SIMD architectures would likely benefit more.

In the following, the main approach is developed, and the construction of two representative functions, and , is given in detail, providing insight by example into the nature of the implementation. In some sense these represent the hardest case; although the trigonometric functions require multiple tables, and there is some computation of the lookup keys, the hardware instructions themselves are simpler. For a complete specification of the algorithms used, see [10].

#### 2. New Instructions

Driven by hardware floating-point instructions, the advent of software pipelining and shortening of pipeline stages favoured iterative algorithms (see, e.g., [11]); the long-running trend towards parallelism has engendered a search for shared execution units [12] and in a more general sense a focus on throughput rather than low latency. In terms of algorithmic development for elementary functions, this makes combining short-latency seed or table value lookups with standard floating-point operations attractive, exposing the whole computation to software pipelining by the scheduler.

In proposing Instruction Set Architecture (ISA) extensions, one must consider four constraints:(i)the limit on the number of instructions imposed by the size of the machine word, and the desire for fast (i.e., simple) instruction decoding,(ii)the limit on arguments and results imposed by the architected number of ports on the register file,(iii)the limit on total latency required to prevent an increase in maximum pipeline depth,(iv)the need to balance increased functionality with increased area and power usage.As new lithography methods cause processor sizes to shrink, the relative cost of increasing core area for new instructions is reduced. The net cost may even be negative if the new instructions can reduce code and data size, thereby reducing pressure on the memory interface (which is more difficult to scale).

To achieve a performance benefit, ISA extensions should do one or more of the following:(i)reduce the number of machine instructions in compiled code,(ii)move computation away from bottleneck execution units or dispatch queues, or(iii)reduce register pressure.

Considering the above limitations and ideals, we propose adding two instructions, the motivation for which follows below: **d** = fmaX** a b c**: an extended-range floating-point multiply-add, with the first argument having 12 exponent bits and 51 mantissa bits, and nonstandard exception handling; **t**1 = lookup** a b f t**: an enhanced table lookup based on one or two arguments, and containing immediate argument specifying the function number and the sequence of the lookup, for example, the first lookup used for range reduction or the second lookup used for function reconstruction.

It is easiest to see them used in an example. Figure 1 describes the data flow graph (omitting register constants), which is identical for a majority of the elementary functions. The correct lookup is specified as an immediate argument to lookup, the final operation being fma for the log functions and fm otherwise. All of the floating-point instructions also take constant arguments which are not shown. For example, the fmaX takes an argument which is .

The dotted box in Figure 1 represents a varying number of fused multiply-adds used to evaluate a polynomial after the multiplicative range reduction performed by the fmaX. The most common size for these polynomials is order three, so the result of the polynomial (the left branch) is available four floating-point operations later (typically about 24–28 cycles) than the result . The second lookup instruction performs a second lookup, for example, the looks up and substitutes exceptional results (, NaN) when necessary. The final fma or fm instruction combines the polynomial approximation on the reduced interval with the table value.

The gray lines indicate two possible data flows for three possible implementations:(i)the second lookup instruction is a second lookup, using the same input;(ii)the second lookup instruction retrieves a value saved by the first lookup (in final or intermediate form) from a FIFO queue;(iii)the second lookup instruction retrieves a value saved in a slot according to an immediate tag which is also present in the corresponding first lookup.In the first case, the dependency is direct. In the second two cases the dependency is indirect, via registers internal to the execution unit handling the lookups.

All instruction variations have two register inputs and one or no outputs, so they will be compatible with existing in-flight instruction and register tracking. On lean in-order architectures, the variants with indirect dependencies—(ii) and (iii)—reduce register pressure and simplify modulo loop scheduling. This would be most effective in dedicated computational cores like the SPUs in which preemptive context switching is restricted.

The variant (iii) requires additional instruction decode logic but may be preferred over (ii) because tags allow lookup instructions to execute in different orders, and for wide superscalar processors, the tags can be used by the unit assignment logic to ensure that matching lookup instructions are routed to the same units. On Very Long Instruction Word machines, the position of lookups could replace or augment the tag.

In low-power environments, the known long minimum latency between the lookups would enable hardware designers to use lower power but longer latency implementations of most of the second lookup instructions.

To facilitate scheduling, it is recommended that the FIFO or tag set be sized to the power of two greater than or equal to the latency of a floating-point operation. In this case, the number of registers required will be less than twice the unrolling factor, which is much lower than what is possible for code generated without access to such instructions. The combination of small instruction counts and reduced register pressure eliminate the obstacles to inlining of these functions.

We recommend that lookups be handled by either a load/store unit, or, for vector implementations with a complex integer unit, by that unit. This code will be limited by floating-point instruction dispatch, so moving computation out of this unit will increase performance.

##### 2.1. Exceptional Values

A key advantage of the proposed new instructions is that the complications associated with exceptional values (0, , NaN, and values with over- or underflow at intermediate stages) are internal to the instructions, eliminating branches and predicated execution.

Iterative methods with table-based seed values cannot achieve this in most cases because(1)in and cases the iteration multiplies by producing a NaN;(2)to prevent over/underflow for high and low input exponents, matched adjustments are required before* and* after polynomial evaluation or iterations.By using the table-based instruction twice, once to look up the value used in range reduction and once to look up the value of the function corresponding to the reduction, and introducing an extended-range floating-point representation with special handling for exceptions, we can handle both types of exceptions without extra instructions.

In the case of finite inputs, the value , such thatreturned by the first lookup is a normal* extended-range* value. In the case of subnormal inputs,* extended-range* are required to represent this lookup value because normal IEEE value would saturate to . Treatment of the large inputs which produce IEEE subnormal as their approximate reciprocals can be handled (similar to normal inputs) using the extended-range representation. The extended-range number is biased by , and the top binary value () is reserved for and NaNs and is reserved for similar to IEEE floating point. When these values are supplied as the first argument of fmaX, they override the normal values, and fmaX simply returns the corresponding IEEE bit pattern.

The second lookup instruction returns an IEEE double except when used for divide, in which case it also returns an extended-range result.

In Table 2, we summarize how each case is handled and describe it in detail in the following section. Each cell in Table 2 contains the value used in the reduction, followed by the corresponding function value. The first is given as an extended-range floating-point number which trades one bit of stored precision in the mantissa with a doubling of the exponent range. In all cases arising in this library, the extra bit would be one of several zero bits, so no actual loss of precision occurs. For the purposes of elementary function evaluation, subnormal extended-range floating-point numbers are not needed, so they do not need to be supported in the floating-point execution unit. As a result, the modifications to support extended-range numbers as inputs are minor.

Take, for example, the first cell in the table, recip computing for a normal positive input. Although the abstract values are both , the bit patterns for the two lookups are different, meaning that must be representable in both formats. In the next cell, however, for some subnormal inputs, is representable in the extended range, but not in IEEE floating point, because the addition of subnormal numbers makes the exponent range asymmetrical. As a result, the second value may be saturated to . The remaining cells in this row show that, for input, we return from both lookups, but for inputs the first lookup returns and the second lookup returns . In the last column we see that, for negative inputs, the returned values change the sign. This ensures that intermediate values are always positive and allows the polynomial approximation to be optimized to give correctly rounded results on more boundary cases. Both lookups return quiet NaN outputs for NaN inputs.

Contrast this with the handling of approximate reciprocal instructions. For the instructions to be useful as approximations inputs should return approximations and vice versa, but if the approximation is improved using Newton-Raphson, then the multiplication of the input by the approximation produces a NaN which propagates to the final result.

The other cases are similar in treating and inputs specially. Noteworthy variations are that multiplicatively shifts subnormal inputs into the normal range so that the normal approximation can be used and then additively shifts the result of the second lookup to compensate, and returns and for subnormal inputs, because the polynomial approximation produces the correct result for the whole subnormal range.

In Table 2, we list the handling of exceptional cases. All exceptional values detected in the first argument are converted to the IEEE equivalent and are returned as the output of the fmaX, as indicated by subscript (for final). The subscripted NaNs are special bit patterns required to produce the special outputs needed for exceptional cases. For example, when fmaX is executed with as the first argument (one of the multiplicands) and the other two arguments are finite IEEE values, the result is (in IEEE floating-point format): If the result of multiplication is an and the addend is with the opposite sign, then the result is zero, although normally it would be a NaN. If the addend is a NaN, then the result is zero. For the other values, indicated by “” in Table 2, fmaX operates as a usual fused multiply-accumulate except that the first argument (a multiplicand) is an extended-range floating-point number. For example, the fused multiplication and addition of finite arguments saturate to in the usual way.

Finally, for exponential functions, which return fixed finite values for a wide range of inputs (including infinities), it is necessary to override the range reduction so that it produces an output which results in a constant value after the polynomial approximation. In the case of exponential, any finite value which results in a nonzero polynomial value will do, because the second lookup instruction returns or and multiplication by any finite value will return as required.

*Lookup Internals*. The lookup instructions perform similar operations for each of the elementary functions we have considered. The function number is an immediate argument. In assembly language each function could be a different macro, while in high level languages the pair could be represented by a single function returning a tuple or struct.

A simplified data flow for the most complicated case, , is represented in Figure 2. The simplification is the elimination of the many single-bit operations necessary to keep track of exceptional conditions, while the operations to substitute special values are still shown. Critically, the diagram demonstrates that the operations around the core lookup operations are all of low complexity. The graph is explained in the following, where letter labels correspond to the blue colored labels in Figure 2. This representation is for variant (ii) or (iii) for the second lookup and includes a dotted line on the centre-right of the figure at (a), delineating a possible set of values to save at the end of the first lookup where the part of the data flow below the line is computed in the second lookup instruction.

Starting from the top of the graph, the input (b) is used to generate two values (c) and (d), and in the case of . The heart of the operation is two lookup operations (e) and (f), with a common index. In implementation (i) the lookups would be implemented separately, while in the shared implementations (ii) and (iii), the lookups could be implemented more efficiently together.

Partial decoding of subnormal inputs (g) is required for all of the functions except the exponential functions. Only the leading nonzero bits are needed for subnormal values, and only the leading bits are needed for normal values, but the number of leading zero bits (h) is required to properly form the exponent for the multiplicative reduction. The only switch (i) needed for the first lookup output switches between the reciprocal exponents valid in the normal and subnormal cases, respectively. Accurate range reduction for subnormal requires both extreme end points, for example, and , because these values are exactly representable. As a result, two exponent values are required, and we accommodate this by storing an exponent bit (j) in addition to the 51 mantissa bits.

On the right hand side, the lookup (e) for the second lookup operation also looks up a 4-bit rotation, which also serves as a flag. We need 4 bits because the table size implies that we may have a variation in the exponent of the leading nonzero bit of up to 11 for nonzero table values. This allows us to encode in bits the floating mantissa used to construct the second lookup output. This table will always contain a 0 and is encoded as a in the bitRot field. In all other cases, the next operation concatenates the implied for this floating-point format. This gives us an effective 31 bits of significance (l), which is then rotated into the correct position in a 42-bit fixed point number. Only the high-order bit overlaps the integer part of the answer generated from the exponent bits, so this value needs to be padded. Because the output is an IEEE float, the contribution of the (padded) value to the mantissa of the output will depend on the sign of the integer exponent part. This sign is computed by adding 1 (m) to the biased exponent, in which case the high-order bit is 1 if and only if the exponent is positive. This bit (n) is used to control the sign reversal of the integer part (o) and the sign of the sign reversal of the fractional part, which is optimized by padding (p) after xoring (q) but before the (r) required to negate a two’s-complement integer.

The integer part has now been computed for normal inputs, but we need to switch (s) in the value for subnormal inputs which we obtain by biasing the number of leading zeros computed as part of the first step. The apparent 75-bit add (t) is really only 11 bits with 10 of the bits coming from padding on one side. This fixed-point number may contain leading zeros, but the maximum number is ((maximum integer part) − (smallest nonzero table value)) = 22, for the tested table size. As a result the normalization (u) only needs to check for up to 22 leading zero bits, and if it detects that number set a flag to substitute a zero for the exponent (v) (the mantissa is automatically zero). The final switches substitute special values for and a quiet NaN.

If the variants (ii) and (iii) are implemented, either the hidden registers must be saved on context/core switches, or such switches must be disabled during execution of these instructions, or execution of these instructions must be limited to one thread at a time.

#### 3. Evaluation

Two types of simulations of these instructions were carried out. First, to test accuracy, our existing Cell/B.E. functional interpreter, see [13], was extended to include the new instructions. Second, we simulated the log lookups and fmaX using logical operations on bits, that is, a hardware simulation without timing, and verified that the results match the interpreter.

*Performance*. Since the dependency graphs (as typified by Figure 1) are close to linear and therefore easy to schedule optimally, the throughput and latency of software-pipelined loops will be essentially proportional to the number of floating-point instructions. Table 1 lists the expected throughput for vector math functions with and without the new instructions. Figure 3 demonstrated the relative measured performance improvements. Overall, the addition of hardware instructions to the SPU results in a mean 2.5× throughput improvement for the whole function library. Performance improvements on other architectures will vary but would be similar, since the acceleration is primarily the result of eliminating instructions for handling exceptional cases.

*Accuracy*. We tested each of the functions by simulating execution for 20000 pseudorandom inputs over their natural ranges or for trigonometric functions and comparing each value to a 500-digit-precision Maple [14] reference. Table 1 shows a maximum 2.051 ulp error, with many functions close to correct rounding. This is well within the OpenCL specifications [15] and shows very good accuracy for functions designed primarily for high throughput and small code size. For applications requiring even higher accuracy, larger tables could be used and polynomials with better rounding properties could be searched for using the lattice-basis-reduction method of [16].

#### 4. Conclusion

We have demonstrated considerable performance improvements for fixed power, exponential, and logarithm calculations by using novel table lookup and fused multiply-add instructions in simple branch-free accurate-table-based algorithms. Performance improved less for trigonometric functions, but this improvement will grow with more cores and/or wider SIMD. These calculations ignored the effect of reduced power consumption caused by reducing instruction dispatch and function calls and branching and reducing memory accesses for large tables, which will mean that these algorithms will continue to scale longer than conventional ones.

For target applications, just three added opcodes pack a lot of performance improvement, but designing the instructions required insights into the algorithms, and even a new algorithm [7] for the calculation of these functions. We invite experts in areas such as cryptography and data compression to try a similar approach.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

The authors thank NSERC, MITACS, Optimal Computational Algorithms, Inc., and IBM Canada for financial support. Some work in this paper is covered by US Patent 6,804,546.