This paper presents a novel approach to implementing multiplication of Galois Fields with . Elements of GF( ) can be represented as polynomials of degree less than N over GF(2). Operations are performed modulo an irreducible polynomial of degree n over GF(2). Our approach splits a Galois Field multiply into two operations, polynomial-multiply and polynomial-remainder over GF(2). We show how these two operations can be implemented using the same hardware. Further, we show that in many cases several polynomial-multiply operations can be combined before needing to a polynomial-remainder. The Sandblaster 2.0 is a SIMD architecture. It has SIMD variants of the poly-multiply and poly-remainder instructions. We use a Reed-Solomon encoder and decoder to demonstrate the performance of our approach. Our new approach achieves speedup of 11.5x compared to the standard SIMD processor of 8x.

1. Introduction

Galois Field arithmetic is widely used in applications such as error-correcting codes and cryptography. Generally, the Galois Fields used are GF( ) for some N. Elements of GF( ) may be represented as polynomials of degree less than N over GF(2). Operations are performed modulo some polynomial P, where P is an irreducible polynomial of degree N over GF(2). P is known as the prime polynomial. The multiplication of two elements X and Y can be accomplished by multiplying their polynomial representations, and then computing the remainder modulo P.

Conventionally, these polynomials are represented as binary numbers, where the th term is represented by setting the th bit to 1 or 0 depending on whether that term is present. Thus, the polynomial x4 x 1 would be represented as 10011.

Addition of GF( ) values in this representation is straightforward; it is simply an exclusive-or (xor) of the two binary numbers. Galois Field multiplication (GFM), however, is much more complicated. It involves the following steps.

(i)Do a polynomial-multiply of two inputs.(ii)Do a polynomial-remainder of the product modulo a third input, the prime polynomial.

In software, GF multiplications are usually performed using Look-up Tables (LUTs). For large N, the LUT becomes rather large, requiring prohibitive large memory size. The processing time also becomes prohibitive at high data rates.

To further complicate issues, processors need to be able to handle Galois Fields of different lengths. Consequently, several processors have added instructions for Galois Field Multiplication (GFM). The most representative is the TI C64x DSP.

A general purpose GFM instruction needs 4 inputs, at least 3 of which must be in registers, the two inputs and the prime polynomial. Since most instruction-sets do not provide for 4 inputs fields, a GFM instruction generally uses a special-purpose register that provides either the length or the prime, or both. For instance, the GMPY4 operation in the TI C64x DSP uses the GFPGFR special purpose register to specify the length and prime polynomial [1].

In this paper, we describe an approach that implements GFM using two instructions, one of which implements polynomial-multiply over GF(2), and the other of which implements the polynomial-remainder over GF(2). In the Sandblaster 2.0 architecture [2], these instructions are called gfmul and gfnormi, respectively. Both of these instructions use two register inputs. The gfnormi additionally has a third immediate input, the length of the polynomial.

The Sandblaster 2.0 has a 16-way SIMD unit; consequently, we also have 16 way variants of polynomial-multiply and polynomial-remainder instructions called rgfmul and rgfnorm. Additionally, the SIMD unit supports polynomial-multiply-and-add and polynomial-multiply-and-reduce instructions called rgfmac and rgfmulred.

It turns out that it is possible to specify the gfmul and gfnormi operations in such a fashion that we can use almost identical hardware to implement both functions. Consequently, there is no hardware overhead to split the GFM operation into two operations.

The Galois Field sum of several GFM operations can be simplified to the polynomial sum (i.e., xor) of several polynomial-multiplies, followed by a single polynomial remainder. This is quite common in several algorithms that use Galois Field arithmetic. In those cases, we can implement the sum of N GFM operations using N polynomial-multiplies and 1 polynomial-remainder, incurring a 1 instruction overhead because of our split implementation.

Section 2 of this paper describes the GFM instructions in the Sandblaster 2.0 architecture. Section 3 focuses on the implementation of the operations. Section 4 examines the performance of GFM instructions in the context of Reed-Solomon encoding/decoding. We conclude in Section 5.

2. Galois Field Multiply Instructions

The Sandblaster scalar unit has 16 32-bit general purpose registers. Like most RISC architectures, at most 2 registers can be read and 1 written per integer operation. An integer operation has fields to specify up to 3 registers. An extended immediate variant of the instruction can additionally provide up to 12 bits of immediate data.

2.1. Polynomial Representation

In the customary binary representation of polynomials over GF(2), the bits are right-aligned, with the LSB bit representing the coefficient of term x0, and bit i representing the coefficient of term . By contrast, we left-align the coefficients, so that the coefficient of the largest term of the polynomial is represented by MSB. For a polynomial of degree N, bit 31-i of the general purpose register represents the coefficient of term .

Note that in this representation, without knowing the length of the polynomial, we cannot be sure which polynomial is represented by a specific number. For instance, 0xb000_0000 could be interpreted as x3 x 1 if the polynomials are of degree 3 or x5 x3 x2 if the polynomial is of degree 5.

We picked this representation to make it easier to compute the remainder. Since the high-order term of the divisor and dividend is left-aligned, we can start subtraction without requiring any shifting to line up the start of the polynomials.

For correctness, it is assumed that all unused bits in the register are 0. Both polynomial-multiply and -remainder are implemented so that they leave their results left-aligned with unused bits as 0.

There is one wrinkle about the representation. We assume that the polynomial remainder is performed with a left-aligned divisor so that the MSB is always 1. In this case, representing the leading coefficient is redundant. So, we do not represent the leading bit of a divisor polynomial. Instead the MSB represents the coefficient of the second highest term . For instance, the divisor polynomial x6 x3 1 is represented as 0x2400_0000; that is, with the leading 6 bits being 001001.

2.2. Polynomial Operations

The poly-multiply instruction in the Sandblaster architecture, gfmul, has the following format:

gfmul rc,ra,rb

It does a polynomial multiplication of the upper-most 8 bits of ra and with the upper 8 bits of rb, and wites the 15 bit result of the poly-multiply in the upper bits of the target register rc. The remaining bits of rc are zeroed.

The poly-remainder instruction in the Sandblaster architecture, gfnormi, has the format

gfnormi rt,rc,rp,J

The dividend is the 32-bit number formed by the upper 16 bits of rc right padded with 0. The divisor is the 17 bit number formed by prepending a 1 to the upper 16 bits of rp. J is an immediate operand ranging from 0 to 7. The gfnormi instruction performs J 1 poly-division steps, leaving the remainder in the 16-(J 1) upper-most bits of the target register rt.

2.3. Galois Field Multiplication

Implementing a GFM over GF( ) with K 1 bit prime polynomial P uses the following setup:

(i)the product inputs are stored in the upper K bits of two registers, ra, rb, (ii)the leading bit of P is dropped and the remaining K bits are stored in the upper K bits of a register, rp,(iii) all unused bits are set to 0.

After executing the following code sequence, the final result of the GFM will be the upper K bits of rt:

gfmul rc,ra,rb gfnormi rt,rc,rp,K-1

Table 1shows an example of Galois Field multiplication over GF(26) of the two numbers 101100 and 011011 with the prime polynomial 1001001. This results in an intermediate product of 01111010100 with a final remainder of 101010. In Table 1, the column on the right shows how the values will be stored in the corresponding registers.

2.4. SIMD

The SIMD unit in Sandbridge 2.0 architecture has eight 16 16-bit SIMD registers as well as four accumulator registers. The instruction encoding for SIMD operations allows for 4 input fields. The SIMD unit allows three registers to be read and 2 to be written by an instruction.

The SIMD unit supports GFM through the rgfmul and rgfnorm instructions, which have the following format:

rgfmul vc,va,vb rgfnorm vt,vc,vp,J

These instructions do 16 poly-multiplies/poly-remainders in parallel. Since the SIMD register elements are 16-bit wide, the rgfmul uses the upper 8 bits of each element, while the rgfnorm uses the entire 16 bits of the element. Other than that, their behavior is identical to the gfmul/gfnormi instructions.

Upto three SIMD registers can be read per cycle; we use the extra read-port to implement a poly-multiply-and-add instruction with the format:

rgfmac vc,va,vb,vs

The rgfmac instruction 16 poly-multiplies of the 16 elements of va and vb, and then poly-adds (xor’s) the products with the corresponding elements of vs.

The SIMD unit has an idiom where the 16 results of element-wise operations (such as rgfmul) are combined together to form a single value that is written to the accumulator. The poly-multiply-and-sum-reduce instruction follows this idiom

rgfmulred act,va,vb

The 16 elements of va and vb are poly-multiplied together, and the 16 resulting products are poly-summed (xor-ed) together to form a single 16-bit value that is written to the accumulator register.

3. Implementation

The gfnormi and gfmul instructions can be implemented by the same block with very little overhead. As can be seen from the pseudocode in Algorithm 1, the algorithms for the two involve the same computational kernel and differ in their setup and controls. They are described in detail below.

gfop (ismul, ra, rb, J)
if (ismul)
res 0 00000
div 0 00.rb [31:24]
N 8
res ra [31:17]
div rb [31:17]
N J 1
/*shift/xor stages*/
for (i 0; I 8; i )
if (ismul)
isxor ra [31 i]
issh true
isxor res && i N
issh i N
if (isxor)
res (res 1) div
else if (issh)
res (res 1)

3.1. gfnormi

The gfnormi instruction computes the remainder using polynomial long division. Since the values are left-aligned, we start the process at bit 31 of the dividend value. The divisor consists of a leading 1 and the upper 16 bits of the divisor register. The immediate argument to the gfnormi instruction specifies the number of divide steps executed, J 1. For example, 5 steps of the division of 011.1101.0100 by 100.1001 will proceed as follows:

01111010100000000 00000000000000000 11110101000000000 10010010000000000 11001110000000000 10010010000000000 10111000000000000 10010010000000000 01010100000000000 00000000000000000 1010100000000000

At each step, the result is xor-ed with 0 or with the divisor, depending on the leading bit being 0/1. The result is then left-shifted by 1 to ensure that the remainder after the division step is left-aligned. Note that xor-ing with 0 is the identity operation; this results in just a left-shift. This is done when the intermediate remainder is smaller than the divisor.

3.2. gfmul

Each poly-multiply step needs to follow the same pattern as the poly-remainder so that much of the hardware is common. If we are going to J 1 steps, we do the following:

(i)the partial result is initialized to all 0s,(ii)the “divisor” at each step is one of the multiply inputs prepended with J 1 zeroes,(iii)the control to select whether the xor is with the divisor or 0s are the bits of the second multiply inputs starting with the upper-most bit.

The example below multiplies 101100 and 011011 using 6 steps. 10110 is used as the control input

0000000000000000 0000000110110000 0000001101100000 0000000000000000 0000011011000000 0000000110110000 0000111011100000 0000000110110000 0001111010100000 0000000000000000 0011110101000000 0000000000000000 0111101010000000

The gfmul instruction always does 8 steps of multiply. Consequently, in the implementation, the “:divisor” is prepended by 8 zeroes.

3.3. Results

The unified block that implements gfnormi and gfmul in the SB3500 consists of some setup followed by 8 stages of a computational kernel. This computation in each stage is an xor-select, as shown in Figure 1.

In the case of the polynomial remainder operation, gfnormi, the result, (res), and divisor (div) values are setup from the value of the ra and rb registers. The count (N) is set to the immediate value specified in the operation 1. For the first N of the 8 xor-shift stages, if the MSB in the res is 1, the res is shifted andxor-ed with value in div; otherwise it is just shifted by 1.

For the polynomial multiply operation, res is set to 0 and div is set to the value of the rb register prepended by 8 zeroes; the count N is always 8. The top 8 bits of the rb register are used as controls to the 8 xor-shift stage; if the corresponding bit is 1, res is shifted and xor-ed with the value in div; otherwise it is just shifted by 1.

From the diagram shown in Figure 1, it is clear that the critical path for each of the 8 xor-select stages in a naïve implementation uses two 2-1 muxes. Adding in the initial setup, this gives a total delay for the entire block of about 17 2-1 muxes. In the TSMC 65 nm low power TSMC65LP process the critical path is about 0.9 nanoseconds at an area of 2856  m2 using regular Vt transistors and typical timing.

The SB3500 implemented is targeted for a 1.6 nanoseconds clock. It has a 2-stage execute pipeline, so the gf-op block is pipelined across 2 stages. This gives the synthesis tool 3.3 nanoseconds to implement the block. The synthesis tool used this relaxed timing to pick a power and area optimized implementation. In this implementation the gf-op block occupies approximately 2018  m2, not including the pipeline registers.

It is possible to implement various look-ahead schemes that would reduce the critical path at the expense of extra logic. Since we have ample slack, we did not investigate any area/speed tradeoffs.

4. Reed-Solomon

We have implemented a RS encoder/decoder that is designed to be implemented on a SIMD architecture. The numbers presented in this section are tuned for the DVB (digital video broadcasting) standard. This standard uses RS(204,188) encoding; that is, it adds 16 check symbols to a 188-byte packet resulting in a total code word of length 204 bytes.

4.1. Algorithm

The RS encoder used in this study does the following steps [3]:

(i)append N zeroes to a data block,(ii)perform successive Horner reduction of the polynomial whose coefficients are the data block plus zeroes to obtain remainders,(iii)multiply remainders by pre-calculated coefficients and sum.

All operations are over GF(28). The RS decoder [4] starts with syndrome computation, which computes the dot-product of the received code-word with precomputed syndrome vectors [4]. If the syndromes are all zero, then there are no errors.

Our implementation combines several techniques to improve the error decoding capability.

(i)Correct codeword using Peterson-Gorenstein-Zieler (PGZ) [5] algorithm.(ii)If that fails to correct the errors, successively apply 2, 4, 6, 8 erasures, deriving an error locator polynomial, until an error locator polynomial of correct degree is derived [6, 7].(iii)If an error locator polynomial is identified, attempt to decode the word using the Forney-Messey-Berlekamp (FMB) method [8, 9].

Again, all operations are over GF(28). The details of this approach have been published previously [6].

4.2. Results

We started off with an original version of the code that was designed to use Galois Field operations. This base code was then rewritten to use SIMD forms of polynomial-multiply and -remainder operations.

The experiments that were run encode one RS(204,188) packet artificially introduce enough errors to require 8 erasures and then decode the packet. Note that this is the worst-case decode situation; in practice 98% of all packets have all syndromes equal to zero, so no error decoding is required.

Table 2gives the details of the results of our experiments. In the encoder, the number of poly-remainder operations is almost the same as the poly-multiplies. Consequently, SIMD only achieves a 8x speed up, even though the SIMD variants of the polynomial instructions perform 16 poly-operations in parallel.

4.3. End-to-end Simulation Results

For the DVB-T case, for the highest bitrate of 31.67 Mbps, the decoder is called 21 763 times per second. The total number of cycles spent by the processor in vector mode for the GF operations only is less then 18 MHz (a fraction of the SBX processor capabilities) compared to 277 MHz in scalar mode. The iterative decoding algorithm was tested in the end-to-end DVB-T/H simulated system, specified by ETSI EN 744 V1.4.1 (2001-01). The simulations were performed by using the SBX simulation tools. Using our GF instructions, the total number of cycles per second consumed in the SBX processor, for the highest bit rates specified in the standards and assuming that every packet has eight errors and eight erasures, is the following: 29 MHz for the 31.67 Mbps DVB-T, 9 MHz for the 4.4 Mbps DVB-H including the optional second RS decoder at the link level.

5. Conclusions

The method for implementing GFM we have described, that implements a GFM using 2 instructions a poly-multiply and a poly-remainder, allows the addition of GFM to a standard architecture without the need to introduce a special purpose register for the GFM. Further, both of these instructions can be implemented using the same hardware block.

We have shown that, for some applications, it is not necessary to execute both a poly-multiply and a poly-remainder for each GFM. In the cases where the results of several GFM are added together, the products of the corresponding poly-multiplies are summed and then a single poly-remainder is used. In one specific case, only 25% of the poly-multiplies required a poly-remainder. Our simulation results indicate a speedup of 11.5x of the extended processor versus the standard processor.