Abstract
This paper presents a novel approach to implementing multiplication of Galois Fields with . Elements of GF() can be represented as polynomials of degree less than N over GF(2). Operations are performed modulo an irreducible polynomial of degree n over GF(2). Our approach splits a Galois Field multiply into two operations, polynomialmultiply and polynomialremainder over GF(2). We show how these two operations can be implemented using the same hardware. Further, we show that in many cases several polynomialmultiply operations can be combined before needing to a polynomialremainder. The Sandblaster 2.0 is a SIMD architecture. It has SIMD variants of the polymultiply and polyremainder instructions. We use a ReedSolomon encoder and decoder to demonstrate the performance of our approach. Our new approach achieves speedup of 11.5x compared to the standard SIMD processor of 8x.
1. Introduction
Galois Field arithmetic is widely used in applications such as errorcorrecting codes and cryptography. Generally, the Galois Fields used are GF() for some N. Elements of GF() may be represented as polynomials of degree less than N over GF(2). Operations are performed modulo some polynomial P, where P is an irreducible polynomial of degree N over GF(2). P is known as the prime polynomial. The multiplication of two elements X and Y can be accomplished by multiplying their polynomial representations, and then computing the remainder modulo P.
Conventionally, these polynomials are represented as binary numbers, where the th term is represented by setting the th bit to 1 or 0 depending on whether that term is present. Thus, the polynomial x^{4 } x 1 would be represented as 10011.
Addition of GF() values in this representation is straightforward; it is simply an exclusiveor (xor) of the two binary numbers. Galois Field multiplication (GFM), however, is much more complicated. It involves the following steps.
(i)Do a polynomialmultiply of two inputs.(ii)Do a polynomialremainder of the product modulo a third input, the prime polynomial.In software, GF multiplications are usually performed using Lookup Tables (LUTs). For large N, the LUT becomes rather large, requiring prohibitive large memory size. The processing time also becomes prohibitive at high data rates.
To further complicate issues, processors need to be able to handle Galois Fields of different lengths. Consequently, several processors have added instructions for Galois Field Multiplication (GFM). The most representative is the TI C64x DSP.
A general purpose GFM instruction needs 4 inputs, at least 3 of which must be in registers, the two inputs and the prime polynomial. Since most instructionsets do not provide for 4 inputs fields, a GFM instruction generally uses a specialpurpose register that provides either the length or the prime, or both. For instance, the GMPY4 operation in the TI C64x DSP uses the GFPGFR special purpose register to specify the length and prime polynomial [1].
In this paper, we describe an approach that implements GFM using two instructions, one of which implements polynomialmultiply over GF(2), and the other of which implements the polynomialremainder over GF(2). In the Sandblaster 2.0 architecture [2], these instructions are called gfmul and gfnormi, respectively. Both of these instructions use two register inputs. The gfnormi additionally has a third immediate input, the length of the polynomial.
The Sandblaster 2.0 has a 16way SIMD unit; consequently, we also have 16 way variants of polynomialmultiply and polynomialremainder instructions called rgfmul and rgfnorm. Additionally, the SIMD unit supports polynomialmultiplyandadd and polynomialmultiplyandreduce instructions called rgfmac and rgfmulred.
It turns out that it is possible to specify the gfmul and gfnormi operations in such a fashion that we can use almost identical hardware to implement both functions. Consequently, there is no hardware overhead to split the GFM operation into two operations.
The Galois Field sum of several GFM operations can be simplified to the polynomial sum (i.e., xor) of several polynomialmultiplies, followed by a single polynomial remainder. This is quite common in several algorithms that use Galois Field arithmetic. In those cases, we can implement the sum of N GFM operations using N polynomialmultiplies and 1 polynomialremainder, incurring a 1 instruction overhead because of our split implementation.
Section 2 of this paper describes the GFM instructions in the Sandblaster 2.0 architecture. Section 3 focuses on the implementation of the operations. Section 4 examines the performance of GFM instructions in the context of ReedSolomon encoding/decoding. We conclude in Section 5.
2. Galois Field Multiply Instructions
The Sandblaster scalar unit has 16 32bit general purpose registers. Like most RISC architectures, at most 2 registers can be read and 1 written per integer operation. An integer operation has fields to specify up to 3 registers. An extended immediate variant of the instruction can additionally provide up to 12 bits of immediate data.
2.1. Polynomial Representation
In the customary binary representation of polynomials over GF(2), the bits are rightaligned, with the LSB bit representing the coefficient of term x^{0}, and bit i representing the coefficient of term . By contrast, we leftalign the coefficients, so that the coefficient of the largest term of the polynomial is represented by MSB. For a polynomial of degree N, bit 31i of the general purpose register represents the coefficient of term .
Note that in this representation, without knowing the length of the polynomial, we cannot be sure which polynomial is represented by a specific number. For instance, 0xb000_0000 could be interpreted as x^{3 } x 1 if the polynomials are of degree 3 or x^{5 } x^{3 } x^{2} if the polynomial is of degree 5.
We picked this representation to make it easier to compute the remainder. Since the highorder term of the divisor and dividend is leftaligned, we can start subtraction without requiring any shifting to line up the start of the polynomials.
For correctness, it is assumed that all unused bits in the register are 0. Both polynomialmultiply and remainder are implemented so that they leave their results leftaligned with unused bits as 0.
There is one wrinkle about the representation. We assume that the polynomial remainder is performed with a leftaligned divisor so that the MSB is always 1. In this case, representing the leading coefficient is redundant. So, we do not represent the leading bit of a divisor polynomial. Instead the MSB represents the coefficient of the second highest term . For instance, the divisor polynomial x^{6 } x^{3 } 1 is represented as 0x2400_0000; that is, with the leading 6 bits being 001001.
2.2. Polynomial Operations
The polymultiply instruction in the Sandblaster architecture, gfmul, has the following format:
gfmul rc,ra,rbIt does a polynomial multiplication of the uppermost 8 bits of ra and with the upper 8 bits of rb, and wites the 15 bit result of the polymultiply in the upper bits of the target register rc. The remaining bits of rc are zeroed.
The polyremainder instruction in the Sandblaster architecture, gfnormi, has the format
gfnormi rt,rc,rp,JThe dividend is the 32bit number formed by the upper 16 bits of rc right padded with 0. The divisor is the 17 bit number formed by prepending a 1 to the upper 16 bits of rp. J is an immediate operand ranging from 0 to 7. The gfnormi instruction performs J 1 polydivision steps, leaving the remainder in the 16(J 1) uppermost bits of the target register rt.
2.3. Galois Field Multiplication
Implementing a GFM over GF() with K 1 bit prime polynomial P uses the following setup:
(i)the product inputs are stored in the upper K bits of two registers, ra, rb, (ii)the leading bit of P is dropped and the remaining K bits are stored in the upper K bits of a register, rp,(iii) all unused bits are set to 0.After executing the following code sequence, the final result of the GFM will be the upper K bits of rt:
gfmul rc,ra,rbgfnormi rt,rc,rp,K1Table 1shows an example of Galois Field multiplication over GF(2^{6}) of the two numbers 101100 and 011011 with the prime polynomial 1001001. This results in an intermediate product of 01111010100 with a final remainder of 101010. In Table 1, the column on the right shows how the values will be stored in the corresponding registers.
2.4. SIMD
The SIMD unit in Sandbridge 2.0 architecture has eight 16 16bit SIMD registers as well as four accumulator registers. The instruction encoding for SIMD operations allows for 4 input fields. The SIMD unit allows three registers to be read and 2 to be written by an instruction.
The SIMD unit supports GFM through the rgfmul and rgfnorm instructions, which have the following format:
rgfmul vc,va,vbrgfnorm vt,vc,vp,JThese instructions do 16 polymultiplies/polyremainders in parallel. Since the SIMD register elements are 16bit wide, the rgfmul uses the upper 8 bits of each element, while the rgfnorm uses the entire 16 bits of the element. Other than that, their behavior is identical to the gfmul/gfnormi instructions.
Upto three SIMD registers can be read per cycle; we use the extra readport to implement a polymultiplyandadd instruction with the format:
rgfmac vc,va,vb,vsThe rgfmac instruction 16 polymultiplies of the 16 elements of va and vb, and then polyadds (xor’s) the products with the corresponding elements of vs.
The SIMD unit has an idiom where the 16 results of elementwise operations (such as rgfmul) are combined together to form a single value that is written to the accumulator. The polymultiplyandsumreduce instruction follows this idiom
rgfmulred act,va,vbThe 16 elements of va and vb are polymultiplied together, and the 16 resulting products are polysummed (xored) together to form a single 16bit value that is written to the accumulator register.
3. Implementation
The gfnormi and gfmul instructions can be implemented by the same block with very little overhead. As can be seen from the pseudocode in Algorithm 1, the algorithms for the two involve the same computational kernel and differ in their setup and controls. They are described in detail below.

3.1. gfnormi
The gfnormi instruction computes the remainder using polynomial long division. Since the values are leftaligned, we start the process at bit 31 of the dividend value. The divisor consists of a leading 1 and the upper 16 bits of the divisor register. The immediate argument to the gfnormi instruction specifies the number of divide steps executed, J 1. For example, 5 steps of the division of 011.1101.0100 by 100.1001 will proceed as follows:
0111101010000000000000000000000000 11110101000000000 10010010000000000 11001110000000000 10010010000000000 10111000000000000 10010010000000000 01010100000000000 00000000000000000 1010100000000000At each step, the result is xored with 0 or with the divisor, depending on the leading bit being 0/1. The result is then leftshifted by 1 to ensure that the remainder after the division step is leftaligned. Note that xoring with 0 is the identity operation; this results in just a leftshift. This is done when the intermediate remainder is smaller than the divisor.
3.2. gfmul
Each polymultiply step needs to follow the same pattern as the polyremainder so that much of the hardware is common. If we are going to J 1 steps, we do the following:
(i)the partial result is initialized to all 0s,(ii)the “divisor” at each step is one of the multiply inputs prepended with J 1 zeroes,(iii)the control to select whether the xor is with the divisor or 0s are the bits of the second multiply inputs starting with the uppermost bit.The example below multiplies 101100 and 011011 using 6 steps. 10110 is used as the control input
00000000000000000000000110110000 0000001101100000 0000000000000000 0000011011000000 0000000110110000 0000111011100000 0000000110110000 0001111010100000 0000000000000000 0011110101000000 0000000000000000 0111101010000000The gfmul instruction always does 8 steps of multiply. Consequently, in the implementation, the “:divisor” is prepended by 8 zeroes.
3.3. Results
The unified block that implements gfnormi and gfmul in the SB3500 consists of some setup followed by 8 stages of a computational kernel. This computation in each stage is an xorselect, as shown in Figure 1.
In the case of the polynomial remainder operation, gfnormi, the result, (res), and divisor (div) values are setup from the value of the ra and rb registers. The count (N) is set to the immediate value specified in the operation 1. For the first N of the 8 xorshift stages, if the MSB in the res is 1, the res is shifted andxored with value in div; otherwise it is just shifted by 1.
For the polynomial multiply operation, res is set to 0 and div is set to the value of the rb register prepended by 8 zeroes; the count N is always 8. The top 8 bits of the rb register are used as controls to the 8 xorshift stage; if the corresponding bit is 1, res is shifted and xored with the value in div; otherwise it is just shifted by 1.
From the diagram shown in Figure 1, it is clear that the critical path for each of the 8 xorselect stages in a naïve implementation uses two 21 muxes. Adding in the initial setup, this gives a total delay for the entire block of about 17 21 muxes. In the TSMC 65 nm low power TSMC65LP process the critical path is about 0.9 nanoseconds at an area of 2856 m^{2} using regular Vt transistors and typical timing.
The SB3500 implemented is targeted for a 1.6 nanoseconds clock. It has a 2stage execute pipeline, so the gfop block is pipelined across 2 stages. This gives the synthesis tool 3.3 nanoseconds to implement the block. The synthesis tool used this relaxed timing to pick a power and area optimized implementation. In this implementation the gfop block occupies approximately 2018 m^{2}, not including the pipeline registers.
It is possible to implement various lookahead schemes that would reduce the critical path at the expense of extra logic. Since we have ample slack, we did not investigate any area/speed tradeoffs.
4. ReedSolomon
We have implemented a RS encoder/decoder that is designed to be implemented on a SIMD architecture. The numbers presented in this section are tuned for the DVB (digital video broadcasting) standard. This standard uses RS(204,188) encoding; that is, it adds 16 check symbols to a 188byte packet resulting in a total code word of length 204 bytes.
4.1. Algorithm
The RS encoder used in this study does the following steps [3]:
(i)append N zeroes to a data block,(ii)perform successive Horner reduction of the polynomial whose coefficients are the data block plus zeroes to obtain remainders,(iii)multiply remainders by precalculated coefficients and sum.All operations are over GF(2^{8}). The RS decoder [4] starts with syndrome computation, which computes the dotproduct of the received codeword with precomputed syndrome vectors [4]. If the syndromes are all zero, then there are no errors.
Our implementation combines several techniques to improve the error decoding capability.
(i)Correct codeword using PetersonGorensteinZieler (PGZ) [5] algorithm.(ii)If that fails to correct the errors, successively apply 2, 4, 6, 8 erasures, deriving an error locator polynomial, until an error locator polynomial of correct degree is derived [6, 7].(iii)If an error locator polynomial is identified, attempt to decode the word using the ForneyMesseyBerlekamp (FMB) method [8, 9].Again, all operations are over GF(2^{8}). The details of this approach have been published previously [6].
4.2. Results
We started off with an original version of the code that was designed to use Galois Field operations. This base code was then rewritten to use SIMD forms of polynomialmultiply and remainder operations.
The experiments that were run encode one RS(204,188) packet artificially introduce enough errors to require 8 erasures and then decode the packet. Note that this is the worstcase decode situation; in practice 98% of all packets have all syndromes equal to zero, so no error decoding is required.
Table 2gives the details of the results of our experiments. In the encoder, the number of polyremainder operations is almost the same as the polymultiplies. Consequently, SIMD only achieves a 8x speed up, even though the SIMD variants of the polynomial instructions perform 16 polyoperations in parallel.
4.3. Endtoend Simulation Results
For the DVBT case, for the highest bitrate of 31.67 Mbps, the decoder is called 21 763 times per second. The total number of cycles spent by the processor in vector mode for the GF operations only is less then 18 MHz (a fraction of the SBX processor capabilities) compared to 277 MHz in scalar mode. The iterative decoding algorithm was tested in the endtoend DVBT/H simulated system, specified by ETSI EN 744 V1.4.1 (200101). The simulations were performed by using the SBX simulation tools. Using our GF instructions, the total number of cycles per second consumed in the SBX processor, for the highest bit rates specified in the standards and assuming that every packet has eight errors and eight erasures, is the following: 29 MHz for the 31.67 Mbps DVBT, 9 MHz for the 4.4 Mbps DVBH including the optional second RS decoder at the link level.
5. Conclusions
The method for implementing GFM we have described, that implements a GFM using 2 instructions a polymultiply and a polyremainder, allows the addition of GFM to a standard architecture without the need to introduce a special purpose register for the GFM. Further, both of these instructions can be implemented using the same hardware block.
We have shown that, for some applications, it is not necessary to execute both a polymultiply and a polyremainder for each GFM. In the cases where the results of several GFM are added together, the products of the corresponding polymultiplies are summed and then a single polyremainder is used. In one specific case, only 25% of the polymultiplies required a polyremainder. Our simulation results indicate a speedup of 11.5x of the extended processor versus the standard processor.