#### Abstract

Recently, we present a novel Mastrovito form of nonrecursive Karatsuba multiplier for all trinomials. Specifically, we found that related Mastrovito matrix is very simple for equally spaced trinomial (EST) combined with classic Karatsuba algorithm (KA), which leads to a highly efficient Karatsuba multiplier. In this paper, we consider a new special class of irreducible trinomial, namely, . Based on a three-term KA and shifted polynomial basis (SPB), a novel bit-parallel multiplier is derived with better space and time complexity. As a main contribution, the proposed multiplier costs about circuit gates of the fastest multipliers, while its time delay matches our former result. To the best of our knowledge, this is the first time that the space complexity bound is reached without increasing the gate delay.

#### 1. Introduction

Efficient hardware implementation of the finite field arithmetic, especially for , is frequently desired in coding theory and public-key cryptosystems [1, 2]. Among these arithmetic operations in , multiplication is of the most importance, as other complicated field operations such as exponentiation and inversion can be carried out by iterative multiplications. Thus, it is necessary to design efficient multiplier.

The field elements are usually represented by a certain basis such as polynomial basis (PB), normal basis (NB), and dual basis (DB). In PB representation, the multiplication consists of multiplying two polynomials and reducing the result modulo an irreducible polynomial. The choice of such an irreducible polynomial is critical to perform the reduction operation efficiently. Irreducible trinomial is one of the most common considerations [3, 4]. During recent years, many bit-parallel multipliers using PB representation are proposed for defined by irreducible trinomials, some of which can be found in [3, 5–8]. The efficiency of the architecture is always evaluated by space and time complexity. The former one is expressed in terms of the number of logic gates (XOR and AND) and the latter one is expressed in terms of the sum of XOR and AND gates delay of the critical path. Among these multipliers, the fastest bit-parallel multipliers nowadays are proposed by Fan and Hasan [9] and Hariri and Reyhani-Masoleh [10]. If is defined by , the corresponding multiplier requires AND and XOR gates with time delay (for good fields, the time delay is ), where and are the circuit delay of one AND gate and one XOR gate, respectively. Except for these multipliers for general trinomials, there are also several proposals for special types of irreducible trinomials [11–13]. These multipliers usually utilize the special form of the trinomial to obtain efficient implementation.

The Karatsuba algorithm (KA) works recursively by breaking down one big multiplication into two or more submultiplications. It is a typical divide-and-conquer algorithm. Please note that the classic KA starts with a way to multiply two 2-term polynomials using three scalar multiplications. Some other variations are also investigated. More details can be found in [14–16]. The KA can be adopted to design subquadratic complexity multiplier [14, 17] or hybrid multiplier [18, 19]. Specially, there is another type of hybrid multiplier, namely, nonrecursive Karatsuba multiplier, which only applies KA once in the polynomial multiplication [8, 20]. These multipliers regularly require 3/4 circuits gates compared to the fastest bit-parallel multipliers, while its time delay increased by a small number of . For example, Elia et al. [8] costs at least two more .

Recently, we proposed a novel nonrecursive Karatsuba multiplier that is based on Mastrovito approach [21]. It is shown that our multiplier only requires one more compared with the fastest multipliers [9, 10]. However, it costs a few more logic gates than Elia's result. Except for the nonrecursive Karatsuba multiplier for general trinomials, Shen and Jin [13] proposed a new Karatsuba multiplier that fully exploited equally spaced trinomial and the classic KA to simplify the modular reduction. Consequently, the space complexity of their scheme matches Elia's result. Meanwhile, the time complexity is , which is roughly equal to the fastest results. Furthermore, we observe that the special case of our multiplier coincides with their scheme. (Here, the trinomial is an equally spaced trinomial.)

In this paper, we explore another special case of our former scheme to obtain even more efficient nonrecursive Karatsuba multipliers. Our main idea is analogous to Shen and Jin [13], where a special type of trinomials and a KA variation are utilized to simplify the structure of corresponding Mastrovito matrix. More explicitly, we consider the irreducible trinomial and a three-term Karatsuba algorithm. It is demonstrated that the corresponding Mastrovito matrix can be simplified further under this condition. The shifted polynomial basis (SPB) [4] is also utilized to reduce the critical path delay further. Consequently, we proposed a bit-parallel multiplier that costs approximately 2/3 circuit gates of the fastest bit-parallel multipliers. On the other hand, the time complexity is , which almost matches the best known results.

The rest of this paper is organized as follows: In Section 2, we briefly review the Mastrovito approach based on SPB representation and some relevant notions. Then we introduce a three-term KA formula and investigate the structure of related Mastrovito matrix. A new bit-parallel multiplier architecture is then proposed in Section 3. Section 4 presents a comparison between the proposed multiplier and some others. Finally, some conclusions are drawn.

#### 2. Preliminary

In this section, we briefly review some related notations and algorithms used throughout this paper. Consider the finite field generated with an irreducible trinomial . Let be a root of and the set constitute a polynomial basis (PB). Therefore, every element of can be represented as a polynomial over of degree less than . The shifted polynomial basis (SPB) is a variation of the polynomial basis, which is obtained by multiplying the set by certain exponentiation of .

*Definition 1 (see [4]). *Let be an integer and the ordered set be a polynomial basis of over . The ordered set is called the shifted polynomial basis with respect to .

Generally speaking, the optimal choice of for irreducible trinomial is equal to the middle term degree or it minus one [4]. In this case, we have and use this denotation thereafter. It follows that the field element can be expressed with respect to SPB as follows: Given two elements of under SPB representation, that is, , , the field multiplication can be performed as Obviously, the product is thus equal to Analogous to ordinary polynomial multiplication, this product can be computed by a matrix-vector multiplication , where express the coefficient vectors of and , and the matrix is given byThe difference between the above matrix and the usual PB case [3] is simply the labels of the lines in left side, which indicate the exponent of indeterminate for each line.

We then reduce the above matrix in view to obtain the field product expressed in SPB representation. The reduced matrix, denoted by** M**, is called Mastrovito matrix. Thus, the SPB field multiplication is rewritten as where denotes the coefficient vector of . The structure of** M** relies on and the modular reduction rule. In this case, we should obey the following reduction rule:However, if we directly reduce the product matrix presented in (4) using the above formulae and perform matrix-vector multiplication, there is no difference between this computation and the general case. In the following section, we will construct a new Mastrovito matrix using a three-term Karatsuba algorithm and describe a highly efficient bit-parallel multiplier.

Moreover, one can check that the irreducible trinomial in the form of exists when where is a nonnegative integer [1]. Although the number of this type of irreducible trinomials is not that abundant, there still exist some trinomials in the range of interest for practical application.

In the end, we also introduce some notations pertaining to matrices and vectors, which are already proposed in [21, 23] and extensively used throughout this paper.(i) represents the th row vector in matrix ;(ii) represents the th column vector in matrix ;(iii) represents the entry with position in matrix .

#### 3. Mastrovito Multiplier Using a Three-Term Karatsuba Algorithm

The Karatsuba algorithm [2] has been applied to improve the efficiency of bit-parallel multiplier for generated by an AOP [20] and a trinomial [8, 13, 21]. It starts with a way to multiply two two-term polynomials using three scalar multiplications which can reduce the space complexity of the multipliers by approximately a factor of 3/4. Besides the classic algorithm, there exist several generalizations with respect to the Karatsuba algorithm [14–16]. Here, we are only focus on a simple Karatsuba algorithm variation, three-term Karatsuba algorithm, which multiplies two three-term polynomials using six scalar multiplications. Given two three-term polynomials in , one can check that

In general, the Mastrovito multiplication utilizing the KA will increase the time complexity. Our former result shows that a Mastrovito multiplier using classic KA costs one more than the fastest ones. However, some literature sources [13] indicated that this result would be further improved for some special cases, for example, the EST . In the following, we will show that for the trinomial , applying the three-term Karatsuba-like formula will also simplify the reduction operation and lead to fast implementation.

Let be an irreducible trinomial and , be two field elements in SPB representation. We partition , into three parts, with each part consisting of bits. In order to simplify related expressions, we denote as . Then, where , , for . Then we multiply and using the three-term Karatsuba-like formula and do the following transformation:where , , , , , . We divide (9) into two parts, and compute each part modulo independently.

##### 3.1. Computation of

We first consider the computation of in detail. Note that actually consists of three different parts: , , (others can be obtained by shift of these parts). When is rewritten as a matrix-vector form, we haveFor simplicity, we do not write the labels of the product matrix here, which indicate the degree of in . Note that these degrees are in the range . In the above expression, represent the coefficient vectors of , respectively. is a zero matrix, () are lower-triangular Toeplitz matrices, and () are upper-triangular Toeplitz matrices. Please note that the matrix on the right side actually contains rows and the product matrix in fact contains rows. However, the last row of the above matrix is** 0**, which does not affect the result. These submatrices have the following form: for . It is easy to check that the products contain the terms of degrees out of the range ; we have to perform the reduction operation for the product matrix in (21). According to Mastrovito scheme, the reduction can be regarded as the construction of product matrices from using the reduction rule in (6). Denoted by , the Mastrovito matrix is related to . Then, we investigate the construction details for this matrix . We have the following proposition.

Proposition 2. *The Mastrovito matrix can be constructed as where *

*Proof. *The proof is analogous with the proof of observation in [21]. Note that the product matrix contains nonzero rows (the last row is a zero vector), each of which corresponds to the polynomial degree from to . It is easy to check that the first rows and the last rows correspond to the degrees that are out of the range . Thus, we need to reduce these rows.

According to the reduction rule in (6), we have to reduce by adding them to the row and and reduce the rows by adding them to the row and . Obviously, the first row here is and the last rows constitute We compare the line number and obtain the result immediately.

Based on Proposition 2, we can compute as follows:

By swapping and combining some overlapped entries, expression (16) now can be rewritten asWe just compute two submatrix-vector multiplications and add them up to obtain . Some tricks can apply to save more logic gates. We mainly utilized the computation strategy presented in [7] and fully considered the overlapped parts of the two above matrices. The computation can be divided into two steps: (i)Perform row-vector products: in parallel. The symbol “” represents only row-vector product related to (or ) and , . For example, represents computing the inner product , for in parallel.(ii)Sum up all the entries of each row using binary XOR tree. Specially, consider some products of each row are zero; we compute the following summations: using binary XOR tree firstly and then add these results together.

*Remarks 3. *It is easy to see that the row-vector products (18) contain all the possible row-vector products in (17). In addition, , , , and are all triangular matrices; one can easily check that each row of both and consists of at most nonzero entries. After the computation of (18) and (19), certain number of XOR gates is required to obtain the final result. Table 1 summarizes the space and time complexity of for all the steps.

##### 3.2. Computation of

Then we consider the computation of in detail. Since and , consist of bits, we can follow similar line as the computation of to obtain the result. More explicitly, we rewrite in matrix-vector form:Here, () are lower-triangular Toeplitz matrices and () are upper-triangular Toeplitz matrices, which are constructed from the coefficients of and are similar to and . Vectors represent the coefficient vectors of .

The reduction of modulo is relatively simpler: we only need to eliminate the last rows by adding them to the lines labeled with and . Thus, we have

Analogous with the computation of , we first perform row-vector products:in parallel. Then, we compute the following summations:using binary XOR tree firstly, and then add related results together. Please note that each row of and consists of at most nonzero entries. We can calculate and in . Finally, we have to add all these summations to obtain the result. It costs more XOR gates with one delay. Related space and time complexities for the computation of are summarized in Table 2.

From Tables 1 and 2, it is clear that the computations of , modulo have the same time delay. So they can be implemented in parallel. Finally, another XOR gates are needed to add the two results together, which also requires one delay. As a consequence, the total space and time complexity of proposed architecture are Furthermore, if where is smaller relatively to , we have . In this case, the time delay of our architecture becomes , which is almost equal to the delay of the fastest bit-parallel multipliers [9].

#### 4. Theoretic Comparison

Table 3 gives a comparison of different implementation methods of bit-parallel multipliers in the fields generated by trinomials . From Table 3, we can see that our multiplier requires about 2/3 circuit gates compared with the previous architectures without using divide-and-conquer algorithm. On the other hand, the time complexity of the proposed multiplier is , which is very close to the fastest result. In fact, we have checked this type of trinomials with degree , , and found that there are 585 such trinomials reaching the bound (others require only one more ).

In Table 4, we give a small example of field defined by . It shows that, compared with other approaches, our architecture may be the best choice if the space and time complexity are both considered. In addition, compared with the fastest Karatsuba multiplier for general trinomials [21], it is argued that the space and time complexities can be reduced even further if special KA and irreducible polynomial are combined together.

#### 5. Conclusion

In this paper, a new Mastrovito multiplier architecture for trinomial of the form is proposed. We show that the space and time complexity of our former Mastrovito-Karatsuba multiplier can be further reduced for special form of trinomial combined with a KA variation. This multiplier can be used in some area-critical occasions because it has low space complexity but maintains a relatively low time delay. To find more polynomials which can use the proposed strategy will be the future work.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

This work is supported by the Natural Science Foundation of China (nos. 61402393, 61601396) and Shanghai Key Laboratory of Integrated Administration Technologies for Information Security (no. AGK201607).