Research Article

Efficient Parallel Implementation of Matrix Multiplication for Lattice-Based Cryptography on Modern ARM Processor

Algorithm 2

Efficient matrix multiplication and accumulation.
Require: Matrix A ( matrix, ), Matrix S ( matrix,
), Matrix E ( matrix, )
Ensure: Matrix E ( matrix, )
1: for i from 0 to M do
2:   for j from 0 to L do
3:    sum_vect = NEON_Lane_Broadcast(0);
4:    for k from 0 to iter_k do
5:      a_vec = NEON_Vector_Load (A + i N + k LANES_SHORT_NUM);
6:      s_vec = NEON_Vector_Load (S + j N + k LANES_SHORT_NUM);
7:      sum_vect = NEON_Multiply_Accumulate(sum_vect, a_vec, s_vec);
8:    NEON_Vector_Store (sum, sum_vect);
9:    E[i L + j] += sum[]+sum[]+sum[] + sum[] +sum[]+sum[]+sum[]+sum[];
10:    if (k == N/LANES_SHORT_NUM) && (NLANES_SHORT_NUM)
11:      for k from N-(NLANES_SHORT_NUM) to N do
12:         E[i L + j] += A[iN+k]B[kN+j];
13: Return E;