Research Article
Efficient Parallel Implementation of Matrix Multiplication for Lattice-Based Cryptography on Modern ARM Processor
Algorithm 2
Efficient matrix multiplication and accumulation.
Require: Matrix A ( matrix, ), Matrix S ( matrix, | ), Matrix E ( matrix, ) | Ensure: Matrix E ( matrix, ) | 1: for i from 0 to M do | 2: for j from 0 to L do | 3: sum_vect = NEON_Lane_Broadcast(0); | 4: for k from 0 to iter_k do | 5: a_vec = NEON_Vector_Load (A + i N + k LANES_SHORT_NUM); | 6: s_vec = NEON_Vector_Load (S + j N + k LANES_SHORT_NUM); | 7: sum_vect = NEON_Multiply_Accumulate(sum_vect, a_vec, s_vec); | 8: NEON_Vector_Store (sum, sum_vect); | 9: E[i L + j] += sum[]+sum[]+sum[] + sum[] +sum[]+sum[]+sum[]+sum[]; | 10: if (k == N/LANES_SHORT_NUM) && (NLANES_SHORT_NUM) | 11: for k from N-(NLANES_SHORT_NUM) to N do | 12: E[i L + j] += A[iN+k]B[kN+j]; | 13: Return E; |
|