Security and Communication Networks

Research Article

Efficient Parallel Implementation of Matrix Multiplication for Lattice-Based Cryptography on Modern ARM Processor

Efficient matrix multiplication and accumulation.

Require: Matrix A ( matrix, ), Matrix S ( matrix,
), Matrix E ( matrix, )
Ensure: Matrix E ( matrix, )
1: *for* i from 0 to M do
2: *for* j from 0 to L do
3: sum_vect = NEON_Lane_Broadcast(0);
4: *for* k from 0 to iter_k do
5: a_vec = NEON_Vector_Load (A + i N + k LANES_SHORT_NUM);
6: s_vec = NEON_Vector_Load (S + j N + k LANES_SHORT_NUM);
7: sum_vect = NEON_Multiply_Accumulate(sum_vect, a_vec, s_vec);
8: NEON_Vector_Store (sum, sum_vect);
9: E[i L + j] += sum[]+sum[]+sum[] + sum[] +sum[]+sum[]+sum[]+sum[];
10: if (k == N/LANES_SHORT_NUM) && (NLANES_SHORT_NUM)
11: *for* k from N-(NLANES_SHORT_NUM) to N do
12: E[i L + j] += A[iN+k]B[kN+j];
13: Return E;