Research Article

Efficient Parallel Implementation of Matrix Multiplication for Lattice-Based Cryptography on Modern ARM Processor

Table 1

ARM NEON intrinsic functions for the proposed method.

OperationsARM NEON Intrinsic functions

Loaduint16x8_t vld1q_u16(__transfersize(8) uint16_t const ptr);

Storevoid vst1q_u16(__transfersize(8) uint16_t ptr, uint16x8_t val);

Extracting lanes from a vector into a registeruint16_t vgetq_lane_u16(uint16x8_t vec, __constrange(0, 7) int lane);

Lane Broadcastuint16x8_t vdupq_n_u16(uint16_t value);

Vector Interleaveuint16x8x2_t vzipq_u16(uint16x8_t a, uint16x8_t b);

Vector Multiply Accumulateuint16x8_t vmlaq_u16(uint16x8_t a, uint16x8_t b, uint16x8_t c);