Research Article

A Low-Power Scalable Stream Compute Accelerator for General Matrix Multiply (GEMM)

Algorithm 1

divide and into blocks ′ and ′ of size (PE × cache depth);
For  each block of and ′  do
prefetch ′ into cache via stream C;
preload any W, , or ′ to stream ;
for  each row of ′  do
stream new elements of the row of ′ via S;
multiply-accumulate elements of ′ and 'across PEs;
if   contains final elements  of   then
   shift new partial results of ′ from PEs  via ;
   perform scalar operations using , and
    at output of and via ASE (§ 3.3);
else
  shift new elements of ′ from PEs via
  to memory or cache;
end
end
end