Research Article
A Low-Power Scalable Stream Compute Accelerator for General Matrix Multiply (GEMM)
divide and into blocks ′ and ′ of size (PE × cache depth); | For each block of ′and ′ do | prefetch ′ into cache via stream C; | preload any W, , or ′ to stream ; | for each row of ′ do | stream new elements of the row of ′ via S; | multiply-accumulate elements of ′ and 'across PEs; | if ′contains final elements of then | shift new partial results of ′ from PEs via ; | perform scalar operations using , and | at output of and via ASE (§ 3.3); | else | shift new elements of ′ from PEs via | to memory or cache; | end | end | end |
|