Performance Modeling for FPGAs: Extending the Roofline Model with High-Level Synthesis Tools
Figure 5
Example of how the partial loop unrolling reuse the fetched data. The kernel with depicted in (a) requires 9 input values. By unrolling two times, consecutive rows are processed in parallel in (b). In that case, besides the fact that each iteration demands 9 input values, 6 values are shared between both iterations, reducing the memory accesses. As the smart buffers are active, only 4 new inputs, the last column of each window, need to be requested.