Research Article

Performance Modeling for FPGAs: Extending the Roofline Model with High-Level Synthesis Tools

Figure 5

Example of how the partial loop unrolling reuse the fetched data. The kernel with depicted in (a) requires 9 input values. By unrolling two times, consecutive rows are processed in parallel in (b). In that case, besides the fact that each iteration demands 9 input values, 6 values are shared between both iterations, reducing the memory accesses. As the smart buffers are active, only 4 new inputs, the last column of each window, need to be requested.
428078.fig.005a
(a)
428078.fig.005b
(b)