FPGAs are known to permit huge gains in performance and efficiency for suitable applications but still require reduced design efforts and shorter development cycles for wider adoption. In this work, we compare the resulting performance of two design concepts that in different ways promise such increased productivity. As common starting point, we employ a kernel-centric design approach, where computational hotspots in an application are identified and individually accelerated on FPGA. By means of a complex stereo matching application, we evaluate two fundamentally different design philosophies and approaches for implementing the required kernels on FPGAs. In the first implementation approach, we designed individually specialized data flow kernels in a spatial programming language for a Maxeler FPGA platform; in the alternative design approach, we target a vector coprocessor with large vector lengths, which is implemented as a form of programmable overlay on the application FPGAs of a Convey HC-1. We assess both approaches in terms of overall system performance, raw kernel performance, and performance relative to invested resources. After compensating for the effects of the underlying hardware platforms, the specialized dataflow kernels on the Maxeler platform are around 3x faster than kernels executing on the Convey vector coprocessor. In our concrete scenario, due to trade-offs between reconfiguration overheads and exposed parallelism, the advantage of specialized dataflow kernels is reduced to around 2.5x.

1. Introduction

In order to achieve the best possible performance for application acceleration on FPGAs, the entire design process, from selection of the algorithm of compute patterns and data structures to customization of operations and their precision, needs to be considered [1]. However, the usage of FPGAs is increasingly extending beyond established domains such as embedded systems and specialized high performance compute scenarios, where such holistic design paradigm is established, towards usage scenarios in general-purpose computing and big data centers [2]. In those scenarios, where large code bases exist, at least parts of the application are subject to ongoing changes, and where the impact of small changes in the algorithm or data representation of some part of the entire application cannot be easily assessed, such a comprehensive design method is often infeasible. Instead, a more pragmatic approach is required, which we denote as kernel-centric, where individual parts of the application, identified as computational hotspots and suitable for acceleration, are translated and offloaded to functionally equivalent FPGA implementations. This approach is a known concept in HW/SW codesign [3, 4] and is also widely employed in the field of GPU acceleration [5, 6]. When we want to propagate more widespread utilization of FPGA accelerators with this approach, beneath good performance of accelerated designs, we necessarily need efficient design methods for the FPGA kernels.

In this work, we evaluate and compare two very different design philosophies for the implementation of such kernels on FPGAs. The first method applied is the design of specific dataflow implementations for each individual kernel in a spatial programming language, which promises high productivity while still retaining most of the result quality that can be achieved by low level design techniques. We evaluate this method on a Maxeler platform [7], using the MaxJ [8] language to specify the kernel designs and targeting a MAX3424A Vectis [7] accelerator card. The second method in contrast utilizes for all kernels the same instruction programmable overlay architecture, a vector coprocessor with very wide vectors, which is one usage mode of Convey HC-1’s [9] application FPGAs. In this work, the vectorized kernels executed on this architecture are handwritten assembly code. With the comparison between the two methods, we primarily want to assess the resulting performance in order to find out whether such a quite generic overlay architecture as the utilized one can performance-wise be a viable solution when development time is limited. As secondary objective, we also quantify the actually required development effort and know-how.

For this comparison, we accelerated an existing stereo matching algorithm with a kernel-centric acceleration approach for both design paths. The algorithm offers an interesting mix of parallelization opportunities in some problem dimensions and dependencies in other dimensions. Because of the dependency pattern, it cannot be implemented as single design, which is fully pipelined through all of its compute stages, like other stereo matchers on FPGAs. In this algorithm, we identified 10 runtime intense kernel functions and offloaded them to the target FPGAs. The group of kernels contains straightforward streaming kernels, kernels with mildly irregular data access patterns, and kernels with the dependency pattern of a dynamic programming approach. Through slight generalization of one group of kernels, we implemented 8 specific dataflow designs to execute those kernel functions, one of the designs executing an auxiliary step that is required to efficiently support the data access pattern for one pair of functions. On the vector coprocessor, the 10 kernel functions are executed by 10 directly corresponding assembly functions.

In this work, we build closely upon a previous publication, in which most aspects of the kernel-centric acceleration of stereo matching with specialized kernels in a spatial programming language are presented [11]. Additionally, earlier work using the vector overlay [12] was heavily modified to match the functionality and kernel selection on the dataflow platform.

Beyond the consolidated presentation of both design paths side by side, the specific contribution of this work is the in-depth comparison of the two approaches, compensating for the effects of the two different hardware platforms and thus allowing assessing the overheads and opportunities of using overlay architectures on FPGAs in a nontrivial and practically relevant use case.

The remainder of this paper is structured as follows. In Section 2, we first outline the general stereo matching task, before presenting in Section 3 the concrete algorithm we accelerate. We then introduce in Section 4 the two accelerator platforms and how they are programmed in this work. Before the concrete kernel implementations are described side by side for both platforms in Section 6, we outline the common acceleration principles and memory management concepts in Section 5. Comparing both systems in Section 8 requires some normalization to account for the different hardware platforms but gives us insights into the trade-offs in runtime, design effort, and tool runtimes. Section 9 discusses related work for three aspects of our paper. In Section 10, we conclude with a careful generalization of the results and outline future directions.

2. Introduction to the Stereo Matching Problem

Stereo matching is the computation of a disparity map from a pair of stereo images. The disparity specifies at every position in the image how far the displayed object or feature appears displaced between the two images due to the different positions of the two camera lenses. In earlier work, the term Stereo Correspondence has been used for the same problem [13, 14]. By inverting the disparity information and scaling it according to the geometry of the camera system, actual depth information about the scene is obtained, which is denoted by Stereo Vision and is probably the most important method for computer vision.

Applications for computer vision in general and stereo matching in particular range from automotive and industrial use cases over robot navigation [15] to 3D movie production and general 3D data acquisition. Common design goals for all types of applications are high matching quality and high processing speed, yet with varying priorities and additional constraints, for example, on image resolution, on latency or throughput, or on power and resource limitations.

Most stereo matching algorithms perform so-called dense stereo matching; that is, they compute a disparity map, containing a disparity value for each pixel. This value represents how far the object that this pixel belongs to appears shifted between the left and right stereo image. More formally, a disparity value for a pixel in the left stereo image at position signifies that the physical feature displayed by this pixel is believed to be found in the right stereo image at position . If a corresponding right disparity image is computed, to be consistent, the corresponding disparity in the right image, should also contain the same disparity value , pointing back to position . For this definition of disparity and consistency to be precise, the two images need to be perfectly horizontally aligned.

As auxiliary metric to compute disparities, many algorithms use a cost value for each pixel at each possible disparity , thus forming a three-dimensional cost volume, where a low cost signifies that it is plausible that this pixel should have the corresponding disparity.

The general sequence of modern stereo matching approaches comprises three steps [16]: first, computation of a matching cost volume, second, an optimization method which computes a disparity map from the cost volume, and, third, postprocessing of the disparity map. For computing the initial cost volume, metrics for local color similarity and for local structural similarity are commonly employed [10, 17]. To smoothen the cost volume, aggregation techniques can be employed [13, 18]. Optimization in the simplest form, often called WTA (winner-takes-all) [17], just selects the disparity with the lowest cost for each pixel: . Other approaches like belief propagation (BP) and graph cuts (GC) seek to combine low matching costs with properties like low energy of the resulting disparity map. Beyond generic image augmentation approaches, postprocessing often involves consistency checks between the disparity maps generated for left and right image and handling of the identified inconsistencies [10, 19].

For many stereo methods, there exist variants which incorporate more than two input images, typically but not necessarily captured from a set of cameras placed along one horizontal line. The additional viewpoints open up additional opportunities for consistency checks among derived disparity maps and can particularly help to fill occluded areas, when they are visible in one of the additional input images. However, these variants still rely on a good underlying stereo matching algorithm, like the one utilized as case study here.

3. Stereo Matching Algorithm with Inherent Parallelism

In our work, we algorithmically follow the stereo matching implementation published by Mei et al. [10]. It follows the three basic steps outlined in Section 2, but splitting the first step of cost computation into two separate phases, we subdivide it here into a total of four phases. Figure 1 gives a high-level overview of the stereo matching sequence. In the first phase, cost initialization, two similarity metrics are applied on the input images to compute for each pixel and each possible disparity a local cost value, thus forming the first cost volumes. In the second phase, cost aggregation, the costs of neighboring pixels of the same disparity are aggregated in adaptive support regions, which are determined by color differences and absolute distances. This smoothes the original cost volumes. In the third phase, scanline optimization, an energy minimization approach is mimicked by dynamic programming along 1-dimensional scanlines. This produces a first pair of disparity maps, but also another pair of cost volumes that are used in the fourth phase, disparity refinement. This fourth phase performs a consistency check between the left and right disparity maps and applies several local optimizations for pixels which are not classified consistently.

As the most time-consuming parts and parts where the accelerated kernel functions are located, we present some details about the mechanisms of cost aggregation and scanline optimization and briefly outline the two less time-consuming steps, cost initialization and disparity refinement, which are executed on CPU in our work.

3.1. Cost Initialization

The cost initialization following Mei et al. [10] provides the first cost metric for each position and disparity based on two individual components. The first component is called the absolute difference cost for a pair of left- and right-image pixels in RGB format. This cost is defined as the difference of pixel intensities , averaged over the three color channels:

The second component is the census cost , computed as the Hamming distance of the census transforms of a left and corresponding right pixel. The census transform captures the local structure in a window around each pixel. As structural information, it is less sensitive to variations in lighting between the left and right image.

These two cost components are individually scaled by an exponential function that also enables the weighting of outliers and then added up to form the initial cost.

3.2. Cost Aggregation

The idea of cost aggregation is to reduce the huge amount of noise contained in the local cost metrics. Instead of simple smoothing, the costs for each possible disparity are aggregated over a limited area around each pixel, which likely belongs to the same objects of the image and thus should have similar disparity values. Therefore, aggregation areas should track object boundaries in shape and size as good as possible. However, computing individual aggregation areas for each pixel and summing up the costs inside them can be very compute intense. The cross-based aggregation method utilized here was first proposed by Zhang et al. [20]. The areas are defined by the length of four arms for each pixel, two extending to the left and right and two up and down. Two possible aggregation areas are now formed by all vertical arms that belong to pixels on the horizontal arms of each pixel and, respectively, the other way round as illustrated in Figure 2. Horizontal first aggregation areas can cover vertical object boundaries better; vertical first aggregation is more precise for horizontal object boundaries.

For both aggregation areas, the actual aggregation can be performed in linear time with the help of integral sums. Pseudocode for the horizontal aggregation step is given in Algorithm 1. As the outer loop indicates, the step is performed independently on each disparity . The first loop nest computes for each row the running sum of costs from the row’s first element to the current element. In the second nested loop, for each position , the difference between two elements of the running sum is taken, with element positions defined by the arm lengths at . This difference is exactly the sum of costs in the horizontal segment around that is specified by the two arms. In Figure 2, aggregation for one disparity is illustrated for the topmost and bottommost rows of the horizontal first aggregation region, where the horizontally aggregated costs depend on two elements of running sums . Afterwards, in the vertical aggregation step, vertically running sums (not illustrated in the figure) are computed and the aggregated costs are computed again as difference between the running costs at two positions, here at the two positions that are marked with in the example. Note that the left and upper positions, from which the respective running sums are taken, are not part of the aggregation area itself.

Require: : = input cost
Require: : = arm lengths
Ensure: : = aggregated costs of row segment around position in disparity
for all  disparities  do
  for all  rows  do  compute running sums
   for    to  columns  do
   end for
  end for
  for all  rows  do  compute row segment costs
   for all  columns  do
   end for
  end for
end for

A pair of horizontal and vertical aggregation steps forms one aggregation iteration with the illustrated horizontal first aggregation region. Following Mei et al. [10], we execute a total of four such aggregation iterations, the first and third one using horizontal first regions and the second and fourth one using vertical first regions. Not mentioned by Mei et al. [10] is a normalization step after each aggregation iteration, where the aggregated cost is scaled by the respective aggregation area. This was already proposed by Zhang et al. [20], also utilized by Shan et al. [21], and we found it to be important for the result quality of our implementation.

3.3. Scanline Optimization

The scanline optimization follows Hirschmüller’s [19] semiglobal matching strategy. Global matching would perform 2-dimensional energy minimization for the entire image, minimizing the weighted sum of the energy in the final disparity image and of the involved matching costs for this disparity image. The scanline optimization mimics this idea along 1-dimensional lines but avoids costly minimization steps and instead uses a dynamic programming approach, where the previous disparity decisions along the scanline are fixed and only the energy trade-off for the current step is considered. Equation (2) outlines the basic recursion equation and Algorithm 2 illustrates pseudocode for one scanline direction. Hence,

Require: : = aggregated cost
Ensure: : = right scanline cost
for all  rows  do
  for all  disparities  do
  end for
  for   to columns  do
   for all  disparities  do
   end for
  end for
end for

The scanline cost in the equation is computed along a scanline path that depends on the direction , which in the pseudocode example is to define a scanline to the right, with accordingly denoted scanline cost . The scanline cost depends on the aggregation cost and a term requiring all scanline costs at the previous pixel position along the scanline path. This previous pixel position is given by in the equation and by in the pseudocode example. This term depending on the previous position reflects the energy minimization concept, selecting either the scanline cost from the previous position at the same disparity or the scanline cost from the previous position at a neighboring disparity plus a small penalty or the minimal scanline cost of all disparities at the previous position plus a larger penalty . These paths trade off energy components added by the matching costs with energy components from the disparity profiles represented by penalties and . Not shown in the equation and pseudocode, both penalty values depend at each specific position on the color differences of the original images. Finally, for normalization, the minimal scanline cost at the previous position is subtracted.

Figure 11 serves us mainly to illustrate the compute and parallelization pattern of our implementations but also contains a numeric example of scanline computation, here of a downward scanline. For simplicity, costs are represented as integer values and with an aggregation cost of 0 for the second line. Green arrows indicate the minimization paths taken to compute the scanline costs in the second row depending on the previous row and the input aggregation costs. These green arrows reflect the best trade-off between minimization of the input costs and the scanline energy for any given position.

In the abstract description of stereo matching approaches in Section 2, the optimization step was described to yield a disparity map, yet the scanline equation as described here computes a new cost volume, now incorporating a trade-off between raw matching costs and energy of the disparity map. This is convenient, as now the results of scanline optimization steps along different directions can simply be combined by computing the average of different scanline costs. On the combined scanline costs, now a WTA optimization selects the actual disparity for each pixel.

We use four directions, up, down, left, and right, as proposed by Mei et al. [10]. Each scanline by itself produces some streaking artifacts in the direction of the scanline, because the penalty values only favor persistence of previously optimal disparities along the scanline, but not in the reverse direction. Therefore, it is important not only to utilize several different scanlines like in [22], but also to have pairs of reverse scanlines to symmetrically offset the streaking.

3.4. Disparity Refinement

The previous three phases are executed for both the left and right image, producing one disparity image for each side. As indicated earlier, their computed disparity values should match: . Pixels for which this is not the case are classified as outliers and are treated with the refinement steps Iterative Region Voting and Proper Interpolation from Mei et al. [10]. Due to insufficient details given, we skip their Depth Discontinuity Adjustment step but again perform the subsequent Subpixel Enhancement step, which aims to reduce errors caused by the discrete disparity levels.

3.5. Software Implementation

As starting point for our acceleration, we use our own software implementation for stereo matching, which follows these concepts but offers additional features, such as different, parametrizable cost initialization metrics (for more metrics, see, e.g., [23]), an adjustable sequence of aggregation steps, and optional OpenGL visualization of aggregation areas, cost volumes, and cost metric profiles. The precision of intermediate cost values required for stable results depends highly on the actual images processed. In general, quality degradation with reduced precision is graceful, but in some cases with single-precision floating point, costs after computing differences in the aggregation step can falsely get values of 0, leading to artifacts. Thus, we use in our software implementation double precision and also require this from the FPGA acceleration. With the settings of Mei et al. [10], our implementation reaches an accuracy in the Middlebury benchmark [14] of average 5.73% bad pixels and we make sure during our acceleration process to still produce the same results.

4. Utilized FPGA Platforms and Programming Models

In this section, we introduce the two hardware platforms we target and outline how they are programmed in this work. We conclude the section with a brief comparison of the accelerator resources as used in our experiments.

4.1. Maxeler Platform and Programming Paradigm

The Maxeler platform we use [7] is illustrated in Figure 3. It comprises two 6-core (12 threads) Intel Xeon X5650 (Westmere microarchitecture) CPUs, running at 2.66 GHz, as host platform and is equipped with four MAX3424A Vectis PCIe accelerator cards, of which in this work only one is used. Each card contains a large Xilinx Virtex-6 SX475T [24] FPGA for user logic, a smaller, non-user-programmable FPGA for the PCIe interface, and 24 GB of local SDRAM memory. This local memory is called LMem and has to be read or written in bursts of 384 adjacent bytes. However, in order to come close to the possible bandwidth of around 30 GB/s (with memory controllers synthesized at 300 MHz; up to 400 MHz is supported by the DDR3 DIMMs), several bursts, either adjacent or with a fixed stride, should be accessed with a single memory command. For example, commands with only 1 burst each lead to an efficiency of only 11%, whereas with 8 consecutive bursts, an efficiency of 80% is reached. The PCIe interface on the other hand can be used to stream data from or to host memory and reaches a bandwidth of 2 GB/s. Note that the memory controller is synthesized by the Maxeler tools onto the user FPGA alongside the custom logic.

The distinctive feature of the Maxeler systems is their development environment [8], which allows programming the FPGAs with a spatial programming language, denoted by MaxJ and realized as a Java extension. The kernel functionality implemented on FPGA is integrated with the host (CPU) part of an application through calls to an API automatically generated for the specified functionality. The MaxJ language offers much higher abstraction than HDL languages like VHDL and Verilog, but much finer control on the design than when generating hardware via HLS. Conceptually, MaxJ is built around streams of data, where typically one data element per cycle is processed in a so-called hardware kernel. A sequence of operations on one or several streams is automatically translated into a corresponding compute pipeline, where pipelining may also happen inside individual operations, in particular when they utilize DSP blocks. The streams can be connected to other kernels or to LMem or via PCIe to host memory and the Maxeler toolflow automatically generates the required buffers and interfaces.

4.2. Convey HC-1 Platform with Vector Processor Overlay

The Convey HC-1 [9], illustrated in Figure 4, is a dual socket server system, where one socket is populated with a dual core Intel Xeon 5138 (Core microarchitecture) CPU, running at 2.13 GHz, while the other socket is connected to a stacked coprocessor board. The two boards communicate using the Intel Front-Side Bus (FSB) protocol. Both processing units have their own dedicated physical memory, which can be transparently accessed by the other unit through a common cache-coherent virtual address space, which distinguishes this platform from the Maxeler system. The coprocessor consists of multiple, individually programmable FPGAs. One FPGA implements the infrastructure that is shared by all coprocessor configurations. These functions include the physical FSB interface and cache coherency protocol as well as configuration and execution management for user-programmable FPGAs. For implementing the application-specific functionality, four high-density Xilinx Virtex-5 LX330 [25] FPGAs are available. Eight memory controllers are implemented on one distinct Virtex-5 LX150 [25] FPGA per memory controller. Each of them accesses two DIMMs, which leads to an aggregated bandwidth of close to 80 GB/s with 16 memory modules. In our system configuration, custom-made scatter-gather DIMMs are installed, which allow accessing memory efficiently in 8-byte data blocks, while standard modules are designed for 64-byte block access.

The user FPGAs can be programmed with fully custom, problem-specific designs, integrated into the rest of the system by interface libraries written in Verilog. Additionally, Convey offers a number of designs, so-called Personalities, which are developed as programmable accelerators for specific classes of tasks, such as graph traversal or local string alignment and, probably with the broadest scope, the so-called Vector Personality, which we use in this work. Since it is a programmable architecture on top of the programmable FPGAs, we consider this an overlay, which comes with abstraction benefits and overheads which we want to quantify in this work.

The Vector Personality provides the functionality of a vector coprocessor that executes programs targeting its vector instruction set. It comes in two variants, optimized for single- or double-precision floating point operations; both also support integer operations, for example, for vectorized address calculations. According to our application, we use the double-precision Vector Personality. The vector instructions are implemented for up to 1024 elements. A total of 64 vector registers are available and each can store such a set of 1024 elements. Besides the usual element-wise arithmetic vector operations, the vector instruction set contains memory instructions that distinguish it from typical SIMD vector instruction set extensions for general-purpose CPUs. It can load and store vectors where the elements are individually indexed and do not need to be aligned in a continuous memory location.

Convey includes a compiler to target this Vector Personality by annotating source code with pragmas; however, we found it to be limited to simple array data structures and simple loop nesting patterns, which often requires significant code adaptations besides adding the vectorization pragmas. We fixed many of these shortcomings with the toolflow proposed in [26]; however, for the comparison of architectural overheads of the overlay, we wanted to achieve the best possible performance. Therefore, for this work, we designed all kernels by hand in assembly code, particularly exploiting on top of the capabilities of the automated toolflow additional opportunities as vector partitioning, vector register rotation, and enhanced reuse of partially computed addresses.

4.3. Comparison of FPGA Platforms

Comparing the two hardware platforms, the Convey HC-1 is a few years older, with the utilized FPGAs being one generation behind and the CPUs being two process shrinks (Intel Tick) and one microarchitectural change (Intel Tock) behind. On the other hand, when we compare a single Maxeler MAX3424A Vectis accelerator card to the coprocessor of the Convey HC-1, the latter incorporates a lot more hardware resources. Table 1 gives an overview of the accelerator hardware as used in our experiments. Together, the four FPGAs for the HC-1’s application logic contain almost 3x more LUTs and some more BRAM resources than the single application FPGA of the MAX3424A. Similarly, the peak memory bandwidth of Convey HC-1’s coprocessor is around 2.5x higher than that of the Maxeler MAX3424A accelerator. This is essentially achieved by using more memory modules. Additionally, Convey HC-1’s memory controllers are implemented on dedicated FPGAs, in contrast to the Maxeler MAX3424A platform, where the memory controller is synthesized along with the application logic onto the same FPGA. For the Convey platform, this saves space on the application FPGAs and avoids timing issues when synthesizing new user designs. Finally, even though both platforms come closest to their peak bandwidth with linear access patterns, physically a much smaller access granularity is supported in the Convey HC-1 configuration we utilize.

In Section 8, where we assess the effects of the two different approaches to kernel design, we need to compensate for the outlined differences of the hardware platforms.

5. Kernel-Centric Acceleration

The general idea of kernel-centric acceleration as followed here is to identify runtime intense kernels with acceleration potential and execute them on FPGA and to keep other possibly complex parts of the application with small contributions to the overall runtimes on CPU. In order to identify the candidate kernels, we first performed profiling on CPU. The runtimes of all kernel functions with nonnegligible runtimes, aggregated over all their invocations when they are executed more than once, are illustrated in Figures 5 and 6 for a FullHD input image pair on both CPU platforms. The kernels are sorted by the time of their first invocation, which reflects the overall sequence of cost initialization, aggregation, scanline optimization, and disparity refinement; however, there are repetition patterns spanning several of those kernels. Based on this result, we selected the 5 aggregation kernels from horSum to scale and the 5 scanline kernels from ScanUp to sumScanlines. They cover 87% of the total program runtime, which permits by Amdahl’s law a speedup of at most 7.8x.

Since both platforms investigated in this work have physically distinct accelerator memory, whenever possible, we want to leave data in this accelerator memory when it is read or modified by several different kernels or several invocations of the same kernel. Therefore, beyond the raw execution times, possible data reuse between the kernels was considered. In case of our stereo matching implementation, the selected kernels cover all cost volume related compute steps of aggregation and scanline optimization, thus maximizing the reuse potential of data in accelerator local memory. Based on pure profiling runtimes, the final step of scanline optimization, sumScanlines, would be a less worthwhile acceleration target than, for example, the computation of census costs, but it reduces the amount of data to be transferred from accelerator memory to host memory significantly from four cost volume instances to a single one.

Both utilized target platforms require data to be moved between CPU and accelerator memory, but in different ways. The Maxeler platform [7] requires explicit data movement functionality added to each design by the designer and the accelerator memory space is entirely managed by the developer [8]. The Convey platform [9] provides a shared memory space and different API functions for allocation on and movement between physical memory locations. In order to abstract these differences away from the application side, we modified and extended the memory manager presented in [11] for the Maxeler platform. An important feature of the memory manager, particularly useful during accelerator kernel development, is to support easy switching between CPU and accelerator execution of individual kernels with all required but no unnecessary data movements.

Our means to achieve this was to express at the beginning of every kernel which data structure it uses, whether it uses it at the host CPU or the accelerator, and whether it reads or writes to this data structure. With this information, the memory manager keeps track of all data locations and initiates all required transfers prior to actual data access. In our new extended memory manager concept, we applied these kernel annotations to both the kernels remaining on CPU and the wrappers for kernels executing on FPGAs. This goes beyond the modifications required for the methods presented in [11], where only data usage on FPGAs was indicated. The extension is however advantageous to the kernel-centric acceleration concept, because it removes the only high-level application knowledge required for the previous version, where transfers from accelerator memory back to CPU had to be initiated manually, requiring changes for each accelerator kernel that is enabled or disabled during development.

Listing 1 illustrates some kernel functions using the memory manager interface. Before they actually use data, they indicate by calls to the memory manager API how (mm.reads, mm.writes) and where (locations CPU, ACC) they are going to use it. When a kernel both reads and writes data, or when it does not completely overwrite a structure, so previous data may still exist after writing, this has to be stated explicitly like in this example for the first function, using b both as input and as output. The accelerator kernels (starting with cny for Convey, max for Maxeler) are mere wrappers and subsequently invoke execution on the respective accelerator. Due to the shared address space, the Convey kernel uses the original addresses, whereas the locations in Maxeler local memory are provided by the memory manager (mm.getLMem). Just like in [11], a memory region in Maxeler local memory is allocated lazily before the first usage of some data structure in this memory.

)    cpuABtoB(  a,   b)  
()     mm.reads(CPU, a);
()     mm.reads(CPU, b);
()     mm.writes(CPU, b);
()     // CPU kernel code here
()    cnyAtoB(  a,   b)  
()     mm.reads(ACC, a);
()     mm.writes(ACC, b);
()  callCnyKernel(a, b);
() maxAtoB(  a,   b)  
()  mm.reads(ACC, a);
()  mm.writes(ACC, b);
()  callMaxKernel(mm.getLMem(a), mm.getLMem(b));

Listing 2 now illustrates usage of two of those kernels. First, dynamic arrays are allocated through the memory manager, per default in host CPU memory. Then, for the first kernel call on CPU in Line 4, the memory manager determines at runtime that both arrays are already in the right location and no movement is required. In this example, the second kernel, Line 5, is executed on the Maxeler accelerator. For the data it reads, c_init, accelerator memory is lazily allocated and data is moved there from host. c_agg on the other hand is only written to, so it gets allocated in accelerator memory, but no data is actually moved. Line 6 now performs another kernel call on the host CPU. c_init was not modified in accelerator memory, so the memory manager internally still has it in a shared state and no data needs to be moved. c_agg however was modified in accelerator memory and on CPU it will now be read before it is possibly overwritten, so its data is transferred back by the memory manager.

)    c_ad = () mm.alloc(size);
()    c_init = () mm.alloc(size);
()    c_agg = () mm.alloc(size);
()  cpuABtoB(c_ad, c_init);
()  maxAtoB(c_init, c_agg);
()  cpuABtoB(c_init, c_agg);

Listing 3 repeats the same kernel pattern, just with the accelerated kernel being executed on the Convey platform instead of Maxeler. This time at the coprocessor kernel call in Line 5 no more memory is allocated since host CPU and accelerator share the same memory space. For the input data c_init, a similar data transfer is initiated as on the Maxeler platform, just using a different API with different arguments internally. For the output data c_agg, again no physical data transfer is required. For this purpose, the Convey API contains a migrate_virtual function which does not actually move any data but just lets the affected shared memory area point now to the physical accelerator memory. This function comes in two flavors, one that touches all affected memory pages to update internal state such as the TLB (Translation Lookaside Buffer) and the other one without this touching. The version with page touching guarantees the fastest raw execution time of subsequently executed accelerator kernels and thus is important for the later evaluation of kernel acceleration. On the other hand, we found the no-touch version in combination with allocation on host to yield the fastest overall matching performance, because it partially overlaps the page touching effort with actual computation. It is even slightly faster than the alternative direct allocation as accelerator memory, even though the latter would require additional a priori knowledge about the first usage location of a data structure. Therefore, we measure and evaluate both versions in our experimental section.

)    c_ad = () mm.alloc(size);
()    c_init = () mm.alloc(size);
()    c_agg = () mm.alloc(size);
()  cpuABtoB(c_ad, c_init);
()  cnyAtoB(c_init, c_agg);
()  cpuABtoB(c_init, c_agg);

These examples conclude this section on the selection of kernels for acceleration and the concepts and infrastructure to support memory management for both platforms through a common interface.

6. Kernel Designs for Two FPGA Platforms

In this section, we present the compute and data access patterns of the identified time-consuming kernels and outline their parallelization opportunities, taking dependencies and data locality into account. Subsequently, we discuss the compute and memory access and data reuse patterns we implemented on the two accelerator platforms. The kernels for the Maxeler platform [8] are designed with a flexible amount of parallelism, which is specified by an unrolling factor prior to synthesis. The actually utilized amount of parallelism, typically low two-digit numbers, is limited either by resource or timing limitations during synthesis (HorDiff and scanline kernels) or by the known limits of the memory interface to feed the compute pipeline (all other kernels). For details of the synthesis results and bandwidth modeling, please refer to [11]. In order to hide feedback latencies in some kernels, in addition to this explicitly utilized parallelism, we also loop through different groups of work items in different clock cycles. For the Convey vector coprocessor [9], the desired amount of parallelism to be expressed by our kernel implementations is given by the size of the vector registers with up to 1024 elements. It internally contains 32 parallel function pipes and additionally makes use of further elements for latency hiding. We present for each kernel the designs for both platforms side by side to emphasize similarities and differences. We outline the designs of the first kernel in some detail whereas for the other kernels we restrict ourselves to noteworthy aspects.

6.1. Aggregation Kernels

The cost aggregation involves five different kernels: horizontal integral sums and differences, vertical integral sums and differences, and scaling. All aggregation steps are independent for each different disparity value and also for at least one of the image dimensions.

For the Maxeler platform, the independent image dimension suffices to support the required parallelism and latency hiding, so we restrict ourselves to unrolling in this dimension. The work of Shan et al. [21] suggests that utilizing disparity level parallelism in addition to image dimension parallelism might allow saving BRAM resources at the cost of additional logic utilization, which we did not investigate further for our kernels.

On the Convey platform, small image sizes do not suffice to fill the available vectors size. With vector partitioning, vectors can work on several groups of data, separated by so-called partition offsets. For the aggregation kernels, we use this feature to exploit both parallelism in image dimensions inside each partition and parallelism in disparity dimensions by multiple vector partitions.

6.1.1. Horizontal Integral Sums

After Section 3 already presented simplified pseudocode for the horizontal aggregation step, Listing 4 presents the corresponding function with the actual indexing used in our software implementation. There are dependencies along the rows, but we can parallelize computation by vertical unrolling, that is, computing several rows in parallel, and additionally work on independent disparity dimensions for Convey vector partitions.

)    void  horSum(in,  out)  
()     long  slice = height    width;
()     for (int  d=0; d<=maxD; d++)  
()      for (int  y=0; y<height; y++)  
()       out[dslice + ywidth] = in[dslice + ywidth];
()       for (x=1; x<width; x++)  
()        out[dslice + ywidth + x] =
()         out[dslice + ywidth + x-1] + in[dslice + y
           width + x];

Figure 7 illustrates the computation pattern implemented on the Maxeler platform. The product of unrolling factor and feedback latency determines the number of rows that are in flight at the same time as one common block. The latency is given by estimates from the Maxeler tools, whereas is limited either by bandwidth estimations or by synthesis results. More than rows in the same block are possible but require larger buffers and provide no further advantages. After an entire block of rows is computed, the next block of rows, not shown in the illustration, is started. Finally, also not illustrated, after one entire image (a slice of the cost volume) is finished, computation continues with the next disparity. In this description, the presented compute pattern now governs the required memory access pattern; however, in practice both are closely codesigned.

In memory, elements are arranged in row-major order, which means that entire rows are stored in continuous memory locations one after the other, because LMem uses the same data layout as the host application to allow for the memory management outlined in Section 5. Thus, each burst of 384 bytes reads 48 subsequent double values from each row. Figure 8 illustrates the way data is read from LMem with an appropriate memory command generator. We see that, inside each block, between memory access and compute step, the data needs to be reordered from horizontal to vertical order. This is relatively easy to do with the MaxJ concept of stream offsets; however, the actual design may need considerable amounts of BRAM. Additional simpler, nonreordering buffers are required to keep the memory and compute pipelines fed.

The implementation for the Convey vector coprocessor, as indicated in the introduction to this section, not only uses the same unrolling into the independent image dimension, here vertically, but additionally can work on more than one disparity dimension in different vector partitions. Figure 9 illustrates this pattern with 4 partitions and 256 rows covered by each partition. The innermost loop runs horizontally inside the rows to reuse the vector register containing the previous integral sum as one of the two inputs for the next step. Before entering this innermost loop, for each group of rows, the number of partitions and size of the partitions are computed based on remaining dimensions and two offsets are written into configuration registers. One is the row offset between two consecutive vector elements inside each partition, and the other one is the image slice offset between two consecutive disparity levels in the cost volume. Vector load and store instructions use these offsets to determine the memory addresses of each vector element and, in this loop profiting from the small access granularity of the scatter-gather RAM, only access the specified vector elements in memory.

6.1.2. Vertical Integral Sums

The vertical integral sum kernel (VerSum) is orthogonal to the HorSum kernel and contains vertical dependencies. Consequently, we now unroll computation horizontally for the implementations on both platforms.

Our Maxeler compute kernel combines the same combination of unrolling and feedback latency hiding as the HorSum kernel illustrated in Figure 7, just horizontally. When we buffer entire rows instead of blocks inside each row, the compute pattern exactly fits the data layout in memory, so we can use a linear memory access pattern instead of a customized memory command generator.

Similarly, the Convey VerSum vector kernel contains the same features, vector partitioning and data reuse in the innermost loop, as the HorSum kernel, but with vectors in horizontal image dimensions. Now the memory access inside each vector partition is continuous, which is beneficial for effective memory performance. In the vector processor instruction set, the only difference is that the element stride is now set to the element size of 8 bytes.

6.1.3. Horizontal Differences

After the computation of the horizontal integral sums (HorSum) follows the step of computing horizontal differences (HorDiff). For each pixel, a left and a right arm length are required, which define the two positions in the integral cost rows to access, before the corresponding cost values are subtracted from each other. So in this kernel we have data dependent memory access, however, only with bounded offsets from a given position, which are limited by the maximal arm lengths. There are no dependencies in this kernel, so both horizontal and vertical unrolling are possible.

Since this kernel does not contain feedback, latency hiding as used on the Maxeler platform for the integral sum kernels is not needed here. With the burst-oriented Maxeler memory interface, we need to have the window of possibly required integral cost data available in local buffers. This seems straightforward when unrolling horizontally, since neighboring pixels in one row require largely overlapping areas of possible input values defined by the arms. Figure 10 illustrates the use of multiplexers for the selection of the position specified by right arms for an unrolling factor of 4 and with possible values for the arm length of 0 to 4 (in practice we use a length of up to 34 as proposed in [10]).

Figure 10 illustrates that the overlapping of the possible access windows makes the buffer very space efficient, in the example actually using only 8 registers to buffer the possible inputs for 4 parallel access operations with 5 different input options each. However, even though resource utilization would permit it, the synthesis tools were not able to route such a design with more than anywhere near 100 MHz. The illustration in Figure 10 may give an intuitive idea that the high number of overlapping signal routes to the different multiplexers causes this problem.

As an alternative, we tried vertical unrolling like in the previous kernel. Here, in addition to the resource consumption of reordering between row oriented memory access and column oriented compute step, for each parallel row a buffer for the possibly accessed input elements needs to be instantiated. With this approach, unrolling was limited by resource consumption after synthesis. Therefore, specifically for this kernel, in our final Maxeler design, we combined horizontal and vertical unrolling, achieving the largest synthesizable design with overall parallelism of 16 through horizontal and vertical unrolling factors and .

On the Convey vector processor platform, conceptually the vector registers might provide a similar line buffer for input cost values selected by arms. However, the instruction set does not support any form of parallel access to specific indexed elements of the vector, so this is not possible. Instead, we resort to computing the address of each element of the horizontal integral sums which needs to be accessed by adding the arm length value to each respective base address. Then, these addresses are used for indexed vector load operations, which are however less efficient than the access with regular strides as outlined for the previous kernel.

We again use multiple vector partitions covering several disparity values in each computation step for efficient utilization of the vector size with small images. Since there is no dependency of the two inner loops, the parallelism in each partition can be provided either by horizontal or by vertical vectors. After implementing and measuring both alternatives, we use vertical unrolling to form the vectors. When forming horizontal vectors, several loads of input cost values inside a vector load may point to the same location. This works functionally correctly but apparently causes additional latencies in the memory interface.

6.1.4. Vertical Differences

Similar to the HorDiff kernel, the computation of vertical differences (VerDiff) does not contain any dependencies. On the Maxeler platform, horizontal unrolling does not suffer from the routing and timing difficulties of the HorDiff kernel, because now selection of arm positions is realized independently for each column. Thus, we can restrict unrolling to the horizontal dimension here and still reach unrolling values up to . On Convey, we again use vector partitioning and this time unroll the vectors horizontally, following the data access pattern of the vertical summation and again avoiding indexed vector loads to contain several instances of the same address.

6.1.5. Scaling

Finally, in the scaling kernel (Scale), each aggregated value is scaled (i.e., divided by the size of its specific aggregation region). It is a straightforward streaming kernel without dependencies and on both platforms can be readily unrolled horizontally, following a linear memory access pattern. On the other hand, the division of double-precision floating point values is neither easy nor efficient to implement on the Maxeler platform and not supported in the vector instruction set of the Convey coprocessor. Fortunately, there are only a fixed number of discrete sizes that any aggregation region can have, so we can precompute the inverse values and replace division operations by multiplications with the inverse values. On the Maxeler platform, those precomputed factors are stored in BRAM and looked up locally. For each parallel function pipe, a separate block of BRAM is instantiated. Due to the indexed access pattern, the Convey vector coprocessor again cannot use the vector registers to hold those values but instead reads them with indexed vector loads from memory. Again, lookups to the same address impair performance, so on this platform we replicate the block of lookup values in memory and use the vector indices to distribute lookups to different blocks.

6.2. Scanline Kernels

In contrast to the aggregation, the scanline optimization is not independent for different disparity values. On the contrary, for the computation of the scanline costs of a new pixel, the minimal scanline costs of the previous pixel over all disparities need to be known. On the other hand, we also have a dependency along the scanlines, such that unrolling can only be performed orthogonally to the scanline direction.

6.2.1. Vertical Scanlines

On the Maxeler platform, we implemented a common vertical scanline compute kernel (ScanVer), suitable for both ScanUp and ScanDown kernels of the host application, switching between both modes by configuring the accompanying memory command generator for different access directions. Figure 11 illustrates the dependency pattern for downward scanline computation and how it can be unrolled horizontally, here with boxes of size 4. All yellow boxes are required as inputs to compute the red boxes. The aggregation costs are read in the required pattern as inputs, as well as the color difference information (not illustrated in the figure) needed to determine the penalty values for each row (boxes and ). The resulting scanline costs are written out to LMem, but for an entire disparity range also buffered locally in BRAM for reuse in the next row. Therefore, computation is performed in blocks, but not in entire rows, as this would require excessive buffer space or additional readbacks.

In our actual Maxeler implementation, due to the burst size of the LMem interface, actual data blocks of 48 horizontal elements are loaded from memory and computed in cycles before proceeding to the next line. Since the previous minimum from step 0 is required to update the minimum for step 1, we reordered the datapath for the recursion of (2) to have a deeper pipeline for the computation of the individual scanline costs and a simple comparison for the selection of the current minimal scanline value. Nevertheless, similar to the integral sum kernels, we incur a feedback latency of four cycles, which for the block size of 48 limits the possible unrolling in space with unrolling factor to 12 (). We also tried larger block sizes to obtain more possible compute throughput, but the resulting large designs failed to meet timing.

On the Convey vector processor platform, we implemented two individual assembly kernels for ScanUp and ScanDown to save unnecessary selection instructions for the direction, but both implementations have identical structures. According to the dependence pattern, vectors cover entire rows or parts of rows, depending on image sizes. In contrast to the aggregation kernels, vector partitioning into different disparity dimensions is not possible.

The compute order looks very similar to the one illustrated for Maxeler in Figure 11, just with much larger blocks formed by the vectors. On this platform, the scanline costs of the previous line cannot all be buffered for the next line, so they are read back at every iteration of the vertical loop. However, only one disparity block needs to be read for every newly computed block; the other two are reused from the previous iteration of the innermost loop, the disparity loop. For example, in Figure 11, the yellow block with scanline cost 3 was newly read for the current compute step; the other two yellow scanline costs are reused, which is done efficiently by using the vector register rotation feature of the vector instruction set. Also, our code makes heavy use of vector mask generation and vector element selection to find the different possible scanline paths inside one single vector, which can be used by the vector internal streaming of the coprocessor to skip masked-out elements.

6.2.2. Horizontal Scanlines

On Maxeler, for horizontal scanlines (ScanLeft and ScanRight), the orthogonal unrolling concept from vertical scanlines with buffering the entire previous scanline costs was not applicable due to prohibitive BRAM requirements. This is because bursts were still aligned horizontally, but unrolling would have to be done vertically and additionally the buffers would have to cover all disparity dimensions. Therefore, we decided to implement an auxiliary turn kernel (Turn) that reads cost arrays in row-major data layout and writes them back to LMem in column-major data layout, or vice versa. Now we can execute horizontal scanlines by a sequence of turning input aggregation data, applying vertical scanlines and turning scanline result data back. The overhead of this turning step gets mitigated, because both horizontal scan kernels use the same turned input data by utilizing the ScanUp and ScanDown variants of the vertical scan kernels.

The Turn kernel uses 48 BRAM blocks which data is written to and read from with a diagonally shifted addressing scheme, which provides the flexibility that either an entire row or column of 48 values can be accessed. The size of blocks to be turned has to match at least the 48 elements per burst from the LMem interface, so in contrast to all other kernels we implemented this with a fixed unrolling factor of 48.

On the Convey coprocessor platform, the finer grained access capabilities of the memory interface allow direct implementation of the horizontal scanline kernels without prior turning. The kernel structure is very similar to the vertical one, just using row strides between the vertically unrolled vector elements as for the HorSum and HorDiff kernels.

6.2.3. Average over Scanline Directions

After computing the costs along all scanline directions, the final scanline costs for each position and disparity are computed by averaging the values of all directions. On both platforms, the resulting ScanAvg kernel is a straightforward streaming kernel with one linear output and four linear input streams. As outlined in Section 5, its particular value for the overall implementation lies in the reduction of output data size that has to be transferred back from accelerator memory to host memory.

This concludes the part of this section covering kernel designs for both platforms. On the Convey platform, the kernels were directly integrated into a heterogeneous executable by filling empty proxy kernels with the proper signature compiled by the Convey compiler with the actual assembly code providing the described functionality. On the Maxeler platform, the kernels defined in the MaxJ language are synthesized to kernel specific FPGA dataflow designs which is summarized in the following subsection.

6.3. Synthesis and Integration

As indicated, most Maxeler dataflow kernel designs are parametrizable at synthesis time with an unrolling factor , which is often constrained by several rules. It must be a whole number divisor of burst sizes; the product of and feedback latency must not exceed burst or block sizes. The Turn kernel has a fixed size for the diagonal buffer addressing scheme. The practically possible unrolling factors are further constrained by resource utilization and our decision to aim for a clock frequency of at least 100 MHz for the datapath. We furthermore analyzed the bandwidth requirements and did not investigate unrolling factors which would exceed those by much. The first data column of Table 2 summarizes the final unrolling factors utilized for individual kernels. Out of eight individual kernels, six were able to reach or exceed the bandwidth limit. Details of this analysis can be found in [11].

Anticipating some performance results from Section 8, in the aggregation phase, there is a high overhead for reconfiguring the FPGAs with different kernels in the sequence these kernels are utilized. Thus, for the five aggregation kernels that are repeated in different cycles during the application, we created a common design implementing all of their functionality in the same FPGA configuration and thus saving the reconfiguration overhead. We had some headroom for this integration, because not all of the individual kernels hit resource limitations, but still we had to decrease the unrolling factors. The final integrated aggregation design was chosen as the optimal trade-off between unrolling and achievable timing and runs at 130 MHz. The second data column of Table 2 summarizes the decreased unrolling factors. For the scanline phase, no integrated design was found that increased overall performance, not even for small images. In this phase, less reconfigurations are required, so the overhead that can be saved is much smaller. On the other hand, severe reductions of the unrolling factors were required to get routable designs.

In Table 3, we finally summarize the resource usage of all used dataflow kernel designs. The table highlights that the individual kernels do not hit hard limits in resource consumption; however, for HorDiff and Scan, no larger design with valid unrolling factor could successfully be synthesized. The critical resources of all kernels are either logic slices or BRAMs.

7. Experimental Setup

In this section, we first present the setup and notation for the evaluated systems and their configurations. We then discuss the generation and selection of our input data.

7.1. Evaluated Systems

After implementing and testing all described kernels on both accelerators, we integrated them into our stereo matching application and tested it in a total of six different configurations.(1)CPU1. The entire execution is performed on the Intel Xeon X5650 CPU with Westmere microarchitecture, running at 2.66 GHz, as host CPU of the Maxeler platform [7].(2)CPU2. The entire execution is performed on the Intel Xeon 5138 CPU with the older Core microarchitecture, running at only 2.13 GHz, as host CPU of the Convey platform [9].(3)MaxKern. The first accelerated configuration executes the individual, maximally unrolled dataflow kernels on the Maxeler accelerator card. This design point guarantees the highest raw kernel performance but induces considerable configuration overheads, in particular during the aggregation phase. The parts of the application that are not accelerated are executed on CPU1 and the memory manager presented in Section 5 handles transfers between host and accelerator memory.(4)MaxFused. The second configuration using Maxeler accelerator card uses the integrated aggregation design, containing five kernels with reduced parallelism. The remainder of the execution is identical to MaxKern, including utilization of individual kernels for the scanline phase. This configuration saves a lot of reconfiguration overhead during the aggregation phase in exchange for reduced raw kernel performance.(5)CnyVecTouch. On the Convey platform, the host parts of the application are executed on the slower CPU2 and the accelerated kernels are executed on the vector processor overlay on the FPGA accelerator. Thus, no bitstream reconfigurations are required during application runtime, but only the much smaller kernel code executed on the coprocessor is changed in the different matching phases. As coprocessor memory interleaving mode, we use a 31-31 interleaving mode, which maps memory addresses to the individual memory banks in a way that allows near-peak throughput for most possible stride patterns. For our tests, we set up the 24 GB of physical host memory and 16 GB of physical accelerator memory with a windowed memory mode with a 12 GB window of mapped coprocessor memory, 12 GB of pure host memory, and 4 GB of pure coprocessor memory. As suggested in Section 5, when no actual data has to be transferred, we use two different strategies to migrate allocated memory areas between the distinct physical memory locations. Here, with the first one, all involved pages are touched on the new location to guarantee the best raw kernel performance.(6)CnyVecNt. With the second strategy, no-touch, the migration step is much faster and overall matching performance is a bit higher, at the cost of some increased kernel runtimes. All other settings are identical to CnyVecTouch.

7.2. Input Data

Conceptually, all accelerated configurations profit from larger image sizes and higher maximal disparity values, as parallelism and pipelining can be exploited better and overheads are amortized better by longer computation times, whereas smaller sizes may help the pure host execution by better caching opportunities. Beyond this general rule of thumb, there are different characteristics specific to either the Maxeler or the Convey accelerator platform. On Maxeler, all LMem access types need to be 384 bytes aligned, so in practice we pad all data structures and memory access to fit these requirements. This padding is an overhead that does not occur for multiple-of-384 dimensions. On the Convey platform, for the best performance, it is important to fill the 1024 vector elements. This is trivially the case for multiple-of-1024 dimensions but in the aggregation phase can also be achieved by nicely fitting vector partitions, for example, for horizontal sums with height 256 and multiple-of-4 disparities as in our earlier illustration in Figure 9. Additionally, more subtle effects occur when the image sizes interfere with the memory interleaving mode which defines distribution of memory space to different memory bank. However, the mentioned 31-31 interleaving mode makes our experiments relatively robust in this regard.

To summarize, absolute and relative performance significantly depend on the input dimensions for our stereo matching systems. We therefore decided to perform our measurements with a series of different input dimensions and to use standardized real-world image sizes or screen resolutions, regardless of their suitability for either architecture. In order to generate the input images, we scaled image pairs from the Middlebury benchmark set [14] to the desired resolution with cubic scaling in Gimp. The number of disparity steps to investigate is scaled according to the scaling factor of image width. This is important, because, with a too low limit to the possible disparities, matching artifacts occur, which lead to disproportionally longer runtimes of the disparity refinement step.

We created two test series, one starting from the Tsukuba image pair, which has a low ratio of maximal disparity to image width, and one starting from the Cones image pair, which has a high ratio of maximal disparity to image width. Tables 4 and 5 show the two series of input dimensions we investigated. We scaled the two image pairs to different commonly used sizes with pixel ratios between 5 : 4 (SXGA) and 64 : 35 (EGA), most of them at 4 : 3 like the original Tsukuba pair. We selected the set of sizes in a way that the number of pixels between two consecutive sizes increases by factors between 1.08 (from UXGA to FullHD) and 1.46 (from SXGA to UXGA) and the number of elements in a cost volume increases by factors between 1.29 (from UXGA and FullHD) and 1.82 (from SXGA to UXGA). This series is currently limited by two aspects. Firstly, a maximal line width of 1920 is synthesized in one of our Maxeler kernels. Secondly, for larger input dimensions, total memory usage starts to become an issue. On the Maxeler platform, with our current implementation of the memory manager, the 24 GB of accelerator memory put a hard limit to the maximal input dimensions. On the Convey platform, we were able to execute tests with larger input dimensions, but performance was impaired by the Linux kernel starting to swap data between main memory and hard disk.

8. Evaluation and Comparisons

We first present overall system performance for both platforms. For the main comparison between the approaches of specialized kernels and the reusable vector processor overlay, we focus on the raw kernel performance with both methods and abstract the underlying hardware away as far as possible. Finally, we give some estimates of the design efforts for both approaches.

8.1. Stereo Matching System Performance

Our first charts present speedups for the execution of the entire stereo matching process for different image sizes compared to pure host execution. For the two respective image series, Figures 12 and 13 show the speedups of the four configurations with accelerators, MaxKern, MaxFused, CnyVecTouch, and CnyVecNt, compared to the host execution on the faster CPU1. Since the host components of the two CnyVec versions are executed on CPU2, we also exemplarily include Figure 14, where CPU2 is used as baseline for speedups of the low-disparity test series. Host CPU agnostic speedups of the CnyVec versions should be somewhere in between the values from Figures 12 and 14.

For both test series, we see that both CnyVec configurations on the Convey HC-1 [9] can achieve speedups, already at small image sizes which do not fully fill the vector registers. 512 × 384 is the first image size, where CnyVec achieves little speedups over CPU1. The speedups increase slightly with increasing image sizes but show some variations for specific sizes fitting vector register sizes or memory interface a bit better or worse. At 1280 × 1024 × 171 CnyVecNt reaches a peak speedup of 1.9x over CPU1. With the slower CPU2 as baseline, the speedups are around 3x.

On the Maxeler platform [7], the MaxFused configuration with a common design for all aggregation kernels persistently outperforms the MaxKern configuration, with its individual, maximally parallel aggregation kernels. However, for small image sizes, MaxFused is still slower than CPU1 and both CnyVec configurations. In the low-disparity test series, it takes the lead over all other designs for the first time at 960 × 640 × 40. In this test series, its speedup peaks at 1920 × 1400 × 80 with 2.4x compared to CPU1. MaxFused profits from the higher disparities of the second test series, achieving a first speedup over CPU1 already at 512 × 384 × 68 and a peak speedup of 2.8x at 1280 × 1024 × 171.

Figure 15 displays some additional details of the underlying data for CPU1 and the respective faster versions for both accelerators, now showing absolute execution times and subdividing them into aggregation phase, scanline phase, and all remaining parts of the application. We see that, in the aggregation phase, only MaxFused outperforms CPU1 by a small margin. However, execution of this phase on the accelerator has the additional benefit that, afterwards, intermediate results are already in accelerator memory for the following scanline phase. During this scanline phase, now both accelerators achieve significant speedups compared to CPU1. During the execution phases remaining on the respective host, CnyVecNt notably loses some of its earlier speedups compared to CPU1, because its host code is executed on the slower CPU2.

8.2. Platform Overheads

All further comparisons are only performed with regard to the faster CPU1. We proceed with the analysis of the two accelerated phases, in this subsection on the basis of results from 1920 × 1400 × 80, and compare CPU1 to all accelerated designs. Figure 16 breaks down the total execution time of the aggregation phase, splitting the entire yellow blocks from Figure 15 into individual components. The first component is the raw execution time of the five described aggregation kernels, still summed up together. We see that this raw kernel execution time is significantly reduced on all accelerator platforms and configurations compared to CPU1, down from 47 s to between 15 s and 24 s on the accelerators, with the design with the highest parallelism, MaxKern, executing fastest.

The next component is the total time of all data transfers between host and accelerator memory, which are initiated through our memory manager. For pure CPU execution, naturally no such transfers are needed. Here, we see that part of the lower execution times observed on the Maxeler platform in comparison to the Convey platform are caused by lower transfer times, either because of faster physical interconnection or because of the overhead incurred for the realization of the shared memory space on Convey. Masking some of this overhead in the CnyVecNt configuration overcompensates for the increased raw kernel runtimes compared to CnyVecTouch. As third component, we summarized the time spent outside the five kernels selected for acceleration. In the aggregation phase, this Host Setup time includes, for example, the time to initialize the aggregation regions needed for scaling. This phase is notably slower on the Convey platform again because of the slower host CPU2.

The fourth component, reconfiguration times, only occurs on the Maxeler platform. We see that, for the MaxFused design, this overhead is negligible as only one reconfiguration is performed, whereas for the individual aggregation kernels in MaxKern, it more than eats up the additional speedups achieved in raw kernel execution times. As final component, we measured the time spent in platform specific allocation and free API calls on Convey, which turns out to be relatively minor in the two configurations observed.

Figure 17 displays the same components for the scanline phase. In this phase, both Maxeler configurations execute the identical designs and thus perform identically. Due to higher computational intensity and higher data reuse, all accelerator platforms in all configurations reduce the raw kernel execution times much more than in the aggregation phase. Compared to its kernel execution times, CnyVecTouch incurs a huge overhead for data transfers, which CnyVecNt can again partially mask during kernel execution. Compared to the CPU execution times, these overheads are smaller than during the aggregation phase, thus allowing considerable overall speedups.

8.3. Kernel Performance

In order to compare the kernel specific dataflow design approach with the vector processor overlay in regard to their suitability for kernel-centric acceleration, we now look at individual kernel execution times and disregard the platform overheads discussed in the previous subsection. Figures 18 and 19 show the raw kernel execution times of the aggregation and scanline phases, again for the largest image pair, now separated into individual kernels, but summing up the execution times of all invocations of the same kernel to be comparable to the previous plots.

We again compare the faster CPU1 with all accelerated versions for completeness; however, we do not consider the data of CnyVecNt very relevant when it comes to assessing the potential of the vector overlay approach, since here the raw kernel execution times are just increased due to the partial masking of transfer overhead, which is to be excluded in this comparison anyway. Therefore, we focus on CnyVecTouch for the evaluation of the overlay architecture. For the specialized dataflow designs, on the other hand, we consider both design points, since both the existence and the absence of design trade-offs due to reconfiguration overheads can represent relevant real-world scenarios.

For the aggregation kernels in Figure 18, we see quite diverse results, with either MaxKern or CnyVecTouch achieving the best kernel runtimes. Furthermore, we see an unexpected artifact for HorSum, where MaxFused in spite of less compute parallelism is faster than MaxKern. Presumably both kernels are limited by effective memory bandwidth, with MaxFused generating a slightly more favorable memory access pattern. The HorDiff kernel, in turn, was already projected to be compute bound for the MaxKern. The MaxFused design with four times less compute parallelism is around 4x slower, supporting this assumption. The VerSum and scale kernels seem to have become compute bound for the MaxFused design, showing smaller slowdowns compared to MaxKern.

A detailed discussion on the underlying effects for the comparison of dataflow and vectorized kernels needs to take into account the achieved compute parallelism, memory reuse properties as outlined in Section 6, bandwidth requirements, and the impact of memory access patterns on the achieved bandwidths. Also, during our optimizations of the vector overlay kernels, we saw that performance cannot be easily modeled as a function of compute throughput or of effective memory bandwidth but also depends on latencies and sequence of dependent instructions. Thus, detailed attribution of certain results to possibly dominating performance factors would be mostly a speculation without additional measurements. However, the effects of sensitivity to latencies and those of different arithmetic intensities caused by data reuse and design of operations are attributable to the kernel design paradigm and thus form the actual subject of our comparison. On the other hand, effects of different amounts of available compute resources and memory bandwidths distort this comparison. Thus, in the following subsection, we try to extract the former design effects by compensating for the latter hardware effects. However, first we proceed with the comparison of kernel performance.

Comparing the runtimes of scanline kernels in Figure 19, we see a more homogeneous result pattern, with the most notable observations being the difference in CPU performance in horizontal and vertical directions and that the specialized kernels dominate for the actual scanline computation whereas the vector overlay takes the lead for the subsequent summation step.

In our concrete stereo matching application, the various kernels do have their individual, well defined contributions to the overall runtime. However, when comparing the two design methods in regard to their general suitability for kernel-centric acceleration, we want to abstract from these individual kernel weights and just profit from the variety of compute and data-usage patterns represented by different kernels. Thus, we consolidate these results into a single metric, the geometric mean of individual kernel speedup factors that each approach achieves over the reference CPU execution. We denote this metric as Kernel-Ratio, analogical to the similarly computed SPECRatio. This metric has the nice property that the choice of reference platform does not impact relative ratio between the two other platforms.

Table 6 summarizes those Kernel-Ratios for three accelerated designs with reference to CPU1 for the geometric mean of all input sizes and individually for the largest problem size tested, SXGA with high disparity. The reference invariance of the Kernel-Ratios metric allows directly deriving additional ratios between the platforms in the list. So, considering the comparison of the two kernel design approaches, for all image sizes, the Kernel-Ratios of MaxKern with reference to CnyVecTouch are computed as . Similarly, for SXGA high-disparity test, it is computed as .

The results, when comparing the Kernel-Ratios of the two specialized dataflow kernel approaches on Maxeler with the vector overlay on Convey, are surprising. In the geometric mean, the specialized kernels are just marginally faster than the vector overlay. When trading off parallelism for the integration of several specialized kernels in MaxFused, the specialized kernels are even slightly slower than the overlay, for all image dimensions with and for high-disparity SXGA with . However, as indicated above, these numbers do abstract away the data transfer and reconfiguration overheads but still contain the mismatch in available compute resources and memory bandwidths.

8.4. Hardware-Normalized Kernel-Ratio

We try to extract the effects of different kernel design approaches on the two platforms by compensating for the effects of underlying hardware. For this, we need metrics to compare the hardware platforms and approach this by looking at compute resources and memory bandwidth. When we compare basic compute resources in terms of 6-input LUTs, which are common to Virtex-5 and Virtex-6 FPGAs, we can observe a ratio of Maxeler to Convey hardware of . Similarly, the ratio of theoretical peak memory bandwidth can be computed as . Now for somewhat sophisticated compensation of hardware configurations, we would like to offset each observed kernel speedup with one of those factors, depending on whether the kernel is compute or bandwidth bound. However, since the factors are roughly similar, we just average those two ratios to . We multiply Maxeler to Convey Kernel-Ratios by the inverse of the combined hardware ratio in order to normalize performance to comparable hardware platform characteristics. This leads to a metric we denote as Normalized Kernel-Ratio and present in Table 7. We can summarize these results as central contribution of this work as follows:In a diverse set of compute kernels with data parallelism, specialized dataflow kernel implementations on FPGAs are around 3x more efficient in terms of performance than a reusable vector processor overlay implemented on comparable hardware. In a concrete scenario, due to trade-offs between reconfiguration overheads and exposed parallelism, this advantage shrinks to around 2.5x.

After this bold statement, we need to discuss the circumstances and limitations that govern the general applicability of these results. First of all, the utilized method of normalizing for different hardware platforms by a single compensation factor depends on the similar ratios of compute resources and bandwidths. Once those differ considerably, such scaling needs to be done on a per-kernel basis after an analysis whether compute or bandwidth would be the limiting factor. For the dataflow kernels, the foundations for such work are present in [11], but for the vector overlay, the performance bounds are hard to quantify since all our kernels are actually constrained by a combination of computation, latencies, and bandwidth. Also, after migration from one hardware platform to the other, the performance bounds can be different, requiring a more elaborate compensation step.

Secondly, we need to discuss aspects of memory bandwidth. The peak bandwidth data we utilized for our normalization already incorporates two aspects from our practical results. On the Maxeler platform, the memory controller is part of the synthesized FPGA design. The theoretical bandwidth maximum can be achieved with the memory controller clocked at 400 MHz. Due to difficulties to meet the timing of this controller after synthesis, we targeted 300 MHz in our experiments and the peak bandwidth value used in our calculation reflects this. On the Convey platform, the memory controllers are implemented in separate FPGAs and their design is fixed, running at 300 MHz. As reported, we utilize a 31-31 interleaving scheme, which maximizes actual performance in our measurements but technically reduces peak bandwidth to of the physical interface capabilities, which we also included in our numbers.

The practically realizable bandwidths of both memory interfaces depend, beyond those peak numbers, on additional influence factors, like burst sizes, strides, and granularity, which are hard to quantify without extensive tests on both platforms. However, we can qualitatively state that the efficient support for element-wise vector memory operations, in particular indexed ones, of the vector overlay depends on the capability to access individual 8-byte blocks enabled by the scatter-gather RAM modules of the Convey platform we use. So we need to constrain our Normalized Kernel-Ratio results for this design approach with the following:The vector processor overlay requires a memory interface with sufficiently fine access granularity in order to achieve the indicated performance efficiency.

Thirdly, we want to discuss the compute resources. Our scaling method depends on the implicit assumption that performance scales linearly with available hardware. When it comes to parallel execution units that operate on unrolled data and are implemented primarily with LUTs, this assumption makes sense. However, other aspects of resource usage often do not scale linearly with compute throughput. On the one hand, some parts of the designs remain constant, for example, in our experiments, the control logic of the dataflow kernels and the scalar processing units of the vector overlay. On the other hand, resource demands of some components grow more than linearly with increased unrolling factors, for example, those of some data reordering buffers or input selection multiplexers.

Also, the FPGAs of the two utilized platforms have different ratios of additional resources as BRAMs and DSP blocks to LUTs, which the scaling in terms of raw logic resources neglects. In particular, as seen in Table 3, the current designs of several of our dataflow kernels rely on the high ratio of BRAMs to LUTs on Maxeler platform’s Virtex-6 SX475T FPGA, which is . On the Virtex-5 LX330 FPGAs of the Convey platform, this ratio is lower: . However, again as a qualitative statement from our design experience of the dataflow kernels, many of the BRAM resources are directly dedicated to buffering or reordering kernel inputs, outputs, and intermediate results in order to properly utilize the burst-oriented memory interface of the Maxeler platform. So, the second addendum to our Normalized Kernel-Ratio result now states the following more precisely for the other design approach:Dataflow kernels can achieve the indicated performance efficiency even with a burst-oriented memory interface but require FPGAs with a sufficiently high ratio of BRAMs to LUTs for this.

Finally, we need to discuss clock frequencies and low-level optimization. Our dataflow kernels are generated with the Maxeler design flow, which enhances design productivity by transparently applying a number of best-practice decisions, for example, to pipelining or organization of buffers. Many of these can be modified manually, but in our designs such optimizations were mostly performed demand driven, in response to specific timing or resource problems. In order to relax the need for deep pipelining and along with it the need to very carefully optimize the balancing of pipeline stages and their physical layout, most compute paths of our dataflow kernels run at modest 100–130 MHz. For the much more widely distributed and reused vector overlay on the Convey platform, on the other hand, common sense and anecdotal evidence suggest that a huge amount of effort and expertise was invested into low-level optimizations. Consequently, this design runs at 300 MHz, which has a large impact on the performance we measured and compared in this work. We do consider this difference as characteristic for the relationship between reusable and problem-specific designs and as such not as a weakness of the comparison but nevertheless want to state this in a third addendum to our overall findings:Our comparison premises that much more manual low-level optimization effort is put into a reusable overlay design than into problem-specific dataflow kernels.

8.5. Estimates on Design Efforts

As final step of our comparison, we want to present some empirical data about our experienced productivity when performing kernel-centric acceleration with two different design philosophies and targets. As we did not systematically track the design process and many factors which are hard to quantify impact the perceived productivity, these results need to be contemplated with at least a grain of salt. The design and implementation results presented in this work were done in several disjunct phases and with different levels of experience gained from other projects.

Overall we would describe the dataflow kernel design process as two phases, the first starting with some limited amount of experience in dataflow kernel design with the Maxeler toolflow, spanning the equivalent of 8–10 full-time developer weeks for conceptualization of kernels and their unrolling patterns, implementation, and many stand-alone tests in simulation, along with early synthesis results to get a feeling of the resource usage characteristics. The second phase, conducted with much additional background of the Maxeler platform, took another 6–8 weeks with focus on integration, synthesis, and optimization.

This phase was in practice prolonged by the process of waiting for synthesis results, which we tried to exclude from the above reported time span, because it to some degree depends on the amount of parallel synthesis resources and to some degree can be covered organizationally, for example, by running synthesis overnight. As an illustrative number, the total tool runtime for the final design of MaxFused was reported as 22 hours, 5 mins, 11 secs. Within this time, for the place and route step, a total of 11 different cost tables were explored, with four parallel instances running concurrently. Another special challenge was posed by one kernel instance, where the Maxeler simulation tool was not able to reproduce a memory interface related error actually encountered in hardware.

The design of the vector coprocessor kernels was also performed in two major phases. An equivalent of 4–6 full-time developer weeks was spent for first concepts and prototypical implementations with no preliminary knowledge of the concrete vector ISA, but with some general background in assembly programming. With a lot more experience with the architecture, another 6–8 weeks was spent for the final kernel designs and optimizations, including a considerable fraction of the time that was spent in exploring performance impacts of memory settings and data transfer patterns triggered through our memory manager. On this platform, assembly of a kernel design and integration into an executable was completed within seconds, allowing for much faster optimization iterations. A special challenge was posed by repeated crashes of the accelerator hardware that occurred when using the debugger for the vector coprocessor.

We summarize our subjective characterization of the design process as follows:Designing specialized dataflow kernels with Maxeler’s spatial programming language and design flow requires some more time and some more expertise than developing assembly code for a vector processor, but not a whole lot. However, the time-consuming synthesis can add some tedious waiting to the process.

We discuss related work in three different fields. First, we give an overview of other approaches for stereo matching on FPGAs, then we discuss the field of kernel-centric acceleration, and we finally present other approaches to design and evaluate overlay architectures.

Apart from our own previous work in [11, 12], stereo matching on FPGAs has been tackled with codesign of algorithm and hardware, typically implementing the entire processing pipeline without off-chip memory access. Different algorithmic approaches have been explored with different design goals in mind. For example, Tippetts et al. [15] present a complete stereo matching system with less than 10,000 LUTs and 30 BRAMs, at much lower result quality, but robust with respect to uncalibrated and unrectified images. Apart from simple pre- and postprocessing steps, their approach employs an intensity profile shape matching algorithm, which directly works on row-local intensity data.

The FPGA implementations with the highest matching accuracy reflect more of the matching patterns utilized in this work. Shan et al. [21] implemented a slightly modified variant of the presented cost aggregation for adaptive support regions on FPGAs. By aggregating only once and in a fixed order, first vertically and then horizontally, they are able to stream the required data only through on-chip buffers. Wang et al. [22] try to follow the algorithm of Mei et al. [10] in their FPGA implementation more closely. In addition to the aggregation technique of Shan et al. [21], they propose a reduced scanline optimization which runs in three downward directions, following the order the data is generated in in the previous aggregation stage. Both implementations try to exploit parallelism both in the spatial domain of the images, working on several rows at once, and in the disparity domain of the cost volume, working on several disparity images at once.

Jin and Maruyama [27, 28] use a similar single-pass aggregation phase and winner-takes-all disparity selection and combine it with a voting scheme, denoted as fast locally consistent (FLC) [29], which is more sophisticated than the one utilized in the postprocessing step we employ. Between these two phases, intermediate disparity results are actually buffered off chip, but requiring much less bandwidth, since no volume data is stored.

These implementations come quite close to our results in terms of matching quality, with Wang et al. [22] reaching an average of 6.17% bad pixels and Jin and Maruyama [28] only 5.86% bad pixels. They are somewhat more limited in problem dimensions than our approach of working on blocks of memory, with Jin and Maruyama [27, 28] projecting a design that supports our largest test inputs to exceed the LUT and BRAM resources of their and our current hardware platform, but suitable for large Virtex-7 FPGAs. In terms of performance, these codesigned implementations are orders of magnitude faster than our implementation, by executing less computation steps on volume data and by integrating the compute pipelines more tightly. Therefore, these approaches are superior when algorithmic trade-offs can be made, whereas our approach is justified, when exact reproduction of results or a simpler, structured design process is required.

Such a kernel-centric design approach is also coming along with the OpenCL-to-FPGA design flows, which are gaining traction, from academic initiatives [30] to FPGA vendor toolchains [31, 32]. Even though so far the actual synthesis of FPGA designs from OpenCL kernels was the focus of this research, a defining feature of the OpenCL approach is the distinction between data-parallel kernels that are to be executed on parallel resources and a host part of the application. This host code may run on a server CPU like in our work or in [31] or on a CPU inside SoC implemented on FPGA as in [30]. Our memory management interface could be seen as easier, more abstract alternative to a subset of the OpenCL runtime feature set. But just like we used our interface to abstract away the underlying data transfer APIs implemented on the Maxeler and Convey platforms [8, 9], we could also add support for a target platform that internally uses the OpenCL API functionality through our interface.

FPGA overlays or architecture templates for such overlays have been researched as means to enable faster or easier manual design, faster synthesis or compilation toolflows, and faster reconfiguration on top of a reusable overlay. Coole and Stitt propose intermediate fabrics (IFs) [33], an overlay architecture of coarse grained compute resources and configurable interconnection implemented on top of an FPGA. Different IFs are designed first manually [33, 34], later also automatically based on OpenCL [35], to each support a group of kernels with similar compute demands. Their coarse grained abstraction allows orders of magnitude faster synthesis and reconfiguration than for the underlying FPGA fabric. Depending on the set of investigated kernels, the degree of specialization, and the reconfiguration properties, the authors report area overheads from 1.4x [33] over 1.8x [35] to 4.4x [34] for the overlay and assume identical clock speeds for overlay and specialized designs.

In the area of instruction programmable FPGA overlays, active academic research on vector processors [36, 37] is going on in the area of embedded computing devices as throughput-optimized alternatives to scalar soft processors. Ovtcharov et al. [38] add the concept of GPU-like multithreading to hide latencies of functional units and memory access by pipelining the execution of different threads. As proposed by Kingyens and Steffan [39] and brought forward by Convey with CHOMP [40] as successor to the vector processor utilized in this work, such a GPU-like architecture may be a promising architecture template for acceleration of server- and datacenter-scale computing tasks. From the programmers perspective, it offers more transparent ways to exploit parallelism in multiple dimensions than with the vector partitioning approach we had to specify explicitly. From an architectural point of view, this may come at higher resource consumption, for example, for address calculations. On the other hand, the improved latency hiding promises higher performance. To the best of our knowledge, there are no published performance results for these architectures, in particular not in comparison to custom datapaths.

10. Conclusion

In this work, we have compared two design approaches for kernel-centric acceleration, specialized dataflow kernels versus an instruction programmed vector processor on FPGA with the example of a stereo matching application. We have shown that, given comparable FPGA and memory resources, the specialized dataflow kernels promise around 3x more performance than kernels executing on a fixed vector overlay, and we have analyzed three important preconditions for this result: (1) the vector processor needs a sufficiently fine grained memory interface, (2) the dataflow kernels need FPGA architectures with sufficient BRAMs for local buffers, and (3) a reusable overlay typically receives more low-level optimization than specialized kernels with a much more narrow usage scope. We have also elaborated that, in our concrete scenario, due to trade-offs between reconfiguration overheads and exposed parallelism, the advantage of specialized dataflow kernels is reduced to around 2.5x.

Looking forward, it will be interesting to extend such an analysis to other presynthesized or customizable overlay architectures, following GPU-like SIMT execution patterns or implementing CGRA structures on FPGA. Also, a careful analysis of whether and where the spatial programming language paradigm or the utilized toolflow might add inefficiencies could help to contextualize our results.

We have motivated the kernel-centric acceleration approach used in this work with productivity considerations and by the desire to precisely retain the desired application behavior while using FPGA resources for acceleration. Our overview of the design process hints that both the dataflow and the vector ISA abstraction may help for this process, but synthesis times are still an issue for specialized dataflow kernels. The similarly kernel-centric OpenCL design flows that currently gain traction in the FPGA community promise even more abstraction, possibly along with reduced control over the designs, and speed up the synthesis process by reusing the memory and PCIe interface as fully mapped and routed components on the FPGA. Thus, they will represent another interesting design approach to compare to.

The analysis of data transfer and platform overheads when looking at the entire application underlines that the current trend of tighter integration of FPGAs, as well as other accelerators like GPUs, into the same SoC with CPUs and a shared memory subsystem may turn out to be very valuable for kernel-centric acceleration approaches.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.


This work was partially supported by the German Research Foundation (DFG) within the Collaborative Research Centre “On-The-Fly Computing” (SFB 901), the European Union Seventh Framework Programme under Grant Agreement no. 610996 (SAVE), and the Maxeler University program MAXUP.