|
Technology/API | Goals (performance, energy, etc.) | Ease of programming | Ease of assessment, e.g., performance | Ease of deployment/(auto) tuning | Portability (between hardware, for new hardware, etc.) |
|
OpenMP | Performance, parallelization | Relatively easy, parallelization of a sequential program by addition of directives for parallelization of regions and optionally library calls for thread management, difficulty of implementing certain schemes, e.g., similar to those with Pthread’s condition variables [1] | Execution times can be benchmarked easily, debugging relatively easy | Easy, thread number can be set using an environment variable, at the level of region or clause | Available for all major shared memory environments, e.g., in gcc |
|
CUDA | Performance | Proprietary API, easy-to-use in a basic version for a default card, more difficult for optimized codes (requires stream handling, memory optimizations including shared memory–avoiding bank conflicts, global memory coalescing) | Can be performed using cuda-gdb or very powerful nvvp (NVIDIA visual Profiler) or text-based nvprof | Easy, requires CUDA drivers and software | Limited to NVIDIA cards, support for various features depends on hardware version-card’s CUDA compute capability and software version |
|
OpenCL | Performance | More difficult than CUDA or OpenMP since it requires much device and kernel management code, optimized code may require specialized kernels which somehow defies the idea of portability | Can be benchmarked at the level of kernels, queue management functions can be used for fencing benchmarked sections | Easy, requires proper drivers in the system | Portable across hybrid parallel systems, especially CPU + GPU |
|
Pthreads | Performance | More difficult than OpenMP, flexibility to implement several multithreaded schemes, involving wait-notify, using condition variables for, e.g., producer-consumer | Easy, thread’s code in designated functions, and can be benchmarked there | Easy, thread’s code executed in designated functions | Available for all major shared memory environments |
OpenACC | Performance | Easy, similar to the OpenMP’s directive-based model, however requires awareness of overheads and corresponding needs for optimization related to, e.g., data placement, copy overheads, etc. | Standard libraries can be used for performance assessment, gprof can be used | Requires a compiler supporting OpenACC, e.g., PGI’s compiler, GCC, or accULL | Portable across compute devices supported by the software |
|
Java Concurrency | Parallelization | Easy, two levels of abstraction | Easy debugging and profiling | Easy deployment for many OS | Portable over majority of hardware |
TCP/IP | Standard network connectivity | Programming can be difficult, requires knowledge of low-level network mechanisms | Debugging can be difficult, available tools for time measurement | Usually already deployed with the OS | Portable over majority of hardware |
|
RDMA | Performance | Programming can be difficult, requires knowledge of low-level network mechanisms | Debugging can be difficult, available tools for time measurement | Deployment can be difficult | Usually used with clusters |
|
UCX | Performance | Programming can be difficult, it is library for frameworks | Debugging can be difficult, it is quite a new solution | Deployment can be difficult | Usually used with clusters |
|
MPI | Performance, parallelization | Relatively easy, high-level, message passing paradigm | Measurement of execution time easy, difficult debugging, especially in a cluster environment | Deployment can require additional tools, e.g., drivers for advanced interconnects such as Infiniband or SLURM for an HPC queue system, tuning typically based on low-level profiling | Portable, implementations available on clusters, servers, workstations, typically used in Unix environments |
|
OpenSHMEM | Performance, parallelization | Easy, needs attention for synchronized data access | No dedicated debugging and profiling tools | Fairly easy deployment in many environments | Portable, implementations available on clusters, servers, workstations, typically used in UNIX environments |
|
PCJ | Performance, parallelization | Easy, classes and annotations used for object distribution | No dedicated debugging and profiling tools | Easy deployment for many OS | Portable over majority of hardware |
Apache Hadoop | Performance, large datasets | Relatively easy, high level abstraction, requires good understanding of MapReduce programming model | Easy to acquire job performance overview (web UI and logs), moderately easy debugging, central logging can be used to streamline the process | Moderately easy basic deployment, tweaking performance, and security for entire hadoop ecosystem can be very difficult | Used in clusters, available for Unix and windows |
|
Apache Spark | Performance, low disk, and high RAM usage, large datasets | Relatively easy, high-level abstraction, based on lambda functions on RDD and dataFrames | Easy to acquire job performance overview (web UI and logs), moderately easy debugging, central logging can be used to streamline the process | Easy Spark Standalone deployment, Spark on YARN deployment requires a functioning Hadoop ecosystem | Used in clusters, available for Unix and Windows |
|