Scientific Programming

Review Article

Survey of Methodologies, Approaches, and Challenges in Parallel Programming Using High-Performance Computing Systems

Table 2

Technologies and goals.


Technology/API	Goals (performance, energy, etc.)	Ease of programming	Ease of assessment, e.g., performance	Ease of deployment/(auto) tuning	Portability (between hardware, for new hardware, etc.)

OpenMP	Performance, parallelization	Relatively easy, parallelization of a sequential program by addition of directives for parallelization of regions and optionally library calls for thread management, difficulty of implementing certain schemes, e.g., similar to those with Pthread’s condition variables [1]	Execution times can be benchmarked easily, debugging relatively easy	Easy, thread number can be set using an environment variable, at the level of region or clause	Available for all major shared memory environments, e.g., in gcc

CUDA	Performance	Proprietary API, easy-to-use in a basic version for a default card, more difficult for optimized codes (requires stream handling, memory optimizations including shared memory–avoiding bank conflicts, global memory coalescing)	Can be performed using cuda-gdb or very powerful nvvp (NVIDIA visual Profiler) or text-based nvprof	Easy, requires CUDA drivers and software	Limited to NVIDIA cards, support for various features depends on hardware version-card’s CUDA compute capability and software version

OpenCL	Performance	More difficult than CUDA or OpenMP since it requires much device and kernel management code, optimized code may require specialized kernels which somehow defies the idea of portability	Can be benchmarked at the level of kernels, queue management functions can be used for fencing benchmarked sections	Easy, requires proper drivers in the system	Portable across hybrid parallel systems, especially CPU + GPU

Pthreads	Performance	More difficult than OpenMP, flexibility to implement several multithreaded schemes, involving wait-notify, using condition variables for, e.g., producer-consumer	Easy, thread’s code in designated functions, and can be benchmarked there	Easy, thread’s code executed in designated functions	Available for all major shared memory environments
OpenACC	Performance	Easy, similar to the OpenMP’s directive-based model, however requires awareness of overheads and corresponding needs for optimization related to, e.g., data placement, copy overheads, etc.	Standard libraries can be used for performance assessment, gprof can be used	Requires a compiler supporting OpenACC, e.g., PGI’s compiler, GCC, or accULL	Portable across compute devices supported by the software

Java Concurrency	Parallelization	Easy, two levels of abstraction	Easy debugging and profiling	Easy deployment for many OS	Portable over majority of hardware
TCP/IP	Standard network connectivity	Programming can be difficult, requires knowledge of low-level network mechanisms	Debugging can be difficult, available tools for time measurement	Usually already deployed with the OS	Portable over majority of hardware

RDMA	Performance	Programming can be difficult, requires knowledge of low-level network mechanisms	Debugging can be difficult, available tools for time measurement	Deployment can be difficult	Usually used with clusters

UCX	Performance	Programming can be difficult, it is library for frameworks	Debugging can be difficult, it is quite a new solution	Deployment can be difficult	Usually used with clusters

MPI	Performance, parallelization	Relatively easy, high-level, message passing paradigm	Measurement of execution time easy, difficult debugging, especially in a cluster environment	Deployment can require additional tools, e.g., drivers for advanced interconnects such as Infiniband or SLURM for an HPC queue system, tuning typically based on low-level profiling	Portable, implementations available on clusters, servers, workstations, typically used in Unix environments

OpenSHMEM	Performance, parallelization	Easy, needs attention for synchronized data access	No dedicated debugging and profiling tools	Fairly easy deployment in many environments	Portable, implementations available on clusters, servers, workstations, typically used in UNIX environments

PCJ	Performance, parallelization	Easy, classes and annotations used for object distribution	No dedicated debugging and profiling tools	Easy deployment for many OS	Portable over majority of hardware
Apache Hadoop	Performance, large datasets	Relatively easy, high level abstraction, requires good understanding of MapReduce programming model	Easy to acquire job performance overview (web UI and logs), moderately easy debugging, central logging can be used to streamline the process	Moderately easy basic deployment, tweaking performance, and security for entire hadoop ecosystem can be very difficult	Used in clusters, available for Unix and windows

Apache Spark	Performance, low disk, and high RAM usage, large datasets	Relatively easy, high-level abstraction, based on lambda functions on RDD and dataFrames	Easy to acquire job performance overview (web UI and logs), moderately easy debugging, central logging can be used to streamline the process	Easy Spark Standalone deployment, Spark on YARN deployment requires a functioning Hadoop ecosystem	Used in clusters, available for Unix and Windows