Review Article

Survey of Methodologies, Approaches, and Challenges in Parallel Programming Using High-Performance Computing Systems

Table 2

Technologies and goals.

Technology/APIGoals (performance, energy, etc.)Ease of programmingEase of assessment, e.g., performanceEase of deployment/(auto) tuningPortability (between hardware, for new hardware, etc.)

OpenMPPerformance, parallelizationRelatively easy, parallelization of a sequential program by addition of directives for parallelization of regions and optionally library calls for thread management, difficulty of implementing certain schemes, e.g., similar to those with Pthread’s condition variables [1]Execution times can be benchmarked easily, debugging relatively easyEasy, thread number can be set using an environment variable, at the level of region or clauseAvailable for all major shared memory environments, e.g., in gcc

CUDAPerformanceProprietary API, easy-to-use in a basic version for a default card, more difficult for optimized codes (requires stream handling, memory optimizations including shared memory–avoiding bank conflicts, global memory coalescing)Can be performed using cuda-gdb or very powerful nvvp (NVIDIA visual Profiler) or text-based nvprofEasy, requires CUDA drivers and softwareLimited to NVIDIA cards, support for various features depends on hardware version-card’s CUDA compute capability and software version

OpenCLPerformanceMore difficult than CUDA or OpenMP since it requires much device and kernel management code, optimized code may require specialized kernels which somehow defies the idea of portabilityCan be benchmarked at the level of kernels, queue management functions can be used for fencing benchmarked sectionsEasy, requires proper drivers in the systemPortable across hybrid parallel systems, especially CPU + GPU

PthreadsPerformanceMore difficult than OpenMP, flexibility to implement several multithreaded schemes, involving wait-notify, using condition variables for, e.g., producer-consumerEasy, thread’s code in designated functions, and can be benchmarked thereEasy, thread’s code executed in designated functionsAvailable for all major shared memory environments
OpenACCPerformanceEasy, similar to the OpenMP’s directive-based model, however requires awareness of overheads and corresponding needs for optimization related to, e.g., data placement, copy overheads, etc.Standard libraries can be used for performance assessment, gprof can be usedRequires a compiler supporting OpenACC, e.g., PGI’s compiler, GCC, or accULLPortable across compute devices supported by the software

Java ConcurrencyParallelizationEasy, two levels of abstractionEasy debugging and profilingEasy deployment for many OSPortable over majority of hardware
TCP/IPStandard network connectivityProgramming can be difficult, requires knowledge of low-level network mechanismsDebugging can be difficult, available tools for time measurementUsually already deployed with the OSPortable over majority of hardware

RDMAPerformanceProgramming can be difficult, requires knowledge of low-level network mechanismsDebugging can be difficult, available tools for time measurementDeployment can be difficultUsually used with clusters

UCXPerformanceProgramming can be difficult, it is library for frameworksDebugging can be difficult, it is quite a new solutionDeployment can be difficultUsually used with clusters

MPIPerformance, parallelizationRelatively easy, high-level, message passing paradigmMeasurement of execution time easy, difficult debugging, especially in a cluster environmentDeployment can require additional tools, e.g., drivers for advanced interconnects such as Infiniband or SLURM for an HPC queue system, tuning typically based on low-level profilingPortable, implementations available on clusters, servers, workstations, typically used in Unix environments

OpenSHMEMPerformance, parallelizationEasy, needs attention for synchronized data accessNo dedicated debugging and profiling toolsFairly easy deployment in many environmentsPortable, implementations available on clusters, servers, workstations, typically used in UNIX environments

PCJPerformance, parallelizationEasy, classes and annotations used for object distributionNo dedicated debugging and profiling toolsEasy deployment for many OSPortable over majority of hardware
Apache HadoopPerformance, large datasetsRelatively easy, high level abstraction, requires good understanding of MapReduce programming modelEasy to acquire job performance overview (web UI and logs), moderately easy debugging, central logging can be used to streamline the processModerately easy basic deployment, tweaking performance, and security for entire hadoop ecosystem can be very difficultUsed in clusters, available for Unix and windows

Apache SparkPerformance, low disk, and high RAM usage, large datasetsRelatively easy, high-level abstraction, based on lambda functions on RDD and dataFramesEasy to acquire job performance overview (web UI and logs), moderately easy debugging, central logging can be used to streamline the processEasy Spark Standalone deployment, Spark on YARN deployment requires a functioning Hadoop ecosystemUsed in clusters, available for Unix and Windows