Review Article

Survey of Methodologies, Approaches, and Challenges in Parallel Programming Using High-Performance Computing Systems

Table 3

Technologies and parallelism.

Tech/APILevel of parallelismParallelism constructsSynchronization constructs

OpenMPThread teams executing some regions of an applicationDirectives that define that a certain region is to be executed in parallel, such as #pragma omp parallel, #pragma omp sections, etc.Several constructs that allow synchronization such as #pragma omp barrier, constructs that denote that a part of code be executed by a certain thread, e.g., #pragma omp master, #pragma omp single, critical section #pragma omp critical, directives for data synchronization, e.g., #pragma omp atomic

CUDAThreads executing kernels in parallel, threads are organized into a grid of blocks each of which consists a number of threads, both threads in a block and blocks in a grid can be organized in 1D, 2D, or 3D logical structures, kernel execution, host to device and device to host copying can be overlapped if issued into various CUDA streamsInvocation of a kernel function launches parallel computations by a grid of threads, possible execution on several GPUs in parallelExecution of all grid’s threads is synchronized after the kernel has completed; on the host side, execution of individual threads in a block is possible with a call to __syncthreads (), atomic functions available for accessing global memory

OpenCLWork items executing kernels in parallel, work items are organized into an NDRange of work groups each of which consists a number of work items, both work items in a work group and work groups in an NDRange can be organized in 1D, 2D, or 3D logical structures, kernel execution, host to device and device to host copying can be overlapped if issued into various command queuesInvocation of a kernel function launches parallel computations by an NDRange of work items, OpenCL allows parallel execution of kernels on various compute devices such as CPUs and GPUsExecution of all NDRange’s work items is synchronized after the kernel has completed; on the host side, execution of individual work items in a block is possible with a call to barrier with indication whether a local or global memory variable that should be synchronized, synchronization using events is also possible, atomic operations available for synchronization of references to global or local memory

PthreadsThreads are launched explicitly for execution of a particular functiona call to pthread_create () creates a thread for execution of a specific function for which a pointer is passed as a parameterThreads can be synchronized by the thread that called pthread_create () by calling pthread_join (), there are mechanisms for synchronization of threads such as mutexes, condition variables with wait pthread_cond_wait () and notify routines, e.g., pthread _cond_signal (), barrier pthread_barrier(), implicit memory view synchronization among threads upon invocation of selected functions

OpenACCThree levels of parallelism available: execution of gangs, one or more workers within a gang, vector lanes within a workerParallel execution of code within a block marked with #pragma acc parallel, parallel execution of a loop can be specified with #pragma acc loopFor #pragma acc parallel, an implicit barrier is present at the end of the following block, if async is not present, atomic accesses possible with #pragma acc atomic according to documentation [10], the user should not attempt to implement barrier synchronization, critical sections or locks across any of gang, worker, or vector parallelism

Java ConcurrencyThread inside the same JVMThe main tread created during the JVM start in main () method is a root of other threads created dynamically using explicit, e.g., new thread (), or implicit constructs, e.g., thread poolTypical shared memory mechanisms like synchronized sections or guarded blocks

TCP/IPThe whole network nodesManaged manually by adding and configuring hardwareUsing IP addresses and ports for distinguishing the connections/destinations, no specific constructs
RDMAThe whole network nodesManaged manually by adding and configuring hardwareUsing remote access with the indicators of the accessed memory

UCXThe whole network nodesManaged manually by adding and configuring hardwareSpecial APIs for message passing and memory access

MPIProcesses (+threads combined with a multithreaded API like OpenMP, Pthreads if MPI implementation supports required thread support level)Processes created with mpirun at application launch + potentially processes created dynamically with a call to MPI_Comm_spawn or MPI_Comm_spawn _multipleMPI collective routines: barrier, communication calls like MPI_Gather, MPI_Scatter, etc.

OpenSHMEMProcesses possibly on different compute nodesProcesses created with oshrun at application launchOpenSHMEM synchronization and collective routines; barrier, broadcast, reduction, etc.

PCJThe so-called nodes placed in possibly separated JVMs on different compute nodesThe node structure is created by a main Manager node at application launchPCJ synchronization and collective routines; barrier, broadcast, etc.

Apache HadoopTask is a single process running inside a JVMAPI to formulate MapReduce functionsSynchronization managed by YARN, API for data aggregation (reduce operation)

Apache SparkExecutors run worker threadsRDD and DataFrame API for managing distributed computationsManaged by built-in Spark Standalone or by external cluster manager: YARN, Mesos etc.