Survey of Methodologies, Approaches, and Challenges in Parallel Programming Using High-Performance Computing Systems
Table 3
Technologies and parallelism.
Tech/API
Level of parallelism
Parallelism constructs
Synchronization constructs
OpenMP
Thread teams executing some regions of an application
Directives that define that a certain region is to be executed in parallel, such as #pragma omp parallel, #pragma omp sections, etc.
Several constructs that allow synchronization such as #pragma omp barrier, constructs that denote that a part of code be executed by a certain thread, e.g., #pragma omp master, #pragma omp single, critical section #pragma omp critical, directives for data synchronization, e.g., #pragma omp atomic
CUDA
Threads executing kernels in parallel, threads are organized into a grid of blocks each of which consists a number of threads, both threads in a block and blocks in a grid can be organized in 1D, 2D, or 3D logical structures, kernel execution, host to device and device to host copying can be overlapped if issued into various CUDA streams
Invocation of a kernel function launches parallel computations by a grid of threads, possible execution on several GPUs in parallel
Execution of all grid’s threads is synchronized after the kernel has completed; on the host side, execution of individual threads in a block is possible with a call to __syncthreads (), atomic functions available for accessing global memory
OpenCL
Work items executing kernels in parallel, work items are organized into an NDRange of work groups each of which consists a number of work items, both work items in a work group and work groups in an NDRange can be organized in 1D, 2D, or 3D logical structures, kernel execution, host to device and device to host copying can be overlapped if issued into various command queues
Invocation of a kernel function launches parallel computations by an NDRange of work items, OpenCL allows parallel execution of kernels on various compute devices such as CPUs and GPUs
Execution of all NDRange’s work items is synchronized after the kernel has completed; on the host side, execution of individual work items in a block is possible with a call to barrier with indication whether a local or global memory variable that should be synchronized, synchronization using events is also possible, atomic operations available for synchronization of references to global or local memory
Pthreads
Threads are launched explicitly for execution of a particular function
a call to pthread_create () creates a thread for execution of a specific function for which a pointer is passed as a parameter
Threads can be synchronized by the thread that called pthread_create () by calling pthread_join (), there are mechanisms for synchronization of threads such as mutexes, condition variables with wait pthread_cond_wait () and notify routines, e.g., pthread _cond_signal (), barrier pthread_barrier(), implicit memory view synchronization among threads upon invocation of selected functions
OpenACC
Three levels of parallelism available: execution of gangs, one or more workers within a gang, vector lanes within a worker
Parallel execution of code within a block marked with #pragma acc parallel, parallel execution of a loop can be specified with #pragma acc loop
For #pragma acc parallel, an implicit barrier is present at the end of the following block, if async is not present, atomic accesses possible with #pragma acc atomic according to documentation [10], the user should not attempt to implement barrier synchronization, critical sections or locks across any of gang, worker, or vector parallelism
Java Concurrency
Thread inside the same JVM
The main tread created during the JVM start in main () method is a root of other threads created dynamically using explicit, e.g., new thread (), or implicit constructs, e.g., thread pool
Typical shared memory mechanisms like synchronized sections or guarded blocks
TCP/IP
The whole network nodes
Managed manually by adding and configuring hardware
Using IP addresses and ports for distinguishing the connections/destinations, no specific constructs
RDMA
The whole network nodes
Managed manually by adding and configuring hardware
Using remote access with the indicators of the accessed memory
UCX
The whole network nodes
Managed manually by adding and configuring hardware
Special APIs for message passing and memory access
MPI
Processes (+threads combined with a multithreaded API like OpenMP, Pthreads if MPI implementation supports required thread support level)
Processes created with mpirun at application launch + potentially processes created dynamically with a call to MPI_Comm_spawn or MPI_Comm_spawn _multiple
MPI collective routines: barrier, communication calls like MPI_Gather, MPI_Scatter, etc.
OpenSHMEM
Processes possibly on different compute nodes
Processes created with oshrun at application launch
OpenSHMEM synchronization and collective routines; barrier, broadcast, reduction, etc.
PCJ
The so-called nodes placed in possibly separated JVMs on different compute nodes
The node structure is created by a main Manager node at application launch
PCJ synchronization and collective routines; barrier, broadcast, etc.
Apache Hadoop
Task is a single process running inside a JVM
API to formulate MapReduce functions
Synchronization managed by YARN, API for data aggregation (reduce operation)
Apache Spark
Executors run worker threads
RDD and DataFrame API for managing distributed computations
Managed by built-in Spark Standalone or by external cluster manager: YARN, Mesos etc.