Scientific Programming

Review Article

Survey of Methodologies, Approaches, and Challenges in Parallel Programming Using High-Performance Computing Systems

Table 3

Technologies and parallelism.


Tech/API	Level of parallelism	Parallelism constructs	Synchronization constructs

OpenMP	Thread teams executing some regions of an application	Directives that define that a certain region is to be executed in parallel, such as #pragma omp parallel, #pragma omp sections, etc.	Several constructs that allow synchronization such as #pragma omp barrier, constructs that denote that a part of code be executed by a certain thread, e.g., #pragma omp master, #pragma omp single, critical section #pragma omp critical, directives for data synchronization, e.g., #pragma omp atomic

CUDA	Threads executing kernels in parallel, threads are organized into a grid of blocks each of which consists a number of threads, both threads in a block and blocks in a grid can be organized in 1D, 2D, or 3D logical structures, kernel execution, host to device and device to host copying can be overlapped if issued into various CUDA streams	Invocation of a kernel function launches parallel computations by a grid of threads, possible execution on several GPUs in parallel	Execution of all grid’s threads is synchronized after the kernel has completed; on the host side, execution of individual threads in a block is possible with a call to __syncthreads (), atomic functions available for accessing global memory

OpenCL	Work items executing kernels in parallel, work items are organized into an NDRange of work groups each of which consists a number of work items, both work items in a work group and work groups in an NDRange can be organized in 1D, 2D, or 3D logical structures, kernel execution, host to device and device to host copying can be overlapped if issued into various command queues	Invocation of a kernel function launches parallel computations by an NDRange of work items, OpenCL allows parallel execution of kernels on various compute devices such as CPUs and GPUs	Execution of all NDRange’s work items is synchronized after the kernel has completed; on the host side, execution of individual work items in a block is possible with a call to barrier with indication whether a local or global memory variable that should be synchronized, synchronization using events is also possible, atomic operations available for synchronization of references to global or local memory

Pthreads	Threads are launched explicitly for execution of a particular function	a call to pthread_create () creates a thread for execution of a specific function for which a pointer is passed as a parameter	Threads can be synchronized by the thread that called pthread_create () by calling pthread_join (), there are mechanisms for synchronization of threads such as mutexes, condition variables with wait pthread_cond_wait () and notify routines, e.g., pthread _cond_signal (), barrier pthread_barrier(), implicit memory view synchronization among threads upon invocation of selected functions

OpenACC	Three levels of parallelism available: execution of gangs, one or more workers within a gang, vector lanes within a worker	Parallel execution of code within a block marked with #pragma acc parallel, parallel execution of a loop can be specified with #pragma acc loop	For #pragma acc parallel, an implicit barrier is present at the end of the following block, if async is not present, atomic accesses possible with #pragma acc atomic according to documentation [10], the user should not attempt to implement barrier synchronization, critical sections or locks across any of gang, worker, or vector parallelism

Java Concurrency	Thread inside the same JVM	The main tread created during the JVM start in main () method is a root of other threads created dynamically using explicit, e.g., new thread (), or implicit constructs, e.g., thread pool	Typical shared memory mechanisms like synchronized sections or guarded blocks

TCP/IP	The whole network nodes	Managed manually by adding and configuring hardware	Using IP addresses and ports for distinguishing the connections/destinations, no specific constructs
RDMA	The whole network nodes	Managed manually by adding and configuring hardware	Using remote access with the indicators of the accessed memory

UCX	The whole network nodes	Managed manually by adding and configuring hardware	Special APIs for message passing and memory access

MPI	Processes (+threads combined with a multithreaded API like OpenMP, Pthreads if MPI implementation supports required thread support level)	Processes created with mpirun at application launch + potentially processes created dynamically with a call to MPI_Comm_spawn or MPI_Comm_spawn _multiple	MPI collective routines: barrier, communication calls like MPI_Gather, MPI_Scatter, etc.

OpenSHMEM	Processes possibly on different compute nodes	Processes created with oshrun at application launch	OpenSHMEM synchronization and collective routines; barrier, broadcast, reduction, etc.

PCJ	The so-called nodes placed in possibly separated JVMs on different compute nodes	The node structure is created by a main Manager node at application launch	PCJ synchronization and collective routines; barrier, broadcast, etc.

Apache Hadoop	Task is a single process running inside a JVM	API to formulate MapReduce functions	Synchronization managed by YARN, API for data aggregation (reduce operation)

Apache Spark	Executors run worker threads	RDD and DataFrame API for managing distributed computations	Managed by built-in Spark Standalone or by external cluster manager: YARN, Mesos etc.