Abstract
This study reports the GPU parallelization of complex threedimensional software for nonlinear analysis of concrete structures. It focuses on coupled thermomechanical analysis of complex structures. A coupled FEM/DEM approach (CDEM) is given from a fundamental theoretical viewpoint. As the modeling of a large structure by means of FEM/DEM may lead to prohibitive computation times, a parallelization strategy is required. With the substantial development of computer science, a GPUbased parallel procedure is implemented. A comparative study between the GPU and CPU computation results is presented, and the runtimes and speedups are analyzed. The results show that dramatic performance improvements are gained from GPU parallelization.
1. Introduction
Many authors have studied the resolution of the nonlinear concrete structure problems, focusing rigorously on the threedimensional (3D) elastoplastic problem, such as Roca and Marí [1] and Spacone et al. [2]. These authors used a finite element model. Other authors (e.g., Rousseau et al. [3], Villard et al. [4]) proposed a coupled FEM/DEM approach with regard to nonlinear material analysis of structures. As is shown, the modeling of a large structure by means of FEM or DEM may lead to prohibitive computation times. Thus, some authors (e.g., Sziveri et al. [5], Romero et al. [6]) developed CPUbased parallel procedures. With significant development of computer science, the implementation of a GPUbased parallel procedure becomes possible.
A graphics processing unit (GPU) is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to display [7]. GPU has become an integral part of today’s mainstream computing systems. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart [8]. Over the past decade, there has been a marked increase in the performance and capabilities of GPUs. As a result, computing on a GPU card has been a hot topic for scientific research in different numerical areas [9–21]. The techniques, which enable GPUs to efficiently undertake largescale scientific computing tasks, can be summarized as follows [21].(i)A GPU card consists of a number of multiprocessors (as shown in Figure 1). On each processor, there are several hundred coresident threads that can execute integer, single and double precision calculations simultaneously.(ii)Finegrained parallelization with a very large number of threads is the kernel principal of GPU computing.(iii)The memory access technique is specially designed in a GPU to accelerate the memory access speed of the threads. To achieve near peak memory access performance, memory coalescing is considered in the design of the data structure and the computational arrangement for computations.(iv)The release of Compute Unified Device Architecture (CUDA) makes it much more straightforward to implement GPU code.
CUDA is a parallel computing platform and programming model (as shown in Figure 2) created by NVIDIA and implemented by the GPUs that they produce [22]. It enables dramatic increases in computing performance by harnessing the power of the GPU. It is the most popular programming toolkit in computational sciences. It has already been used by researchers to parallelize their codes of different numerical methods, for example, Molecular Dynamic (MD) [9, 10], Lattice Boltzmann Method (LBM) [11, 12], Boundary Element Method (BEM) [13], Finite Element Method (FEM) [14–16], Discrete Element Method (DEM) [17, 18], Finite Difference Method [19], Moving Particle Semiimplicit (MPS) method [20], and Distinct Lattice Spring Model (DLSM) [21]. The overall speedups of GPU parallel codes compared to serial CPU counterparts are reported to be from 10x to 100x [9–21]. This is promising for both scientific computations and engineering practices.
To model the nonlinear response of complex structures, we will parallelize the coupled FEM/DEM code on a GPU with CUDA. In the first section, the coupled FEM/DEM approach with its basic formulations is introduced. Then, the GPU implementation and optimization techniques are presented. Furthermore, the parallel FEM/DEM code is tested on computers equipped with modern GPU cards. Finally, conclusions regarding the GPU parallelization are drawn.
2. Coupled Finite/Discrete Element Model
2.1. Geometric Representation
Li et al. [24, 25] proposed a Continuumbased Discrete Element Method (CDEM) which in fact is a Coupled Finite/Discrete Element Method and has a variety of applications. A detailed description about CDEM will be introduced in the following context.
CDEM is a combination of Finite Element Method (FEM) and Discrete Element Method (DEM). Figure 3 shows the geometric domains used by the CDEM approach. It contains two kinds of elements, blocks (i.e., A, B, C, and D) and contacts (Figures 3(c) and 3(d)). A discrete block consists of one or more FEM elements, all of which share the same nodes and faces. In this paper, it is assumed that a discrete block consists of only one FEM element. All three element types are shown in Figure 4. A contact contains several normal and tangent springs (Figure 5); each connects nodes of neighboring blocks. Inside the block, the FEM is used, while for the interface, the DEM is adopted.
(a) The original domain
(b) The FEM domain
(c) The FEM/DEM domain
(d) The DEM domain
(a) 4node tetrahedron
(b) 6node wedge
(c) 8node hexahedron
2.2. Governing Equations
Block stress analysis means figuring out stress of every single block and the interaction between different blocks. For every block in the model, it needs to satisfy the following governing equations.
Equilibrium equation:
Straindisplacement relationship:
Constitutive law:
Boundary conditions:
Initial conditions: In (1)~(5), represents the firstorder partial derivative of stress tensor versus coordinate; stands for the body force; is density; denotes the acceleration; denotes the velocity; is damping ratio; , are both the firstorder partial derivatives of displacement versus coordinate; , are strains; is stress; is stressstrain tensor.
It should be mentioned that various constitutive models in (3) can be used in the block, including the linear elastic model, DruckerPrager model, the block breakage model, and the discrete spring model.
2.3. Thermal Analysis
The equilibrium equation of heat transfer can be written as
The initial condition is
The boundary conditions are
In (6)~(8), represents age; is thermal conductivity; stands for density; is specific heat capacity; is temperature; is the initial temperature of concrete; is the fixed boundary temperature; represents heat flux; denotes ambient temperature in natural convection conditions and adiabatic temperature of boundary layer in forced convection conditions; is heat transfer coefficient in surface.
By using the variation formulation, the equilibrium equation of heat transfer in (7) can be transformed into the following matrix form: where is the matrix for specific heat capacity, is the matrix for heat conductivity, and is the heat flux vector.
A time forward difference scheme is used for thermal analysis. Thus, the time difference equation can be formulated as where stands for the temperature vector at time step ; is the time step.
2.4. Dynamic Relaxation Method
By using the variation formulation, the equilibrium equation of momentum in (1) can be transformed into the following matrix form in an element: where represents the diagonal mass matrix; represents the damping matrix; represents the stiffness matrix; represents the vector of displacement; represents the vector of external force.
In our approach, global stiffness matrix is not assembled. Instead, (11) is iterated element by element, using the dynamic relaxation method. In every element, its displacement satisfies (11) strictly. If the solution of every element satisfies (11), then the overall solution made up of element solutions must satisfy the assembled equation. In fact, this is beneficial for the parallel implementation.
In time domain, an explicit iteration technique is applied. In this technique, the acceleration is iterated by the central difference scheme, while the velocity is iterated by the unilateral difference scheme. The schemes can be written as where , , and , respectively, represent the acceleration, the velocity, and the displacement of the th node of the th element, at the th time step.
The explicit iteration technique can be formulated from (12):
The process of explicit iteration using the dynamic relaxation method is illustrated in Figure 6. During the iteration, convergence is reached when the total magnitude of the kinetic energy is minimized.
3. GPU Implementation
3.1. GPU Implementation with CUDA
In Section 2, we refer to the CDEM approach, in which global stiffness matrix is not assembled. An elementbyelement strategy is employed to solve the equilibrium equation. In this sense, the CDEM is suitable to run on a GPU since calculations involved in the CDEM are all performed independently for the elements and the nodes.
In GPU parallelization, heterogeneous computing methodology is usually used. This means that the serial part of the code executes in a host (CPU) thread, while the parallel part executes in a large amount of device (GPU) threads. The device can be viewed as a virtual computer that has its own separate memory space. It is ideally suited to perform element and nodal calculations across hundreds of threads. Figure 7 shows a flow chart of the GPU implementation of the CDEM. One principle for the implementation is to replace the original CPUbased calculation functions with CUDA kernels. A kernel is a function that runs on the device. Another principle is to integrate as many CUDA kernels into a large one as possible. This is to minimize the delay of kernel launching. The kernels should be integrated only when they can be parallelized. Serial and dependent kernels cannot be integrated together.
An example is given to demonstrate the GPU implementation with CUDA. In the example, the CPU functions DoElems and DoNodes are first replaced by corresponding CUDA kernels. Then, the CUDA kernels are integrated into one CUDA kernel DoKernel. The additional qualifier “__global__” is used to define a CUDA kernel. Embellished with <<<nBlocks, nThreads>>>, the DoKernel function will be run in threads in parallel (see Algorithms 1 and 2).


As illustrated in Figure 7, stress and thermal fields are analyzed on two GPUs concurrently. Such multiGPU technique ensures higher efficiency when we perform multifield simulation for complex structures. All of the data in two fields will be first sent to the device. During the calculation, only thermal stress will be read from one GPU and sent to another. The communication is realized by using the CUDA function cudaMemcpy. The data transfer from the device to the host only happens when the simulation results need to be output. Since the data are kept on the device for the whole simulation process, memory transfer between host and device is largely reduced. Data transfer from GPU2 to GPU1 only happens under necessary circumstances, which will not affect the overall efficiency.
3.2. GPU Parallel Algorithm and Procedure
The GPU parallel algorithm is realized by GPU kernels. In Section 3.1, we have discussed the implementation of one GPU kernel. Now all the kernels will be presented to demonstrate the detailed process (see Algorithm 3).

The kernels GenerateStiffness() and GenerateConductivity() generate element stiffness matrix and element conductivity matrix, respectively. They use similar algorithms. To generate element matrix, we perform the Gaussian integrations over each element by using the following formula: where is strain matrix; is stressstrain tensor; is the determinant of Jacobian matrix; , , and are weights in Gaussian integrations.
The strain matrix is calculated by GPU. For linear problems, is calculated and stored in GPU memory before iteration. For memoryconsuming problems (i.e., nonlinear or big problems), is calculated when iteration is taking place. Since the GPU calculation of matrix is fast, it costs little time. For different elements, the parallel algorithms to generate matrix are different. The algorithm is shown below.
Algorithm of Generation of Matrix. (1) For each 4node tetrahedron element, there is one or four Gaussian integration points. Each element is divided into four threads. If there is one Gaussian integration point, the first thread performs the Gaussian integration and the weight is unit. If there are four Gaussian integration points, each of the four threads performs its own Gaussian integration and each of the four weights is 0.25.
(2) For each 6node wedge element, there are six Gaussian integration points. Every 8 threads are one group. In one group, only the first six threads perform the Gaussian integrations, while the left two threads do nothing.
(3) For each 8node hexahedron element, there are eight Gaussian integration points. Each element is divided into eight threads. Each thread performs the Gaussian integration of one Gaussian point.
The kernel DoKernel() does nodal and element calculation of stress field. The algorithm in DoKernel() can be summarized as follows.
Algorithm of DoKernel(). (1) Calculate external nodal forces , which include tractions and body forces.
(2) Calculate internal nodal forces: (in static problems, ).
(3) Calculate total nodal forces: .
(4) Calculate nodal accelerations by Newton’s Second Law: .
(5) Calculate nodal velocities and displacements by (13).
(6) Repeat 1~5 until convergence is reached.
Steps 1~3 are calculations in terms of elements, while steps 4~5 are in terms of nodes. In parallel algorithm, all elements and nodes are distributed into nBlocks × nThreads threads. Thus, the DoKernel() function will be run in nBlocks × nThreads threads in parallel.
It is worth mentioning that a CUDA library function __syncthreads() must be called between Steps 1 and 2. The matrix multiplication is performed by GPU. It demands that the displacement vector be synchronized. Thus, the function __syncthreads() is used to keep all the displacement components at the same level.
Boundary conditions need special treatments. For boundary, no calculation is done. The displacements are set as zero. For traction boundary, tractions are equivalent to nodal forces. Each thread performs the calculation of one node.
Convergence is reached if the kinetic energy is lower than a threshold. The kinetic energy is defined as.
3.3. Performance Optimization
The parallel procedure is implemented and tested by using NVIDIA CUDA Toolkit 3.0. The 8node hexahedron element is adopted. The 6node wedge element can be viewed as a degenerated element from 8node hexahedron. The 4node tetrahedron element owns half the number of nodes, compared with the 8node hexahedron element. They share a proportional (1/2) relationship with each other. Thus, a proportional optimization strategy can be used. In this sense, once the 8node hexahedron element is optimized, the 4node tetrahedron element is optimized as well by multiplying a proportion of 1/2.
To optimize the parallel procedure, the number of threads per block needs to be carefully chosen. As is recommended by NVIDIA [29], the best results will be achieved when the number of threads per block is a multiple of 64. For ideal performance, this number is recommended to be over 192. In practice, the best performance is achieved when the number of elements per block is 16 or 32. Equations (16) [18] and (17) show the relationship between the number of threads per block and the number of elements per block .
8node/6node elements:
4node element:
The effective bandwidth of each memory space depends significantly on the memory access pattern. As described in [29], global memory bandwidth is used most efficiently when the simultaneous memory accesses by threads in a halfwarp (during the execution of a single read or write instruction) can be coalesced into a single memory transaction of 32, 64, or 128 bytes. To achieve maximum global memory bandwidth, the data structures are realigned to thread. Besides, shared memory is used to store nodal variables, for the shared memory space is much faster than the local and global memory spaces, and nodal variables are reused among threads [18].
4. Performance Tests
4.1. Test Platform
In all our tests, three computers, one Intel Xeon E5506 at 2.13 GHz with NVIDIA GeForce GTX580, one Intel Xeon E526090 at 2.40 GHz with NVIDIA GeForce GTX670, and one Intel Core i5 2320 at 3.20 GHz with NVIDIA GeForce GTX680, are used. All the three computers run a 64bit version of Windows 7 with the CUDA driver 3.0 as the compiler. Details of the GPU cards are summarized in Table 1.
4.2. Test Model
The model of a dam is tested by the following steps.(i)First, the gravity field is simulated to test the accuracy of the GPU procedure, to get the runtime distribution of different GPU kernels, and to obtain the concerning speedups.(ii)Then, the thermal field is calculated with a distribution of thermal stress.(iii)Finally, the fracture field is determined by the superposition of the original gravity and the newlyinduced thermal stress.
The model of the dam is presented in Figure 8(a). The boundary conditions for gravity simulation are displacement constraints. The front and back faces ( directions) and the left and right faces ( directions) are all constrained with zero displacements in normal directions. The bottom face ( direction) is constrained with zero displacements in all directions. The thermal boundary conditions are illustrated in Figure 8(b), and the temperatures used in boundary conditions are presented in Table 2.
(a) Isotropic view (11166 hexahedron elements)
(b) The thermal boundary conditions
4.3. GPU Analysis
4.3.1. Accuracy
The results from the CPU and the GPU procedures are not identical though they look the same to each other (see Figure 9). A comparison between the results is illustrated in Figure 10. This comparison shows that the results from different GPUs are the same, but they have a little difference compared with the CPU result. Further, the errors between different procedures are shown in Table 3. The average error is about 2%.
(a) displacement by CPU Core i5 2320 
(b) displacement by GPU GTX580 
(c) displacement by GPU GTX670 
(d) displacement by GPU GTX680 
(a)
(b)
(c)
(d)
The reason why errors happen is that the parallel algorithm has a different accumulation order from the serial one. The accumulation occurs in the nodal force redistribution function. The CPU procedure accumulates the nodal forces one by one in the order of the elements, while the GPU version has a different order, which depends on the generation of nodal force group. Besides, the Arithmetic Logical Units (ALUs) on CPU and GPU have different relative error bounds.
4.3.2. Runtime
Figure 11 illustrates the runtimes of different GPU kernels. It shows that the iteration kernel of dynamic relaxation takes possession of over 95% of the runtime, while other kernels take possession of that less than 5%. The iteration kernel determines the overall runtime, which means that the iteration speedup can represent the overall speedup to some extent. The second timeconsuming is the data output kernel. An average of about 3% of runtime is consumed in data output. When big problems are calculated, this part of runtime can be large. This encourages us to reduce data transfer between host and device. Other kernels consume little. Their total runtime is less than 1%, which can be neglected in this study.
(a) GPU580, single precision
(b) GPU580, double precision
(c) GPU670, single precision
(d) GPU670, double precision
(e) GPU580, single precision
(f) GPU580, double precision
From another point of view, what makes the runtimes different on different platforms is concerned. Take the runtime of stiffness matrix generation as an example. The runtimes on different platforms are shown in Table 4. On GTX680, the runtime is the least, only 0.828582 ms, nearly the same as the GTX570. The GTX580 is the most timeconsuming GPU. The reason why this happens can only be the difference of the numbers of CUDA cores. This indicates that more CUDA cores will reduce the runtime of stiffness matrix generation. However, the number of cores is not the only decisive factor. The processor clocks determine the runtimes of dynamic relaxation iterations, which can be learned from Figure 12. The higher the clock rate is, the less the runtime of iteration is consumed. Figure 13 indicates that newer GPU models are designed for less runtime of output. All the figures presented show that double precision calculations cost more runtimes than single precision ones.
The runtime analysis helps us to choose a proper GPU model for simulations in both scientific research and engineering applications.
4.3.3. Speedup
In parallel computing, speedup refers to how much a parallel algorithm is faster than a corresponding serial algorithm [30]. Speedup is defined by the following formula: where is the runtime of the serial algorithm; is the runtime of the parallel algorithm.
In the tests, we obtain different speedups, which are presented in Figure 14, on different GPU computers using different precisions. The figure shows that speedups range from about 100 to 400. The maximum speedup reaches dramaticly 417, while the minimum reaches 102. Following the optimization strategies presented in Section 3.2, we have gained dramatic performance improvements from GPU parallelization. The figure also shows that double precision calculations consume more runtimes than single precision ones. The GTX580 is the fastest GPU among the three GPUs we present.
4.4. Other Analysis
Temperatures in summer and in winter are both simulated, as shown in Figures 15 and 16. We obtain the steady thermal fields in summer (see Figure 15) and in winter (see Figure 16), respectively. The thermal differences between two thermal fields will produce thermal stresses, which may cause cracking in concrete. The crack caused by thermal stresses is a major research topic in future work.
(a)
(b)
(c)
(d)
(a)
(b)
(c)
(d)
5. Conclusions
In this study, the CDEM is successfully accelerated by using GPUs. In CDEM, global stiffness matrix is not assembled, and an elementbyelement strategy is employed to solve the equilibrium equation. This, as well as the performance optimization, enables us to implement an efficient GPU parallel procedure. Detailed tests on accuracy, runtime, and speedup are performed on different GPUs. The results show that dramatic performance improvements are gained from GPU parallelization. A maximum speedup of 417 has been achieved. The GPU parallelization of CDEM makes it more powerful in multifield analysis of complex structures.
Conflict of Interests
The authors do not have any conflict of interests with the content of the paper.
Acknowledgments
The authors would like to gratefully acknowledge the financial support of the National Basic Research Program of China (973 Programs: Grant no. 2010CB731500, Grant no. 2013CB036406, and Grant no. 2013CB035904) and National Natural Science Foundation of China (50909105). Acknowledgments also go to the colleagues/students Chun Feng, Dong Zhou, Qingbo Zhang, and so forth, for their kind help.