Abstract
We present a parallel framework for simulating incompressible fluids with predictivecorrective incompressible smoothed particle hydrodynamics (PCISPH) on the GPU in real time. To this end, we propose an efficient GPU streaming pipeline to map the entire computational task onto the GPU, fully exploiting the massive computational power of stateoftheart GPUs. In PCISPHbased simulations, neighbor search is the major performance obstacle because this process is performed several times at each time step. To eliminate this bottleneck, an efficient parallel sorting method for this timeconsuming step is introduced. Moreover, we discuss several optimization techniques including using fast onchip shared memory to avoid global memory bandwidth limitations and thus further improve performance on modern GPU hardware. With our framework, the realism of realtime fluid simulation is significantly improved since our method enforces incompressibility constraint which is typically ignored due to efficiency reason in previous GPUbased SPH methods. The performance results illustrate that our approach can efficiently simulate realistic incompressible fluid in real time and results in a speedup factor of up to 23 on a highend NVIDIA GPU in comparison to singlethreaded CPUbased implementation.
1. Introduction
Physically based simulation methods have revolutionized special effects in computer games and interactive applications, creating incredible scenes consisting of fluids, cloth, rigid bodies, and others. Among these methods, smoothed particle hydrodynamics (SPH) is a wellknown particlebased approach with a wide application for simulating many kinds of phenomena including water, fire, and gas. This paper focuses on simulating incompressible fluids for realtime visual effects.
Many studies have been devoted to realtime fluid simulation methods in recent years. However, these techniques suffer from visible compression artifacts. Although weakly compressible SPH (WCSPH) is the most widespread technique for pressure conservation in SPH based methods, the challenge with WCSPH is that to get sufficient realism requires very small time step. To solve this problem, PCISPH is introduced to enforce incompressibility with significantly larger time steps. While each iteration step costs more, the overall time required in each time step to achieve incompressibility is less. Although this makes it a good candidate for simulating realistic fluid phenomenon, it is still too computationally intensive for the stateoftheart CPU to put into realtime applications.
The modern GPU substantially outpaces its CPU counterpart in arithmetic throughput and memory bandwidth, making it the ideal processor to boost a vast variety of data parallel applications such as SPHbased fluid simulations. Originally GPU was designed for realtime graphics rendering applications. In the past decade, it has been established as one of the major parallel processors for computationally intensive tasks. Advances in massively parallel GPU architectures have opened doors to employing PCISPHbased methods for simulating realistic fluid phenomena in real time. Modern GPUs normally have thousands of compute cores which deliver tremendous computing power. PCISPH exposes a high degree of parallelism, making it a perfect candidate for stateoftheart GPUs. In the SPH literature, several GPUaccelerated techniques have been proposed to speed up the timeconsuming physics calculation. However, efficient GPUbased PCISPH solver for realtime incompressible fluids has not been presented.
In SPHbased fluid simulation, the physical attributes of one particle are typically integrated over its dynamically changing neighbors at every time step, and neighbor search is the computationally most expensive part of the simulation, especially when PCISPH is used since this algorithm requires searching neighboring particles multiple times per simulation step due to its iterative feature. Therefore, efficient neighbor search is critical for performance in PCISPHbased fluid simulations. In literature, acceleration data structures such as uniform grids and kdtrees are commonly used to improve the neighbor search performance. In our framework, uniform grid is adopted due to the single resolution of particles. To further utilize the spatial coherence of neighboring particles and temporal coherence of a particle between two subsequent physics steps in this spatial partitioning scheme, we introduce an efficient parallel sorting method for GPUbased neighbor search. The proposed technique is then compared with previous approaches from a performance point of view.
In this paper, we design an efficient GPU pipeline for PCISPH to make use of the substantial parallelism inherent in this algorithm. By implementing this pipeline using CUDA on modern GPUs, we achieve one order of magnitude improvement over a singlethreaded CPUbased algorithm. Figure 1 illustrates our first example where the interaction of incompressible fluids and solid bunny is modelled using our GPU pipeline in real time.
(a)
(b)
(c)
(d)
Our main contributions are summarized as follows:(i)an efficient GPU streaming pipeline for realtime incompressible fluid simulation,(ii)a new parallel sorting method for efficient SPH neighbor search,(iii)several CUDA optimization techniques for the proposed method, such as using fast onchip shared memory, constant cache, and readonly cache to reduce global memory data access.
Our paper is organized as follows. After related work is reviewed in Section 2, we give a brief overview of the predictive corrective incompressible SPH (PCISPH) algorithm in Section 3. Our GPU simulation framework and optimization techniques are introduced in Section 4, followed by the visual and performance results in Section 5. Finally, this paper is concluded in Section 6.
2. Related Work
SPH is a meshfree Lagrangian method which was originally invented for astrophysical applications [1, 2]. Muller et al. first introduced this technique to simulate compressible liquids [3]. Keiser et al. [4] proposed a unified SPH framework to handle interaction of compressible fluids and deforming solids. Solenthaler et al. [5] extended this unified particle model to animate melting and solidification phenomena. Adams et al. [6] presented an adaptive sampling algorithm based on SPH to improve simulation efficiency without observably affecting the fluid flow behavior. Goswami and Pajarola [7] accelerated SPH fluid simulation by using a globally adaptive time discretization method and by saving computations on the inactive fluid particles. Goswami and Batty [8] proposed a regional time stepping technique for WCSPH that can obtain a two times speedup over the traditional method.
Nevertheless, all these methods compute pressure from density using ideal gas equation which results in compressibility artifacts. To avoid spurious visual effects due to compressibility, many solutions have been presented. Cummins and Rudman [9] presented a pressure projection formulation for SPH fluids, where a pressure Poisson equation (PPE) was used to solve pressure. Premože et al. [10] introduced incompressible SPH (ISPH) method to the graphics community. Although this approach allows for large time steps, the PPE pressure computation is extremely computationally expensive per time step. To avoid timeconsuming Poisson equation solvers, Becker and Teschner [11] introduced Tait equation [12] to computer graphics to deal with nearly incompressible flows. The cost per simulation step of this method is very low, but it requires small times due to high stiffness to enforce incompressibility. To combine the advantages of both ISPH and WCSPH, that is, low cost per step and large time steps, Solenthaler and Pajarola [13] proposed PCISPH method which can handle typical incompressible scenarios with small fluid density deviations of down to 0.01% and comparatively large time steps. Ihmsen et al. [14] extended PCISPH to simulate incompressible fluids with complex boundaries. An extension of PCISPH for realizing twoway coupling with rigid bodies has been proposed in [15]. Recently, He et al. [16] presented a local Poisson SPH (LPSPH) approach based on equation of state to enforce incompressibility. Ihmsen et al. [17] proposed a formulation of the projection method which combines an SPH form of the pressure force and an SPH discretization of the continuity equation to obtain a discretized form of the PPE. Cornelis et al. [18] further integrated this method with fluidimplicitparticle (FLIP) to solve mass preservation issue in FLIP and efficiency issue in SPH.
In the past years, many solutions have been presented to take advantage of GPUs to simulate physically based effects such as SPH fluids that are computationally intensive [19]. Amada et al. [20] proposed a GPUassisted approach to solve SPH equations, in which case the physical properties calculations were carried out on the GPU, but the neighbor map creation was still executed on the CPU. Hegeman et al. [21] used a dynamic quadtree structure to accelerate the neighbor queries in SPHbased fluid simulations in 2D scenarios. Zhang et al. [22] presented an adaptive sampling algorithm for simulating and visualizing weakly compressible fluid on the GPU, which used RGBA channels to store fluid particle indices and mapped uniform grid onto a 2D texture.
However, the aforementioned techniques are difficult to implement without expert knowledge of the graphics API even with a higherlevel programming environment. This has been changed since the appearance of NVIDIA’s CUDA. Since its introduction in 2006, many research works have been carried out with this generalpurpose parallel programming model. Harada et al. [23] fully exploited the massive computational power of programmable graphics hardware by developing a method that executes all calculations including neighbor search entirely on the GPU. Harada [24] presented a method using CUDA instead of graphics APIs, which can more easily make use of multiple processors with distributed memory to accelerate particlebased fluid simulation. Green [25] adopted an index sorting approach to improve the performance of neighbor search on the GPU. Later, Goswami et al. [26] utilized shared memory to further optimize the CUDA kernels where neighboring particles information is recalculated instead of being stored on the device. Zhang et al. [27] used multiGPUs to improve the efficiency of weakly compressible fluid simulation and to overcome the device memory limitations of a single GPU.
3. Background
SPH is a meshfree technique in which fluid volume is discretized into a finite set of particles. The fluid properties at an arbitrary point are accumulated using its neighboring particles. That is, where is the physical property at particle . , , are mass, density, and physical property at particle . is the smoothing kernel in the support radius of particle . The density of fluid particle can be calculated by substituting into (1) as Similarly, we can obtain the pressure force for an arbitrary particle : and its viscosity forces: where and are the velocities of particle and its neighbor .
However, unrealistic compressibility artifacts are produced when simulating fluid using standard SPH method, since the ideal gas state equation is used to relate pressure to density. To avoid this problem, an iterative equation of state solver PCISPH (see Algorithm 1) is introduced where a predictorcorrector technique is applied to iteratively calculate fluid pressures that enforce incompressibility. At the beginning of each time step, neighbor search is first executed, followed by calculating all forces except for pressure force, which will be used to predict positions and velocities of fluid particles. Afterwards, the predictioncorrection loop is performed to propagate density fluctuations through the fluid and update pressure and pressure forces until the desired density variation constraint is satisfied; that is, is smaller than a user defined threshold . In each iteration, the pressure of particle is updated according to where is

The pressure force of particle is updated according to (3). Finally, time integration is executed to advance the positions and velocities of all particles using the total force exerted on each particle:
4. Our GPU Simulation Framework
PCISPHbased fluid simulator requires intensive numerical computations. Therefore, it is important to design an efficient strategy to map all the physical simulation steps onto the GPU to improve its performance. In this section, we introduce a GPUfriendly particle data representation and a parallel sorting technique for fast neighbour search and develop an incompressible fluid simulation pipeline fully on the GPU to take advantage of its massively parallel processing power.
4.1. Particle Data Representation
In CPUbased PCISPH method, a single particle is represented as follows: Struct Particle { float position; float velocity; float predicted_position; float external_force; float correction_pressure_force; float correction_pressure; float predicted_density; float pressure; int type; int index; }
Thus, the whole particle sets are defined as Particle particleArray[N],
where N is the total number of particles. If we access each component of Particle concurrently, then AoS makes sense because, in this situation, successive reads of all components of each particle are contiguous. However, we observe that not all physical attributes are accessed at the same time in our simulation; for example, fluid density evaluation only needs to read particle position data. In other words, scattered memory access pattern occurs. In such a simulation, it is generally better to store fluid particle properties in structure of arrays (SoA) instead of array of structures (AoS) to achieve a higher GPU memory bandwidth. This is due to the reason that a better coalesced memory access can be satisfied which results in optimum performance.
Therefore, in our implementation we patch fluid particle data in a SoA manner: Struct Particle { float4 position[N]; float4 velocity[N]; float4 predicted_position[N]; float4 external_force[N]; float4 correction_pressure_force[N]; float correction_pressure[N]; float predicted_density[N]; float pressure[N]; int type[N]; int index[N]; }
For CUDA architecture, this means that memory coalescing and successive data are contained within the same cache line, leading to optimal global memory bandwidth and high cachehit rate. Note that we apply padding strategy in our particle data structures; that is, we add a fourth dummy element to each 3dimensional vector to ensure 4byte alignment which is also critical for achieving high performance on the GPU.
4.2. Parallel Sorting Method for GPUBased Neighbor Search
In PCISPH, the particle neighbor search step is performed many times in each physics update. Thus, it is critical for improving the total performance of the simulation system that we can efficiently query particle neighbors. As with most CPUbased and GPUbased SPH methods, we use 3D uniform grid data structure to accelerate neighbor search process since all particles have the same radius in our framework.
In uniform grid, each fluid particle is assigned to one grid cell according to its spatial location. To determine the neighborhood of one particle, we only need to search for particles that locate in the same grid cell or in one of the other 26 neighboring grid cells. In other words, 27 grid cells need to be queried for every particle. Although uniform grid is one of the most widely used spatial subdivision techniques on the CPU side, the parallel construction of it on the GPU is not quite straightforward due to race conditions of particle insertion. To avoid memory conflicts on the parallel architectures such as GPU, index sort [25] is used. This approach sorts particles using parallel radix sort, which makes particles in the same grid cell also close in GPU memory. In this way, memory coherence on the CUDAcapable devices can be achieved to some extent. However, it still has the weakness that fluid particles that are spatially close are not necessarily close in GPU memory since grid cells are ordered on the basis of dimensional, dimensional, and dimensional position one after the other. This cannot fully utilize the modern GPU architectures that have a hierarchical memory system which uses caches to work as a fast data store.
To further improve the total memory bandwidth, index sort [26, 28] is used to sort particles so that the memory reads are sequential and coherent per thread. This space filling curve has a good spatial locality feature for GPU caches, leading to a high cachehit rate. However, we observe that the radix sort is done on keys and these keys are the grid cell IDs that we are sorting, and the particle positions in each grid cell are irrelevant. This means that we only need to sort grid cells instead of every particle which can avoid many duplicate key sortings. Based on this observation, we present a new sorting method that performs 1radix sort on grid cells instead of radix sort on digits. To this end, we change the size of the radix to match the number of grid cells. Thus, we can sort on the grid cells directly instead of sorting on the first key, the second key, and so on. Here we use prefix sum to figure out how many fluid particles are in each cell and thus the memory layout in the global memory. To count the number of particles in each grid cell, we use the atomic add operations. After this step, we can have the offsets of a particle from the start of its cell. At this point, we have both the memory location of each grid cell and the offset of each particle, so all particles can be put into the updated location in one step in parallel. In the context of GPUaccelerated SPH fluids, this means that a further performance improvement can be achieved.
4.3. GPU Streaming Pipeline
In this section, we present our GPU streaming pipeline for incompressible fluid simulation with PCISPH (see Figures 2 and 3).
In order to fully exploit the parallel capabilities of modern GPU, we map all physical calculation steps to CUDA kernels and particle data to streaming data representation. As illustrated in Figures 2 and 3, we use the following streaming data representations.(1)Position streams and velocity streams: these streams describe the current state of the fluid under simulation. We use double buffers for each stream; for example, position stream 0 is used to store old positions that are readonly, while position stream 1 represents new positions that are writable. At the end of each physics update, these two streams reverse their role.(2)External force stream: the stream contains the sum of viscosity force, surface tension force, boundary force, gravity force, and other forces except internal pressure force. This external force information will be used to generate predicted position stream for the fluid particles involved in the predictioncorrection loop.(3)Predicted position stream: the stream will be used to generate corrected pressure, predicted density, and density error of each particle.(4)Predicted density stream: the stream will be used to predict density variation and corrected pressure.(5)Corrected pressure stream: the stream will be used to update pressure for the pressure force calculation.(6)Corrected pressure force stream: the stream will be used for the next iteration or the time integration step.(7)Density error stream: the stream will be used to generate loop end flag that will be used to check the exit condition of the predictioncorrection loop.
The different computational stages corresponding to particle sorting, external force calculation, and various steps of predictioncorrection loop and time integration are abstracted as CUDA kernels and performed in parallel on the GPU compute cores.(1)Parallel sorting kernel: we use the method described in Section 4.2 to sort fluid particles at the beginning of each physics update.(2)External force calculation kernel: in this kernel, we accumulate each particle’s viscosity force, surface tension force, and boundary force using its position and velocity information, as well as the gravity force.(3)Position prediction kernel: at the start of each predictioncorrection loop, we use this kernel to predict particle position. The collision with boundaries should be also handled in the kernel.(4)Predicted density and pressure calculation kernel: in this kernel, predicted density is accumulated according to (2) using previously predicted position information.(5)Corrected pressure force calculation kernel: this kernel computes corrected pressure force according to (3) using the predicted density and pressure value.(6)Maximum predicted density calculation kernel: we use parallel reduction in this kernel to get the maximum predicted density which will be used to control the loop end flag.(7)Time integration kernel: we use symplectic Euler integration scheme, which is a semiimplicit method that uses explicit Euler method to update velocity but uses the newly updated velocity instead of the old value to calculate position: where and are the updated velocity and updated position of a fluid particle. is the position, is the time step, is the force, and is the mass of the particle.(8)Indices calculation kernel: at the end of each frame, we update particle index in this kernel that will be used for particle sorting at the beginning of the next frame.
4.4. Optimization Techniques
To achieve a better performance on the GPU, we use several optimization techniques. Doublebuffering techniques combined with texture memory/readonly data cache are used to read and update position and velocity values of fluid particles. To be specific, at each physical computation step, we first obtain previous attributes through CUDA texture memory/readonly data cache from one of the double buffers. Then, we use these data to calculate their new values and write into the other one of the double buffers. After that we switch these two buffers when necessary. For constant parameters such as mass, support radius, viscosity constant, gravity constant, and other constant properties, we utilize CUDA’s constant memory to store them for efficient data access, since constant cache is extremely effective due to the massive reuse of these attributes, increasing the compute to memory access ratio. This also helps to boost global memory bandwidth.
In SPH, the physical attributes at one particle such as density, pressure force, and viscosity force are approximately computed with a set of attributes at its neighboring particles. To take advantage of the fast onchip memory of GPU, shared memory is also used to load data which are shared by neighboring particles, thus reducing the global memory access request. After parallel sorting, these neighboring particle data are most likely close in global memory which further improves performance due to the coalesced memory access pattern when loading data from global memory to shared memory.
As for our shared memory implementation, each CUDA thread in a block is assigned to one fluid particle , and this thread copies all particles in the 27 neighboring grid cells of into this thread’s shared memory. Each CUDA thread outputs the updated attribute of one fluid particle. To save memory, we make use of the existing space as much as possible. For example, we use two arrays to store keyvalue pair for particle’s index value and its index in the array: one is used to store both the input unsorted array and output sorted array; the other one is used as an auxiliary array to allow ping pong computation. This auxiliary array is also used for CUDA block counts generations, which allows for simulating large scenes with more particles with the same hardware.
5. Results
In this section, we give a description of our GPU implementation and make performance comparisons with their CPUbased counterparts on several benchmarks.
5.1. Implementation
We have tested our parallel framework on an Intel Core i73770 with 3.4 GHz and two different commodity GPUs: a NVIDIA GeForce GTX 480 (Fermi architecture with compute capability 2.0) and a NVIDIA GeForce GTX 780 (Kepler architecture with compute capability 3.5). The parameters of these two GPUs are shown in Table 1. For these two NVIDIA GPUs, we used CUDA Toolkit 6.5/Visual Studio 2010/Windows 7 Ultimate 64bit as the development environment.
In our implementation, fluid surface reconstruction technique similar to [5] was used, and solid boundary particles were sampled using [29]. For all simulations, we set the time step to , the fluid particle radius to m, the rest distance between initial particles to , the support radius to , the fluid rest density to , and the density fluctuation threshold to .
5.2. Benchmarks
To test the performance of our framework, we used four different benchmarks. Each benchmark is tested on both the CPU and the GPU.(1)Solid bunny scene: in Figure 1, we simulated a bunny scene, in which water flows around it and finally calms down. This figure demonstrates that our GPU framework can realistically simulate incompressible fluids with complex boundaries in real time.(2)CBD scene: Figure 4 is a classical dambreak benchmark.(3)Bunny drop scene: in Figure 5, we simulated a scenario with a water bunny dropping into the pool.(4)Two dam breaks’ scene: in Figure 6, we simulated a scene, in which two corner breaking dams collide and generate sprays and then converge into a water pool.
(a)
(b)
(a)
(b)
(a)
(b)
For all benchmarks, our framework can simulate incompressible fluids accurately and robustly in real time on the GPU.
5.3. Performance Comparisons
In this section, we compare the performance of our GPU framework with thestateoftheart CPUbased implementations, that is, singlethreaded PCISPH implementation [13] and its multithreaded version [30].
Table 2 highlights the performance of our GPU framework and CPUbased implementations on four different benchmarks from which we can see that the speedups of our GPU implementation on GTX 480 compared with its singlethreaded counterpart were 11.4 for CBD scene, 9.4 for solid bunny, 11.2 for bunny drop, and 6.6 for two dam breaks. These speedups are even higher on GTX 780 which was 19.5 for CBD scene, 18.2 for solid bunny, 23.2 for bunny drop, and 16.8 for two dam breaks.
Figure 7 is the graphical illustration of the performance results presented in Table 2. From this figure, we can notice that our GPUbased incompressible fluid simulation framework works well on both Fermi and Kepler architectures and is also scalable to current Maxwell and future architectures. In other words, our simulation pipeline can be used to exploit the largescale parallel capabilities of current and future GPUs. Since our GPUbased pipeline will benefit from more compute cores and higher memory bandwidth, we can expect higher frame rates on future GPU architectures.
(a) CBD
(b) Solid bunny
(c) Bunny drop
(d) Two dam breaks
(e) Four benchmarks on GTX 780
Figure 8 shows execution time ratios for each step of bunny drop scene, that is, parallel sorting, block generation, external force calculation, predictioncorrection loop, time integration, and index calculation. From the figure, we can see that the predictioncorrection loop is the most timeconsuming stage in the whole simulation pipeline, which occupies of the total computational time. The second most computationally expensive stage is the external force calculation which has a percentage of . These two stages consume the vast majority of computing resources. This is because these two stages contain timeconsuming neighbor search, especially in the predictioncorrection loop which needs to perform neighbor search many times.
6. Conclusion and Future Work
We have presented an efficient GPU pipeline to utilize the inherent parallelism of the PCISPH algorithm to simulate incompressible fluids in real time. We further took advantage of shared memory on the GPU which has lower latency and much higher bandwidth than global memory. Moreover, we proposed an efficient parallel sorting method for SPH fluid simulations which can achieve a high cache hit rate and thus a higher simulation performance. Our parallel framework is able to accurately simulate incompressible fluids in real time on the GPU.
As for our future work, we will extend our pipeline to support multiple GPU system.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The authors would like to thank Assistant Professor Dr. Prashant Goswami, Computer Graphics group at Blekinge Institute of Technology, for many insightful discussions. The authors are also thankful to the anonymous reviewers for their helpful comments. This project is supported by the National High Technology Research and Development Program of China (no. 2012AA011503), Project on the Integration of Industry, Education and Research of Guangdong Province (no. 2012B090600008), and the preresearch project (no. 51306050102). The armadillo model and bunny model are courtesy of the Stanford Computer Graphics Lab.
Supplementary Materials
The supplementary video shows the simulation results using our method, which are rendered with OpenGL and PovRay, respectively.