Scientific Programming

Volume 2016 (2016), Article ID 2360492, 14 pages

http://dx.doi.org/10.1155/2016/2360492

## A New Parallel Method for Binary Black Hole Simulations

^{1}Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China^{2}Institute of Applied Mathematics, Academy of Mathematics and Systems Science Chinese Academy of Sciences, Beijing 100190, China^{3}Center for Computation & Technology, Louisiana State University, Baton Rouge, LA 70803, USA^{4}School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA

Received 1 January 2016; Revised 25 May 2016; Accepted 6 June 2016

Academic Editor: Bormin Huang

Copyright © 2016 Quan Yang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Simulating binary black hole (BBH) systems are a computationally intensive problem and it can lead to great scientific discovery. How to explore more parallelism to take advantage of the large number of computing resources of modern supercomputers is the key to achieve high performance for BBH simulations. In this paper, we propose a scalable MPM (Mesh based Parallel Method) which can explore both the inter- and intramesh level parallelism to improve the performance of BBH simulation. At the same time, we also leverage GPU to accelerate the performance. Different kinds of performance tests are conducted on Blue Waters. Compared with the existing method, our MPM can improve the performance from 5x speedup (compared with the normalized speed of 32 MPI processes) to 8x speedup. For the GPU accelerated version, our MPM can improve the performance from 12x speedup to 28x speedup. Experimental results also show that when only enough CPU computing resource or limited GPU computing resource is available, our MPM can employ two special scheduling mechanisms to achieve better performance. Furthermore, our scalable GPU acceleration MPM can achieve almost ideal weak scaling up to 2048 GPU computing nodes which enables our software to handle even larger BBH simulations efficiently.

#### 1. Introduction

The latest supercomputers [1] have significantly increasing number of computing nodes/cores, but many practical applications cannot achieve better performance on more computing resources because enough parallelism in the applications has not been explored. At the same time, many applications cannot take advantage of supercomputers equipped with accelerator such as GPU [2, 3] because the existing codes cannot run on GPU directly. We focus on the two problems and propose an efficient method to improve the performance of one kind of challenging scientific application (binary black hole simulations) on one typical large scale supercomputer, Blue Waters.

A binary black hole (BBH) system has two black holes in close orbit around each other. For the inspiralling BBH, the two black holes will move around each other. The number of orbits means the number of circles the black holes move. Mass ratio means the ratio of the mass of small black hole to the mass of the big black hole. BBH systems are important because they would be the strongest known gravitational wave [4] source in the universe. The most challenging and important subject in numerical relativity now is the simulation of these BBH systems. For gravitational wave detection, theoretical model for gravitational wave sources is essential for experimental data analysis. As an example, the theoretical model for BBH played an important role in GW150914 detection [5]. For ground based detectors of gravitational wave, such as LIGO, VERGO, GEO600, and KAGRA [6], the almost equal mass BBHs are the most important gravitational wave sources. For the space-based detectors, such as eLISA [7], the gravitational wave sources will include many large mass ratio BBH systems. When the mass ratio of BBH increases, the computational cost increases dramatically, roughly proportional to the fourth power of the mass ratio. Currently, the upper limit for the mass ratio of BBH which can be simulated by numerical relativity is 100. So, for the future space-based detection of gravitational wave, improving the computational ability is quite important.

Adaptive mesh refinement (AMR) [8] is widely used in BBH simulation because of its simplicity and great reduction in total computing work. It is a natural method to divide each mesh into many submeshes and execute the submeshes of the same level in parallel to improve the simulation speed. We call it Submesh based Parallel Method (SPM) in this paper. Halo zones are needed for each submesh to keep the data from its neighbors. SPM can really achieve very good parallel performance when it only uses limited computing resources. But if the time saving by parallel executing the smaller submeshes cannot compensate the communication overhead to fill the halo zones, SPM cannot scale to more computing resources to achieve better performance. The size of submeshes cannot be too small because of the increasing communication overhead and the size of refined mesh cannot be very large because of the significant increase in total computation, so the number of parallel submeshes () cannot be too large. The latest supercomputers have more and more computing nodes/cores, but the SPM cannot take advantage of more computing resources to further improve its performance. When we try to handle some challenging BBH simulations with more orbits (≥20) and larger mass ratio (≥400), more parallelism must be explored to significantly improve the simulation performance to conduct the simulations in reasonable time.

We propose a novel Mesh based Parallel Method (MPM) to explore both the inter- and intramesh parallelism in BBH simulations. MPM will explore the parallelism among different mesh levels first (intermesh parallelism). Then, it employs SPM to explore the parallelism in each mesh (intramesh parallelism). MPM has two advantages. On one hand, for given submesh size, MPM can provide much more parallelism than SPM does. On the other hand, for given number of parallel tasks, MPM can assign more data and computation to each parallel task than SPM does. In other words, the computation-to-communication ratio of MPM can be larger than the SPM. So MPM has higher parallel efficiency than the SPM does.

We develop a new mesh partitioning algorithm to optimize the load and communication among all the parallel tasks. Our mesh partitioning algorithm is employed into our MPM to achieve balanced load and reduce communication as much as possible. In order to take advantage of the GPU capability to improve the performance, we rewrite the most computationally intensive solver codes in BBH simulations on GPU and employ different kind of GPU codes optimization methods (employing coalesced memory access and shared memory, reducing data copy between CPU and GPU, and removing unnecessary synchronization) to improve its performance.

Different kinds of experiments on performance evaluation have been conducted on the large scale Blue Waters supercomputer at National Center for Supercomputing Applications (NCSA). The experimental results show that significant performance improvement can be achieved with our scalable MPM. Because the sequential code is too slow, we select the performance of 32 MPI processes with SPM algorithm as the baseline. The existing method can achieve at most 5x speedup; our MPM can achieve 8x speedup. After we port and optimize the code on GPU, the existing method can achieve at most 12x speedup; our scalable MPM can achieve about 28x speedup. Furthermore, our scalable MPM + GPU acceleration method can achieve almost ideal weak scaling up to 2048 GPU computing nodes. This means that our method can handle very large problem on large number of computing resources efficiently.

The major contribution of our work lies in two aspects: First, the proposed MPM can enable the BBH simulation to take advantage of more computing resources of supercomputers to achieve better performance. Second, the proposed MPM has been enabled by GPU to further improve the performance of BBH simulation.

#### 2. Problem Description

We will briefly introduce the BBH simulation problem and the numerical method used to solve the problem first. Then, a special mesh refinement method used in BBH simulation is given.

##### 2.1. Equations for the Problem

In order to simulate BBH systems in general relativity, the basic idea is to solve Einstein’s equations [9]. We adopt BSSN (Baumgarte-Shapiro-Shibata-Nakamura) [10] formalism which is widely accepted by the numerical relativity community. The BSSN formalism is a conformal-traceless “” formulation of the Einstein equations. In this formalism, the space-time is decomposed into three-dimensional spacelike slices, described by three-metric ; its embedding in the four-dimensional space-time is specified by extrinsic curvature and the variables, lapse , and shift vector , which specify a coordinate system. is the gravitational constant, and is the speed of light. In numerical relativity, geometrical units are usually adopted which lead . In this paper, we follow the notations of [11]. The related equation description about the problem is as follows:Here, , , , and are source terms which come from matter. For a vacuum space-time, . In the above evolution equations, is the covariant derivative associated with three-metric , and “TF” indicates the trace-free part of tensor objects. For the time evolution, we use th-order Runge-Kutta (for short “RK” in the following parts) method. Spatial derivatives are computed using th-order finite differencing. The advection terms are approximated with an upwind finite differencing stencil, and other derivative terms are approximated with a centered stencil. More details regarding the equations and implementations can be found in [12, 13].

##### 2.2. Mesh Refinement Method

Mesh refinement is a technique which allows a simulation to spend more time on the parts of the domain which are more interesting. The Berger-Oliger mesh refinement (or AMR) algorithm [8] is widely used in BBH simulations. We use Fixed Mesh Refinement (FMR) in this work instead of AMR because of the following two facts: FMR will not introduce more computation than AMR in BBH simulations with the same accuracy; FMR is easy to achieve load balancing and it is very critical to improve the performance in large scale parallel computing. The subcycling technology is also introduced because it can further reduce the total simulation steps significantly. For our FMR method with subcycling, the closer to the black hole, the finer the mesh, and the smaller the time step (typical in half), the more the evolution steps, but all the meshes at different levels will have the same grid points.

Figure 1 is an example of our FMR method with subcycling. For simplicity, we only show three mesh levels. They are the coarsest Mesh 0, the refined Mesh 1, and the finest Mesh 2. The finer mesh level will be closer to the black hole, but the coarser mesh level will cover larger region. All the mesh levels will have the same number of grid points. When Mesh 0 executes one simulation step with time step , Mesh 1 will need to execute two simulation steps to catch up with Mesh 0 because its time step () is half of Mesh 0 and so on for the following finer mesh levels. After each simulation step, the mesh should synchronize with its neighbor mesh levels to exchange the boundary data or get the updated data covered by the refined mesh level.