Research Article  Open Access
Shuaizhi Guo, Tianqi Wang, Linfeng Tao, Teng Tian, Zikun Xiang, Xi Jin, "RPRing: A Heterogeneous MultiFPGA Accelerator", International Journal of Reconfigurable Computing, vol. 2018, Article ID 6784319, 14 pages, 2018. https://doi.org/10.1155/2018/6784319
RPRing: A Heterogeneous MultiFPGA Accelerator
Abstract
To reduce the cost of designing new specialized FPGA boards as directsummation MOND (Modified Newtonian Dynamics) simulator, we propose a new heterogeneous architecture with existing FPGA boards, which is called RPring (reconfigurable processor ring). This design can be expanded conveniently with any available FPGA board and only requires quite low communication bandwidth between FPGA boards. The communication protocol is simple and can be implemented with limited hardware/software resources. In order to avoid overall performance loss caused by the slowest board, we build a mathematical model to decompose workload among FPGAs. The dividing of workload is based on the logic resource, memory access bandwidth, and communication bandwidth of each FPGA chip. Our accelerator can achieve two orders of magnitude speedup compared with CPU implementation.
1. Introduction
Nbody simulations have been widely used in scientific and engineering applications. Problems in astrophysics, semiconductor device simulation, molecular dynamics, plasma physics, and fluid mechanics require efficient Nbody simulation methods [1]. The problem can be described as follows. The topic gives the initial positions and velocities of N particles, demanding updating their positions and velocities every time steps. Nonetheless, the size of the Nbody simulation is always limited by the available computational resources, and the increasing need for larger system simulations requires more efficient computational methods. So many researchers have been interested in faster algorithms for largescale particle simulation and invented some efficient algorithms, such as Barnes and Hut algorithms that reduce the computation complexity to and FMM algorithms which have a computation complexity of [2]. However, these algorithms use some approximation and are complex to parallelize. Directsummation Nbody algorithm computes the interaction between particles in an accurate way and is quite convenient for parallelization. What is more, directsummation is a fundamental buildingblock for other algorithms [3], so a lot of high performance directsummation computational platforms emerge these years. In the modified Newton dynamics simulation project of Yunnan Observatories, Chinese Academy of Sciences, we want to work out such a platform that can make use of all the resources that we have in lab and to meet the demand for power and performance at the same time.
1.1. Background
Computational solutions for Nbody simulation can be categorized as CPU, GPU, ASIC, and FPGA according to the computing unit. Furthermore, these technologies vary in their cost, programming abstraction level, and power consumption [4]. There has been one thorough study on x86based or powerPCbased tuning [5] and several papers on GPU cluster implementation [2, 6]. The GRAPE (“GRAvity piPE”) project built ASICbased high performance computing solutions for gravitational force calculations where the calculation of particle interactions was calculated by an ASIC chip in the form of a fully pipelined hardwired processor dedicated to gravitational force calculation [7, 8]. Hamada et al. used the Bioler3 system to implement an FPGAbased gravitational force computing accelerator [4]. These studies have shown that CPUs’ performance is limited and ASIC offers no advantages; GPUs are competitive in performance and performance per cost; the performance per Watt figure favoured FPGA [4].
1.2. Related Work and Challenge
Figure 1 shows a basic structure of a hardware accelerated solution for Nbody simulations. It consists of a host computer and an acceleration coprocessor for potential calculation. The host computer performs all other calculations, for example, position and velocity upgrade. More specifically, the coprocessor consists of a number of pipelines and the particle information stored in particle memory. The control/interface unit receives instruction from host computer and controls the pipeline to acquire particle information from particle memory to calculate potential [7, 9].
Figure 2 shows how to build a computing cluster with GRAPE. To build a cluster with 16 GRAPE boards, we need one hostinterface board (HIB), five network boards (NBs), and 16 processor boards (PBs). Each NB has one uplink and four downlinks. Thus, the 16 PBs are connected to the host computer through twolevel tree network of NBs. NB and HIB handle the communication between PB and the host computer [9]. With the increase in the amount of PBs, the demand for NBs increases rapidly. It means that the interconnection overhead of building a large computing cluster is unacceptable and the interconnection problem becomes a challenge of building large computing cluster.
1.3. Motivation
We want to work out an accurate numerical computation method based on MOND theory. MOND (Modified Newtonian Dynamics) theory is an alternative for the popular Dark Matter (DM) theory, which successfully explains the distribution of force in an astronomical object from observed distributions of baryonic matters [10].
MOND’s numerical algorithm is different from traditional Nbody simulation’s method, so the GRAPE is not suitable for our mission. MOND theory is based on potential calculation and can be described as follows: given the density distribution of baryonic matters , try to figure out the final potential . The final potential is influenced by the distribution of two kinds of matter: and , where is the density distribution of baryonic matters including stars and gasses and is the density distribution of phantom dark matter, which is the theoretical hypothesis of MOND [11]. The final gravity potential is given by the classical Poisson equation: is given by the ’s potential as in the following equation:
Finally, ’s potential is given by the classical Poisson equation:
Different from the traditionally directsummation Nbody algorithm, MOND requires a more timeconsuming potential calculation, whose computation complexity is .
With limited project budget, we choose to use the FPGAbased directsummation algorithm. Instead of designing new specialized boards, we reuse existing ones in order to reduce the overhead. The scale of MOND simulation is limited by the available computational resources. A single FPGA chip does not provide enough logical resource, so the multiFPGA solution seems to be the only choice. Major contributions of our work are as follows:
In order to accelerate directsummation Nbody simulation we propose an extensible heterogeneous multiFPGA solution called RPring (reconfigurable processor ring) to utilize existing multiple different FPGA boards. The experiment shows that our implementation achieved two orders of magnitude speedup compared with highend CPU implementation.
In order to prevent the slowest board from dragging the overall performance down, we propose a model about how to decompose workload among FPGAs and optimize logic resource allocation. To improve the whole system’s performance, this model shall divide workload based on the logic resource, memory bandwidth, and communication bandwidth of each FPGA board and allocate logic resource among potential calculation pipeline, DMA/FIFO, and other modules.
The remainder of this paper is organized as follows: Section 2 presents background information on the algorithm of MOND theory numerical simulation. The following section will present the architecture of RPring. Then we will present the model, which guides the decomposition of workload and logic resource’s allocation. After that, we will present our hardware implementation result on heterogeneous multiFPGA and compare it with other implementation on various GPU boards and CPUs as well as ASIC implementation. These results will then be discussed before conclusions are drawn.
2. DirectSummation Algorithm
MOND numerical simulation is a variant of Nbody simulation; the calculation can be described in the following five steps [11]:(1)With the known baryonic matter distribution , calculate gravity potential according to (3).(2)Calculate the phantom dark matter distribution with (2) by finite difference(3)Solve the Poisson equation (1) to get the final potential .(4)Calculate the acceleration and velocity with the final potential by finite difference.(5)Calculate the location of each particle in the next time step.
Steps , , and have a computation complexity of , so using CPU to do the tasks serially will not influence the performance. Steps and have a computation complexity of , so we focus on accelerating them. In directsummation algorithm, Steps and , which calculate gravity potential, we can use the solution of the Poisson equation:
Therefore, in the following article we use FPGA to construct potential calculation pipeline and propose the RPring solution to build a larger multiFPGA system. It should be pointed out that this work is not limited to MOND theory numerical simulation. It can be extended conveniently to other directsummation Nbody simulations.
3. Architecture
As (4) shows, the accumulation of the potential acting on each particle by all other particles in the system is mutually independent. We can calculate several particlepairs’ potential simultaneously, so the existing ASIC implementation represented by GRAPE puts several potential calculating pipeline in the chip to find the best degree of parallelism. In Section 1, we have analyzed the existing work and find out their bottleneck. Now we come to the RPring.
3.1. RPRing Solution
Figure 3 illustrates a ring network consisting of FPGA boards and a host computer. Each FPGA board has onboard memory and is connected with previous/next FPGA board with cable. There are pipelines, DMA, memory controller, and other modules in an FPGA chip. All these modules are controlled by protocol controller. The potential pipelines have two input ports, one connected to InputFIFO and the other connected to DMAFIFO. In order to reuse the local particle information, we fix the data from the InputFIFO, which is received from previous board, and traverse the local particle information as Figure 4 shows. In the following paragraphs, we will explain RPring’s control flow and data flow.
3.1.1. Control Flow
As shown in Figure 3, each FPGA board’s control flow has the following features:(1)Obtain the results from the previous FPGA board and put data into the InputFIFO.(2)There is DMA onboard memory to get local particle information and put it into the DMAFIFO.(3)The pipelines gain data from InputFIFO and DMAFIFO, calculate the potential, and then write the result into OutputFIFO(4)Read data from OutputFIFO, and send it to the next FPGA board through output connection.
3.1.2. Data Flow
Figure 5 shows the data flow of RPring. The information of the th particle consists of its location , mass , and potential which is initialized to zero. When the th particle’s information reaches the zero board, the FPGA board calculates the interaction between the particle and the local particles work set 0 and stores the intermediate results into the potential field. When the th particle’s information reaches the first board, the FPGA board calculates the interaction between the particle and the local particles work set 1 and stores the intermediate results into the potential field. Therefore, when the th particle’s information flows through the whole ring and returns to the host computer, the potential field reaches the final result . Then the host computer can continue to follow up operation.
The RPring solution, we propose in this paper, can avoid the problem we mentioned in Section 1. It implements an extensible heterogeneous multiFPGA solution. Different from the GRAPE’s tree network, RPring utilizes ring topology network. Each FPGA board’s onboard memory stores a portion of particle information. During the process, each board receives particle information from the previous board, then calculates the interaction with local particle information, and finally sends the result to the next board in the ring network. For each particle’s information, when it flows through the whole boards in the ring, the calculation of interaction with all other particles is finished. The advantages of this solution are as follows:(1)In RPring, when the whole working set flows through the ring network once, the calculation of interaction is completed. The amount of data that needs to be transported between FPGA boards is reduced, and the demand for communication bandwidth is also reduced.(2)The ring network topology is simpler than the tree network in GRAPE cluster. There is no need for additional network board.(3)The interconnection protocol is quite simple. It requires little overhead to implement protocols no matter the software or hardware. Thus, we can save more resources to construct potential calculation pipeline.
3.2. Potential Pipeline
Potential pipeline is designed based on Poisson equation (4). To make each stage of the pipeline have similar latency, (4) is rewritten as
Figure 6 shows the design of potential pipeline. According to the complexity of different operation, the addition, substraction, and multiplication units are set to the same latency cycle as the division and square root units. In our design, , , , are presented in IEEE 754 floatingpoint format.
3.3. System Optimization
From the above, the ring network may result in the slowest board dragging the overall performance down, so it is important to balance the timeconsumption of particle’s information flowing through each board. Processing capacity of each board varies, but by decomposing the workload based on their own capacity we can adjust to improve the whole system’s performance efficiently. (See suitable work set in Figure 5.) The following section will discuss this problem in detail with a mathematical model.
4. Model
The purpose of this mathematical model is, given multiple FPGA boards with known parameters, how to decompose the workload among them and choose their parameters of potential calculation pipeline, so that the whole system’s maximum throughput can be obtained.
4.1. Symbol Conventions
Assume that the scale of simulation is , and we have FPGA boards. is the workload assigned to the th FPGA board. It can be seen from the RPring’s architecture that each board’s processing capacity depends on the number of potential calculation pipelines and their operating frequency. Suppose that the th FPGA board contains pipelines and their operating frequency is ; then the performance of the board has
In order to maximize the whole system’s throughput, we just need to allocate the workload among the FPGA boards in a proportional way according to their processing capacity, so the problem is converted to how to choose and , s.t. maximum. Additionally, is a function of ; that is,
Therefore, in this model, is the only free variable. Finally, the problem is rewritten as
4.2. Constraint
Furthermore, there are three constraints in this model:(1)FPGA logic resource constraint,(2)memory access bandwidth constraint,(3)communication bandwidth constraint.
FPGA logic resource constraint: in each FPGA, the logic resource consumption of FIFO, DMA, memory controller, input/output interconnection, and potential pipeline is smaller than the maximum amount of resources that FPGA can provide. Suppose that , , , and are the amounts of LUT, FlipFlop, BRAM, and DSP Slice that the th FPGA provides; we use vector to present them, , is the resource consumption of the th board’s FIFO, and so on. Then we have
Apparently, , , , , and depend on the pipeline needed data bandwidth and can be seen as a function of . That is to say,
Memory access bandwidth constraint: the input data of potential pipeline come from previous board’s result and local memory’s particle information. We fixed the data from the previous board and traverse the local particle information, so half of the potential pipelines’ input bandwidth is borne by the memory access bandwidth. That is to say,
In (12), is the maximum memory access bandwidth of the th board, and is the input data bandwidth summation of all the potential pipelines in the th board. Apparently, is proportional to the number of potential pipelines. That is to say, , so (12) can be written as where is a constant.
Communication bandwidth constraint: as described in memory access bandwidth constraint, the th board’s communication bandwidth between previous board and next board should be greater than one over of the half of . That is to say,
In conclusion, based on the target function and the three constraints, when the parameter of the boards and the needed functional relation are given, solving the optimization problem can guide us on how to decompose the work load among the boards and how to choose their parameters of potential calculation pipeline.
5. Implementation
In this section, we will demonstrate our implementation under RPring solution and its performance parameters in our MOND theory numerical simulation project.
Table 1 shows our existing boards and their parameters. The experiment demonstrates that under RPring solution, we can,(1)based on the boards’ feature, select software or hardware to implement the interconnection protocol,(2)according to the boards’ resource, choose different interconnection media.

Thus, this solution has good flexibility and scalability and is compatible with heterogeneous multiFPGA.
In Table 1, JetsonTK1 is a NVIDIA Application Processor board, which is used as host computer. Zedboard, KC705, and XUPV5 are Xilinx Evaluation Kits for Zynq7000, Kintex7, and Virtex5. Gemini1 is our design FPGA board for prototyping. Figure 7 shows the toplevel structure of Gemini1. Gemini1 has two pieces of XC6VLX365T Virtex6 FPGA; they are connected through PCB trace. Each Virtex6 FPGA chip has SMA connector to transport data.
5.1. Topology
Figure 8 shows connection between the boards and connection between chips. In the ring network, Tegra K1 is used as host computer, and XC7Z020, XC7K325T, XC5VLX110T, and XC5VLX365T are connected through ethernet, SMA cable, or PCB trace. Figure 9 is the picture of real product.
5.2. Data Structure
The provisions of the RPring’s particle information format are as shown in Figure 10. The location, mass, and potential field store the particle’s threedimensional coordinates, mass, and the provisional result of potential, as well as the field of Tag record FPGA boards that the particle information has passed. Once the information passes an FPGA board, the corresponding position in Tag filed is set. When all of the bits are set, the potential field contains the final result.
5.3. Protocol
5.3.1. Software Implementation
For the FPGA with integrated CPU, like Zynq7000, the interconnection protocol can be implemented with software as Figure 11 shows. Figure 11 shows its system architecture: multiple pipelines are instantiated in FPGA; two input ports of each pipeline are connected to InputFIFO and DMAFIFO; the output port of each pipeline is connected to OutputFIFO. The CPU creates three processes: Input Process, Computing Process, and Output Process.
Input Process. Control Gigabyte Ethernet receives particle information from previous board and stores it to InputFIFO. Moreover, Input Process handles the situation of retransmission, InputFIFO’s fullness, and so on.
Computing Process. Control DMA traverses local particle information from DDR3. Control pipelines compute the potential based on the particle information in InputFIFO and local information in DMAFIFO and then write the result to OutputFIFO.
Output Process. Control Gigabyte Ethernet Sends the result in OutputFIFO to the next board. Moreover, Output Process handles the situation of retransmission, next board’s InputFIFO’s fullness, and so on.
5.3.2. Hardware Implementation
For the FPGA without integrated CPU, like XC7K325T, the interconnection protocol can be implemented with hardware as Figure 12 shows. Because the interconnection media can be ethernet cable, SMA cable, or PCB trace, the input/output connection can be ethernet controller, SMA controller, or SelectIO controller. In Figure 12, a protocol FSM replaces the role of CPU. It controls the input connection, receives the data from previous board, stores the message in the InputFIFO, and controls the DMA traverse local data, and pipelines finish the calculation, control the output connection, and send the result in OutputFIFO to the next board.
6. Experimental Result
6.1. Logical Resource Consumption
Table 2 shows the resource utilization of each FPGA board. Particularly, Zynq FPGA has hard DDR3 controller and Gigabit Ethernet controller, and Virtex5 FPGA has hard DDR2 and MAC. Thus, these modules do not require extra logic resource. Virtex5 does not have existing DMA IP, so we design a simplified version.

6.2. Communication Bandwidth Consumption
In order to measure the communication bandwidth consumption between boards, we add counters to record the data traffic on the interconnection. Figure 13 shows the counters in the system and the interconnection’s notation. These counters will work after each packet passes through the path. At the end of the whole experiment, we can read out the value of each counter and divide it by the time of calculation. By this way, we can confirm that our data has no loss in the communication and give out the specific throughput bandwidth of each FPGA boards.
Table 3 shows the communication bandwidth consumption result during calculating 131,072 particles’ potential, the time of which is 1297.113 ms. Because of the solution’s ring topologies, the consumption of communication bandwidth is very low.

6.3. Performance Comparison
Table 4 shows the number of potential pipelines in each FPGA board, operation frequency, and their performance. The conversion between GFlops and MPair/s is based on Atsushi Kawai’s work [16]. Based on the above results, we calculate the whole system’s theoretical parameters. Finally, we list the system’s experimental result.

6.3.1. Comparison with Software
We choose CPU and GPU solutions as the control groups of our work. Fabian’s RAMSES code is a widely used method for MOND simulation [11]. Therefore, we choose their method as the reference software implementation and use Intel Xeon E52660 to run different scale test. Based on RAMSES’s QUMOND method and Nitin’s direct Nbody kernels to simulate the same work set on NVIDIA’s Tesla K80 GPU, Figure 14 shows the time taken to calculate potential in different platform. The benefit from the structure of FPGA is that RPring is faster than other solutions at the beginning. In the rightend, the NVIDIA results appear to be better than the RPring, because as the number of particles increases, our platform is approaching its theoretical maximum performance (503.5 of 620.1 GFlops), and Tesla K80 has not reached its limits. As for the CPU solution, it uses much more time compared to RPring and GPU, limited by its poor capability of parallel computing. When simulating a system with 131072 particles, our work is 193 times faster than Xeon E52660 CPU and can achieve similar performance to Tesla K80.
6.3.2. Comparison with Hardware
Table 4 shows a number of hardware implementation details of Nbody simulation. Junichiro’s GRAPE4 cluster uses 1692 pipeline and achieves 1.08 TFlops [7]. And he builds a compute cluster with GRAPE6 chips whose peak performance is 1.349 TFlops [9]. Furthermore, his GRAPE8 achieves 960 GFlops with just one board [8]. Reference [14] showed a solution with GPU whose performance is 781 GFlops.
Some FPGA solutions are also listed in the Table 5. Lienhart et al. use FPGA to achieve 3.9 GFlops’ performance [12]. Spurzem et al.’s solution has the performance of 4.3 GFlops [13], Hamada et al.’s work reaches 324.2 GFlops by a board with 5 FPGA chips [4], and Sozzo et al.’s work makes a solution with 46.55 GFlops’ performance [15].
7. Conclusion and Discussion
In this paper, we proposed an extensible solution: RPring, which is used for heterogeneous multiFPGAbased directsummation Nbody simulation, and a model to decompose workload among each FPGA. RPring tries to use existing FPGA boards rather than designing new specialized boards to reduce cost. The solution can be expanded conveniently with any heterogeneous FPGA boards and the communication bandwidth requirement is quite low, so that the communication protocol could be designed to be simple and consume few resource. The model considers the constraint of FPGA’s logic resource, memory access bandwidth, and communication bandwidth to divide workload reasonably and optimize the whole system’s performance. We also build a heterogeneous multiFPGA system based on RPring and use it for MOND theory’s numerical simulation. The experimental result shows that the low cost multiFPGA system is 193 times faster than highend CPU implementation and achieves similar performance to high performance GPU.
Disclosure
An earlier version of this work was presented as a poster at 2016 IEEE 24th Annual International Symposium on FieldProgrammable Custom Computing Machines (FCCM).
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was sponsored by Huawei Innovation Research Program (YB2015090102); support from Huawei Technologies Co., Ltd., is gratefully acknowledged.
References
 S. Bhatt, M. Chen, C.Y. Lin et al., “Abstractions for parallel Nbody simulations,” in Proceedings of the Scalable High Performance Computing Conference (SHPCC92), pp. 38–45, IEEE. View at: Google Scholar
 T. Hamada, T. Narumi, R. Yokota, K. Yasuoka, K. Nitadori, and M. Taiji, “42 TFlops hierarchical Nbody simulations on GPUs with applications in both astrophysics and turbulence,” in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, (SC '09), New York, NY, USA, November 2009. View at: Publisher Site  Google Scholar
 S. Harfst, A. Gualandris, D. Merritt, R. Spurzem, S. P. Zwart, and P. Berczik, “Performance analysis of direct Nbody algorithms on specialpurpose supercomputers,” New Astronomy, vol. 12, no. 5, pp. 357–377, 2007. View at: Publisher Site  Google Scholar
 T. Hamada, K. Benkrid, K. Nitadori, and M. Taiji, “A comparative study on ASIC, FPGAs, GPUs and general purpose processors in the O(N^{2}) gravitational Nbody simulation,” in Proceedings of the NASA/ESA Conference on Adaptive Hardware and Systems (AHS '09), pp. 447–452, San Francisco, Calif, USA, July 2009. View at: Publisher Site  Google Scholar
 N. Arora, A. Shringarpure, and R. W. Vuduc, “Direct nbody kernels for multicore platforms,” in Proceedings of the 38th International Conference on Parallel Processing, ICPP2009, pp. 379–387, Austria, September 2009. View at: Publisher Site  Google Scholar
 I. Zecena, M. Burtscher, T. Jin, and Z. Zong, “Evaluating the performance and energy efficiency of nbody codes on multicore CPUs and GPUs,” in Proceedings of the 2013 IEEE 32nd International Performance Computing and Communications Conference, IPCCC 2013, USA, December 2013. View at: Publisher Site  Google Scholar
 J. Making, M. Taiji, T. Ebisuzaki, and D. Sugimoto, “Grape4: A massively parallel specialpurpose computer for collisional nbody simulations,” The Astrophysical Journal , vol. 480, no. 1, pp. 432–446, 1997. View at: Publisher Site  Google Scholar
 J. Makino and H. Daisaka, “GRAPE8—an accelerator for gravitational Nbody simulation with 20.5Gflops/W performance,” in Proceedings of the 24th International Conference for High Performance Computing, Networking, Storage and Analysis (SC '12), Salt Lake City, Utah, USA, November 2012. View at: Publisher Site  Google Scholar
 J. Makino, T. Fukushige, and M. Koga, “A 1.349 Tflops simulation of black holes in a galactic center on GRAPE6,” in Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, pp. 43–43, Dallas, Tex, USA, November 2000. View at: Publisher Site  Google Scholar
 B. Famaey and S. S. McGaugh, “Modified newtonian dynamics (MOND): observational phenomenology and relativistic extensions,” Living Reviews in Relativity, vol. 15, article 10, 2012. View at: Publisher Site  Google Scholar
 F. Lüghausen, B. Famaey, and P. Kroupa, “Phantom of RAMSES (POR): a new Milgromian dynamics Nbody code,” Canadian Journal of Physics, vol. 93, no. 2, pp. 232–241, 2014. View at: Publisher Site  Google Scholar
 G. Lienhart, A. Kugel, and R. Männer, “Using floatingpoint arithmetic on FPGAS to accelerate scientific NBody simulations,” in Proceedings of the 10th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines, FCCM 2002, pp. 182–194, USA, April 2002. View at: Publisher Site  Google Scholar
 R. Spurzem, P. Berczik, G. Marcus et al., “Accelerating astrophysical particle simulations with programmable hardware (FPGA and GPU),” Computer Science  Research and Development, vol. 23, no. 34, pp. 231–239, 2009. View at: Publisher Site  Google Scholar
 A. Castellĺő, R. Mayo, and J. Planas, “Exploiting TaskParallelism on GPU Clusters via OmpSs and rCUDA Virtualization Trustcom/BigDataSE/ISPA,” in Proceedings of the IEEE Trustcom/BigDataSE/ISPA, pp. 160–165, 2015. View at: Google Scholar
 E. D. Sozzo, L. D. Tucci, and M. D. Santambrogio, “A highly scalable and efficient parallel design of Nbody simulation on FPGA,” in Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017, pp. 241–246, USA, June 2017. View at: Publisher Site  Google Scholar
 A. Kawai and T. Fukushige, “$158/GFLOPS astrophysical Nbody simulation with reconfigurable addin card and hierarchical tree algorithm,” in Proceedings of the ACM/IEEE Conference on Supercomputing (SC '06), 2006. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2018 Shuaizhi Guo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.