The objective of this study is to evaluate the performances of Intel Xeon Phi hardware accelerators for Geant4 simulations, especially for multithreaded applications. We present the complete methodology to guide users for the compilation of their Geant4 applications on Phi processors. Then, we propose series of benchmarks to compare the performance of Xeon CPUs and Phi processors for a Geant4 example dedicated to the simulation of electron dose point kernels, the TestEm12 example. First, we compare a distributed execution of a sequential version of the Geant4 example on both architectures before evaluating the multithreaded version of the Geant4 example. If Phi processors demonstrated their ability to accelerate computing time (till a factor 3.83) when distributing sequential Geant4 simulations, we do not reach the same level of speedup when considering the multithreaded version of the Geant4 example.

1. Introduction

Monte Carlo simulations have become an indispensable tool for radiation transport calculations in a great variety of applications. The Geant4 simulation toolkit [1] has come into widespread use in the field of high energy physics for simulating detectors in the Large Hadron Collider (LHC) experiments and also in the field of medical physics for diagnostic applications (e.g., analysis of the radiation burden to patients), for therapeutic applications (e.g., treatment planning in radionuclide therapy and treatment verification through emission scans), and for external beam therapy, especially in emerging areas such as proton and light-ion therapy [24]. The toolkit has demonstrated an attracting increasing interest because of its great versatility. It contains a comprehensive range of physics models for electromagnetic, hadronic, and optical interactions of a large set of particles over a wide energy range. It furthermore offers a diversity of tools for defining or importing the problem geometry, for modeling complex radiation sources and detection systems, including electromagnetic fields, electronic detector responses, and time-dependencies, and for exporting the required output data. The code is continuously being improved and extended with new functionalities. Since release 10.0, a multithreaded version of Geant4 provides the possibility to manage simulations on different threads at the event level. Through this new capacity and the market entry of novel Intel Xeon Phi hardware accelerators [5, 6], it was relevant to evaluate computing time performances of Geant4 10.0 with these new processors.

Some authors [711] already compared the performance of Intel Xeon Phi accelerators with GP-GPU (General Purpose Graphical Processing Unit) architectures; outcomes about these studies lead to convergent arguments on the necessity to initiate clever and optimized programming for Intel Phi in order to achieve performances very close to GPUs. It has to be remarked that the portability of codes on GP-GPU is not always feasible; and in case it is, it demands greater code programming than Xeon Phi hardware accelerators. Effectively, regarding GP-GPU, highly parallel code sections have to be first identified in order to be fully rewritten using the CUDA language [12, 13]. For Xeon Phi hardware, simple rebuilding with correct memory usage is enough for using it. Bernaschi et al. [7] evaluated CPU Sandy Bridge, Kepler GPU, and Xeon Phi processors for a simulation in physics. In order to obtain good performances for their application, they highly tuned their application. For Xeon Phi hardware accelerators (5110P), they particularly worked on the memory management regarding the inputs and outputs used by their application and on the tuning of the C source code. When no changes are made on the application, they found that 1 GPGPU NVIDIA Kepler K20 is equivalent to 1 Phi and to 5 CPUs (Intel E5-2687W), while with a strong scaling (using MPI parallelization with asynchronous communication primitives to overlap data exchanges with the computations) 1 GPU is equivalent to 1.5 to 3.3 Phi. Lyakh [8] is more contrasted regarding this kind of comparisons by arguing on the difficulty to scale correctly an application to each architecture. Nevertheless, he obtained a steady x2-x3 speedup on GP-GPU architecture (Tesla K20X) compared to Phi (5110P). For Murano et al. [9], GPUs (NVIDIA Kepler K20) are constantly faster (between 12 and 27 times faster) than Xeon Phi hardware accelerators (5110P) without any tuning of the code (but using vectorization and adapted memory management). Other authors [14] experienced unsatisfactory performances when using Intel Xeon Phi accelerators for a medical imaging benchmark using a simple data structure and massive parallelism. In a recent study [15], we investigated performances of 7120P Intel Xeon Phi by distributing memory-bound Geant4 (version 9) simulation concerning the tracking of muons through a volcano in order to produce tomographic imaging. A maximum speedup of 2.56x was obtained when compared to calculations performed on a Sandy Bridge processor at 3.1 GHz.

In this paper, we describe very precisely the methodology followed for the compilation of Geant4 software and dependencies in the objective to guide any user willing to take part in the expected Xeon Phi computing potentiality for time-consuming simulations. Then, we propose as benchmarks a set of computing tests performed for the Geant4 advanced example “TestEm12” in the objective to conclude the suitability of Xeon Phi architecture for such simulations.

2. Portability of Geant4 Software on Intel Xeon Phi Coprocessor

2.1. Xeon Phi Cluster Characteristics

The simulations have been performed on a machine having two Intel Xeon CPUs E5-2690v2 (http://ark.intel.com/) (10 cores, 12 MB Cache, 3.001 GHz, Hyper Threading, 2 threads per core) each capable of a theoretical 240.1 GFLOPS of double precision floating point instructions with 59.7 GB/sec memory bandwidth, 128 GB of memory, and four Xeon Phi hardware accelerators 5110P (60 cores, 1.052 GHz, 4 hardware threads per core) each capable of theoretical 1010.5 GFLOPS of double precision floating point instructions with 320 GB/sec memory bandwidth and 8 GB of memory. If we consider the announced performance in GFLOPS, the maximum speedup in favor of Xeon Phi should be 4.2x. The Turbo frequency (reaching 3.6 GHz) was kept inactivated for the reproducibility of the benchmark computing times. The version of the dedicated Linux distribution is for Xeon Phi specific architecture and 2.6.32-504.1.3.el6.x86_64 (CentOS 6.6) for Xeon CPU. In the whole paper, we simply refer to “Xeon” for Intel Xeon CPU and to “Phi” for Xeon Phi hardware accelerator.

2.2. Cross-Compilation of Geant4 Software and Dependencies

The present work was performed with version 10.0p01 of Geant4 [1] making use of CLHEP [16] libraries version, in order to handle pseudorandom numbers, numerics, points, vectors, planes, their transformations in 3D space, and Lorentz vectors and their transformations in simulations. Xerces 3.1.1 and Expat 2.1.0 libraries used to manage XML files are also necessary to run Geant4 correctly. The Qt library used for the OpenGL visualization has not been compiled since the visualization option is not recommended when using a parallel architecture for large computational tests (see Figure 1). Source codes were compiled with the Intel C++ Composer XE 2013 compiler set to O2 optimization level in order to avoid introducing uncontrolled bias when executing the codes compared to the O3 level like it is specified in the reference guide for Intel compilers (Quick-Reference Guide to Optimization with Intel Compilers version 12; retrieved from https://software.intel.com/sites/default/files/compiler_qrg12.pdf). The release of the CMake build system is version 2.8.

Each compilation process (native or cross-compilation) has to be run on Xeon (x86_64) processor using the Intel C Compiler (ICC) version 14.0.3 (compliant with GCC 4.8.1) in order to be later launched on Phi coprocessor architecture. For CMake compilation, the cross-compilation for Xeon Phi accelerators is activated when using the CMake flag “-mmic” for(i)DCMAKE_CXX_FLAGS,(ii)DCMAKE_C_FLAGS, DCMAKE_EXE_LINKER_FLAGS,(iii)DCMAKE_MODULE_LINKER_FLAGS,(iv)DCMAKE_SHARED_LINKER_FLAGS options.

For libraries built using a configure script, the cross-compilation for Xeon Phi accelerators is activated when using the flag “-mmic” and when specifying that the x86_64 host machine is unknown in order to be sure that the compilation process does not refer to the current Linux version installed on the machine hosting the compiler.

2.2.1. Definition of Environment Variables

In order to compile Geant4 toolkit and dependencies, it is necessary to set specific environment variables. It is mandatory to append the corresponding libraries to the LD_LIBRARY_PATH variable and the executable binaries to the PATH variable. For our compilation, we created a mic-install repository where all the libraries were downloaded, built, and installed.

All variables starting their name with G4 are used for setting the Geant4 data libraries paths (photon evaporation data, radioactive decay data, particle cross sections for different energy ranges, cross section data for impact ionization, nuclear shell effects, optical surface reflectance, and nuclide state properties). It has to be noticed that as the bash and tcsh Unix shells are not supported on Phi coprocessor, environment variables have to be set using a basic sh script file; this file is then sourced using the command line “source env.sh” (Script 1).

2.2.2. Cross-Compilation of Geant4 Toolkit and Dependencies

The methodology followed for cross-compiling Geant4 toolkit and dependencies is inspired from a preliminary work [17] tested for a previous Geant4 version (9.6 p02) which was not developed for multithreaded execution. The methodology has been improved and lightened especially regarding the ROOT compilation, which is included in Geant4 since 10.0 release.

Xerces and Expat libraries have been compiled using the configuration instructions specified in Scripts 2 and 3.

cd xerces-c-3.1.1 build
CC=/opt/intel/bin/icc CFLAGS="-ansi -fp-model precise -mmic -static-intel" CXXFLAGS=
"-ansi -fp-model precise -mmic
-static-intel" CXX=/opt/intel/bin/icpc../xerces-c-3.1.1 src/configure
--prefix=/mic-install/xerces/xerces-c-3.1.1 install/ --host=x86 64-unknown-linux-gnu
make -j40
make install
cd expat-2.1.0 build
CC=/opt/intel/bin/icc CFLAGS="-ansi -fp-model precise
-mmic -static-intel"../expat-2.1.0 src/configure
--prefix=/mic-install/expat/expat-2.1.0 install/
--host=x86 64-unknown-linux-gnu
make -j40
make install

Script 4 details the CMake instructions for the compilation of the CLHEP library.

cd clhep- build
-DCMAKE CXX COMPILER=/opt/intel/bin/icpc
-DCMAKE CXX FLAGS="-ansi -fp-model precise -mmic -static-intel -gxx-name=g++-4.4"
-DCMAKE C COMPILER=/opt/intel/bin/icc
-DCMAKE C FLAGS="-ansi -fp-model precise -mmic -static-intel -gcc-name=gcc-4.4"
-DCMAKE EXE LINKER FLAGS="-ansi -fp-model precise -mmic -static-intel"
-DCMAKE INSTALL PREFIX=/mic-install/clhep/clhep- install
-DCMAKE MODULE LINKER FLAGS="-ansi -fp-model precise -mmic -static-intel"
-DCMAKE SHARED LINKER FLAGS="-ansi -fp-model precise -mmic -static-intel"
../clhep- src/CLHEP
make -j40
make install

Finally, Script 5 describes the CMake instructions for compiling Geant4 on Xeon Phi coprocessors.

cd geant4.10.0.p01_build
-DCMAKE_C_FLAGS="-ansi -fp-model precise -mmic"
-DCMAKE_CXX_FLAGS="-ansi -fp-model precise -mmic"
-DCMAKE_SHARED_LINKER_FLAGS="-ansi -fp-model precise -mmic -Wl,
-rpath,$LD_LIBRARY_PATH -Wl,-rpath-link,$LD_LIBRARY_PATH"
-DCMAKE_EXE_LINKER_FLAGS="-ansi -fp-model precise -mmic -Wl,
-rpath,$LD_LIBRARY_PATH -Wl,-rpath-link,$LD_LIBRARY_PATH"
make -j40
make install

It has to be remarked that we used the “-fp-model precise” flag making ICC compiler fulfill the IEEE 754 standard for floating point number representation and computation. This flag also prevents compiler optimizations that could introduce numerical errors according to the current floating point standard.

2.2.3. Compilation Test

The Geant4 extended example TestEm12 migrated to enable multithread computing (accessible at $G4INSTALL/examples/extended). To compile TestEm12 for Xeon Phi coprocessors, the listed CMake instructions have to be used (see Script 6).

cd TestEm12MT_build
-DCMAKE_CXX_FLAGS="-ansi -fp-model precise -mmic -static-intel -gxx-name=g++-4.4"
-DCMAKE_C_FLAGS="-ansi -fp-model precise -mmic -static-intel -gcc-name=gcc-4.4"
-DCMAKE_EXE_LINKER_FLAGS="-ansi -fp-model precise -mmic -static-intel"../TestEm12MT
make -j40

This example, already validated by authors against other Monte Carlo codes [18], enables scoring the energy deposited per source particle in thin, concentric, spherical shells around an isotropic, monoenergetic, electron point source of 4 MeV (Mega Electron-Volt) centered in spherical geometry. The material of the sphere was chosen to be liquid water (density 1 g·cm−3); the radius of the sphere was set to 3 cm and the number of shells was fixed to 150. The standard electromagnetic (EM) physics list option 3, describing electron and photon interactions between ~1 keV and 100 TeV, was used for all simulations, taking into account electron impact ionization, multiple scattering, and Bremsstrahlung generation. In Script 7 is presented a TestEm12 macro file example. When using the multithreaded mode, the command line “/run/numberOfThreads 10” has to be added with the corresponding number of threads (in this example: 10 threads). The macro file is launched on coprocessors with the executable generated after the compilation using “./exe TestEm12.mac”.

/control/verbose 0
/run/verbose 0#
/testem/det/setMat G4_WATER
/testem/phys/addPhysicsemstandard_opt3#  em  physics
/run/numberOfThreads 10
/gun/particle e-
/gun/energy 4 MeV
/analysis/setFileName test
/analysis/h1/set 1 150 0.3.cm    #edep profile
/analysis/h1/set 3 100 0.  3.cm  #true track length
/analysis/h1/set 8 120 0.  1.2none     #normalized edep profile
/testem/applyAutomaticStepMax true
/run/beamOn 1000000

In order to verify the correct cross-compilation of Geant4 and dependencies, we tested the reproducibility of results for TestEm12 electromagnetic example on 1 Xeon and its multithreaded version TestEm12MT running on 40 Xeon threads and 240 Phi threads using or not the optimized compilation flag “-fp-model precise”. Figure 2 represents the energy deposited per source particle for the four test cases. Each test case has been repeated 5 times. As it can be noticed, if the compilation flag “-fp-model precise” is omitted during the compilation procedure of Geant4 and CLHEP, then the energy deposition profile is shifted to significantly smaller radial distances which leads to a bad agreement with other configurations. If we quantify the difference between energy () at distance calculated with Xeon CPU and the energy () at the same distance calculated with Phi without applying any optimized compilation process, as a percentage of the maximum value of the two calculated energy distributions, then differences between and up to 40% are found when a threshold less than 3% is usually accepted. No other optimized compilation flags were tested as we achieved a very good agreement with “-fp-model precise” flag.

3. Performance Evaluations

3.1. Benchmark Descriptions

Geant4 simulations, through TestEm12 extended example, have been tested on Xeon Phi accelerators using distributed (TestEm12) and multithreaded (TestEm12MT) modes. Prior to any computational tests, we profiled TestEm12 example using the Intel VTune toolkit in order to quantify memory bandwidth consumption. We could conclude that the simulation is highly compute-bound.

In this study, we consider that the “distributed” mode means launching several independent simulation instances at the same time without involving any communication between instances. Concerning the “multithreaded” mode, simulations are launched using the pthread library. In both cases, simulations are balanced regarding the number of particles equally spread among worker nodes.

For the distributed mode, the total number of particles is split between the multiple instances of runs like described by authors in [19, 20]. The postprocessing time due to the merging of results has been evaluated and represents a negligible percentage (less than 0.5%) of the total execution time measured.

Table 1 is listing the number of partitions used for speedup calculations. For each benchmark, simulations were repeated five times; mean time values have been used for all the results. Standard deviations for all recovered time values do not overlap 1%. The standard EM option 3 used 4 MeV electron point sources. All times are wall clock times, measured by the internal clock of the host machine. Distributed simulations used the Mersenne Twister pseudorandom number generator (PRNG). The pseudorandom numbers were generated using a sequence splitting method ensuring a sufficiently large sequence of 3.1010 nonoverlapping random numbers in order to not reproduce any particle event. When using the multithreaded version of Geant4, nothing was modified to the process of partitioning; the Mersenne Twister PRNG was used.

The Xeon Phi hardware was used with a “native mode,” meaning whole simulations were executed on the Xeon Phi or directly started on the Xeon Phi using SSH.

3.2. Xeon versus Phi for Distributed Geant4 Simulations (TestEm12)

Speedup was evaluated for distributed TestEm12 simulations for generated source particles going from 103 to 108. Figure 3 presents execution times in minutes obtained for one Xeon thread and one Phi thread as the speedup reached. It can be observed that the speedup keeps constant (~0.38) whatever the number of particles selected. This result was expected since the ratio of intrinsic frequencies between Xeon and Phi is about 3 (3.001 GHz for Xeon compared to 1.052 GHz for Phi).

In order to verify the Intel claimed performances for Xeon and Phi [5], we represented on Figure 4 execution times in minutes obtained for 10 Xeon versus 60 Phi hardware cores. It can be observed that when using a higher number of particles (107), we reach a speedup of 2.18. For this benchmark, it appears that Phi coprocessors are more suitable than Xeon coprocessors for number of particles higher than 105.

When using the total amount of threads available on 1 Xeon and 1 Phi, respectively, 20 and 240 threads (see Figure 5), we reach a speedup of 3.83 for 108 particles, which is close to the expected Intel maximum speedup of 4.2, when comparing the announced GFLOPS performances. Since we have selected limited optimization and a floating point precision flag, this result is very satisfying.

3.3. Distributed (TestEm12) versus Multithreaded (TestEm12MT) Geant4 Simulations on Xeon Processors

Speedup was evaluated for three different numbers of threads: 10, 20, and 40 corresponding, respectively, to the number of hardware cores for one Xeon CPU, the number of hardware cores for 2 Xeon CPUs, and the number of threads for 2 Xeon CPUs. The goal was to evaluate the potential impact of using a multithreaded version compared to a distributed one on Xeon processors. This case study presented on Figure 6 demonstrates that TestEm12MT, the multithreaded version of TestEm12 Geant4 example, is slower than TestEm12 (speedup between 1.08 and 1.33) whatever the number of threads used.

3.4. Impact of the Number of Phi Threads for Multithreaded (TestEm12MT) Geant4 Simulations

In the objective to evaluate if a high number of threads reduces significantly the execution time of TestEm12MT on Phi whatever the number of particles generated, we plotted on Figure 7 the different computing times obtained for generated source particles going from 105 to 108 and a number of Phi threads going from 10 to 240.

We can remark that the higher the number of generated source particles is, the higher the number of threads must be to reduce the execution times. For 108 particles, we obtain an almost linear diminution of computing time with the number of threads (till 60 threads), as it is also shown in Table 2 representing the speedups obtained for 10, 20, 40, 60, 120, and 240 Phi threads for 108 particles compared to one Phi thread. For source particles inferior to 105, the initialization process to fix physics datasets and geometry is of the same order of duration as the emission and tracking of particles which explains that in this case we obtain the best computing time for a lower number of threads (20 threads).

3.5. Xeon versus Phi for Multithreaded (TestEm12MT) Geant4 Simulations

The speedup was evaluated for TestEm12MT simulations using the standard EM physics list for generated source particles going from 105 to 108 running on 40 Xeon threads and the best computing time obtained for Phi threads. It can be observed that whatever the number of particles generated, Phi provides longer execution times (see Figure 8). The speedup is nevertheless increasing in function of the number of particles generated to reach 0.25 for 108 particles.

In Table 3, we listed the computing time obtained on 40 Xeon threads and 960 Phi threads for 108 and 109 particles.

When using 960 Phi threads, the computing time reaches 49.9 minutes for 108 particles, which is 23% higher than when using 40 Xeon threads (computing time corresponding to 40.6 minutes). But when reaching 109 particles the computing time is finally reduced on 960 Phi threads compared to Xeon; for this last configuration, we obtain a speedup of 1.04.

4. Conclusion

The objective of this paper was to first detail a clear and understandable methodology to compile and execute any Geant4 application on Xeon Phi accelerators. Special attention should be paid for using the optimization compilation flag “-fp-model precise” in order to obtain identical results compared to an execution on CPUs.

Then, the ambition of authors was to evaluate the performance of Xeon Phi accelerators for such applications especially due to the availability of the multithreaded version of the Geant4 toolkit. We have to remind the reader that, in a first instance, no tuning of the source code has been initiated in this study. Regarding the different outcomes obtained, we may conclude that(i)when distributing sequential Geant4 simulations (40 Xeon threads compared to 240 Phi threads), Phi (5110P at 1 GHZ) are faster than Xeon CPUs (E5-2690v2 at 3 GHZ), almost reaching the maximum speedup (3.83x versus 4.2x) though limited optimization was considered to save the precision of the final results;(ii)when considering multithreaded Geant4 simulations on Xeon CPUs, we can remark that this version is unfortunately slightly slower than the classical distribution of the sequential Geant4 simulations whatever the number of threads used;(iii)even if we observe a loss of performance for the multithreaded version of Geant4 on Phi compared to Xeon CPUs, it has to be noticed that, using a high number of particles in simulations (corresponding to more than 6 hours of computing on 40 Xeon CPUs for 109 particles), we finally reach a very tiny speedup of 1.04 using 960 Phi threads.

For the moment, we can state that the multithreaded version of Geant4 is not yet optimized to compete with a distributed submission of simulations on a farm of CPU clusters, on a cluster of Phi hardware accelerators, or on a grid infrastructure. It would certainly necessitate tuning drastically the source code and suppressing any verbose display, in order to make such applications fully compliant with Xeon Phi architectures. One would expect that the next generation of the Geant toolkit (Geant5) would answer such problematic.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.


This research was carried out as part of the Laboratory of Excellence ClerVolc project. The authors wish to thank the Geant4 collaboration for its technical support.