Astrophysics with GRAPE - Oxford Academic

Prog. Theor. Exp. Phys. 2012, 01A303 (27 pages)
DOI: 10.1093/ptep/pts029
Astrophysics with GRAPE
Junichiro Makino∗ and Takayuki Saitoh
Interactive Research Center of Science, Tokyo Institute of Technology, 2-12-1, Ookayama, Meguro,
Tokyo, 152-8551, Japan
∗
E-mail: [email protected]
Received May 25, 2012; Accepted July 23, 2012; Published October 4, 2012
...............................................................................
In this paper we provide an overview of the GRAPE (GRAvity PipE) project and related developments in astrophysics. The basic idea of the GRAPE project is to develop computers specialized
for N -body simulations and to use them to perform large-scale simulations which cannot be
done easily on general-purpose computers. The first GRAPE system, GRAPE-1, was completed
in 1989. Since then more than ten systems have been developed and used for research in many
fields of astrophysics. Some GRAPE systems were specifically designed for molecular dynamics
simulations. In this paper we first give a brief overview of the history of the GRAPE project, and
then try to give a systematic view of the advance of the numerical methods, in particular how they
have been driven by the nature of the systems to be simulated and the advance of semiconductor
and computer technology.
...............................................................................
1.
Introduction
In many simulations in astrophysics it is necessary to solve gravitational N -body problems. In
some cases, such as the study of the formation of galaxies or stars, it is important to treat nongravitational effects such as the hydrodynamical interaction, radiation, and magnetic fields, but in
these simulations calculation of gravity is usually the most time-consuming part.
To solve the gravitational N -body problem we need to evaluate the gravitational forces on all bodies (particles) in the system from all the other particles in the system. There are many ways to do
so. The simplest is to calculate all pairwise interactions, which is the most efficient for systems with
a relatively small number of particles (less than 10,000) and still widely used in many applications.
When the number of particles is much larger than 10,000, one can significantly accelerate the calculation using by the Barnes–Hut tree algorithm [1] or Fast Multipole Method (FMM) [2]. Even with
these methods, however, the calculation of the gravitational interaction between particles (or particles
and multipole expansions of groups of particles) is the most time-consuming part of the calculation.
Therefore, one can greatly improve the speed of the entire simulation just by accelerating the speed
of the calculation of particle–particle interaction. This is the basic idea behind GRAPE computers.
Figure 1 shows the basic idea. The system consists of a host computer and special-purpose hardware, and the special-purpose hardware handles the calculation of gravitational interactions between
particles. The host computer performs other calculations such as the time integration of particles,
I/O, and diagnostics.
Even with the latest GRAPE-8 system, we still maintain this basic structure in which specialpurpose hardware is connected to a general-purpose computer. However, in two decades since the
completion of GRAPE-1, the overall speed has increased by a factor of nearly 106 . This increase
in speed made it necessary to develop new algorithms, in particular efficient parallel algorithms,
© The Author(s) 2012. Published by Oxford University Press on behalf of the Physical Society of Japan.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
PTEP 2012, 01A303
J. Makino and T. Saitoh
Fig. 1. Basic structure of a GRAPE system.
since a large part of the speedup is realized by the increase in the number of pipeline processors.
More importantly, this increase in the speed makes it possible to address complex problems such as
the formation of galaxies. In order to study the galaxy formation process we need to model the gas
dynamics of interstellar gas; its radiative cooling; the star formation process from cool, dense gas;
and the heating of gas due to supernova explosions. These processes require new algorithms. Thus,
in this paper we first provide an overview of the history of the GRAPE hardware in Sect. 2, then try
to give a consistent view of the algorithms used for N -body systems in Sect. 3, and then present an
overview of the recent development of new algorithms for galaxy formation simulations in Sect. 4.
2. History
2.1. GRAPEs 1, 2, and 3
The GRAPE Project was started in 1988. GRAPE-1 [3] was the first machine we completed, and
became operational in September 1989. It was a single-board unit on which around 100 IC and
LSI chips were mounted and wire-wrapped. The pipeline processor of GRAPE-1 consisted of commercially available IC and LSI chips. GRAPE-1 was the first machine we ever built, and to use
commercially available chips was the only practical possibility—we did not have the knowledge or
budget to design custom LSI chips.
For GRAPE-1 we used an unusually short word format, to make the hardware as simple as possible. The first subtraction of the position vectors was done in 16-bit fixed-point format, and the final
accumulation of the force was done in 48-bit fixed-point format. All other operations were done in
8-bit logarithmic format, in which 3 bits are used for the “fractional” part. This choice simplified
the hardware significantly, since we could use a single 512-kbit EPROM chip to express any binary
operations. For example, x 2 + y 2 was done by one chip. This use of an extremely short word format in GRAPE-1 was based on a detailed theoretical analysis of error propagation and numerical
experiment [4].
The reason why a short word format can be used is related to the fundamental nature of N -body
simulation. In the simulation of “collisionless” systems, for which the simulation timescale is much
shorter than the thermal relaxation time of the system, the primary source of numerical error is
the artificial two-body relaxation caused by the finite number of particles, which is usually many
orders of magnitude smaller than the actual number of particles in the system. Since we express the
actual system with a smaller number of particles, the potential field fluctuates, and the amplitude
√
of fluctuation is proportional to 1/ N . This fluctuation is the cause of the two-body relaxation. We
have proved that the numerical error due to the short word format has the same effect as the two-body
relaxation itself, but with a numerical coefficient the same as the relative accuracy of the pairwise
force, as far as we can regard the error as a random error. Thus, we need a word length just long
enough so that the error behaves as effectively random.
GRAPE-1 was used for studying the merging of spherical galaxies and violent relaxation.
2/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
For GRAPE-1 we used an IEEE-488 (GP-IB) interface for communication with the host. In addition, GRAPE-1 could only handle particles with the same mass. These choices were made to simplify
the hardware design.
As will be discussed in Sect. 3, for simulations of collisionless systems we do not use the O(N 2 )
direct summation method. We use fast methods like the Barnes–Hut tree algorithm or FMM (see
Sect. 3 for detailed discussion on these methods). In order to use the GRAPE hardware with these
methods, it must be able to handle particles of unequal mass (physical particles and center-ofmass particles), and communication with the host computer should be much faster. GRAPE-1A was
designed for this purpose. For GRAPE-1A a VME bus was used, and it provided a speed of around
4 MB/s, faster than the communication speed of GRAPE-1 by nearly two orders of magnitude.
For the simulation of collisional systems, in other words, if the simulation timescale is longer
than the thermal relaxation time of the system, the very short word format used for GRAPE-1 is
not appropriate, since the variation of the total energy of the system can become noticeable. We
developed GRAPE-2 for the simulation of collisional systems.
In GRAPE-2, in order to achieve higher accuracy, commercial LSI chips for floating-point arithmetic operations, such as the TI SN74ACT8847 and Analog Devices ADSP3201/3202, were used.
The pipeline of GRAPE-2 processes the three components of the interaction sequentially, accumulating one interaction every three clock cycles. This approach was adopted to reduce the circuit size.
Its speed was around 40 Mflops, but it was still much faster than workstations or minicomputers at
that time. GRAPE-2 was used for many problems, including the study of the runaway growth of
protoplanets and the merging of two galaxies with central massive black holes.
GRAPE-2A was designed to handle interactions with arbitrary functional form, so that it could be
used for molecular dynamics simulations.
GRAPE-3 was the first GRAPE computer with a custom LSI chip. The number format was a
combination of the fixed-point and logarithmic formats similar to those used in GRAPE-1. The chip
was fabricated using a 1 µm design rule by National Semiconductor. The number of transistors on
the chip was 110K. The chip operated at 20 MHz clock speed, offering an overall speed of about
0.8 Gflops. Printed circuit boards with 8 chips were mass produced, for a speed of 6.4 Gflops per
board. Thus, GRAPE-3 was also the first GRAPE computer to integrate multiple pipelines into a
system. Also, GRAPE-3 was the first GRAPE computer to be manufactured and sold by a commercial
company. Nearly 100 copies of GRAPE-3 have been sold to more than 30 institutes (more than 20
outside Japan).
2.2.
GRAPEs 4, 5, and 6
In 1992 we started the development of GRAPE-4, with a target performance of 1 Tflops. At that
time, the number of floating point units which could be integrated into one LSI chip was around 20,
and the practical clock frequency was 30 MHz. This means we could achieve around 600 Mflops per
chip, and around 1,700 chips were necessary to achieve 1 Tflops.
Each chip integrated one pipeline unit similar to that of GRAPE-2. This chip calculates the first
time derivative of the force, so that a fourth-order Hermite scheme [5] can be used. The chip was
fabricated using a 1 µm design rule by LSI Logic with a total transistor count of about 400 K.
The completed GRAPE-4 system consisted of 1728 pipeline chips (36 PCBs each with 48 pipeline
chips). It operated on a 32 MHz clock, delivering a speed of 1.1 Tflops. Technical details of the
machines from GRAPE-1 through GRAPE-4 can be found in our book [6] and references therein.
3/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
Table 1. History of the GRAPE project.
GRAPE-1
GRAPE-2
GRAPE-1A
GRAPE-3
GRAPE-2A
HARP-1
(89/4–89/10)
(89/8–90/5)
(90/4–90/10)
(90/9–91/9)
(91/7–92/5)
(92/7–93/3)
GRAPE-3A
(92/1–93/7)
GRAPE-4
(92/7–95/7)
MD-GRAPE
(94/7–95/4)
GRAPE-5
GRAPE-6
(96/4–99/8)
(97/8–02/3)
310 Mflops, low accuracy
50 Mflops, high accuracy (32bit/64bit)
310 Mflops, low accuracy
18 Gflops, low accuracy
230 Mflops, high accuracy
180 Mflops, high accuracy
Hermite scheme
8 Gflops/board
some 80 copies are used all over the world
1 Tflops, high accuracy
Some 10 copies of small machines
1 Gflops/chip, high accuracy
programmable interaction
5 Gflops/chip, low accuracy
64 Tflops, high accuracy
Fig. 2. The evolution of GRAPE and general-purpose parallel computers. The peak speed is plotted against the
year of delivery. Open circles, crosses and stars denote GRAPEs, vector processors, and parallel processors,
respectively.
GRAPE-5 [7] was an improvement over GRAPE-3. It integrated two full pipelines which operate
on an 80 MHz clock. Thus, a single GRAPE-5 chip offered 8 times more speed than the GRAPE-3
chip, or the same speed as that of an 8-chip GRAPE-3 board. GRAPE-5 was awarded the 1999
Gordon Bell Prize for price–performance. The GRAPE-5 chip was fabricated with a 0.35 µm design
rule by NEC.
Table 1 summarizes the history of GRAPE project. Figure 2 shows the evolution of GRAPE systems and general-purpose parallel computers. One can see that the evolution of GRAPE is faster than
that of general-purpose computers.
GRAPE-6 was essentially a scaled-up version of GRAPE-4 [8], with a peak speed of around
64 Tflops. The peak speed of a single pipeline chip was 31 Gflops. In comparison, GRAPE-4 consists of 1728 pipeline chips, each providing 600 Mflops. The increase of a factor of 50 in speed was
achieved by integrating six pipelines into one chip (the GRAPE-4 chip has one pipeline which needs
three cycles to calculate the force from one particle) and using a three times higher clock frequency.
The advances in device technology (from 1 µm to 0.25 µm) made these improvements possible.
Figure 3 shows the processor chip delivered in early 1999—the six pipeline units are visible.
4/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
Fig. 3. The GRAPE-6 processor chip.
The completed GRAPE-6 system consisted of 64 processor boards, grouped into 4 clusters with
16 boards each. Within a cluster, 16 boards are organized in a 4 by 4 matrix, with 4 host computers.
They are organized so that the effective communication speed is proportional to the number of host
computers. In a simple configuration, the effective communication speed becomes independent of
the number of host computers. The details of the network used in GRAPE-6 are in [9].
2.3.
GRAPE-DR
In 2004 we started the development of GRAPE-DR. It has an architecture quite different from
that of previous GRAPE hardware. Instead of hardwired pipelines for the gravitational interaction,
A GRAPE-DR processor chip integrates a large number of very simple processors which operate
in a SIMD fashion. This rather drastic change in design was to extend the application area. At least
part of the reason we tried to extend the application area is to justify the large initial cost of custom
LSI chips. In 1990 the initial cost of a custom chip was around 150 K USD; in 1997 it was around
1M USD, and in 2004 it was more than 3 M USD. The total amount of grant necessary to complete
a system is around four times larger than the initial cost of the LSI chip, so we had to get a grant of
10–15 M USD. Such a large grant was impractical for a system which could solve only astrophysical
N -body problems.
GRAPE-DR is an acronym for “Greatly Reduced Array of Processor Elements with Data Reduction”. The last part, “Data Reduction”, means that it has an on-chip tree network which can do various
reduction operations such as summation, max/min and logical and/or. When we use GRAPE-DR as
GRAPE, this summation network is used to add the partial forces on one particle calculated on
multiple processors on one chip.
The GRAPE-DR project was started in FY 2004, and finished in FY 2008. The GRAPE-DR processor chip consists of 512 simple processors which can operate at a clock speed of 500 MHz for
512 Gflops of single-precision peak performance (256 Gflops double precision). It was fabricated
with by TSMC with a 90 nm process and the size is around 300 mm2 ; peak power consumption
is around 60 W. The GRAPE-DR processor board (Fig. 4) houses 4 GRAPE-DR chips, each with
5/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
Fig. 4. The GRAPE-DR processor board.
Fig. 5. The performance of the individual timestep scheme on a single-card GRAPE-DR in Gflops, plotted as
a function of the number of particles.
its own local DRAM chips. It communicates with the host computer through a Gen1 16-lane
PCI-Express interface.
This card gives a theoretical peak performance of 819 Gflops (in double precision) at a clock
speed of 400 MHz. The actual performance numbers are 640 Gflops for matrix-matrix multiplication, 430 Gflops for LU-decomposition, and 500 Gflops for direct N -body simulation with individual
timesteps (Fig. 5). These numbers are typically a factor of two or more better than the best
performance number so far reported with GPGPUs.
In the case of parallel LU decomposition, the measured performance was 24 Tflops on a 64-board,
64-node system. The average power consumption of this system during the calculation was 29 KW,
and thus performance per Watt is 815 Mflops/W. This number is listed as No. 1 in the Little Green
6/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
Fig. 6. The GRAPE-DR cluster.
500 list of June 2010. Thus, from technical point of view, we believe that the GRAPE-DR project is
highly successful in making multi-purpose computers with the highest single-card performance and
highest performance-per-watt.
2.4.
PROGRAPE and GRAPE-7
Another way to reduce the high initial cost is to use FPGA (field-programmable gate array) chips.
An FPGA chip is programmable in the sense that one item of hardware can be used to realize an
arbitrary logic design. Conceptually, an FPGA consists of programmable logic elements connected
with a programmable network. Here, a “programmable” logic element is typically just a small SRAM
block which can express any combinatorial logic. A “programmable” network similarly means wires
with multiplexers which select inputs connected to small SRAM blocks. By loading configuration
data to SRAM blocks, one can use one FPGA chip to express any logic design, as far as it can fit into
the chip. Thus, FPGA chips are mass produced, and no initial cost is necessary.
The drawback of FPGAs is that their transistor efficiency is much lower than that of a custom
design, and their operating frequency is somewhat lower. Thus, the performance of an FPGA chip is
typically lower than that of a custom LSI chip by a factor of ten or so.
However, even with this low efficiency, a pipeline processor implemented on FPGA chips can have
large advantages over software implementations on general-purpose processors. This is especially
true for pipeline processors with a very short word format.
Hamada [10] described PROGRAPE-1, in which we used FPGAs to implement low-accuracy
GRAPE hardware. Several generations of such hardware have been built, and the latest one is
GRAPE-7. Using four large FPGAs, it gives a peak performance of 830 Gflops for low-accuracy
gravitational force calculations.
2.5.
GRAPE-8
GRAPE-8 is the latest generation of GRAPE hardware, based on relatively new technology called
structured ASIC, which is something in between custom LSI and FPGA. Its design is similar to
FPGA, but its function is determined by customizing one layer of wiring and via holes. Thus, the
7/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
initial development cost of a structured ASIC chip is much lower than that of custom LSI, and yet
its mass production cost (or price per logic gate) is significantly lower than that of FPGA.
Thus, in principle, structured ASIC technology can be very useful for the development of specialpurpose computers such as GRAPE.
For GRAPE-8 we used the N2X740 chip from eASIC Corporation. It integrates 48 pipeline processors, similar to those of GRAPE-5 but with somewhat higher accuracy, and also an additional cutoff
function unit to be used with P3 M or P3 T schemes. The GRAPE-8 processor board with two processor chips and one interface FPGA chip provides a speed of 960 Gflops for a power consumption of
around 40 W.
2.6.
GRAPEs for molecular dynamics
Molecular dynamics is rather similar to astrophysical N -body simulation, except that atoms interact
with van der Waals and Coulomb forces, instead of gravitational force. Thus, pipeline processors
similar to GRAPE can be used to accelerate molecular dynamics simulations. Actually, two pipeline
processors for molecular dynamics simulations were built years before GRAPE-1. The first one was
DMDP, which was designed as a complete hardware processor for simulation: not only the force
calculation, but also the integration of the orbits of atoms and calculation of the output physical
quantities were done on a specialized pipeline processor. FASTRUN was an accelerator processor
for molecular dynamics simulation for protein molecules.
GRAPE-2A is the first GRAPE hardware for molecular dynamics. It uses commercial floatingpoint LSI chips as in the case of GRAPE-2.
With MD-GRAPE, one pipeline processor similar to that of GRAPE-2A is implemented into one
LSI chip. MDM is a massively parallel development of MD-GRAPE which achieved a peak speed
of 75 Tflops. Protein Explorer achieved 1 Pflops.
In the US, a specialized processor named ANTON was developed. It can be regarded as a design
similar to GRAPE, but with programmable processor and network interfaces integrated on one chip
with pipeline processors. This design was apparently chosen to reduce the communication latency
between the pipeline processors and the general-purpose processor, as well as between generalpurpose processors. Thus, ANTON achieved extremely short calculation time per one MD step,
around two orders of magnitude faster than any other machines, including GRAPEs.
3.
Algorithms I: Pure N-body dynamics
Starting with the 300-Mflops GRAPE-1, the calculation speed of GRAPE systems increased by
nearly six orders of magnitude in two decades. This increase is primarily driven by the increase in the
number of arithmetic units, or in the case of most GRAPE systems the number of pipelines. GRAPE-1
was a single-pipeline system. GRAPE-6 had 12,288 pipeline units. This increase in the degree of parallelism has been achieved by using various algorithms. Some of them are modifications of previously
known ones and some are newly developed. In this section, we review the algorithms used.
In Sects. 3.1, 3.2, and 3.3 we discuss the algorithms in the time domain, and in Sects. 3.4 and 3.5
we discuss the algorithms in the space domain. In Sect. 3.6 we discuss the issue of parallelization.
Finally, we discuss how we can now combine algorithms in space and time in Sects. 3.7 and 3.8.
3.1.
Individual and block timestep method
The individual timestep scheme [11,12] is the basic integration scheme for astrophysical N -body
problems. It has remained the standard for more than 30 years following its invention in 1960.
8/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
Fig. 7. Schematic description of the individual timestep algorithm.
Figure 7 illustrates how the individual timestep scheme works. We consider a system of n particles.
In the individual timestep scheme, each particle has its own time and timestep. We denote the time
and timestep of particle i as ti and ti .
We first select the particle i for which the value for ti + ti is a minimum, and then we integrate
the position and velocity of this particle to its new time, and update the timestep. In order to integrate
particle i, we first predict the positions of all particles at time ti + ti , then calculate the force on
particle i and apply the correction to the position and velocity of particle i.
Thus, traditional way to use the individual timestep scheme is to combine it with one of the multistep predictor–corrector schemes. Historically, four-step, fourth-order scheme with variable stepsize
were used.
In most modern implementations of the individual timestep scheme or its variants, the Hermite
integration scheme [5] is used. It is based on the Hermite interpolation method. Hermite interpolation
is similar to Newton–Cotes interpolation, which is the basis of the traditional variable-stepsize linearmultistep scheme.
In Newton interpolation we only use the values of the function f . With the Hermite interpolation,
we use the values of the derivatives of f in addition to f to construct the interpolation formula. In the
case of a gravitational N -body system, the first derivative of the acceleration can be calculated for a
small additional cost. The acceleration and its first time derivative are given by
ri j
Gm j 2
(1)
ai =
(ri j + 2 )3/2
j
3(vi j · ri j )ri j
vi j
ȧi =
Gm j
− 2
,
(2)
2 + 2 )3/2
2 )5/2
(r
(r
+
i
j
i
j
j
where
ri j = x j − xi ,
(3)
vi j = v j − vi .
(4)
Here, is the softening parameter.
With the jerk calculated directly, the construction of the higher-order integrator is simplified
significantly. For example, the simplest explicit scheme is now second order in time, instead of
9/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
first order in time. In the following, we present the complete formula for a two-step, fourth-order
predictor-corrector scheme. The predictor is given by:
t 3
t 2
ȧ0 +
a0 + tv0 + x0
(5)
6
2
t 2
ȧ0 + ta0 + v0 ,
vp =
(6)
2
where xp and vp are the predicted position and velocity; x0 , v0 , a0 and ȧ0 are the position, velocity,
acceleration, and its time derivative at time t0 ; and t is the timestep.
The corrector is given by the following formula (see, for example, [13]):
xp =
t
t 2
(vc + v0 ) −
(a1 − a0 ),
(7)
2
12
t
t 2
(a1 + a0 ) −
(ȧ1 − ȧ0 ).
(8)
vc = v 0 +
2
12
The predictor formulae use only the “instantaneous” quantities that are calculated directly from the
position and velocity at the present time. Compared to the scheme which has to keep track of values
at previous timesteps, the program becomes much simpler.
The merit of the Hermite scheme is, however, not just the simplicity of the formula. The local
truncation error of the Hermite interpolation is several orders of magnitude smaller than that of
Newton interpolation with the same order and stepsize. Therefore, the Hermite scheme allows a
significantly longer timestep than that used for the Aarseth scheme. Of course, this advantage is
partially offset by the additional cost needed to calculate the time derivative directly. Thus, the relative
advantage depends upon the computer used.
In the case of the pipeline processor, the Hermite scheme has several additional advantages. First,
the word length for the calculation of jerk can be shorter than that for the force, resulting in significant
reduction in the size of the hardware. Second, the timestep size is longer for the Hermite scheme
for the same accuracy, resulting in a reduction in the amount of communication between CPU and
GRAPE.
If we calculate higher-order derivatives directly, we can construct higher-order integration schemes.
If we calculate the second-order derivative, we can construct a single-step corrector of sixth order.
In order to make the order of the predictor consistent with the corrector, we need derivatives up to
the third order. Thus, the predictor needs to be a two-step one, but is easy to construct. Similarly,
by calculating third-order derivatives directly, we can construct an eighth-order scheme. Nitadori
et al. [14] described the formulation and performance of such sixth- and eighth-order schemes. They
do have a significant advantage over the simple fourth-order scheme, in particular when high accuracy
is required or when the central density of the system becomes high. In retrospect, it is a bit surprising
(or shameful) that it took such a long time to recognize the advantage of high-order schemes.
xc = x 0 +
3.2.
Limit of the gain by the individual timestep scheme
One important recent finding is that the behavior of the size of the timestep and the energy error are
quite different between the fourth-order scheme and higher-order schemes.
In the case of the fourth-order scheme, when the core becomes small through gravothermal collapse, the energy error grows quickly. However, the growth of the error is much smaller for the
higher-order scheme, even if we choose the timestep criterion so that the error per crossing time is
initially the same.
10/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
Of course, this difference means the calculation cost of higher-order schemes grows a bit faster.
However, surprisingly, what makes the error of the higher-order schemes small is the shrinking of
the timesteps for particles far away from the core. The timestep for particles in the core becomes
smaller for both fourth-order and higher-order schemes, but the timesteps of particles in the outer
region shrinks only in the case of the higher-order schemes.
What this difference tells us is that the traditional timestep criterion used for the fourth-order
scheme is too insensitive to high frequency, low amplitude variation of the acceleration, and thus
the timesteps of particles far away are too long to resolve the motions of particles in the core. Thus,
the error of the fourth-order scheme becomes large. On the other hand, the timestep criterion for
higher-order schemes, which uses the higher-order time derivatives, is sensitive enough to reduce
the timestep of particles far away from the core, so that they can resolve the motion of the particles
in the core. Thus, in order to keep the error reasonably small, the timesteps of particles far away from
the core must be small enough to resolve the motion of particles in the core.
This observation implies that there is a fundamental limit on the gain by the individual timestep
algorithm. The timestep of most of the particles must be small enough to resolve the motions of
particles with the smallest orbital timescale.
The existence of this limit looks quite natural, once we understand the underlying mathematics.
Even so, it had been overlooked for the first half-century of the history of gravitational N -body
simulation. Clearly, we have not yet fully understood a relatively simple-looking problem: What is
the best way to numerically integrate the gravitational N -body problem?
3.3.
Neighbor scheme
With the individual timestep algorithm, particles have their own times and timesteps. The timestep of
one particle is determined so that the required accuracy is achieved. In principle, we can generalize
this concept of an individual timestep to all of the N 2 interactions. Each interaction has its time
and timestep, determined by the required accuracy. Whether or not such a scheme is practical, or
even realizable, is not well studied yet, but one can show that the ultimate gain of such a scheme is
not so large. Consider a system like a star cluster with relatively large core (not too high a central
density). Interactions between most pairs of particles must be evaluated at a time interval which is
a small fraction of the orbital timescale of the particles in the core. On the other hand, interactions
of particles in the core need to be evaluated in the timescale in which particles move the average
interparticle distance. Thus, the difference between the shortest timescale and a typical timescale for
distant pairs is O(N −1/3 ), which is not small but can be insignificant compared to the additional
overhead of the complex algorithm.
If we can achieve a significant fraction of the gain of such an ideal “pairwise individual timestep” by
something simpler, that might in practice be more useful than the ideal scheme. One such possibility
is the neighbor scheme.
In the neighbor scheme, the force on a particle is divided into two components, the neighbor force
and the “regular” force (we follow the traditional naming here). Typically, around 30 “neighbor”
particles, which are the first 30 nearest neighbors, are selected, and the force from these neighbor
particles is integrated with a timestep shorter than the force from the rest of the system. In the neighbor scheme, the list of neighbors is updated at each timestep for the regular force. Theoretically, one
should be able to achieve a reduction of calculation cost of O(N −1/4 ) by making the number of
neighbors O(N 3/4 ). In practice the gain is smaller, because of the following two reasons. First, the
11/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
Fig. 8. Barnes–Hut tree in two dimensions.
above argument that the timestep for distant pairs can be of the order of the orbital timescale is too
optimistic, as we discussed in the previous section. Second, the number of neighbors of O(N 3/4 ) is
too large, since this means we need O(N 7/4 ) words of memory.
3.4.
The treecode
The basic idea of the treecode [1] is to replace the force from a group of distant particles by the
force from their center of mass or by a multipole expansion. To ensure accuracy, we make groups for
distant particles large and groups for nearby particles small.
We use a tree structure to construct the appropriate grouping for each particle. Before calculating
the forces on particles, we first organize the particles into a tree structure. Barnes and Hut [1] used
an octree based on the recursive subdivision of a cube into eight subcubes. We stop the recursive
subdivision if the cube contains only one particle, or is empty. Figure 8 shows the Barnes–Hut tree
in two-dimensional space.
After the tree is constructed, for each node of the tree, which corresponds to a cube of a certain size,
we calculate the coefficient of the multipole expansion of the gravitational force exerted by particles
in that cube. This calculation can be done using a simple recursive procedure.
The force calculation is also expressed as a recursive procedure. To calculate the force on a particle
we start from the root node, which corresponds to the total system. We calculate the distance between
the node and the particle (d) and compare it with the size of the node (l)—see Fig. 9. If they satisfy
the convergence criterion
l
< θ,
(9)
d
where θ is the accuracy parameter, we calculate the force from that node to the particle using the
coefficients of the multipole expansion. If criterion (9) is not satisfied, the force is calculated as a
summation of the forces from eight sub-nodes.
12/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
Fig. 9. Opening criterion for tree traversal.
Usually, we use the distance between the particle and the center of mass of the node to determine
whether the force is accurate enough. When θ is very large, this criterion can cause unacceptably
large error [15]. For most calculations, however, such a pathological situation does not happen. In
addition, for treecode with GRAPE, relatively small values of θ such as 0.5–0.6 are not too costly.
3.5.
The fast multipole method
The basic idea of the treecode is to replace a group of particles with the multipole expansion of
its gravitational field. This multipole expansion is used to evaluate the gravitational field to distant
particles. Here, there is a rather clear inefficiency: forces on two particles which are physically close
to each other are evaluated independently. There should be some way to make use of the fact that the
forces on particles physically close are similar.
The fast multipole method (FMM) is a systematic way to make use of this fact. The basic idea of
FMM is to eliminate the O(log N ) factor associated with the force calculation cost of the treecode
by evaluating the potential at a higher level of the tree. In the case of the treecode, the multipole
expansion of a group of particles at the position of one particle is just calculated directly. In the
case of FMM, it is first translated to a spherical harmonic expansion at the center of a box which
contains the particle at which the potential is evaluated, and then the spherical harmonic expansion
is evaluated at the position of the particle.
We can add up the contribution of many groups of particles to the spherical harmonic expansion at
one box. Moreover, we can shift the expansion at the center of one box to the center of its child boxes
and hierarchically add up contributions from all levels of the tree. Thus, the order of the calculation
cost is now reduced to O(N ).
FMM is thus theoretically advantageous over the treecode, but so far not so widely used, probably
mainly because of the complexity of its implementation.
3.6.
Parallelization
3.6.1. Block timestep algorithm. The individual timestep algorithm is quite powerful in reducing calculation cost. However, an obvious problem with the individual timestep algorithm is that it
reduces the degree of parallelism to the smallest possible value. Only one particle is integrated at
one time. We can still evaluate the forces from many particles in parallel and take a summation, but
we cannot calculate forces on multiple particles in parallel.
If several particles share exactly the same time, we can calculate forces on them in parallel. In
addition, we need to predict the positions of the other particles only once in order to calculate the
forces on these particles. The cost of the prediction therefore becomes a small fraction of that of the
force calculation.
In order to force several particles to share exactly the same time, we adjust the timesteps to integer
powers of two. By this modification, all particles that have the same timestep also have the same
13/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
time. This scheme is usually called the block timestep algorithm. This algorithm was developed by
McMillan [16] in order to use the Cyber 205 more efficiently. The technical details concerning this
algorithm have been described in [17].
On average, the number of particles which share exactly the same time is O(N 2/3 ) if the system is nearly homogeneous. If the system has a small, high-density core, this number becomes
smaller, depending on the details of the timestep criterion and particle distribution. Even so, this
block timestep scheme has a very large impact on the parallel efficiency of N -body simulation.
3.6.2. 2D parallelization and the Ninja scheme. The blockstep algorithm makes it possible to
parallelize the force calculation for multiple particles on shared-memory multiprocessors and vector
processors. Here we discuss parallelization on distributed-memory parallel processors.
The simplest approach is to let each processor have a complete copy of the system and calculate
the forces on a fraction of the particles. In the case of the blockstep scheme, each processor selects
the particles to be integrated and integrates n/ p particles, where n is the number of particles in the
block and p is the number of processors. After integration is finished, each processor broadcasts the
updated particles, so that all processors have all particles in the block updated.
The scalability of this algorithm is rather limited, since we can use only O(n) processors. The
performance is limited by the communication bandwidth and latency, and it is usually difficult to use
more than 128 processors even for more than 100k particles. On the other hand, the calculation cost
per crossing time is O(N 7/3 ) and that for thermal relaxation time is O(N 10/3 ). Thus, parallelization
in this way does not give sufficient speedup.
One can use larger numbers of processors by using a so-called two-dimensional algorithm [18].
The basic idea of this scheme is to divide the calculation of forces f i j in both i and j directions.
If we have p = r 2 processors, we organize them into a two-dimensional grid. Particles are divided
into r groups, processor pi j has both group i and group j, and calculates the force from group j
to group i. The total force is obtained by taking a summation over the j direction (the direction for
which j is different and i is the same). Then we can let each processor update group i. Finally, we
let the diagonal processors broadcast their group in the i direction.
This parallelization effectively increases the communication bandwidth by a factor of r , and it
is now possible to use O(N n) processors. If we combine the blockstep algorithm with this twodimensional parallelization then there is a relatively large load imbalance between processors, since
the number of particles in one blockstep shows fluctuation. The Ninja scheme (Nbody-i-and-j
algorithm) achieves almost perfect load balancing in the following way.
We again have r 2 processors, but processor pi j has only group j. After the processors have selected
the particles to be integrated, they divide the selected particles into r groups, and processors in row i
broadcast their group i in the j direction. Then each processor calculates the forces from its group to
the received particles. The total forces are calculated by taking a summation, and the summed results
should go to their original location, so that processor pi j obtains the total force on subgroup i of
particles to be integrated in group j. Finally, by broadcasting these summed result in the i direction,
all the processors have the forces for the particles to be integrated.
With this scheme, it is not impossible to use more than N processors to integrate a system of N
particles, and we can achieve very high parallel efficiency for relatively small numbers of particles.
3.6.3. Parallel algorithms for treecode. There are a number of studies on the parallelization of
the Barnes–Hut tree algorithm. Some of the very early works discussed efficient implementation on
14/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
vector processors [19–22], but there were also implementations on both MIMD [23] and SIMD [24]
distributed-memory computers.
Roughly speaking, a parallel implementation of the tree algorithm needs to take care of the
following three steps:
(1) Distribution of particles to processors (in the case of distributed-memory machines)
(2) Construction of the tree
(3) Force calculation
In the case of shared-memory parallel machines (including vector processors with single or multiple processors), step (1) is not necessary, but in the case of distributed-memory parallel machines,
this step is critical. There are two major algorithms used. One is the ORB (orthogonal recursive bisection) and its variant. As its name suggests, with ORB we first consider a cubic box which contains
all particles in the system, and first divide it in the x direction, so that the number of particles, or
calculation cost, is the same on each side. Then we again divide the two regions in the y direction,
and then in z, and then x, until the number of regions is the same as the number of processors (this
means the number of processors must be an integer power of two). Each processor then takes care of
one region. It is also possible to make the division in one dimension not a bisection but a division to
an arbitrary number of regions, so that one can make use of systems with the number of processors
not a power of two [25].
In this scheme, the octree structure used for the force calculation is constructed after the decomposition of the space is done and regions are assigned to processors. Thus, on each processor, both
the tree construction and force calculation can be done independently from other processors, but
each processor needs information about the particles in other processors. One way to transfer the
necessary information is to let each processor send to all other processors all the data that they need.
If the regions of two processors are far apart, all particles in one processor can be approximated as
one node, and it only needs to send the data for one node. If two nodes are close, we can construct
the necessary data by doing tree traversal. If one node in the tree in one processor is too close to the
region of the other processor, we open up that node and go down the tree. If the node is sufficiently
distant, we stop the tree traversal there. In this way we can make a cut-out version of the tree, and
we then send this tree structure to the other processor. The processor which receives the cut-out tree
then merges it with its own tree structure.
It is possible to simplify this procedure, by not sending the tree structure but by sending the list of
nodes and particles [25]. In this scheme, the nodes and particles in the list are “inserted” into the tree,
or the whole tree is reconstructed after all the necessary data is received. A modern implementation
is described in [26].
On modern microprocessors, traversing the tree is a relatively slow process compared to the calculation of the interaction itself. Thus, it is desirable to somehow reduce the cost of tree traversal, and
one way to do so is to do the traversal for a group of particles, instead of one particle. To identify a
group, we can use the tree structure itself.
Figure 10 illustrates the opening criterion for a group of particles. Instead of calculating the distance
between one box and one particle, we should calculate the minimum of the distances of all particles
in a box from the other box. We use the geometrical minimum distance between the box and the
center of mass of the other box as the distance [20]. In the usual implementation, we construct the
list of nodes and particles which exert force on the group of particles, and then calculate the forces
from the list to the group.
15/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
Fig. 10. Modified opening criterion for treecode with GRAPE.
Once we separate the tree traversal and force calculation, we can use vector processor or SIMD
execution units quite efficiently for force calculation. We can also use GRAPE hardware or, more
recently, GPGPUs for the force calculation.
In practice, if we accelerate the force calculation by a very large factor, other parts of the calculation, such as the tree construction and tree traversal, can become the bottleneck. Efficient use of
cache and SIMD units for these parts is the current challenge.
3.7.
BRIDGE
In principle, it is possible to combine the individual timestep (or blockstep) method and the tree
algorithm. In the earliest implementation [27], the tree structure is newly constructed at each blockstep. If the range of the timestep is not so large then this approach is satisfactory, but for systems
like star clusters the overhead of tree construction is too large. It is also possible to partially update
the tree at each timestep [28]. However, to our knowledge there is no implementation of this type on
distributed-memory parallel computers.
For the last few years we have been working on methods to combine the blockstep algorithm and
the tree algorithm in such a way that the resulting scheme can be efficiently parallelized.
The first method is the BRIDGE scheme [29]. It is a rather specialized method to handle the evolution of star clusters in galaxies. The goal of the BRIDGE scheme is to handle both the parent
galaxy and the star cluster as N -body systems, and to include their gravitational interactions in a
fully consistent way.
If we used the individual timestep method for the entire system, the calculation cost would become
too high. On the other hand, if we use the tree algorithm with a constant stepsize for the entire system,
we have to use some softening to the stars in the cluster, and yet we have to use a very short timestep.
Thus, we cannot express important processes such as the formation of binaries and physical collisions
of stars.
Ideally, we want to apply the tree algorithm to the evolution of the parent galaxy and interaction
between the galaxy and the star cluster, and to apply the individual timestep algorithm to the internal
dynamics of the star cluster.
Fujii et al. [29] achieved this goal by applying the idea of Hamiltonian splitting. In the BRIDGE
algorithm, the potential energy of the system is split into two components as
V = Vcl + Vrest .
(10)
Here, V is the total potential energy of the system, Vcl is the internal potential energy of the star
cluster, and Vrest is the remaining part of the potential energy. Note that Vrest consists of the internal
potential energy of the galaxy and the interaction potential of the galaxy and the star cluster.
The Hamiltonian H = T + V is now split to
H1 = T + Vcl ,
(11)
H2 = Vrest .
(12)
16/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
We now can integrate the system using a procedure similar to the second-order leapfrog scheme. At
the beginning of the integration we first evaluate the acceleration due to Vrest , and push the velocity
of all particles by ta/2, where a is the calculated acceleration. Then we integrate the position and
velocity of all the particles using H1 . Since H1 contains only the internal potential energy of the star
cluster, particles of the parent galaxy move with constant velocity, and particles in the star cluster
are integrated using the individual timestep scheme, just following the internal potential.
When all particles in the star cluster reach the next timestep for the leapfrog scheme we calculate
the forces due to Vrest and push the velocity of all particles by ta/2 to reach the new time. For the
next timestep, we again push the velocity by ta/2. When no diagnostics or output are necessary,
we can just push the velocity by ta.
This BRIDGE scheme is, to our knowledge, the first algorithm with which we can handle the
evolution of star clusters in their parent galaxy in a fully self-consistent way, and has been used for
many studies of the evolution of young star clusters.
3.8.
Particle-Particle Particle-Tree
The idea of Hamiltonian splitting can be applied in many different ways, and one possibility is to
split the pairwise interaction itself into near and distant terms. This splitting is the same as is used in
the P3 M (particle-particle particle-mesh) scheme or Ewald summation method. In these schemes, a
pairwise potential is split into two parts by a smooth function as
Ui j = g(ri j )Ui j + [1 − g(ri j )]Ui j .
(13)
Here, g is a smooth function which satisfies the following criteria:
◦ g(0) = 1.
◦ g(∞) = 0.
◦ g is differentiable n times, where n is the order of the integrator used.
We can apply the time integration scheme as in the BRIDGE scheme, by replacing the cluster potential Vcl by the potential energy due to the g(ri j )Ui j term. There are many technical details concerning
the actual implementation, which are discussed in detail in [30].
Our current view is that this P3 T scheme is the practical “solution” of the fundamental limitation
of the individual timestep algorithm discussed in Sect. 3.2. We just make the timestep for the tree
force calculation comparable to the orbital timescale of particles in the core. Though the calculation
cost is not small, it is smaller than that of the standard blockstep scheme by a factor proportional to
N / log N . This factor is likely to be much larger than the gain due to the neighbor scheme.
4.
Algorithms II: N-body+SPH for galaxy formation
Numerical simulation is a powerful tool for studying the complex processes involved in galaxy formation. Since the pioneering works by Katz and Gunn [31] and Navarro and Benz [32], a number
of efforts have been undertaken [33–45]. In order to make a single galaxy, it is necessary to collect
material in a volume of 1 Mpc3 and squeeze it into several tens of kpcs. In addition, cold gas forms a
thin disk in which spiral arms and star-forming regions with a size of several pcs develop. Thus, we
need to cover six orders of magnitude in spatial scale. Clearly, we cannot handle such a wide range
with a fixed, regular grid, and schemes with adaptive resolution are necessary.
There are two major ways to achieve adaptive resolution: AMR (adaptive mesh refinement) and
SPH (smoothed particle hydrodynamics). They have their own merits and demerits. The advantage
17/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
of AMR over SPH is that it is a grid-based finite-difference scheme. Thus, it has significantly better shock-capturing capability. For many test problems, in particular those for instabilities such as
the Kelvin–Helmholtz instability and Rayleigh–Taylor instability, the performance of grid-based
schemes is better than that of SPH for comparable degrees of freedom.
However, when applied to galaxy formation, AMR has one fundamental difficulty. Consider the
situation where we want to resolve a high-density molecular cloud in the galactic plane with a size
of 1 pc and temperature of 20 K. The finest grid scale would be 0.1 pc, and the rotation velocity of
the gas is 200 km/s. Thus, the Courant–Friedrichs–Lewy (CFL) condition requires that the timestep
should not exceed 500 years. On the other hand, if we can use some Lagrangian scheme in which the
grid (or particle or whatever) moves, the timestep limit due to the CFL condition is around 0.1 million
years. Also, with the Euler grid, gas moves with a Mach number close to 1,000, which means we need
to solve the hydrodynamics with extremely high accuracy just to conserve the shape and energy of
the cloud. Whether or not one can achieve the necessary accuracy with the Eulerian AMR method
remains to be seen. Lagrangian schemes, such as N -body+SPH [27], are therefore the method of
choice for simulations of galaxy formation.
With particle-based methods like SPH, the number of particles used determines the resolution
and accuracy of the calculation. The typical number of particles for a simulated galaxy was 104
in the 1990s and 104–5 in the 2000s. Some recent studies used 106–7 particles. Thus, the increase
in the number of particles is less than a factor of 1000 in two decades. In the case of cosmological N -body simulations, the the number of particles increased by a factor of 105 or more, from
106 to 1011 in the same period. Thus, there was the difference of about a factor of 100 between
the number of particles used for cosmological N -body simulations and that for N -body+SPH
galaxy formation simulations, and this difference itself has increased by a factor of 100 in 20
years.
In both types of calculation, the evaluation of gravitational force on particles from the rest of the
system is the most expensive part, and in both cases the parallel tree method discussed in the previous
section has been used. The essential reason why there is such a large difference in the number of
particles used is the dependence of the local dynamical timescale on the resolution.
In the case of cosmological N -body simulations, the dependence of the minimum timescale on
the number of particles is fairly weak. In many simulations, the increased number of particles is
used to cover a wider volume without improving the resolution. In this case, clearly the timestep is
independent of the number of particles. In the case of constant volume and improved resolution, the
minimum timestep is proportional to m 0.3∼0.4 , where m is the mass of the particles.
In the case of N -body+SPH simulations, the smallest timescale appears in supernova (SN) remnants. Since the minimum mass of gas particles is still many orders of magnitude larger than that of
the actual remnant mass, the mass of gas to which we assign the explosion energy is directly proportional to the mass of the gas particles. Moreover, with gas particles of smaller mass we can resolve
smaller gas clouds with higher density. The temperature of the SN remnants can exceed 108 K, which
corresponds to a sound velocity higher than 1000 km/s. Thus, if we want to use a resolution of 0.1 pc,
the timestep can go below 100 years. On the other hand, with pure N -body simulations, the required
timestep is around 105 years even for extremely high-resolution simulations.
One can reduce the impact of the small timestep by means of the individual timestep method, as we
described in the previous section. However, as we also discussed in the previous section, the speedup
one can achieve with the combination of individual timestep algorithm and tree algorithm is rather
limited.
18/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
If we need the timestep 100 times smaller for N -body+SPH simulation than that for pure N -body
simulation, using the same amount of computer resources, we can use a number of particles 100
times smaller. However, the fact that the number of particles is 100 times smaller means the parallel
efficiency is much worse, resulting in even more reduction of available computer resources and thus
further reduction in the number of particles.
In addition, there are still several fundamental problems with the SPH method when applied to
galaxy formation problems.
In the rest of this section we discuss some of these issues and our proposed solutions.
4.1.
The timestep limiter for individual timesteps
The individual timestep method is commonly used in simulations of galaxy formation [16,27,46].
This method allows particles to have different timesteps. As a result, the total calculation cost is
reduced. Almost all known implementations of this method violate Newton’s third law. However, the
error due to this violation is tolerable in usual simulations.
In [47], we found that the SPH method with individual timesteps in the way described in the literature cannot adequately handle strong explosion problems. This is because the timestep of a particle
is determined explicitly, at the beginning of that timestep itself. A particle cannot respond to a strong
shock if its timestep before the shock comes is too long compared to the timescale of the shock, and
thus supernova explosions pose a serious difficulty. Such an explosion generates a small amount of
very hot gas (T ∼ 108 K) in a large clump of cold gas (T ∼ 10 K). Thus, the timestep of the hot
gas particles becomes 1000 times smaller than that of cold gas around them. The hot gas wants to
expand, but the cold gas around it is frozen, because the timesteps for the cold gas particles are an
order of magnitude longer than the expansion timescale of the hot gas. This inconsistency results in
the breakdown of the numerical integration.
To overcome this difficulty we introduced a timestep limiter which limits the difference in timesteps
between neighboring particles. We denote the timestep of the ith particle as dti and that of a neighbor
particle, with index j, as dt j . The basic idea of our limiter is to enforce the following conditions:
dti ≤ f dt j ,
(14)
dt j ≤ f dti ,
(15)
where f is an adjustable parameter. We found f = 4 to give good results, from the perspective of
total energy and linear momenta conservation, without significant increase in the calculation cost.
It is essential that the timestep of particle j shrinks when the timestep of its neighbor particle i
suddenly shrinks by a large factor. Thus particle i should let particle j respond to the change of its
timestep.
Our implementation of the timestep limiter is as follows: To enforce the small enough difference
in timesteps among neighboring particles, particles send their timesteps to neighboring particles
when they are integrated. Particles which receive timesteps compare them with the local minimum
timestep, dtlmin,i , and update the local minimum timestep if necessary. If the local minimum timestep
of a particle is too small compared to its own timestep (dtlmin,i < f dti ), the timestep of the particle
is reduced to dtlmin,i = f dti .
Note that this reduction of the timestep of particle j is possible only if the times of the two particles
ti and t j satisfy the condition ti ≥ t j + f dti . If this condition is not satisfied, the reduction of the
timestep results in the new time of particle j being before the current time of particle i, requiring the
backward integration of the entire system.
19/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
In this case, the new time of particle j is set to a value that is consistent with the current system
time (ti ), and the timestep is set to the difference between the particle’s current time and new time.
Schematic pictures of the traditional individual timesteps method and our implementation of the
individual timesteps method are found in Fig. 1 in [47].
As shown in Fig. 3 in [47], the traditional implementation of individual timesteps failed to reproduce the solution for a point-like explosion. When we used the timestep limiter for the individual
timesteps, the numerical solution of the SPH simulation is in good agreement with the self-similar
Sedov solution. The increase in the calculation cost is negligible.
In many previous simulations of galaxy formation, the mass of gas particles is 105 M or more,
and gas cooling below 104 K is artificially suppressed. In this case, the reduction of the timestep in
the region heated by a SN explosion is rather small, say a factor of 10, and special treatments such as
that described here were not necessary. In other words, one of the numerical difficulties associated
with dealing with low-temperature, high-density interstellar gas is this problem. Our new timestep
limiter is essential, and has been adopted in many new calculations.
4.2.
Asynchronous time integrator for self-gravitating fluid
In simulations of galaxies, supernova explosion in dense regions leads to the shortest timestep.
Consider the situation that an SN with energy E SN occurs in the interstellar medium (ISM) with
a temperature of TISM . If this SN heats the surrounding ISM from temperature TISM to TSN , and
if the size of the region is unchanged during this process, we can write the reduction factor of the
timestep of the ISM before the SN, dtISM , and after the SN, dtSN , as
dtSN /dtISM ∝ (TISM /TSN )1/2 ,
∝ E SN −1/2 m SN TISM 1/2 ,
1/2
(16)
where m SN is the mass of the heated region, and we used the relation TSN ∝ E SN /m SN . For SPH simulations, it is reasonable to set m SN close to the resolution ∼ NNB × m SPH , where NNB is the number
of neighbor particles (typically 30–50), and m SPH is the mass of an SPH particle. The shrinkage of
the timestep is larger when the mass resolution is higher and the ISM temperature is lower. Therefore,
a high-resolution simulation which incorporates the low temperature ISM (<104 K) requires much
shorter timesteps than that required by conventional simulations of galaxy formation with a cooling
cutoff at ∼104 K.
In a heated region, the thermal energy and kinetic energy become many orders of magnitude larger
than the gravitational potential energy. This means that the timestep for gravitational interaction
can be much longer than that for hydrodynamics. By extending the concept of individual timesteps,
we constructed a new integration scheme which allows an individual fluid particle to have different
timesteps for gravitational and hydrodynamical interactions. As we stated above, particles in the
heated region have the shortest timesteps. Therefore, if we assign different timesteps to gravitational
and hydrodynamical forces, we should be able to use a much longer timestep for gravity, thereby
accelerating simulations by a large factor. We call this integrator FAST, which is an acronym for
“Fully Asynchronous Split Time-integrator”. A similar idea has been used in molecular dynamics, in
which the long-range Coulomb and the short-range van der Waals forces are integrated with different
timesteps [48].
FAST reduces unnecessary gravitational force evaluations in the small timesteps induced by SN
explosion. Since the number of dark matter and stellar particles is usually larger than that of SPH
20/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
particles in typical simulations of galaxy formation, the calculation cost of gravity is larger than that
of hydrodynamics. This reduction in unnecessary evaluations of gravity can improve the calculation
speed significantly. When used with a parallel tree algorithm, FAST offers a further advantage. The
calculation of gravity to a small fraction of particles, which occurs when a small number of particles with short timesteps are integrated, is quite expensive. This is because the overhead of tree
construction and communication dominates the computing time.
The splitting of the integration steps for hydrodynamics and gravity is done in the same way as
in the case of the BRIDGE or P3 T schemes, though in the case of FAST neither the integration of
hydrodynamics or that of gravity are exactly symplectic.
In [49] we compared results by the FAST method and the conventional leapfrog method. The evolution of an SN explosion in a self-gravitating cloud solved by the FAST method is in good agreement
with that by the conventional leapfrog method with individual timesteps (see Figs. 5 and 6 in [49]).
For the case with the FAST method, the total number of gravity steps decreased and the calculation
time was reduced. The gain in speed in this case was not so large because the fraction of particles
with small timestep was large.
In order to test the performance of the FAST method under more realistic circumstances we compared the results of merger simulations, in which we took into account the radiative cooling of gas
down to 10 K, star formation in the dense (n H > 100 cm−3 ) and cold (T = 100 K) phase gas, and
type II SN explosion. The initial numbers of dark matter, (old) star, and gas particles were 6 930 000,
341 896, and 148 104, respectively. We used 128 cores of a Cray XT4 system. The evolutions of
merger galaxies in the two schemes are essentially identical. The calculation time with the FAST
method is half of that without FAST. The calculation time of gravity was reduced to one-eighth in
this case, as expected from Fig. 8 in [49]. In these simulations we used highly hand-tuned code, the
Phantom-GRAPE library [50], for the calculation of gravity, but the calculation of the SPH part is
not that optimized yet. We should be able to further improve the performance of our simulation code
by using a similarly optimized code for SPH part.
4.3.
Extension of treecode for multi-mass and multi-scale simulations
When one perform a multi-mass and multi-scale simulation, one would like to use different Plummer
softenings for particles of different mass. The reason we use small-mass particles is to improve mass
resolution, and it is often necessary to reduce softening for these particles. For the direct summation
method, it is possible to use an arbitrary symmetric form for the softening length, for example, [(i2 +
2j )/2]1/2 , (i + j )/2, or max(i , j ), where i and j are the softening lengths for particles i and
j, respectively, instead of the found in the usual Plummer potential to satisfy Newton’s third law.
When one uses the tree method [1,51] with Plummer potential, one needs to use separate trees
for each of the different groups of particles with the same gravitational softening length [52], since
otherwise there will be an error in the force calculation of order 2 . The use of different trees for
different groups of particles having the same softenings leads to a large increase in the calculation
cost. If we want to use “individual” softening, we cannot use the tree algorithm.
Saitoh and Makino [53] introduced a new way to handle particle-dependent softening length using
a single tree structure. Conceptually, if we use symmetric softening of the form [(i2 + 2j )/2]1/2 , we
can regard i and j as two additional dimensions, in addition to the three spatial dimensions. When
we evaluate the force from a group of particles, usually we use a multipole expansion. If particles in
the group have different softenings, we can formally expand the potential in terms of j , or actually
2j − 2j .
21/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
Consider the following form:
φi j = −
Gm j
|ri2j
+ i2 + 2j |1/2
.
(17)
This form has been used before to model the gravitational interaction between galaxy particles of
different size and mass [54,55]. The gravitational potential induced by a group of particles j =
1, . . . , N with particle mass m and total mass M is given by
φi = −
N
Gm j
j
|(ri − δr j )2 + i2 + 2j |1/2
.
(18)
Here, the center of the coordinates is set to the center of mass of particles j and the positions of
particles j from the coordinate center are δr j .
By introducing an arbitrary form of the averaged softening length for particles j, E , Eq. 18 can be
rewritten by
N
Gm j
,
(19)
φi = −
2
|(ri − δr j ) + i2 + E 2 + δ( 2j )|1/2
j
where δ( 2j ) ≡ 2j − E 2 . The Taylor expansion of this equation up to the second order of δr j and
δ( 2j ) is
N
Gm j
2
(ri · δr j ) δ( j ) 3(ri · δr j )2 − |ri |2 |δr j |2
+
+
R
R2
R2
R4
j
⎛
3 ⎞⎫
2
⎬
3(ri · δr j )δ( 2j ) 3(δ( 2j ))2
|δr j | δ( j )
⎝
⎠ ,
+
+
+
+
O
⎭
R4
4R 4
R
R2
φi = −
1+
(20)
where R = (ri2 + i2 + E 2 )1/2 is the Plummer distance for the symmetrized potential. The first term
in parentheses is the monopole moment. The second term vanishes by definition since we adopt
the center of mass as the coordinate center. The third term, which is the second-order term of the
expansion by δ( 2j ), can vanish if we adopt the second moment of j as the averaged softening length:
N
E =
2
2j ≡
j
m j 2j
M
.
(21)
Hence, adopting the second moment of j to be the averaged softening length is the most favorable
option. Since the multipole moment of the symmetrized potential is obtained by expanding both
δr j and δ( 2j ), two criteria are necessary for the convergence of the multipole moment. By using the
analogy of Barnes and Hut’s opening criterion, we introduce a simple, but rather loose, set of opening
criteria which can be used for treecode easily:
w
(22)
η> ,
R
and
2
2
− min
max
,
(23)
R2
for the convergence criteria of the multipole expansion. Here we adopt η and η as the tolerance
√
parameters. For the convergence of the multipole expansion, η < 1/ 3 and η < 1 are necessary.
η >
22/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
√
The reason for η < 1/ 3 is that the usual Barnes and Hut opening criterion causes unbound errors
√
in force calculation when θ > 1/ 3.
We successfully extended the tree method for the multi-mass and multi-scale simulations by using
the symmetrized Plummer potential and deriving a multipole moment for a group of particles. Since
our method is quite simple, it is easy to apply our method to code which uses the tree method with
the ordinary Plummer potential. So far, the latest version of GRAPE, GRAPE-8, adopts this type
of symmetrized Plummer potential as a standard feature. In addition, Phantom-GRAPE [50] is also
equipped with this symmetrized Plummer potential (Tanikawa, private communication).
4.4.
A new formulation of smoothed particle hydrodynamics
Recently, there have been several comparison studies between SPH and Eulerian grid codes, in which
SPH was demonstrated to fail in handling Kelvin–Helmholtz (KH) instability. The essential reason
for this failure is that the standard SPH formulation uses “smoothed” density to evaluate all other
quantities, while there is a discontinuity of the density at the contact discontinuity where the KH
instability develops.
In standard SPH, the smoothed estimate of a physical quantity, f , is expressed as follows:
fj
m j W (ri j , h i ),
(24)
f (ri ) ρj
j
where ri j = |ri j | and ri j = ri − r j , h i is the kernel size of particle i, and W is the smoothing kernel.
By substituting ρ into f , we obtain
ρi m j W (ri j , h i ),
(25)
j
where ρi ≡ ρ(ri ) is the smoothed density at the position of particle i. Since there is no unknown
parameter, we evaluate this equation first.
The equations of motion and energy in the SPH expression are given by
Pj
Pi
d 2 ri
−
mj
+ 2 ∇ W̃ ,
(26)
dt 2
ρi2
ρj
j
and
Pi
du i
m j 2 vi j · ∇ W̃ ,
dt
ρi
(27)
respectively. Here, W̃ is the symmetrized kernel, given by W̃ = 12 [W (ri j , h i ) + W (ri j , h j )]. Equations 25, 26, and 27 close with the equation of state (EOS),
P = (γ − 1)ρu,
(28)
where γ is the specific heat ratio and u is the specific internal energy.
In the standard SPH, density is evaluated first and other quantities are calculated using the smoothed
density. Consider the case that two mediums with uniform but different densities come into contact
with a sharp density jump. Since the standard SPH evaluates density by convolution with the kernel
function, the smoothed density at the interface is always over(under)-estimated in the low(high)density part. This error in the smoothed density propagates to the calculated pressure, resulting in a
repulsive force at the interface and the cause of the large error in the motion.
23/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
Conceptually, the reason why the density is necessary to obtain the smoothed estimates of other
quantities is that we need to obtain the volume element associated with each particle, and with the
standard SPH we use m i /ρi for that purpose. In principle, one can use any physical quantity to
obtain the volume element. Since we have the difficulty at the contact discontinuity due to the jump
in density, it is desirable to use something which is not discontinuous at the contact discontinuity,
and the obvious candidate is the pressure P. In the case of the ideal gas,
(γ − 1)U j
,
(29)
dr =
Pj
where Ui = m i u i is the internal energy of particle i, gives the volume element, and we can derive
the new set of SPH equations in a similar fashion to the standard set of equations. The final forms of
the equation of motion and the energy equation are
dvi
1
1
−(γ − 1)
∇ W̃i j ,
mi
Ui U j
+
(30)
dt
qi
qj
j
and
Ui U j
dUi
(γ − 1)
vi j · ∇ W̃i j ,
(31)
dt
qi
j
respectively (see [56] for detailed derivations of these equations). It is obvious that the equation of
motion does not include the density. Instead, it includes the energy density, q. Thus, this formulation
should show good behavior at the contact discontinuity.
The standard SPH needs to adopt an artificial viscosity term to capture shocks. Our new SPH also
adopts an artificial viscosity term which is the same as the one used in the standard SPH. According
to our tests, the artificial viscosity term used in the standard SPH, with the smoothed density estimate,
works well with our new SPH.
We have performed many detailed tests on the behavior of the new formulation [56]. In this review,
we show the result of a very simple test of the evolution of a hydrostatic fluid system. By definition,
a hydrostatic fluid is static, which means it should not show any evolution.
Figure 11 shows the evolution of a two-fluid system with a density contrast of 64 for eight sound
crossing times. In the calculation with the standard SPH (the top row), the boundary of the two fluids
evolves from a square shape to a much rounder shape, and a wide empty ring structure develops
between the two fluids. This unphysical evolution is induced by the artificial surface tension at the
density jump. In contrast, in the calculation with our SPH method, the system shows virtually no
evolution except the small local adjustment of positions of particles, which would occur even for a
single-fluid system (the bottom row).
Density jumps are everywhere in the universe. Therefore, the fact that standard SPH can so miserably fail in modeling the jump is rather worrisome, and this failure explains why SPH had failed
in many comparison tests with grid codes. We believe our new form has the potential to replace the
standard, density-based form.
One might imagine that our new SPH would fail to handle strong shocks, since the pressure jump
in the shock region is orders of magnitude larger than the density jump. However, as long as we adopt
the smoothed density for the evaluation of the artificial viscosity, there is no critical trouble under
the strong shock condition. See also [56].
Our form can be extended to non-ideal EOS, and we are currently testing one such extended formulation. One problem is that when the pressure becomes nearly zero, our formulation fails. In such
a case, the formulation which explicitly solves the continuity equation [57] might be preferred.
24/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
Fig. 11. Evolution of a two-fluid system with a density contrast of 64. Snapshots at t = 0.1, 0.3, 0.5, 1.0 and
8.0 are shown. The red and blue points indicate the positions of particles with ρ = 64 and ρ = 1, respectively.
The upper row shows the results of the standard SPH, whereas the lower row shows those of our new SPH. The
particle separation is constant and the particle mass difference is 1:64.
5.
Final words
In hindsight, the 1990s was a very good period for the development of a special-purpose architecture
such as GRAPE, for two reasons. First, semiconductor technology reached the point where many
floating-point arithmetic units could be integrated into a chip. Second, the initial design cost of a
chip was still within the reach of fairly small research projects in basic science.
Now, semiconductor technology has reached the point where one can integrate thousands of
arithmetic units into a chip. On the other hand, the initial design cost of a chip has become too
high.
The use of FPGAs and the GRAPE-DR approach are two examples of ways to tackle the problem of
increasing initial cost. However, unless one can keep increasing the budget, the GRAPE-DR approach
is not viable, simply because it still means an exponential increase in the initial, and therefore total,
cost of the project.
On the other hand, such an increase in the budget might not be impossible, since the field of computational science as a whole is becoming more and more important. Even though a supercomputer
is expensive, it is still much less expensive compared to, for example, particle accelerators or space
telescopes. Of course, computer simulation cannot replace real experiments of observations, but
computer simulations have become essential in many fields of science and technology.
In addition, there are several technologies available in between FPGAs and custom chips. One is
what is called “structured ASIC”. It requires customization of typically just one metal layer, resulting
in a large reduction in the initial cost. The number of gates one can fit into the given silicon area falls
between those of FPGAs and custom chips. We are currently working on a new fully pipelined system
based on this structured ASIC. The price of the chip is not very low, but in the current plan it gives
extremely good performance for very low energy consumption.
When we look back at the evolution of numerical schemes, our impression is that many of the
schemes, including what we have developed, are still rather crude, and there are many possibilities
of improvement, both in the direction of parallelization and that of more sophisticated numerical
schemes. We hope this review helps readers to get some idea on new directions.
25/27
PTEP 2012, 01A303
J. Makino and T. Saitoh
Acknowledgements
This work is supported in part by a Grant-in-Aid for Scientific Research (21244020) and Strategic Programs
for Innovative Research of the Ministry of Education, Culture, Sports, Science and Technology (SPIRE).
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
J. Barnes and P. Hut, Nature 324, 446 (1986).
L. Greengard and V. Rokhlin, J. Comput. Phys. 73, 325 (1987).
T. Ito, J. Makino, T. Ebisuzaki, and D. Sugimoto, Comput. Phys. Commun. 60, 187 (1990).
J. Makino, T. Ito, and T. Ebisuzaki, Publ. Astron. Soc. Jpn. 42, 717 (1990).
J. Makino and S. J. Aarseth, Publ. Astron. Soc. Jpn. 44, 141 (1992).
J. Makino and M. Taiji, Scientific Simulations with Special-Purpose Computers — The GRAPE Systems
(Wiley, Chichester, UK, 1998).
A. Kawai, T. Fukushige, J. Makino, and M. Taiji, Publ. Astron. Soc. Jpn. 52, 659 (2000).
J. Makino, M. Taiji, T. Ebisuzaki, and D. Sugimoto, Astrophys. J. 480, 432 (1997).
J. Makino, T. Fukushige, M. Koga, and K. Namura, Publ. Astron. Soc. Jpn. 55, 1163 (2003).
T. Hamada, T. Fukushige, A. Kawai, and J. Makino, Publ. Astron. Soc. Jpn. 52, 943 (2000).
S. J. Aarseth, Mon. Not. R. Astron. Soc. 126, 223 (1963).
S. J. Aarseth, Direct methods for n-body simulations. In Multiple Time Scales, eds. J. U. Blackbill and
B. I. Cohen (Academic Press, New York, 1985), pp. 377–418.
P. Hut, J. Makino, and S. McMillan, Astrophys. J. Lett. 443, L93 (1995).
K. Nitadori and J. Makino, New Astron. 13, 498 (2008).
J. K. Salmon and M. S. Warren, J. Comput. Phys. 111, 136 (1994).
S. L. W. McMillan, The vectorization of small-N integrators. In The Use of Supercomputers in Stellar
Dynamics, eds. P. Hut and S. L. W. McMillan (Springer, Berlin, 1986) Lecture Notes in Physics, Vol.
267, p. 156.
J. Makino, Publ. Astron. Soc. Jpn. 43, 859 (1991).
J. Makino, New Astron. 7, 373 (2002).
L. Hernquist, Astrophys. J. Suppl. 64, 715 (1987).
J. E. Barnes, J. Comput. Phys. 87, 161 (1990).
L. Hernquist, J. Comput. Phys. 87, 137 (1990).
J. Makino, J. Comput. Phys. 87, 148 (1990).
M. S. Warren and J. K. Salmon, Astrophysical N-body simulations using hierarchical tree data structures.
(IEEE Comp. Soc., Los Alamitos, 1992), pp. 570–576.
J. Makino and P. Hut, Comput. Phys. Rep. 9, 199 (1989).
J. Makino, Publ. Astron. Soc. Jpn. 56, 521 (2004).
T. Ishiyama, T. Fukushige and J. Makino, Publ. Astron. Soc. Jpn. 61, 1319 (2009).
L. Hernquist and N. Katz, Astrophys. J. Suppl. 70, 419 (1989).
S. L. W. McMillan and S. J. Aarseth, Astrophys. J. 414, 200 (1993).
M. Fujii, M. Iwasawa, Y. Funato, and J. Makino, Publ. Astron. Soc. Jpn. 59, 1095 (2007).
S. Oshino, Y. Funato, and J. Makino, Publ. Astron. Soc. Jpn. 63, 881 (2011).
N. Katz and J. E. Gunn, Astrophys. J. 377, 365 (1991).
J. F. Navarro and W. Benz, Astrophys. J. 380, 320 (1991).
N. Katz, Astrophys. J. 391, 502 (1992).
M. Steinmetz and E. Mueller, Astron. Astrophys. 281, L97 (1994).
J. F. Navarro and M. Steinmetz, Astrophys. J. 478, 13 (1997).
M. Steinmetz and J. F. Navarro, Astrophys. J. 513, 555 (1999).
R. J. Thacker and H. M. P. Couchman, Astrophys. J. Lett. 555, L17 (2001).
M. G. Abadi, J. F. Navarro, M. Steinmetz, and V. R. Eke, Astrophys. J. 591, 499 (2003).
J. Sommer-Larsen, M. Götz, and L. Portinari, Astrophys. J. 596, 47 (2003).
T. R. Saitoh and K. Wada, Astrophys. J. Lett. 615, L93 (2004).
F. Governato et al., Astrophys. J. 607, 688 (2004).
F. Governato et al., Mon. Not. R. Astron. Soc. 374, 1479 (2007).
F. Governato et al., Nature 463, 203 (2010).
C. B. Brook et al., Mon. Not. R. Astron. Soc. 415, 1051 (2011).
26/27
PTEP 2012, 01A303
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
J. Makino and T. Saitoh
J. Guedes, S. Callegari, P. Madau, and L. Mayer, Astrophys. J. 742, 76 (2011).
J. Makino, Publ. Astron. Soc. Jpn. 43, 859 (1991).
T. R. Saitoh and J. Makino, Astrophys. J. Lett. 697, L99 (2009).
W. B. Streett, D. J. Tildesley, and G. Saville, Mol. Phys. 35, 639 (1978).
T. R. Saitoh and J. Makino, Publ. Astron. Soc. Jpn. 62, 301 (2010).
A. Tanikawa, K. Yoshikawa, K. Nitadori, and T. Okamoto, arXiv:1203.4037.
A. W. Appel, SIAM J. Sci. and Stat. Comput. 6, 85 (1985).
V. Springel, N. Yoshida, and S. D. M. White, New Astron. 6, 79 (2001).
T. R. Saitoh and J. Makino, New Astron. 17, 76 (2012).
S. D. M. White, Mon. Not. R. Astron. Soc. 177, 717 (1976).
S. J. Aarseth and S. M. Fall, Astrophys. J. 236, 43 (1980).
T. R. Saitoh and J. Makino, arXiv:1202.4277.
J. J. Monaghan, J. Comput. Phys. 110, 399 (1994).
27/27

Download Report

Astrophysics with GRAPE - Oxford Academic

Paperzz.com

Your Paperzz