Prog. Theor. Exp. Phys. 2012, 01A303 (27 pages) DOI: 10.1093/ptep/pts029 Astrophysics with GRAPE Junichiro Makino∗ and Takayuki Saitoh Interactive Research Center of Science, Tokyo Institute of Technology, 2-12-1, Ookayama, Meguro, Tokyo, 152-8551, Japan ∗ E-mail: [email protected] Received May 25, 2012; Accepted July 23, 2012; Published October 4, 2012 ............................................................................... In this paper we provide an overview of the GRAPE (GRAvity PipE) project and related developments in astrophysics. The basic idea of the GRAPE project is to develop computers specialized for N -body simulations and to use them to perform large-scale simulations which cannot be done easily on general-purpose computers. The first GRAPE system, GRAPE-1, was completed in 1989. Since then more than ten systems have been developed and used for research in many fields of astrophysics. Some GRAPE systems were specifically designed for molecular dynamics simulations. In this paper we first give a brief overview of the history of the GRAPE project, and then try to give a systematic view of the advance of the numerical methods, in particular how they have been driven by the nature of the systems to be simulated and the advance of semiconductor and computer technology. ............................................................................... 1. Introduction In many simulations in astrophysics it is necessary to solve gravitational N -body problems. In some cases, such as the study of the formation of galaxies or stars, it is important to treat nongravitational effects such as the hydrodynamical interaction, radiation, and magnetic fields, but in these simulations calculation of gravity is usually the most time-consuming part. To solve the gravitational N -body problem we need to evaluate the gravitational forces on all bodies (particles) in the system from all the other particles in the system. There are many ways to do so. The simplest is to calculate all pairwise interactions, which is the most efficient for systems with a relatively small number of particles (less than 10,000) and still widely used in many applications. When the number of particles is much larger than 10,000, one can significantly accelerate the calculation using by the Barnes–Hut tree algorithm [1] or Fast Multipole Method (FMM) [2]. Even with these methods, however, the calculation of the gravitational interaction between particles (or particles and multipole expansions of groups of particles) is the most time-consuming part of the calculation. Therefore, one can greatly improve the speed of the entire simulation just by accelerating the speed of the calculation of particle–particle interaction. This is the basic idea behind GRAPE computers. Figure 1 shows the basic idea. The system consists of a host computer and special-purpose hardware, and the special-purpose hardware handles the calculation of gravitational interactions between particles. The host computer performs other calculations such as the time integration of particles, I/O, and diagnostics. Even with the latest GRAPE-8 system, we still maintain this basic structure in which specialpurpose hardware is connected to a general-purpose computer. However, in two decades since the completion of GRAPE-1, the overall speed has increased by a factor of nearly 106 . This increase in speed made it necessary to develop new algorithms, in particular efficient parallel algorithms, © The Author(s) 2012. Published by Oxford University Press on behalf of the Physical Society of Japan. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. PTEP 2012, 01A303 J. Makino and T. Saitoh Fig. 1. Basic structure of a GRAPE system. since a large part of the speedup is realized by the increase in the number of pipeline processors. More importantly, this increase in the speed makes it possible to address complex problems such as the formation of galaxies. In order to study the galaxy formation process we need to model the gas dynamics of interstellar gas; its radiative cooling; the star formation process from cool, dense gas; and the heating of gas due to supernova explosions. These processes require new algorithms. Thus, in this paper we first provide an overview of the history of the GRAPE hardware in Sect. 2, then try to give a consistent view of the algorithms used for N -body systems in Sect. 3, and then present an overview of the recent development of new algorithms for galaxy formation simulations in Sect. 4. 2. History 2.1. GRAPEs 1, 2, and 3 The GRAPE Project was started in 1988. GRAPE-1 [3] was the first machine we completed, and became operational in September 1989. It was a single-board unit on which around 100 IC and LSI chips were mounted and wire-wrapped. The pipeline processor of GRAPE-1 consisted of commercially available IC and LSI chips. GRAPE-1 was the first machine we ever built, and to use commercially available chips was the only practical possibility—we did not have the knowledge or budget to design custom LSI chips. For GRAPE-1 we used an unusually short word format, to make the hardware as simple as possible. The first subtraction of the position vectors was done in 16-bit fixed-point format, and the final accumulation of the force was done in 48-bit fixed-point format. All other operations were done in 8-bit logarithmic format, in which 3 bits are used for the “fractional” part. This choice simplified the hardware significantly, since we could use a single 512-kbit EPROM chip to express any binary operations. For example, x 2 + y 2 was done by one chip. This use of an extremely short word format in GRAPE-1 was based on a detailed theoretical analysis of error propagation and numerical experiment [4]. The reason why a short word format can be used is related to the fundamental nature of N -body simulation. In the simulation of “collisionless” systems, for which the simulation timescale is much shorter than the thermal relaxation time of the system, the primary source of numerical error is the artificial two-body relaxation caused by the finite number of particles, which is usually many orders of magnitude smaller than the actual number of particles in the system. Since we express the actual system with a smaller number of particles, the potential field fluctuates, and the amplitude √ of fluctuation is proportional to 1/ N . This fluctuation is the cause of the two-body relaxation. We have proved that the numerical error due to the short word format has the same effect as the two-body relaxation itself, but with a numerical coefficient the same as the relative accuracy of the pairwise force, as far as we can regard the error as a random error. Thus, we need a word length just long enough so that the error behaves as effectively random. GRAPE-1 was used for studying the merging of spherical galaxies and violent relaxation. 2/27 PTEP 2012, 01A303 J. Makino and T. Saitoh For GRAPE-1 we used an IEEE-488 (GP-IB) interface for communication with the host. In addition, GRAPE-1 could only handle particles with the same mass. These choices were made to simplify the hardware design. As will be discussed in Sect. 3, for simulations of collisionless systems we do not use the O(N 2 ) direct summation method. We use fast methods like the Barnes–Hut tree algorithm or FMM (see Sect. 3 for detailed discussion on these methods). In order to use the GRAPE hardware with these methods, it must be able to handle particles of unequal mass (physical particles and center-ofmass particles), and communication with the host computer should be much faster. GRAPE-1A was designed for this purpose. For GRAPE-1A a VME bus was used, and it provided a speed of around 4 MB/s, faster than the communication speed of GRAPE-1 by nearly two orders of magnitude. For the simulation of collisional systems, in other words, if the simulation timescale is longer than the thermal relaxation time of the system, the very short word format used for GRAPE-1 is not appropriate, since the variation of the total energy of the system can become noticeable. We developed GRAPE-2 for the simulation of collisional systems. In GRAPE-2, in order to achieve higher accuracy, commercial LSI chips for floating-point arithmetic operations, such as the TI SN74ACT8847 and Analog Devices ADSP3201/3202, were used. The pipeline of GRAPE-2 processes the three components of the interaction sequentially, accumulating one interaction every three clock cycles. This approach was adopted to reduce the circuit size. Its speed was around 40 Mflops, but it was still much faster than workstations or minicomputers at that time. GRAPE-2 was used for many problems, including the study of the runaway growth of protoplanets and the merging of two galaxies with central massive black holes. GRAPE-2A was designed to handle interactions with arbitrary functional form, so that it could be used for molecular dynamics simulations. GRAPE-3 was the first GRAPE computer with a custom LSI chip. The number format was a combination of the fixed-point and logarithmic formats similar to those used in GRAPE-1. The chip was fabricated using a 1 µm design rule by National Semiconductor. The number of transistors on the chip was 110K. The chip operated at 20 MHz clock speed, offering an overall speed of about 0.8 Gflops. Printed circuit boards with 8 chips were mass produced, for a speed of 6.4 Gflops per board. Thus, GRAPE-3 was also the first GRAPE computer to integrate multiple pipelines into a system. Also, GRAPE-3 was the first GRAPE computer to be manufactured and sold by a commercial company. Nearly 100 copies of GRAPE-3 have been sold to more than 30 institutes (more than 20 outside Japan). 2.2. GRAPEs 4, 5, and 6 In 1992 we started the development of GRAPE-4, with a target performance of 1 Tflops. At that time, the number of floating point units which could be integrated into one LSI chip was around 20, and the practical clock frequency was 30 MHz. This means we could achieve around 600 Mflops per chip, and around 1,700 chips were necessary to achieve 1 Tflops. Each chip integrated one pipeline unit similar to that of GRAPE-2. This chip calculates the first time derivative of the force, so that a fourth-order Hermite scheme [5] can be used. The chip was fabricated using a 1 µm design rule by LSI Logic with a total transistor count of about 400 K. The completed GRAPE-4 system consisted of 1728 pipeline chips (36 PCBs each with 48 pipeline chips). It operated on a 32 MHz clock, delivering a speed of 1.1 Tflops. Technical details of the machines from GRAPE-1 through GRAPE-4 can be found in our book [6] and references therein. 3/27 PTEP 2012, 01A303 J. Makino and T. Saitoh Table 1. History of the GRAPE project. GRAPE-1 GRAPE-2 GRAPE-1A GRAPE-3 GRAPE-2A HARP-1 (89/4–89/10) (89/8–90/5) (90/4–90/10) (90/9–91/9) (91/7–92/5) (92/7–93/3) GRAPE-3A (92/1–93/7) GRAPE-4 (92/7–95/7) MD-GRAPE (94/7–95/4) GRAPE-5 GRAPE-6 (96/4–99/8) (97/8–02/3) 310 Mflops, low accuracy 50 Mflops, high accuracy (32bit/64bit) 310 Mflops, low accuracy 18 Gflops, low accuracy 230 Mflops, high accuracy 180 Mflops, high accuracy Hermite scheme 8 Gflops/board some 80 copies are used all over the world 1 Tflops, high accuracy Some 10 copies of small machines 1 Gflops/chip, high accuracy programmable interaction 5 Gflops/chip, low accuracy 64 Tflops, high accuracy Fig. 2. The evolution of GRAPE and general-purpose parallel computers. The peak speed is plotted against the year of delivery. Open circles, crosses and stars denote GRAPEs, vector processors, and parallel processors, respectively. GRAPE-5 [7] was an improvement over GRAPE-3. It integrated two full pipelines which operate on an 80 MHz clock. Thus, a single GRAPE-5 chip offered 8 times more speed than the GRAPE-3 chip, or the same speed as that of an 8-chip GRAPE-3 board. GRAPE-5 was awarded the 1999 Gordon Bell Prize for price–performance. The GRAPE-5 chip was fabricated with a 0.35 µm design rule by NEC. Table 1 summarizes the history of GRAPE project. Figure 2 shows the evolution of GRAPE systems and general-purpose parallel computers. One can see that the evolution of GRAPE is faster than that of general-purpose computers. GRAPE-6 was essentially a scaled-up version of GRAPE-4 [8], with a peak speed of around 64 Tflops. The peak speed of a single pipeline chip was 31 Gflops. In comparison, GRAPE-4 consists of 1728 pipeline chips, each providing 600 Mflops. The increase of a factor of 50 in speed was achieved by integrating six pipelines into one chip (the GRAPE-4 chip has one pipeline which needs three cycles to calculate the force from one particle) and using a three times higher clock frequency. The advances in device technology (from 1 µm to 0.25 µm) made these improvements possible. Figure 3 shows the processor chip delivered in early 1999—the six pipeline units are visible. 4/27 PTEP 2012, 01A303 J. Makino and T. Saitoh Fig. 3. The GRAPE-6 processor chip. The completed GRAPE-6 system consisted of 64 processor boards, grouped into 4 clusters with 16 boards each. Within a cluster, 16 boards are organized in a 4 by 4 matrix, with 4 host computers. They are organized so that the effective communication speed is proportional to the number of host computers. In a simple configuration, the effective communication speed becomes independent of the number of host computers. The details of the network used in GRAPE-6 are in [9]. 2.3. GRAPE-DR In 2004 we started the development of GRAPE-DR. It has an architecture quite different from that of previous GRAPE hardware. Instead of hardwired pipelines for the gravitational interaction, A GRAPE-DR processor chip integrates a large number of very simple processors which operate in a SIMD fashion. This rather drastic change in design was to extend the application area. At least part of the reason we tried to extend the application area is to justify the large initial cost of custom LSI chips. In 1990 the initial cost of a custom chip was around 150 K USD; in 1997 it was around 1M USD, and in 2004 it was more than 3 M USD. The total amount of grant necessary to complete a system is around four times larger than the initial cost of the LSI chip, so we had to get a grant of 10–15 M USD. Such a large grant was impractical for a system which could solve only astrophysical N -body problems. GRAPE-DR is an acronym for “Greatly Reduced Array of Processor Elements with Data Reduction”. The last part, “Data Reduction”, means that it has an on-chip tree network which can do various reduction operations such as summation, max/min and logical and/or. When we use GRAPE-DR as GRAPE, this summation network is used to add the partial forces on one particle calculated on multiple processors on one chip. The GRAPE-DR project was started in FY 2004, and finished in FY 2008. The GRAPE-DR processor chip consists of 512 simple processors which can operate at a clock speed of 500 MHz for 512 Gflops of single-precision peak performance (256 Gflops double precision). It was fabricated with by TSMC with a 90 nm process and the size is around 300 mm2 ; peak power consumption is around 60 W. The GRAPE-DR processor board (Fig. 4) houses 4 GRAPE-DR chips, each with 5/27 PTEP 2012, 01A303 J. Makino and T. Saitoh Fig. 4. The GRAPE-DR processor board. Fig. 5. The performance of the individual timestep scheme on a single-card GRAPE-DR in Gflops, plotted as a function of the number of particles. its own local DRAM chips. It communicates with the host computer through a Gen1 16-lane PCI-Express interface. This card gives a theoretical peak performance of 819 Gflops (in double precision) at a clock speed of 400 MHz. The actual performance numbers are 640 Gflops for matrix-matrix multiplication, 430 Gflops for LU-decomposition, and 500 Gflops for direct N -body simulation with individual timesteps (Fig. 5). These numbers are typically a factor of two or more better than the best performance number so far reported with GPGPUs. In the case of parallel LU decomposition, the measured performance was 24 Tflops on a 64-board, 64-node system. The average power consumption of this system during the calculation was 29 KW, and thus performance per Watt is 815 Mflops/W. This number is listed as No. 1 in the Little Green 6/27 PTEP 2012, 01A303 J. Makino and T. Saitoh Fig. 6. The GRAPE-DR cluster. 500 list of June 2010. Thus, from technical point of view, we believe that the GRAPE-DR project is highly successful in making multi-purpose computers with the highest single-card performance and highest performance-per-watt. 2.4. PROGRAPE and GRAPE-7 Another way to reduce the high initial cost is to use FPGA (field-programmable gate array) chips. An FPGA chip is programmable in the sense that one item of hardware can be used to realize an arbitrary logic design. Conceptually, an FPGA consists of programmable logic elements connected with a programmable network. Here, a “programmable” logic element is typically just a small SRAM block which can express any combinatorial logic. A “programmable” network similarly means wires with multiplexers which select inputs connected to small SRAM blocks. By loading configuration data to SRAM blocks, one can use one FPGA chip to express any logic design, as far as it can fit into the chip. Thus, FPGA chips are mass produced, and no initial cost is necessary. The drawback of FPGAs is that their transistor efficiency is much lower than that of a custom design, and their operating frequency is somewhat lower. Thus, the performance of an FPGA chip is typically lower than that of a custom LSI chip by a factor of ten or so. However, even with this low efficiency, a pipeline processor implemented on FPGA chips can have large advantages over software implementations on general-purpose processors. This is especially true for pipeline processors with a very short word format. Hamada [10] described PROGRAPE-1, in which we used FPGAs to implement low-accuracy GRAPE hardware. Several generations of such hardware have been built, and the latest one is GRAPE-7. Using four large FPGAs, it gives a peak performance of 830 Gflops for low-accuracy gravitational force calculations. 2.5. GRAPE-8 GRAPE-8 is the latest generation of GRAPE hardware, based on relatively new technology called structured ASIC, which is something in between custom LSI and FPGA. Its design is similar to FPGA, but its function is determined by customizing one layer of wiring and via holes. Thus, the 7/27 PTEP 2012, 01A303 J. Makino and T. Saitoh initial development cost of a structured ASIC chip is much lower than that of custom LSI, and yet its mass production cost (or price per logic gate) is significantly lower than that of FPGA. Thus, in principle, structured ASIC technology can be very useful for the development of specialpurpose computers such as GRAPE. For GRAPE-8 we used the N2X740 chip from eASIC Corporation. It integrates 48 pipeline processors, similar to those of GRAPE-5 but with somewhat higher accuracy, and also an additional cutoff function unit to be used with P3 M or P3 T schemes. The GRAPE-8 processor board with two processor chips and one interface FPGA chip provides a speed of 960 Gflops for a power consumption of around 40 W. 2.6. GRAPEs for molecular dynamics Molecular dynamics is rather similar to astrophysical N -body simulation, except that atoms interact with van der Waals and Coulomb forces, instead of gravitational force. Thus, pipeline processors similar to GRAPE can be used to accelerate molecular dynamics simulations. Actually, two pipeline processors for molecular dynamics simulations were built years before GRAPE-1. The first one was DMDP, which was designed as a complete hardware processor for simulation: not only the force calculation, but also the integration of the orbits of atoms and calculation of the output physical quantities were done on a specialized pipeline processor. FASTRUN was an accelerator processor for molecular dynamics simulation for protein molecules. GRAPE-2A is the first GRAPE hardware for molecular dynamics. It uses commercial floatingpoint LSI chips as in the case of GRAPE-2. With MD-GRAPE, one pipeline processor similar to that of GRAPE-2A is implemented into one LSI chip. MDM is a massively parallel development of MD-GRAPE which achieved a peak speed of 75 Tflops. Protein Explorer achieved 1 Pflops. In the US, a specialized processor named ANTON was developed. It can be regarded as a design similar to GRAPE, but with programmable processor and network interfaces integrated on one chip with pipeline processors. This design was apparently chosen to reduce the communication latency between the pipeline processors and the general-purpose processor, as well as between generalpurpose processors. Thus, ANTON achieved extremely short calculation time per one MD step, around two orders of magnitude faster than any other machines, including GRAPEs. 3. Algorithms I: Pure N-body dynamics Starting with the 300-Mflops GRAPE-1, the calculation speed of GRAPE systems increased by nearly six orders of magnitude in two decades. This increase is primarily driven by the increase in the number of arithmetic units, or in the case of most GRAPE systems the number of pipelines. GRAPE-1 was a single-pipeline system. GRAPE-6 had 12,288 pipeline units. This increase in the degree of parallelism has been achieved by using various algorithms. Some of them are modifications of previously known ones and some are newly developed. In this section, we review the algorithms used. In Sects. 3.1, 3.2, and 3.3 we discuss the algorithms in the time domain, and in Sects. 3.4 and 3.5 we discuss the algorithms in the space domain. In Sect. 3.6 we discuss the issue of parallelization. Finally, we discuss how we can now combine algorithms in space and time in Sects. 3.7 and 3.8. 3.1. Individual and block timestep method The individual timestep scheme [11,12] is the basic integration scheme for astrophysical N -body problems. It has remained the standard for more than 30 years following its invention in 1960. 8/27 PTEP 2012, 01A303 J. Makino and T. Saitoh Fig. 7. Schematic description of the individual timestep algorithm. Figure 7 illustrates how the individual timestep scheme works. We consider a system of n particles. In the individual timestep scheme, each particle has its own time and timestep. We denote the time and timestep of particle i as ti and ti . We first select the particle i for which the value for ti + ti is a minimum, and then we integrate the position and velocity of this particle to its new time, and update the timestep. In order to integrate particle i, we first predict the positions of all particles at time ti + ti , then calculate the force on particle i and apply the correction to the position and velocity of particle i. Thus, traditional way to use the individual timestep scheme is to combine it with one of the multistep predictor–corrector schemes. Historically, four-step, fourth-order scheme with variable stepsize were used. In most modern implementations of the individual timestep scheme or its variants, the Hermite integration scheme [5] is used. It is based on the Hermite interpolation method. Hermite interpolation is similar to Newton–Cotes interpolation, which is the basis of the traditional variable-stepsize linearmultistep scheme. In Newton interpolation we only use the values of the function f . With the Hermite interpolation, we use the values of the derivatives of f in addition to f to construct the interpolation formula. In the case of a gravitational N -body system, the first derivative of the acceleration can be calculated for a small additional cost. The acceleration and its first time derivative are given by ri j Gm j 2 (1) ai = (ri j + 2 )3/2 j 3(vi j · ri j )ri j vi j ȧi = Gm j − 2 , (2) 2 + 2 )3/2 2 )5/2 (r (r + i j i j j where ri j = x j − xi , (3) vi j = v j − vi . (4) Here, is the softening parameter. With the jerk calculated directly, the construction of the higher-order integrator is simplified significantly. For example, the simplest explicit scheme is now second order in time, instead of 9/27 PTEP 2012, 01A303 J. Makino and T. Saitoh first order in time. In the following, we present the complete formula for a two-step, fourth-order predictor-corrector scheme. The predictor is given by: t 3 t 2 ȧ0 + a0 + tv0 + x0 (5) 6 2 t 2 ȧ0 + ta0 + v0 , vp = (6) 2 where xp and vp are the predicted position and velocity; x0 , v0 , a0 and ȧ0 are the position, velocity, acceleration, and its time derivative at time t0 ; and t is the timestep. The corrector is given by the following formula (see, for example, [13]): xp = t t 2 (vc + v0 ) − (a1 − a0 ), (7) 2 12 t t 2 (a1 + a0 ) − (ȧ1 − ȧ0 ). (8) vc = v 0 + 2 12 The predictor formulae use only the “instantaneous” quantities that are calculated directly from the position and velocity at the present time. Compared to the scheme which has to keep track of values at previous timesteps, the program becomes much simpler. The merit of the Hermite scheme is, however, not just the simplicity of the formula. The local truncation error of the Hermite interpolation is several orders of magnitude smaller than that of Newton interpolation with the same order and stepsize. Therefore, the Hermite scheme allows a significantly longer timestep than that used for the Aarseth scheme. Of course, this advantage is partially offset by the additional cost needed to calculate the time derivative directly. Thus, the relative advantage depends upon the computer used. In the case of the pipeline processor, the Hermite scheme has several additional advantages. First, the word length for the calculation of jerk can be shorter than that for the force, resulting in significant reduction in the size of the hardware. Second, the timestep size is longer for the Hermite scheme for the same accuracy, resulting in a reduction in the amount of communication between CPU and GRAPE. If we calculate higher-order derivatives directly, we can construct higher-order integration schemes. If we calculate the second-order derivative, we can construct a single-step corrector of sixth order. In order to make the order of the predictor consistent with the corrector, we need derivatives up to the third order. Thus, the predictor needs to be a two-step one, but is easy to construct. Similarly, by calculating third-order derivatives directly, we can construct an eighth-order scheme. Nitadori et al. [14] described the formulation and performance of such sixth- and eighth-order schemes. They do have a significant advantage over the simple fourth-order scheme, in particular when high accuracy is required or when the central density of the system becomes high. In retrospect, it is a bit surprising (or shameful) that it took such a long time to recognize the advantage of high-order schemes. xc = x 0 + 3.2. Limit of the gain by the individual timestep scheme One important recent finding is that the behavior of the size of the timestep and the energy error are quite different between the fourth-order scheme and higher-order schemes. In the case of the fourth-order scheme, when the core becomes small through gravothermal collapse, the energy error grows quickly. However, the growth of the error is much smaller for the higher-order scheme, even if we choose the timestep criterion so that the error per crossing time is initially the same. 10/27 PTEP 2012, 01A303 J. Makino and T. Saitoh Of course, this difference means the calculation cost of higher-order schemes grows a bit faster. However, surprisingly, what makes the error of the higher-order schemes small is the shrinking of the timesteps for particles far away from the core. The timestep for particles in the core becomes smaller for both fourth-order and higher-order schemes, but the timesteps of particles in the outer region shrinks only in the case of the higher-order schemes. What this difference tells us is that the traditional timestep criterion used for the fourth-order scheme is too insensitive to high frequency, low amplitude variation of the acceleration, and thus the timesteps of particles far away are too long to resolve the motions of particles in the core. Thus, the error of the fourth-order scheme becomes large. On the other hand, the timestep criterion for higher-order schemes, which uses the higher-order time derivatives, is sensitive enough to reduce the timestep of particles far away from the core, so that they can resolve the motion of the particles in the core. Thus, in order to keep the error reasonably small, the timesteps of particles far away from the core must be small enough to resolve the motion of particles in the core. This observation implies that there is a fundamental limit on the gain by the individual timestep algorithm. The timestep of most of the particles must be small enough to resolve the motions of particles with the smallest orbital timescale. The existence of this limit looks quite natural, once we understand the underlying mathematics. Even so, it had been overlooked for the first half-century of the history of gravitational N -body simulation. Clearly, we have not yet fully understood a relatively simple-looking problem: What is the best way to numerically integrate the gravitational N -body problem? 3.3. Neighbor scheme With the individual timestep algorithm, particles have their own times and timesteps. The timestep of one particle is determined so that the required accuracy is achieved. In principle, we can generalize this concept of an individual timestep to all of the N 2 interactions. Each interaction has its time and timestep, determined by the required accuracy. Whether or not such a scheme is practical, or even realizable, is not well studied yet, but one can show that the ultimate gain of such a scheme is not so large. Consider a system like a star cluster with relatively large core (not too high a central density). Interactions between most pairs of particles must be evaluated at a time interval which is a small fraction of the orbital timescale of the particles in the core. On the other hand, interactions of particles in the core need to be evaluated in the timescale in which particles move the average interparticle distance. Thus, the difference between the shortest timescale and a typical timescale for distant pairs is O(N −1/3 ), which is not small but can be insignificant compared to the additional overhead of the complex algorithm. If we can achieve a significant fraction of the gain of such an ideal “pairwise individual timestep” by something simpler, that might in practice be more useful than the ideal scheme. One such possibility is the neighbor scheme. In the neighbor scheme, the force on a particle is divided into two components, the neighbor force and the “regular” force (we follow the traditional naming here). Typically, around 30 “neighbor” particles, which are the first 30 nearest neighbors, are selected, and the force from these neighbor particles is integrated with a timestep shorter than the force from the rest of the system. In the neighbor scheme, the list of neighbors is updated at each timestep for the regular force. Theoretically, one should be able to achieve a reduction of calculation cost of O(N −1/4 ) by making the number of neighbors O(N 3/4 ). In practice the gain is smaller, because of the following two reasons. First, the 11/27 PTEP 2012, 01A303 J. Makino and T. Saitoh Fig. 8. Barnes–Hut tree in two dimensions. above argument that the timestep for distant pairs can be of the order of the orbital timescale is too optimistic, as we discussed in the previous section. Second, the number of neighbors of O(N 3/4 ) is too large, since this means we need O(N 7/4 ) words of memory. 3.4. The treecode The basic idea of the treecode [1] is to replace the force from a group of distant particles by the force from their center of mass or by a multipole expansion. To ensure accuracy, we make groups for distant particles large and groups for nearby particles small. We use a tree structure to construct the appropriate grouping for each particle. Before calculating the forces on particles, we first organize the particles into a tree structure. Barnes and Hut [1] used an octree based on the recursive subdivision of a cube into eight subcubes. We stop the recursive subdivision if the cube contains only one particle, or is empty. Figure 8 shows the Barnes–Hut tree in two-dimensional space. After the tree is constructed, for each node of the tree, which corresponds to a cube of a certain size, we calculate the coefficient of the multipole expansion of the gravitational force exerted by particles in that cube. This calculation can be done using a simple recursive procedure. The force calculation is also expressed as a recursive procedure. To calculate the force on a particle we start from the root node, which corresponds to the total system. We calculate the distance between the node and the particle (d) and compare it with the size of the node (l)—see Fig. 9. If they satisfy the convergence criterion l < θ, (9) d where θ is the accuracy parameter, we calculate the force from that node to the particle using the coefficients of the multipole expansion. If criterion (9) is not satisfied, the force is calculated as a summation of the forces from eight sub-nodes. 12/27 PTEP 2012, 01A303 J. Makino and T. Saitoh Fig. 9. Opening criterion for tree traversal. Usually, we use the distance between the particle and the center of mass of the node to determine whether the force is accurate enough. When θ is very large, this criterion can cause unacceptably large error [15]. For most calculations, however, such a pathological situation does not happen. In addition, for treecode with GRAPE, relatively small values of θ such as 0.5–0.6 are not too costly. 3.5. The fast multipole method The basic idea of the treecode is to replace a group of particles with the multipole expansion of its gravitational field. This multipole expansion is used to evaluate the gravitational field to distant particles. Here, there is a rather clear inefficiency: forces on two particles which are physically close to each other are evaluated independently. There should be some way to make use of the fact that the forces on particles physically close are similar. The fast multipole method (FMM) is a systematic way to make use of this fact. The basic idea of FMM is to eliminate the O(log N ) factor associated with the force calculation cost of the treecode by evaluating the potential at a higher level of the tree. In the case of the treecode, the multipole expansion of a group of particles at the position of one particle is just calculated directly. In the case of FMM, it is first translated to a spherical harmonic expansion at the center of a box which contains the particle at which the potential is evaluated, and then the spherical harmonic expansion is evaluated at the position of the particle. We can add up the contribution of many groups of particles to the spherical harmonic expansion at one box. Moreover, we can shift the expansion at the center of one box to the center of its child boxes and hierarchically add up contributions from all levels of the tree. Thus, the order of the calculation cost is now reduced to O(N ). FMM is thus theoretically advantageous over the treecode, but so far not so widely used, probably mainly because of the complexity of its implementation. 3.6. Parallelization 3.6.1. Block timestep algorithm. The individual timestep algorithm is quite powerful in reducing calculation cost. However, an obvious problem with the individual timestep algorithm is that it reduces the degree of parallelism to the smallest possible value. Only one particle is integrated at one time. We can still evaluate the forces from many particles in parallel and take a summation, but we cannot calculate forces on multiple particles in parallel. If several particles share exactly the same time, we can calculate forces on them in parallel. In addition, we need to predict the positions of the other particles only once in order to calculate the forces on these particles. The cost of the prediction therefore becomes a small fraction of that of the force calculation. In order to force several particles to share exactly the same time, we adjust the timesteps to integer powers of two. By this modification, all particles that have the same timestep also have the same 13/27 PTEP 2012, 01A303 J. Makino and T. Saitoh time. This scheme is usually called the block timestep algorithm. This algorithm was developed by McMillan [16] in order to use the Cyber 205 more efficiently. The technical details concerning this algorithm have been described in [17]. On average, the number of particles which share exactly the same time is O(N 2/3 ) if the system is nearly homogeneous. If the system has a small, high-density core, this number becomes smaller, depending on the details of the timestep criterion and particle distribution. Even so, this block timestep scheme has a very large impact on the parallel efficiency of N -body simulation. 3.6.2. 2D parallelization and the Ninja scheme. The blockstep algorithm makes it possible to parallelize the force calculation for multiple particles on shared-memory multiprocessors and vector processors. Here we discuss parallelization on distributed-memory parallel processors. The simplest approach is to let each processor have a complete copy of the system and calculate the forces on a fraction of the particles. In the case of the blockstep scheme, each processor selects the particles to be integrated and integrates n/ p particles, where n is the number of particles in the block and p is the number of processors. After integration is finished, each processor broadcasts the updated particles, so that all processors have all particles in the block updated. The scalability of this algorithm is rather limited, since we can use only O(n) processors. The performance is limited by the communication bandwidth and latency, and it is usually difficult to use more than 128 processors even for more than 100k particles. On the other hand, the calculation cost per crossing time is O(N 7/3 ) and that for thermal relaxation time is O(N 10/3 ). Thus, parallelization in this way does not give sufficient speedup. One can use larger numbers of processors by using a so-called two-dimensional algorithm [18]. The basic idea of this scheme is to divide the calculation of forces f i j in both i and j directions. If we have p = r 2 processors, we organize them into a two-dimensional grid. Particles are divided into r groups, processor pi j has both group i and group j, and calculates the force from group j to group i. The total force is obtained by taking a summation over the j direction (the direction for which j is different and i is the same). Then we can let each processor update group i. Finally, we let the diagonal processors broadcast their group in the i direction. This parallelization effectively increases the communication bandwidth by a factor of r , and it is now possible to use O(N n) processors. If we combine the blockstep algorithm with this twodimensional parallelization then there is a relatively large load imbalance between processors, since the number of particles in one blockstep shows fluctuation. The Ninja scheme (Nbody-i-and-j algorithm) achieves almost perfect load balancing in the following way. We again have r 2 processors, but processor pi j has only group j. After the processors have selected the particles to be integrated, they divide the selected particles into r groups, and processors in row i broadcast their group i in the j direction. Then each processor calculates the forces from its group to the received particles. The total forces are calculated by taking a summation, and the summed results should go to their original location, so that processor pi j obtains the total force on subgroup i of particles to be integrated in group j. Finally, by broadcasting these summed result in the i direction, all the processors have the forces for the particles to be integrated. With this scheme, it is not impossible to use more than N processors to integrate a system of N particles, and we can achieve very high parallel efficiency for relatively small numbers of particles. 3.6.3. Parallel algorithms for treecode. There are a number of studies on the parallelization of the Barnes–Hut tree algorithm. Some of the very early works discussed efficient implementation on 14/27 PTEP 2012, 01A303 J. Makino and T. Saitoh vector processors [19–22], but there were also implementations on both MIMD [23] and SIMD [24] distributed-memory computers. Roughly speaking, a parallel implementation of the tree algorithm needs to take care of the following three steps: (1) Distribution of particles to processors (in the case of distributed-memory machines) (2) Construction of the tree (3) Force calculation In the case of shared-memory parallel machines (including vector processors with single or multiple processors), step (1) is not necessary, but in the case of distributed-memory parallel machines, this step is critical. There are two major algorithms used. One is the ORB (orthogonal recursive bisection) and its variant. As its name suggests, with ORB we first consider a cubic box which contains all particles in the system, and first divide it in the x direction, so that the number of particles, or calculation cost, is the same on each side. Then we again divide the two regions in the y direction, and then in z, and then x, until the number of regions is the same as the number of processors (this means the number of processors must be an integer power of two). Each processor then takes care of one region. It is also possible to make the division in one dimension not a bisection but a division to an arbitrary number of regions, so that one can make use of systems with the number of processors not a power of two [25]. In this scheme, the octree structure used for the force calculation is constructed after the decomposition of the space is done and regions are assigned to processors. Thus, on each processor, both the tree construction and force calculation can be done independently from other processors, but each processor needs information about the particles in other processors. One way to transfer the necessary information is to let each processor send to all other processors all the data that they need. If the regions of two processors are far apart, all particles in one processor can be approximated as one node, and it only needs to send the data for one node. If two nodes are close, we can construct the necessary data by doing tree traversal. If one node in the tree in one processor is too close to the region of the other processor, we open up that node and go down the tree. If the node is sufficiently distant, we stop the tree traversal there. In this way we can make a cut-out version of the tree, and we then send this tree structure to the other processor. The processor which receives the cut-out tree then merges it with its own tree structure. It is possible to simplify this procedure, by not sending the tree structure but by sending the list of nodes and particles [25]. In this scheme, the nodes and particles in the list are “inserted” into the tree, or the whole tree is reconstructed after all the necessary data is received. A modern implementation is described in [26]. On modern microprocessors, traversing the tree is a relatively slow process compared to the calculation of the interaction itself. Thus, it is desirable to somehow reduce the cost of tree traversal, and one way to do so is to do the traversal for a group of particles, instead of one particle. To identify a group, we can use the tree structure itself. Figure 10 illustrates the opening criterion for a group of particles. Instead of calculating the distance between one box and one particle, we should calculate the minimum of the distances of all particles in a box from the other box. We use the geometrical minimum distance between the box and the center of mass of the other box as the distance [20]. In the usual implementation, we construct the list of nodes and particles which exert force on the group of particles, and then calculate the forces from the list to the group. 15/27 PTEP 2012, 01A303 J. Makino and T. Saitoh Fig. 10. Modified opening criterion for treecode with GRAPE. Once we separate the tree traversal and force calculation, we can use vector processor or SIMD execution units quite efficiently for force calculation. We can also use GRAPE hardware or, more recently, GPGPUs for the force calculation. In practice, if we accelerate the force calculation by a very large factor, other parts of the calculation, such as the tree construction and tree traversal, can become the bottleneck. Efficient use of cache and SIMD units for these parts is the current challenge. 3.7. BRIDGE In principle, it is possible to combine the individual timestep (or blockstep) method and the tree algorithm. In the earliest implementation [27], the tree structure is newly constructed at each blockstep. If the range of the timestep is not so large then this approach is satisfactory, but for systems like star clusters the overhead of tree construction is too large. It is also possible to partially update the tree at each timestep [28]. However, to our knowledge there is no implementation of this type on distributed-memory parallel computers. For the last few years we have been working on methods to combine the blockstep algorithm and the tree algorithm in such a way that the resulting scheme can be efficiently parallelized. The first method is the BRIDGE scheme [29]. It is a rather specialized method to handle the evolution of star clusters in galaxies. The goal of the BRIDGE scheme is to handle both the parent galaxy and the star cluster as N -body systems, and to include their gravitational interactions in a fully consistent way. If we used the individual timestep method for the entire system, the calculation cost would become too high. On the other hand, if we use the tree algorithm with a constant stepsize for the entire system, we have to use some softening to the stars in the cluster, and yet we have to use a very short timestep. Thus, we cannot express important processes such as the formation of binaries and physical collisions of stars. Ideally, we want to apply the tree algorithm to the evolution of the parent galaxy and interaction between the galaxy and the star cluster, and to apply the individual timestep algorithm to the internal dynamics of the star cluster. Fujii et al. [29] achieved this goal by applying the idea of Hamiltonian splitting. In the BRIDGE algorithm, the potential energy of the system is split into two components as V = Vcl + Vrest . (10) Here, V is the total potential energy of the system, Vcl is the internal potential energy of the star cluster, and Vrest is the remaining part of the potential energy. Note that Vrest consists of the internal potential energy of the galaxy and the interaction potential of the galaxy and the star cluster. The Hamiltonian H = T + V is now split to H1 = T + Vcl , (11) H2 = Vrest . (12) 16/27 PTEP 2012, 01A303 J. Makino and T. Saitoh We now can integrate the system using a procedure similar to the second-order leapfrog scheme. At the beginning of the integration we first evaluate the acceleration due to Vrest , and push the velocity of all particles by ta/2, where a is the calculated acceleration. Then we integrate the position and velocity of all the particles using H1 . Since H1 contains only the internal potential energy of the star cluster, particles of the parent galaxy move with constant velocity, and particles in the star cluster are integrated using the individual timestep scheme, just following the internal potential. When all particles in the star cluster reach the next timestep for the leapfrog scheme we calculate the forces due to Vrest and push the velocity of all particles by ta/2 to reach the new time. For the next timestep, we again push the velocity by ta/2. When no diagnostics or output are necessary, we can just push the velocity by ta. This BRIDGE scheme is, to our knowledge, the first algorithm with which we can handle the evolution of star clusters in their parent galaxy in a fully self-consistent way, and has been used for many studies of the evolution of young star clusters. 3.8. Particle-Particle Particle-Tree The idea of Hamiltonian splitting can be applied in many different ways, and one possibility is to split the pairwise interaction itself into near and distant terms. This splitting is the same as is used in the P3 M (particle-particle particle-mesh) scheme or Ewald summation method. In these schemes, a pairwise potential is split into two parts by a smooth function as Ui j = g(ri j )Ui j + [1 − g(ri j )]Ui j . (13) Here, g is a smooth function which satisfies the following criteria: ◦ g(0) = 1. ◦ g(∞) = 0. ◦ g is differentiable n times, where n is the order of the integrator used. We can apply the time integration scheme as in the BRIDGE scheme, by replacing the cluster potential Vcl by the potential energy due to the g(ri j )Ui j term. There are many technical details concerning the actual implementation, which are discussed in detail in [30]. Our current view is that this P3 T scheme is the practical “solution” of the fundamental limitation of the individual timestep algorithm discussed in Sect. 3.2. We just make the timestep for the tree force calculation comparable to the orbital timescale of particles in the core. Though the calculation cost is not small, it is smaller than that of the standard blockstep scheme by a factor proportional to N / log N . This factor is likely to be much larger than the gain due to the neighbor scheme. 4. Algorithms II: N-body+SPH for galaxy formation Numerical simulation is a powerful tool for studying the complex processes involved in galaxy formation. Since the pioneering works by Katz and Gunn [31] and Navarro and Benz [32], a number of efforts have been undertaken [33–45]. In order to make a single galaxy, it is necessary to collect material in a volume of 1 Mpc3 and squeeze it into several tens of kpcs. In addition, cold gas forms a thin disk in which spiral arms and star-forming regions with a size of several pcs develop. Thus, we need to cover six orders of magnitude in spatial scale. Clearly, we cannot handle such a wide range with a fixed, regular grid, and schemes with adaptive resolution are necessary. There are two major ways to achieve adaptive resolution: AMR (adaptive mesh refinement) and SPH (smoothed particle hydrodynamics). They have their own merits and demerits. The advantage 17/27 PTEP 2012, 01A303 J. Makino and T. Saitoh of AMR over SPH is that it is a grid-based finite-difference scheme. Thus, it has significantly better shock-capturing capability. For many test problems, in particular those for instabilities such as the Kelvin–Helmholtz instability and Rayleigh–Taylor instability, the performance of grid-based schemes is better than that of SPH for comparable degrees of freedom. However, when applied to galaxy formation, AMR has one fundamental difficulty. Consider the situation where we want to resolve a high-density molecular cloud in the galactic plane with a size of 1 pc and temperature of 20 K. The finest grid scale would be 0.1 pc, and the rotation velocity of the gas is 200 km/s. Thus, the Courant–Friedrichs–Lewy (CFL) condition requires that the timestep should not exceed 500 years. On the other hand, if we can use some Lagrangian scheme in which the grid (or particle or whatever) moves, the timestep limit due to the CFL condition is around 0.1 million years. Also, with the Euler grid, gas moves with a Mach number close to 1,000, which means we need to solve the hydrodynamics with extremely high accuracy just to conserve the shape and energy of the cloud. Whether or not one can achieve the necessary accuracy with the Eulerian AMR method remains to be seen. Lagrangian schemes, such as N -body+SPH [27], are therefore the method of choice for simulations of galaxy formation. With particle-based methods like SPH, the number of particles used determines the resolution and accuracy of the calculation. The typical number of particles for a simulated galaxy was 104 in the 1990s and 104–5 in the 2000s. Some recent studies used 106–7 particles. Thus, the increase in the number of particles is less than a factor of 1000 in two decades. In the case of cosmological N -body simulations, the the number of particles increased by a factor of 105 or more, from 106 to 1011 in the same period. Thus, there was the difference of about a factor of 100 between the number of particles used for cosmological N -body simulations and that for N -body+SPH galaxy formation simulations, and this difference itself has increased by a factor of 100 in 20 years. In both types of calculation, the evaluation of gravitational force on particles from the rest of the system is the most expensive part, and in both cases the parallel tree method discussed in the previous section has been used. The essential reason why there is such a large difference in the number of particles used is the dependence of the local dynamical timescale on the resolution. In the case of cosmological N -body simulations, the dependence of the minimum timescale on the number of particles is fairly weak. In many simulations, the increased number of particles is used to cover a wider volume without improving the resolution. In this case, clearly the timestep is independent of the number of particles. In the case of constant volume and improved resolution, the minimum timestep is proportional to m 0.3∼0.4 , where m is the mass of the particles. In the case of N -body+SPH simulations, the smallest timescale appears in supernova (SN) remnants. Since the minimum mass of gas particles is still many orders of magnitude larger than that of the actual remnant mass, the mass of gas to which we assign the explosion energy is directly proportional to the mass of the gas particles. Moreover, with gas particles of smaller mass we can resolve smaller gas clouds with higher density. The temperature of the SN remnants can exceed 108 K, which corresponds to a sound velocity higher than 1000 km/s. Thus, if we want to use a resolution of 0.1 pc, the timestep can go below 100 years. On the other hand, with pure N -body simulations, the required timestep is around 105 years even for extremely high-resolution simulations. One can reduce the impact of the small timestep by means of the individual timestep method, as we described in the previous section. However, as we also discussed in the previous section, the speedup one can achieve with the combination of individual timestep algorithm and tree algorithm is rather limited. 18/27 PTEP 2012, 01A303 J. Makino and T. Saitoh If we need the timestep 100 times smaller for N -body+SPH simulation than that for pure N -body simulation, using the same amount of computer resources, we can use a number of particles 100 times smaller. However, the fact that the number of particles is 100 times smaller means the parallel efficiency is much worse, resulting in even more reduction of available computer resources and thus further reduction in the number of particles. In addition, there are still several fundamental problems with the SPH method when applied to galaxy formation problems. In the rest of this section we discuss some of these issues and our proposed solutions. 4.1. The timestep limiter for individual timesteps The individual timestep method is commonly used in simulations of galaxy formation [16,27,46]. This method allows particles to have different timesteps. As a result, the total calculation cost is reduced. Almost all known implementations of this method violate Newton’s third law. However, the error due to this violation is tolerable in usual simulations. In [47], we found that the SPH method with individual timesteps in the way described in the literature cannot adequately handle strong explosion problems. This is because the timestep of a particle is determined explicitly, at the beginning of that timestep itself. A particle cannot respond to a strong shock if its timestep before the shock comes is too long compared to the timescale of the shock, and thus supernova explosions pose a serious difficulty. Such an explosion generates a small amount of very hot gas (T ∼ 108 K) in a large clump of cold gas (T ∼ 10 K). Thus, the timestep of the hot gas particles becomes 1000 times smaller than that of cold gas around them. The hot gas wants to expand, but the cold gas around it is frozen, because the timesteps for the cold gas particles are an order of magnitude longer than the expansion timescale of the hot gas. This inconsistency results in the breakdown of the numerical integration. To overcome this difficulty we introduced a timestep limiter which limits the difference in timesteps between neighboring particles. We denote the timestep of the ith particle as dti and that of a neighbor particle, with index j, as dt j . The basic idea of our limiter is to enforce the following conditions: dti ≤ f dt j , (14) dt j ≤ f dti , (15) where f is an adjustable parameter. We found f = 4 to give good results, from the perspective of total energy and linear momenta conservation, without significant increase in the calculation cost. It is essential that the timestep of particle j shrinks when the timestep of its neighbor particle i suddenly shrinks by a large factor. Thus particle i should let particle j respond to the change of its timestep. Our implementation of the timestep limiter is as follows: To enforce the small enough difference in timesteps among neighboring particles, particles send their timesteps to neighboring particles when they are integrated. Particles which receive timesteps compare them with the local minimum timestep, dtlmin,i , and update the local minimum timestep if necessary. If the local minimum timestep of a particle is too small compared to its own timestep (dtlmin,i < f dti ), the timestep of the particle is reduced to dtlmin,i = f dti . Note that this reduction of the timestep of particle j is possible only if the times of the two particles ti and t j satisfy the condition ti ≥ t j + f dti . If this condition is not satisfied, the reduction of the timestep results in the new time of particle j being before the current time of particle i, requiring the backward integration of the entire system. 19/27 PTEP 2012, 01A303 J. Makino and T. Saitoh In this case, the new time of particle j is set to a value that is consistent with the current system time (ti ), and the timestep is set to the difference between the particle’s current time and new time. Schematic pictures of the traditional individual timesteps method and our implementation of the individual timesteps method are found in Fig. 1 in [47]. As shown in Fig. 3 in [47], the traditional implementation of individual timesteps failed to reproduce the solution for a point-like explosion. When we used the timestep limiter for the individual timesteps, the numerical solution of the SPH simulation is in good agreement with the self-similar Sedov solution. The increase in the calculation cost is negligible. In many previous simulations of galaxy formation, the mass of gas particles is 105 M or more, and gas cooling below 104 K is artificially suppressed. In this case, the reduction of the timestep in the region heated by a SN explosion is rather small, say a factor of 10, and special treatments such as that described here were not necessary. In other words, one of the numerical difficulties associated with dealing with low-temperature, high-density interstellar gas is this problem. Our new timestep limiter is essential, and has been adopted in many new calculations. 4.2. Asynchronous time integrator for self-gravitating fluid In simulations of galaxies, supernova explosion in dense regions leads to the shortest timestep. Consider the situation that an SN with energy E SN occurs in the interstellar medium (ISM) with a temperature of TISM . If this SN heats the surrounding ISM from temperature TISM to TSN , and if the size of the region is unchanged during this process, we can write the reduction factor of the timestep of the ISM before the SN, dtISM , and after the SN, dtSN , as dtSN /dtISM ∝ (TISM /TSN )1/2 , ∝ E SN −1/2 m SN TISM 1/2 , 1/2 (16) where m SN is the mass of the heated region, and we used the relation TSN ∝ E SN /m SN . For SPH simulations, it is reasonable to set m SN close to the resolution ∼ NNB × m SPH , where NNB is the number of neighbor particles (typically 30–50), and m SPH is the mass of an SPH particle. The shrinkage of the timestep is larger when the mass resolution is higher and the ISM temperature is lower. Therefore, a high-resolution simulation which incorporates the low temperature ISM (<104 K) requires much shorter timesteps than that required by conventional simulations of galaxy formation with a cooling cutoff at ∼104 K. In a heated region, the thermal energy and kinetic energy become many orders of magnitude larger than the gravitational potential energy. This means that the timestep for gravitational interaction can be much longer than that for hydrodynamics. By extending the concept of individual timesteps, we constructed a new integration scheme which allows an individual fluid particle to have different timesteps for gravitational and hydrodynamical interactions. As we stated above, particles in the heated region have the shortest timesteps. Therefore, if we assign different timesteps to gravitational and hydrodynamical forces, we should be able to use a much longer timestep for gravity, thereby accelerating simulations by a large factor. We call this integrator FAST, which is an acronym for “Fully Asynchronous Split Time-integrator”. A similar idea has been used in molecular dynamics, in which the long-range Coulomb and the short-range van der Waals forces are integrated with different timesteps [48]. FAST reduces unnecessary gravitational force evaluations in the small timesteps induced by SN explosion. Since the number of dark matter and stellar particles is usually larger than that of SPH 20/27 PTEP 2012, 01A303 J. Makino and T. Saitoh particles in typical simulations of galaxy formation, the calculation cost of gravity is larger than that of hydrodynamics. This reduction in unnecessary evaluations of gravity can improve the calculation speed significantly. When used with a parallel tree algorithm, FAST offers a further advantage. The calculation of gravity to a small fraction of particles, which occurs when a small number of particles with short timesteps are integrated, is quite expensive. This is because the overhead of tree construction and communication dominates the computing time. The splitting of the integration steps for hydrodynamics and gravity is done in the same way as in the case of the BRIDGE or P3 T schemes, though in the case of FAST neither the integration of hydrodynamics or that of gravity are exactly symplectic. In [49] we compared results by the FAST method and the conventional leapfrog method. The evolution of an SN explosion in a self-gravitating cloud solved by the FAST method is in good agreement with that by the conventional leapfrog method with individual timesteps (see Figs. 5 and 6 in [49]). For the case with the FAST method, the total number of gravity steps decreased and the calculation time was reduced. The gain in speed in this case was not so large because the fraction of particles with small timestep was large. In order to test the performance of the FAST method under more realistic circumstances we compared the results of merger simulations, in which we took into account the radiative cooling of gas down to 10 K, star formation in the dense (n H > 100 cm−3 ) and cold (T = 100 K) phase gas, and type II SN explosion. The initial numbers of dark matter, (old) star, and gas particles were 6 930 000, 341 896, and 148 104, respectively. We used 128 cores of a Cray XT4 system. The evolutions of merger galaxies in the two schemes are essentially identical. The calculation time with the FAST method is half of that without FAST. The calculation time of gravity was reduced to one-eighth in this case, as expected from Fig. 8 in [49]. In these simulations we used highly hand-tuned code, the Phantom-GRAPE library [50], for the calculation of gravity, but the calculation of the SPH part is not that optimized yet. We should be able to further improve the performance of our simulation code by using a similarly optimized code for SPH part. 4.3. Extension of treecode for multi-mass and multi-scale simulations When one perform a multi-mass and multi-scale simulation, one would like to use different Plummer softenings for particles of different mass. The reason we use small-mass particles is to improve mass resolution, and it is often necessary to reduce softening for these particles. For the direct summation method, it is possible to use an arbitrary symmetric form for the softening length, for example, [(i2 + 2j )/2]1/2 , (i + j )/2, or max(i , j ), where i and j are the softening lengths for particles i and j, respectively, instead of the found in the usual Plummer potential to satisfy Newton’s third law. When one uses the tree method [1,51] with Plummer potential, one needs to use separate trees for each of the different groups of particles with the same gravitational softening length [52], since otherwise there will be an error in the force calculation of order 2 . The use of different trees for different groups of particles having the same softenings leads to a large increase in the calculation cost. If we want to use “individual” softening, we cannot use the tree algorithm. Saitoh and Makino [53] introduced a new way to handle particle-dependent softening length using a single tree structure. Conceptually, if we use symmetric softening of the form [(i2 + 2j )/2]1/2 , we can regard i and j as two additional dimensions, in addition to the three spatial dimensions. When we evaluate the force from a group of particles, usually we use a multipole expansion. If particles in the group have different softenings, we can formally expand the potential in terms of j , or actually 2j − 2j . 21/27 PTEP 2012, 01A303 J. Makino and T. Saitoh Consider the following form: φi j = − Gm j |ri2j + i2 + 2j |1/2 . (17) This form has been used before to model the gravitational interaction between galaxy particles of different size and mass [54,55]. The gravitational potential induced by a group of particles j = 1, . . . , N with particle mass m and total mass M is given by φi = − N Gm j j |(ri − δr j )2 + i2 + 2j |1/2 . (18) Here, the center of the coordinates is set to the center of mass of particles j and the positions of particles j from the coordinate center are δr j . By introducing an arbitrary form of the averaged softening length for particles j, E , Eq. 18 can be rewritten by N Gm j , (19) φi = − 2 |(ri − δr j ) + i2 + E 2 + δ( 2j )|1/2 j where δ( 2j ) ≡ 2j − E 2 . The Taylor expansion of this equation up to the second order of δr j and δ( 2j ) is N Gm j 2 (ri · δr j ) δ( j ) 3(ri · δr j )2 − |ri |2 |δr j |2 + + R R2 R2 R4 j ⎛ 3 ⎞⎫ 2 ⎬ 3(ri · δr j )δ( 2j ) 3(δ( 2j ))2 |δr j | δ( j ) ⎝ ⎠ , + + + + O ⎭ R4 4R 4 R R2 φi = − 1+ (20) where R = (ri2 + i2 + E 2 )1/2 is the Plummer distance for the symmetrized potential. The first term in parentheses is the monopole moment. The second term vanishes by definition since we adopt the center of mass as the coordinate center. The third term, which is the second-order term of the expansion by δ( 2j ), can vanish if we adopt the second moment of j as the averaged softening length: N E = 2 2j ≡ j m j 2j M . (21) Hence, adopting the second moment of j to be the averaged softening length is the most favorable option. Since the multipole moment of the symmetrized potential is obtained by expanding both δr j and δ( 2j ), two criteria are necessary for the convergence of the multipole moment. By using the analogy of Barnes and Hut’s opening criterion, we introduce a simple, but rather loose, set of opening criteria which can be used for treecode easily: w (22) η> , R and 2 2 − min max , (23) R2 for the convergence criteria of the multipole expansion. Here we adopt η and η as the tolerance √ parameters. For the convergence of the multipole expansion, η < 1/ 3 and η < 1 are necessary. η > 22/27 PTEP 2012, 01A303 J. Makino and T. Saitoh √ The reason for η < 1/ 3 is that the usual Barnes and Hut opening criterion causes unbound errors √ in force calculation when θ > 1/ 3. We successfully extended the tree method for the multi-mass and multi-scale simulations by using the symmetrized Plummer potential and deriving a multipole moment for a group of particles. Since our method is quite simple, it is easy to apply our method to code which uses the tree method with the ordinary Plummer potential. So far, the latest version of GRAPE, GRAPE-8, adopts this type of symmetrized Plummer potential as a standard feature. In addition, Phantom-GRAPE [50] is also equipped with this symmetrized Plummer potential (Tanikawa, private communication). 4.4. A new formulation of smoothed particle hydrodynamics Recently, there have been several comparison studies between SPH and Eulerian grid codes, in which SPH was demonstrated to fail in handling Kelvin–Helmholtz (KH) instability. The essential reason for this failure is that the standard SPH formulation uses “smoothed” density to evaluate all other quantities, while there is a discontinuity of the density at the contact discontinuity where the KH instability develops. In standard SPH, the smoothed estimate of a physical quantity, f , is expressed as follows: fj m j W (ri j , h i ), (24) f (ri ) ρj j where ri j = |ri j | and ri j = ri − r j , h i is the kernel size of particle i, and W is the smoothing kernel. By substituting ρ into f , we obtain ρi m j W (ri j , h i ), (25) j where ρi ≡ ρ(ri ) is the smoothed density at the position of particle i. Since there is no unknown parameter, we evaluate this equation first. The equations of motion and energy in the SPH expression are given by Pj Pi d 2 ri − mj + 2 ∇ W̃ , (26) dt 2 ρi2 ρj j and Pi du i m j 2 vi j · ∇ W̃ , dt ρi (27) respectively. Here, W̃ is the symmetrized kernel, given by W̃ = 12 [W (ri j , h i ) + W (ri j , h j )]. Equations 25, 26, and 27 close with the equation of state (EOS), P = (γ − 1)ρu, (28) where γ is the specific heat ratio and u is the specific internal energy. In the standard SPH, density is evaluated first and other quantities are calculated using the smoothed density. Consider the case that two mediums with uniform but different densities come into contact with a sharp density jump. Since the standard SPH evaluates density by convolution with the kernel function, the smoothed density at the interface is always over(under)-estimated in the low(high)density part. This error in the smoothed density propagates to the calculated pressure, resulting in a repulsive force at the interface and the cause of the large error in the motion. 23/27 PTEP 2012, 01A303 J. Makino and T. Saitoh Conceptually, the reason why the density is necessary to obtain the smoothed estimates of other quantities is that we need to obtain the volume element associated with each particle, and with the standard SPH we use m i /ρi for that purpose. In principle, one can use any physical quantity to obtain the volume element. Since we have the difficulty at the contact discontinuity due to the jump in density, it is desirable to use something which is not discontinuous at the contact discontinuity, and the obvious candidate is the pressure P. In the case of the ideal gas, (γ − 1)U j , (29) dr = Pj where Ui = m i u i is the internal energy of particle i, gives the volume element, and we can derive the new set of SPH equations in a similar fashion to the standard set of equations. The final forms of the equation of motion and the energy equation are dvi 1 1 −(γ − 1) ∇ W̃i j , mi Ui U j + (30) dt qi qj j and Ui U j dUi (γ − 1) vi j · ∇ W̃i j , (31) dt qi j respectively (see [56] for detailed derivations of these equations). It is obvious that the equation of motion does not include the density. Instead, it includes the energy density, q. Thus, this formulation should show good behavior at the contact discontinuity. The standard SPH needs to adopt an artificial viscosity term to capture shocks. Our new SPH also adopts an artificial viscosity term which is the same as the one used in the standard SPH. According to our tests, the artificial viscosity term used in the standard SPH, with the smoothed density estimate, works well with our new SPH. We have performed many detailed tests on the behavior of the new formulation [56]. In this review, we show the result of a very simple test of the evolution of a hydrostatic fluid system. By definition, a hydrostatic fluid is static, which means it should not show any evolution. Figure 11 shows the evolution of a two-fluid system with a density contrast of 64 for eight sound crossing times. In the calculation with the standard SPH (the top row), the boundary of the two fluids evolves from a square shape to a much rounder shape, and a wide empty ring structure develops between the two fluids. This unphysical evolution is induced by the artificial surface tension at the density jump. In contrast, in the calculation with our SPH method, the system shows virtually no evolution except the small local adjustment of positions of particles, which would occur even for a single-fluid system (the bottom row). Density jumps are everywhere in the universe. Therefore, the fact that standard SPH can so miserably fail in modeling the jump is rather worrisome, and this failure explains why SPH had failed in many comparison tests with grid codes. We believe our new form has the potential to replace the standard, density-based form. One might imagine that our new SPH would fail to handle strong shocks, since the pressure jump in the shock region is orders of magnitude larger than the density jump. However, as long as we adopt the smoothed density for the evaluation of the artificial viscosity, there is no critical trouble under the strong shock condition. See also [56]. Our form can be extended to non-ideal EOS, and we are currently testing one such extended formulation. One problem is that when the pressure becomes nearly zero, our formulation fails. In such a case, the formulation which explicitly solves the continuity equation [57] might be preferred. 24/27 PTEP 2012, 01A303 J. Makino and T. Saitoh Fig. 11. Evolution of a two-fluid system with a density contrast of 64. Snapshots at t = 0.1, 0.3, 0.5, 1.0 and 8.0 are shown. The red and blue points indicate the positions of particles with ρ = 64 and ρ = 1, respectively. The upper row shows the results of the standard SPH, whereas the lower row shows those of our new SPH. The particle separation is constant and the particle mass difference is 1:64. 5. Final words In hindsight, the 1990s was a very good period for the development of a special-purpose architecture such as GRAPE, for two reasons. First, semiconductor technology reached the point where many floating-point arithmetic units could be integrated into a chip. Second, the initial design cost of a chip was still within the reach of fairly small research projects in basic science. Now, semiconductor technology has reached the point where one can integrate thousands of arithmetic units into a chip. On the other hand, the initial design cost of a chip has become too high. The use of FPGAs and the GRAPE-DR approach are two examples of ways to tackle the problem of increasing initial cost. However, unless one can keep increasing the budget, the GRAPE-DR approach is not viable, simply because it still means an exponential increase in the initial, and therefore total, cost of the project. On the other hand, such an increase in the budget might not be impossible, since the field of computational science as a whole is becoming more and more important. Even though a supercomputer is expensive, it is still much less expensive compared to, for example, particle accelerators or space telescopes. Of course, computer simulation cannot replace real experiments of observations, but computer simulations have become essential in many fields of science and technology. In addition, there are several technologies available in between FPGAs and custom chips. One is what is called “structured ASIC”. It requires customization of typically just one metal layer, resulting in a large reduction in the initial cost. The number of gates one can fit into the given silicon area falls between those of FPGAs and custom chips. We are currently working on a new fully pipelined system based on this structured ASIC. The price of the chip is not very low, but in the current plan it gives extremely good performance for very low energy consumption. When we look back at the evolution of numerical schemes, our impression is that many of the schemes, including what we have developed, are still rather crude, and there are many possibilities of improvement, both in the direction of parallelization and that of more sophisticated numerical schemes. We hope this review helps readers to get some idea on new directions. 25/27 PTEP 2012, 01A303 J. Makino and T. Saitoh Acknowledgements This work is supported in part by a Grant-in-Aid for Scientific Research (21244020) and Strategic Programs for Innovative Research of the Ministry of Education, Culture, Sports, Science and Technology (SPIRE). References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] J. Barnes and P. Hut, Nature 324, 446 (1986). L. Greengard and V. Rokhlin, J. Comput. Phys. 73, 325 (1987). T. Ito, J. Makino, T. Ebisuzaki, and D. Sugimoto, Comput. Phys. Commun. 60, 187 (1990). J. Makino, T. Ito, and T. Ebisuzaki, Publ. Astron. Soc. Jpn. 42, 717 (1990). J. Makino and S. J. Aarseth, Publ. Astron. Soc. Jpn. 44, 141 (1992). J. Makino and M. Taiji, Scientific Simulations with Special-Purpose Computers — The GRAPE Systems (Wiley, Chichester, UK, 1998). A. Kawai, T. Fukushige, J. Makino, and M. Taiji, Publ. Astron. Soc. Jpn. 52, 659 (2000). J. Makino, M. Taiji, T. Ebisuzaki, and D. Sugimoto, Astrophys. J. 480, 432 (1997). J. Makino, T. Fukushige, M. Koga, and K. Namura, Publ. Astron. Soc. Jpn. 55, 1163 (2003). T. Hamada, T. Fukushige, A. Kawai, and J. Makino, Publ. Astron. Soc. Jpn. 52, 943 (2000). S. J. Aarseth, Mon. Not. R. Astron. Soc. 126, 223 (1963). S. J. Aarseth, Direct methods for n-body simulations. In Multiple Time Scales, eds. J. U. Blackbill and B. I. Cohen (Academic Press, New York, 1985), pp. 377–418. P. Hut, J. Makino, and S. McMillan, Astrophys. J. Lett. 443, L93 (1995). K. Nitadori and J. Makino, New Astron. 13, 498 (2008). J. K. Salmon and M. S. Warren, J. Comput. Phys. 111, 136 (1994). S. L. W. McMillan, The vectorization of small-N integrators. In The Use of Supercomputers in Stellar Dynamics, eds. P. Hut and S. L. W. McMillan (Springer, Berlin, 1986) Lecture Notes in Physics, Vol. 267, p. 156. J. Makino, Publ. Astron. Soc. Jpn. 43, 859 (1991). J. Makino, New Astron. 7, 373 (2002). L. Hernquist, Astrophys. J. Suppl. 64, 715 (1987). J. E. Barnes, J. Comput. Phys. 87, 161 (1990). L. Hernquist, J. Comput. Phys. 87, 137 (1990). J. Makino, J. Comput. Phys. 87, 148 (1990). M. S. Warren and J. K. Salmon, Astrophysical N-body simulations using hierarchical tree data structures. (IEEE Comp. Soc., Los Alamitos, 1992), pp. 570–576. J. Makino and P. Hut, Comput. Phys. Rep. 9, 199 (1989). J. Makino, Publ. Astron. Soc. Jpn. 56, 521 (2004). T. Ishiyama, T. Fukushige and J. Makino, Publ. Astron. Soc. Jpn. 61, 1319 (2009). L. Hernquist and N. Katz, Astrophys. J. Suppl. 70, 419 (1989). S. L. W. McMillan and S. J. Aarseth, Astrophys. J. 414, 200 (1993). M. Fujii, M. Iwasawa, Y. Funato, and J. Makino, Publ. Astron. Soc. Jpn. 59, 1095 (2007). S. Oshino, Y. Funato, and J. Makino, Publ. Astron. Soc. Jpn. 63, 881 (2011). N. Katz and J. E. Gunn, Astrophys. J. 377, 365 (1991). J. F. Navarro and W. Benz, Astrophys. J. 380, 320 (1991). N. Katz, Astrophys. J. 391, 502 (1992). M. Steinmetz and E. Mueller, Astron. Astrophys. 281, L97 (1994). J. F. Navarro and M. Steinmetz, Astrophys. J. 478, 13 (1997). M. Steinmetz and J. F. Navarro, Astrophys. J. 513, 555 (1999). R. J. Thacker and H. M. P. Couchman, Astrophys. J. Lett. 555, L17 (2001). M. G. Abadi, J. F. Navarro, M. Steinmetz, and V. R. Eke, Astrophys. J. 591, 499 (2003). J. Sommer-Larsen, M. Götz, and L. Portinari, Astrophys. J. 596, 47 (2003). T. R. Saitoh and K. Wada, Astrophys. J. Lett. 615, L93 (2004). F. Governato et al., Astrophys. J. 607, 688 (2004). F. Governato et al., Mon. Not. R. Astron. Soc. 374, 1479 (2007). F. Governato et al., Nature 463, 203 (2010). C. B. Brook et al., Mon. Not. R. Astron. Soc. 415, 1051 (2011). 26/27 PTEP 2012, 01A303 [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] J. Makino and T. Saitoh J. Guedes, S. Callegari, P. Madau, and L. Mayer, Astrophys. J. 742, 76 (2011). J. Makino, Publ. Astron. Soc. Jpn. 43, 859 (1991). T. R. Saitoh and J. Makino, Astrophys. J. Lett. 697, L99 (2009). W. B. Streett, D. J. Tildesley, and G. Saville, Mol. Phys. 35, 639 (1978). T. R. Saitoh and J. Makino, Publ. Astron. Soc. Jpn. 62, 301 (2010). A. Tanikawa, K. Yoshikawa, K. Nitadori, and T. Okamoto, arXiv:1203.4037. A. W. Appel, SIAM J. Sci. and Stat. Comput. 6, 85 (1985). V. Springel, N. Yoshida, and S. D. M. White, New Astron. 6, 79 (2001). T. R. Saitoh and J. Makino, New Astron. 17, 76 (2012). S. D. M. White, Mon. Not. R. Astron. Soc. 177, 717 (1976). S. J. Aarseth and S. M. Fall, Astrophys. J. 236, 43 (1980). T. R. Saitoh and J. Makino, arXiv:1202.4277. J. J. Monaghan, J. Comput. Phys. 110, 399 (1994). 27/27
© Copyright 2026 Paperzz