Here - CSE @ IITD

Accelerating an N-Body
Simulation
Anuj Kalia
Maxeler Technologies
CPU loads particle data into
DRAM for every iteration.
(every N*N cycles)
CPU loads particle data into
DRAM for every iteration.
(every N*N cycles)
A new set of 4 values is read
from DRAM in every cycle.
CPU loads particle data into
DRAM for every iteration.
(every N*N cycles)
A new set of 4 values is read
from DRAM in every cycle.
16 force computations are
done based on 16 scalar
inputs and the 4 values read
earlier.
The pipeline and
accumulator are described in
another slide.
CPU loads particle data into
DRAM for every iteration.
(every N*N cycles)
A new set of 4 values is read
from DRAM in every cycle.
16 force computations are
done based on 16 scalar
inputs and the 4 values read
earlier.
The pipeline and
accumulator are described in
another slide.
Every pipeline outputs 12
partial sums after ‘N’ cycles.
CPU loads particle data into
DRAM for every iteration.
(every N*N cycles)
A new set of 4 values is read
from DRAM in every cycle.
16 force computations are
done based on 16 scalar
inputs and the 4 values read
earlier.
The pipeline and
accumulator are described in
another slide.
Every pipeline outputs 12
partial sums after ‘N’ cycles.
CPU adds the 12 partial sums together (for every particle), updates
velocities, updates positions and re-writes into the DRAM.
for(int j=0;j<N/PAR;j++)
{
max_set_scalar_input(device,"RowSumKernel.N",N,FPGA_A);
//set scalar inputs
max_set_scalar_input_f(device,"RowSumKernel.EPS",EPS,FPGA_A);
for(int p=0;p<PAR;p++)
{
max_set_scalar_input_f(device,pi_x[p],px[j*PAR+p],FPGA_A);
max_set_scalar_input_f(device,pi_y[p],py[j*PAR+p],FPGA_A);
max_set_scalar_input_f(device,pi_z[p],pz[j*PAR+p],FPGA_A);
}
max_run
//run the kernel
(
device,
max_output("ax",outputX,12*PAR*sizeof(float)),
max_output("ay",outputY,12*PAR*sizeof(float)),
max_output("az",outputZ,12*PAR*sizeof(float)),
max_runfor("RowSumKernel",N),
max_end()
);
for(int i=0;i<12*PAR;i++)
//sum up the partial sums
{
ax[j*PAR+(i/12)]+=outputX[i];
ay[j*PAR+(i/12)]+=outputY[i];
az[j*PAR+(i/12)]+=outputZ[i];
}
}
N/PAR
times
N Cycles
//update velocity
//update position
//load memory
Host C code
Pipeline and
Accumulator:
1. 1 Input per cycle:
P_j data from
DRAM.
2. Acceleration:
accumulated as 12
partial sums.
Resource Usage
Resource Usage for 16 fold parallel kernel @ 150MHz:
LUTs: 156032 / 297600 (52.43%)
FFs: 166543 / 595200 (27.98%)
BRAMs: 433 / 1064 (40.70%)
288 / 2016 (14.29%)
Performance: Comparison
1600
1400
1200
1000
Seconds
Hardware
800
Software
600
400
200
0
0
50000
100000
150000
200000
250000
Particles
300000
350000
400000
450000
Performance: Speedup
Speedup
16
14
12
10
8
Speedup
Speedup
6
4
2
0
0
50000
100000
150000
200000
250000
Particles
300000
350000
400000
450000
38400 Particles