Application performance on new IBM HPCF

IFS performance on the new
IBM Power6 systems at ECMWF
Deborah Salmond
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
Slide 1
Plan for Talk
• HPC systems at ECMWF
– 2 x IBM p5 575+ clusters (Power5+)
– New IBM p6 575 cluster (Power6)
• Preliminary performance measurements for IFS
Cycle 35r1 - ECMWF’s operational weather
forecasting model - on Power6 compared with
Power5+
• Power6 system is available to all ECMWF internal
users for research experiments
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
Slide 2
Power5+
Power6
hpce & hpcf
c1a & c1b
IBM p5 575+
IBM p6 575
Power5+ 1.9 GHz + SMT
Peak 7.6 Gflops per core
Power6 4.7 GHz + SMT
Peak 18.8 Gflops per core
2480 cores per cluster
7936 cores per cluster
16 cores per node
32 cores per node
Federation Switch
QLogic Silverstorm
InfiniBand Switch
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
Slide 3
Power5+
Power6
Power5+
Power6
Increase
1.9 GHz
4.7 GHz
2.5 x
Cores per cluster 2480
7936
3.2 x
Compute nodes
per cluster
Cores per node
248
1.6 x
32
(64 SMT)
64 Gbytes
2x
Clock
155
16
(32 SMT)
Memory per node 32 Gbytes
L2 cache per core 0.9 Mbytes 4 Mbytes
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
Slide 4
1x
(per core)
4x
ECMWF’s next operational resolution:
T1279 L91
Horizontal
grid-spacing =
~16km
684.1
650
600
Number of
Horizontal
gridpoints =
2,140,704
550
Timestep =
450 secs
300
Flops for 10day forecast =
8*1015
500
450
400
350
250
200
150
100
50
50°N
50°N
10
0°
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
Slide 5
ECMWF’s next operational resolution
for EPS: T399
Horizontal
grid-spacing =
~50km
473.1
450
400
Number of
Horizontal
gridpoints =
213,988
350
300
250
Timestep =
1800 secs
Flops for 51
member EPS =
5*1015
200
150
100
50
50°N
50°N
10
0°
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
Slide 6
IFS T1279 L91 forecast (35r1) on P6
compared with P5+
– same number of cores
Wall Number of Tflops % of
time
cores
Peak
P5+ 10332 640*
0.77
15.9
(40 nodes)
P6 6610 640
1.21
9.9
(20 nodes)
Speed-up
1
1.56
* 160 MPI tasks and 8 OpenMP threads – using SMT
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
Slide 7
IFS T1279 L91 forecast (35r1) on P6
compared with P5+
– 3.2 * number of cores
Wall
time
P5+ 10332
P6
2541*
Number of Tflops % of
cores
Peak
640
0.77
15.9
(40 nodes)
2048
3.15
8.2
(64 nodes)
*Some performance problems
E.g. load imbalance from ‘jitter’
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
Slide 8
Speed-up
1
4.06
Variable time-step times
– now mostly fixed
3.5
Sec
3
Power5
1280 cores
2.5
2
1.5
1
Power6
4096 cores
0.5
0
350
400
450
500
550
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
600
650
Slide 9
700
Step number
IFS T1279 10-day forecast on Power6
Operations
6
Number of
Physical cores
4096
5
3072
4
2048
Tflops 3
P6
P5
1024
2
640
1
960
1280
Run with SMT
and 8 OpenMP threads
320
0
0
25
50
% of cluster
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
Slide 10
IFS T1279 10-day forecast on Power6
- scalability to whole cluster
8
7040
7
Number of
Physical cores
6
4096
5
3072
Tflops 4
P6
P5
2048
3
1024
2
1
320
640
960
1280
0
0
0
% of
75cluster
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
100
150
Slide 11
IFS T1279 10-day forecast on Power6
- speed-up curve
8
7040
Speed-up 7
= 14080 user threads
6
5
4096
4
Ideal
P6
3072
3
2048
2
1024
1
0
0
100
% of75
cluster
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
150
Slide 12
DrHook on Power5+ and Power6 for T1279 forecast
All PEs % Tot
P5
27398.0
27721.8
26762.5
8233.4
7791.7
9391.3
9606.8
8075.2
9565.0
9265.2
9134.5
6.64
6.72
6.49
2.00
1.89
2.28
2.33
1.96
2.32
2.25
2.22
All PEs % Tot
P6
17512.5
13125.1
9564.7
4634.2
5407.0
5665.2
5594.3
5455.2
Ave
TIME
85.6
86.6
83.6
25.7
24.3
29.3
30.0
25.2
29.9
29.0
28.5
Min
Max
73.7
80.5
76.8
7.4
2.5
26.5
28.1
21.6
27.8
27.9
27.2
97.7
94.6
87.6
50.8
33.3
32.9
32.0
32.0
31.2
30.0
29.4
Ave
Min
Max
TIME
7.23
5.42
3.95
1.91
2.64
2.34
2.31
1.79
2.25
54.7
41.0
29.9
14.5
19.9
17.7
17.5
13.5
17.0
Ave
Min
MFLOPS/Logical
645.4
518.0
362.3
352.0
2606.6 2342.0
197.5
71.0
0.0
0.0
290.6
266.0
2318.5 2268.0
989.4
852.0
473.1
457.0
278.6
267.0
0.0
0.0
Ave
Min
Max
core
733.0
374.0
2941.0
232.0
0.0
311.0
2365.0
1113.0
506.0
292.0
0.0
CUADJTQ
CLOUDSC
MXMAOP
CLOUDVAR
>MPL-TRLTOG_COMMS
VDFEXCU
VERINT
LAITQM
VDFMAIN
SRTM_SPCVRT_MCICA
>MPL-TRMTOL_COMMS
Max
MFLOPS/Logical core
51.4
35.1
26.4
3.6
18.1
17.2
16.9
3.8
16.7
59.4
44.6
33.5
29.3
21.2
18.6
18.5
17.8
17.5
573.5
551.2
5314.9 5124.3
1847.7 1446.0
350.0
145.9
0.0
0.0
456.4
433.0
808.3
751.7
0.0
0.0
1466.6 1101.9
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
595.6
5776.4
2137.7
402.2
0.0
470.9
853.3
0.0
2035.2
CLOUDSC
MXMAOP
CUADJTQ
CLOUDVAR
>MPL-TRMTOL_COMMS
SRTM_SPCVRT_MCICA
VDFMAIN
>MPL-TRLTOG_COMMS
LAITQM
Slide 13
Speed-up per core - Computation
Power5+
Power6
• Compute speed-up
– Clock cycle = 2.47 x
Routine
Description
Gflops/core
on Power6
Speed-up
P5
P6
CUADJTQ
Math functions
3.6
1.9
CLOUDSC
IF tests
1.1
1.4
MXMAOP
DGEMM call
10.6
2.2
SRTM
IF tests
0.9
1.5
LAITQM
Indirect addressing
2.9
1.5
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
Slide 14
Speed-up per core - Communications
Power5+
Power6
• Communications speed-up
– 8 IB links per node on Power6
– 2 Federation links per node on Power5+
– Increase in aggregate Bandwidth per core = 2 x
Routine
Description
Speed-up
P5
P6
TRLTOG
Transposition Fourier
to Grid-Point
1.89
TRMTOL
Transposition Spectral
to Fourier
1.52
SLCOMM1
Semi-Lagrangian Halo
1.44
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
Slide 15
Power5 compared with Power6
• Very similar for users ☺
• Re-compile –qarch=pwr6
• No ‘out-of-order execution’ on Power6
– so SMT is more advantageous
• Some constructs relatively slower on
Power6
– Floating point compare & branch
– Store followed by load on same address
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
Slide 16
HPCF at ECMWF 1978-2011
Sustained Mflops
40 Mflops
CPUs
1 CPU 1978 CRAY1
1984
1986
1990
1992
CRAY
CRAY
CRAY
CRAY
XMP-2
XMP-4
YMP-8
C90-16
Vector
Vector - Shared Memory Parallel
1996 Fujitsu VPP700 Vector - MPI
2000 Fujitsu VPP5000
2002
2004
2*24802006
2*7936 2008
IBM Power4
IBM Power4+
IBM Power5+
IBM Power6
Scalar – MPI+OpenMP
-> SMT
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
Slide 17
4 Tflops
~20Tflops
History of IFS scalability
10000
IBM Power 6 p575 2008
RAPS-10+ T1279 L91
IBM p575+ 2006
RAPS-9 T799 L91
Gflop/s
1000
100
10
1
100
IBM p690+ 2004
RAPS-8 T799 L91
CRAY T3E-1200 1998
RAPS-4 T213 L31
1000
10000
Number of cores
13th Workshop on the use of HPC in Meteorology, 3-7 Nov 2008
Slide 18