Slides

A Multi-Level Adaptive Loop
Scheduler for Power 5
Architecture
Yun Zhang, Michael J. Voss
University of Toronto
Guansong Zhang, Raul Silvera
IBM Toronto Lab
13-Jul-17
Agenda








Background
Motivation
Previous Work
Adaptive Schedulers
IBM Power 5 Architecture
A Multi-Level Hierarchical Scheduler
Evaluation
Future Work
2
Simultaneous Multi-Threading

Architecture

Several threads per
physical processor

Threads share



Caches
Registers
Functional Units
3
Power 5 SMT
Execution
Resource
Execution
Resource
Clock Cycles
Resource 0
Resource 0
Resource 1
Resource 1
Clock Cycles
……………
……………
Resource n
Resource n
Thread 0
Thread 1
Thread 0
Thread 1
4
OpenMP

OpenMP
 A standard API
for shared memory
programming
 Add directives for parallel regions

Standard Loop Schedulers
 Static
 Dynamic
 Guided
 Runtime
5
OpenMP API
#pragma omp parallel for shared(a, b) private(i, j) schedule(runtime)
for ( i = 0; i < 100; i ++ ) {
for ( j = 0; j < 100; j ++) {
a[i , j] = a[i , j] + b[i , j];
}
}
An example of a parallel loop in C code. (Similar in Fortran)
i
……..
……..
….
….
….
….
….
j
……..
T0
Tn
6
Motivation

OpenMP Applications
 Designed
for SMP systems
 Not aware of HT technology
 Understanding and controlling performance of
OpenMP applications on SMT processors is
not trivial

Important performance issues on SMP
system with SMT nodes
 Inter-thread
data locality
 Instruction Mix
 SMT-related Load Balance
7
Scaling (Spec & NAS)
4.00
3.50
3.00
4 Intel Xeon
Processors with
Hyperthreading
ammp
apsi
art
equake
mgrid
swim
Speedup
2.50
wupwise
BT
CG
EP
2.00
1.50
1.00
1 Thread per Processor
1-2 Threads per Processor
FT
MG
SP
0.50
0.00
1
2
3
4
5
6
7
8
Number of Threads
8
Why do they scale poorly?

Inter-thread data locality
 cache

misses
Instruction Mix
 functional
units sharing
 benefit gained this way may outweigh cache
misses

SMT-related Load Balance
We should balance work loads well among:
 processors
 threads running on the same physical processor.
9
Previous Work:
Runtime Adaptive Scheduler

Hierarchical Scheduling
 Upper
level scheduler
 Lower level scheduler

Select scheduler and the number of
threads to run at runtime
 One
thread per physical processor
 Two threads per physical processor
10
Two-Level Hierarchical Scheduler
11
Traditional Scheduling
Static Scheduling
Dynamic Scheduling
i
……..
……..
……..
……..
….
….
T0
….
……..
….
j
….
….
….
….
….
….
j
i
……..
Tn
T0
Tn
Ti
Tk
12
Hierarchical Scheduling
i
……..
……..
……..
……..
P0
….
……..
….
….
….
….
….
….
j
……..
Dynamic Scheduling
……..
Pi
Static Scheduling
T00
T01
Ti0
Ti1
13
Why can we benefit from
runtime scheduler selection?


Many parallel loops in
OpenMP applications are
executed again and again.
# of calls vs.
Execution time
< 10
times
10 – 40
times
> 40
times
ammp
0%
0%
84.20%
Example
apsi
0%
0%
82.55%
for (k = 1; k<100;
k++) {
………….
calculate();
………….
}
art
100%
0%
0%
equake
0.05%
0%
98.23%
mgrid
0%
0.11%
95.95%
swim
0.09%
0%
99.25%
wupwise
0.12%
0%
99.49%
BT
0%
0%
100%
CG
0.92%
3.5%
92.57%
EP
100%
0%
0%
MG
12.73%
12.87%
71.91%
SP
1.02%
0%
92.71%
void calculate () {
#pragma omp parallel for
schedule(runtime)
for (i = 1; i<100; i++) {
……………; // calculation
}
}
14
Adaptive Schedulers

Region Based Scheduler
 Select
loop schedulers at runtime
 Parallel loops in one parallel region have to use the
same scheduler which may not be the best

Loop Based Scheduler
 Higher
runtime overhead
 More accurate loop scheduler for each parallel loop
15
Sample from NAS2004
region based
scheduler
picks one
scheduler that
applies to all
three loops
!$omp parallel default(shared) private(i,j,k)
!$omp do schedule(runtime)
do j=1,lastrow-firstrow+1
do k=rowstr(j),rowstr(j+1)-1
colidx(k) = colidx(k) - firstcol + 1
enddo
enddo
!$omp end do nowait
!$omp do schedule(runtime)
do i = 1, na+1
x(i) = 1.0D0
enddo
!$omp end do nowait
!$omp do schedule(runtime)
do j=1, lastcol-firstcol+1
q(j) = 0.0d0
z(j) = 0.0d0
r(j) = 0.0d0
p(j) = 0.0d0
enddo
!$omp end do nowait
!$omp end parallel
loop based
scheduler picks
a scheduler
loop based
scheduler picks
a scheduler
loop based
scheduler picks
a scheduler
16
Runtime Loop Scheduler Selection
Phase 1:
try upper level scheduler, run with 4 threads…………
M1
Static Scheduler
P0
P1
P2
P3
T0
T1
T2
T3
17
Runtime Loop Scheduler Selection
Phase 1:
try upper level scheduler, run with 4 threads…………
M1
Dynamic Scheduler
P0
P1
P2
P3
T0
T1
T2
T3
18
Runtime Loop Scheduler Selection
Phase 1:
try upper level scheduler, run with 4 threads…………
M1
Affinity Scheduler
P0
P1
P2
P3
T0
T1
T2
T3
19
Runtime Loop Scheduler Selection
Phase 1:
Made a decision on upper level scheduler, try lower level
scheduler, run with 8 threads…………
M1
Affinity Scheduler
P0
P0
P1
P1
Static
T0
T1
T2
T3
T4
T5
T6
T7
20
Sample from NAS2004
!$omp parallel default(shared) private(i,j,k)
!$omp do schedule(runtime)
do j=1,lastrow-firstrow+1
do k=rowstr(j),rowstr(j+1)-1
Static-Static,
colidx(k) = colidx(k) - firstcol + 1
8 threads
enddo
enddo
!$omp end do nowait
!$omp do schedule(runtime)
do i = 1, na+1
TSS, 4 threads
x(i) = 1.0D0
enddo
!$omp end do nowait
!$omp do schedule(runtime)
do j=1, lastcol-firstcol+1
q(j) = 0.0d0
z(j) = 0.0d0
TSS, 4 threads
r(j) = 0.0d0
p(j) = 0.0d0
enddo
!$omp end do nowait
!$omp end parallel
21
Hardware Counter Scheduler

Motivation


The RBS and LBS has runtime overhead. They will work even
better if we can reduce the overhead as much as possible
Algorithm




Try different schedulers on parallel loops on a subset of the
benchmarks using training data
Use the characteristic: cache miss, number of floating point
operations, number of micro-ops, load imbalance and the best
scheduler for that loop as input
Feed the above data to classification software (we use C4.5) to
build a decision tree
Apply this decision tree to a loop at runtime. Feed the runtime
collected hardware counter data as input, and get the result –
scheduler – as output.
22
Speedup on 4 Threads
4 Intel Xeon
Processors with
Hyperthreading
4.50
4.00
static
3.50
dynamic
afs
tss
original
2.50
RBS
LBS
2.00
HCS
1.50
er
ag
e
Av
SP
(W
)
G
M
EP
G
C
BT
(W
)
is
e
up
w
w
sw
im
gr
id
m
eq
ua
ke
ar
t
ap
si
m
p
1.00
am
Speedup
guided
3.00
Benchmarks
23
Speedup on 8 Threads
4 Intel Xeon
Processors with
Hyperthreading
4.00
3.50
static
dynamic
3.00
afs
2.50
tss
original
RBS
2.00
LBS
HCS
1.50
er
ag
e
Av
SP
(W
)
G
M
EP
G
C
BT
(W
)
is
e
up
w
w
sw
im
gr
id
m
eq
ua
ke
ar
t
ap
si
m
p
1.00
am
Speedup
guided
Benchmarks
24
IBM Power 5




Technology: 130nm
Dual processor core
8-way superscalar
Simultaneous MultiThreaded (SMT) core



Up to 2 virtual processors
24% area growth per core
for SMT
Natural extension to Power
4 design
25
Single Thread

Single Thread has advantage when executing
unit limited applications
 Floating


or fixed point intensive workloads
Extra resources necessary for SMT provide
higher performance benefit when dedicated to a
single thread
Data locality on one SMT core is better with
single thread for some applications
26
Power 5 Multi-Chip Module (MCM)


Or Multi-Chipped Monster
4 processor chips


2 processors per chip
4 L3 cache chips
27
Power5 64-way Plane Topology



Each MCM has 4 interconnected processor
chips
Each processor chip has
two processors on chip
Each processor has SMT
technology therefore two
threads can be executed
on it simultaneously
28
Multi-Level Scheduler
Loop Iterations
1st Level
Scheduler
Iterations for
Module 1
…….
Iterations for
Module i
……………….
Iterations for
Thread 1
Iterations for
Module n
……………….
2nd Level
Scheduler
Iterations for
Processor 1
…….
Iterations for
Processor m
2nd Level
Scheduler
……………….
3rd Level
Scheduler
Iterations for
Processor 1
3rd Level
Scheduler
Iterations for
Thread k
Iterations for
Thread 1
Iterations for
Processor m
……………….
Iterations for
Thread k
29
OpenMP Implementation

Outline Technique


New subroutines
created with body of
each parallel
construct
Runtime routines
receives as a
parameter the
address of the
outlined procedure
30
Source Code:
#pragma omp parallel for shared(a,b) private(i)
for ( i = 0; i < 100; i ++ ) {
a = a + b;
}
Runtime Library
1. Initialize Work Items
and work shares
2. Call
_xlsmp_DynamicChunkCall(…)
long main {
_xlsmpParallelDoSetup_TPO(…)
}
void main@OL@1 ( … ) {
do {
loop body;
}
while (loop end condition meets);
return;
}
Outlined Functions
while (still iterations left,
go to get some iterations
for this thread)
{
…………
call main@OL@1(...);
………….
}
31
Source Code:
#pragma omp parallel for shared(a,b) private(i)
for ( i = 0; i < 100; i ++ ) {
a = a + b;
}
Runtime Library
1. Initialize Work Items
and work shares
2. Call
_xlsmp_DynamicChunkCall(…)
long main {
_xlsmpParallelDoSetup_TPO(…)
}
void main@OL@1 ( … ) {
do {
loop body;
}
while (loop end condition meets);
return;
}
Outlined Functions
while (hier_sched(…)))
{
…………
call main@OL@1(...);
………….
}
32
Root
Guided
M0
M1
Static Cyclic
P0
T0
P0
P1
T1
T2
T3
T4
P1
T5
T6
1.
Lookup its parents iteration list to see if there is any
iteration available; if yes, get some iterations from the 2nd
level scheduler and return
2.
Look one level up, grab the lock for its group, and seek
more iterations from the upper level using the upper level
loop scheduler (a recursive function call) till it gets some
iteration or the whole loop ends
T7
33
Hierarchical Scheduler

Guided as the 1st level scheduler
 Balance
work loads among processors
 Reduce runtime overhead

Static Cyclic as the 2nd level scheduler
 Improve
cache locality
 Reduce runtime overhead
….
T0
….
T1
Iteration space dividing using
standard static scheduling
T0 T1 T0 T1
T0 T1 T0 T1
Iteration space dividing using
static cyclic scheduling
34
Evaluation

IBM Power 5 System



Operating System


AIX 5.3
Compiler:


4 Power 5 1904 MHz SMT processors
31872 M memory
IBM XL C/C++, XL Fortran compiler
Benchmark

SpecOMP2001
35
Scalability of IBM Power 5 SMT Processors
1 through 8 threads
6.00
ammp
applu
5.00
apsi
art
Speedup
4.00
equake
gafort
mgrid
3.00
swim
wupwise
2.00
1.00
0.00
1
2
3
4
5
Benchmarks
6
7
8
36
Evaluation on Power 5
Execution Time Normalized to Default (Static) Scheduler
1.10
Normalized Time over Static
1.05
1.00
static
dynamic
0.95
guided
Hier
0.90
0.85
0.80
ammp
applu
apsi
art
equake
Benchmarks
gafort
mgrid
swim
wupwise
37
Conclusion




Standard schedulers are not aware of SMT technology
Adaptive hierarchical schedulers take SMT specific
characteristics into account, which could make OpenMP
API (software) and SMT technology (hardware) work
better together.
OpenMP parallel applications running on Power 5
architecture with SMT has the same problem
Multi-level hierarchical scheduler designed for IBM
Power 5 achieves an average improvement over the
default loop scheduler of 3% on SPEC OMP2001


Large improvements of 7% and 11% on some benchmarks
Improves on average over all other standard OpenMP loop
schedulers by at least 2%
38
Future Work




Evaluate multi-level hierarchical scheduler on a
larger system with 32 SMT processors (with
MCM)
Explore performance on auto-parallelized
benchmarks (SPEC CPU FP)
Examine mechanisms for determining best
scheduler configuration at compile-time
Explore the use of helper threads on Power 5
 Cache
prefetching
39
Thank You~
(A cache miss comparison chart
will be shown here)


If find a way to calculate the overall L2 load/store
miss generally.
If not, will show the overhead of this optimization
from the tprof data.
41
Schedulers’ Speedup on 4 threads
4.00
3.50
3.00
dynamic
guided
2.50
afs
tss
original
2.00
1.50
er
ag
e
Av
SP
(W
)
G
M
EP
G
C
BT
(W
)
is
e
up
w
w
sw
im
gr
id
m
eq
ua
ke
ar
t
ap
si
m
p
1.00
am
Speedup
static
Benchmarks
42
Scheduler’s Speedup on 8 Threads
4.00
3.50
static
dynamic
guided
afs
tss
original
Speedup
3.00
2.50
2.00
1.50
1.00
ammp
apsi
art
equake
mgrid
swim
wupwise
BT(W)
CG
EP
MG
SP(W) Average
43
Decision Tree



Only one decision tree is
built offline, before
executing the program
Apply that decision tree to
loops at runtime without
changing the tree
Make a decision on which
scheduler we should use
with only one run of each
loop, which greatly
reduces runtime
scheduling overhead
uops <= 3.62885e+08 :
| cachemiss <= 111979 :
| | uops > 748339 : static-4
| | uops <= 748339 :
| | | l/s <= 167693 : static-4 (
| | | l/s > 167693 : static-static
| cachemiss > 111979 :
| | floatpoint <= 1.52397e+07 :
| | | cachemiss <= 384690 :
| | | | uops <= 2.06431e+07 : static-static
| | | | uops > 2.06431e+07 :
| | | | | imbalance <= 1330 : afs-static
| | | | | imbalance > 1330 :
| | | | | | cachemiss <= 301582 : afs-4
| | | | | | cachemiss > 301582 : guided-static
…………………………….
uops > 3.62885e+08 :
| l/s > 7.22489e+08 : static-4
| l/s <= 7.22489e+08 :
| | imbalance <= 32236 : static-4
| | imbalance > 32236 :
| | | floatpoint <= 5.34465e+07 : static-4
| | | floatpoint > 5.34465e+07 :
| | | | floatpoint <= 1.20539e+08 : tss-4
| | | | floatpoint > 1.20539e+08 :
| | | | | floatpoint <= 1.45588e+08 : static-4
| | | | | floatpoint > 1.45588e+08 : tss-4 END
hardware-counter scheduling
END hardware-counter scheduling
44
(Load imbalance comparison chart
will be shown here)

Generating……..
45

Download Report

Slides

Paperzz.com

Your Paperzz