The Impact of Performance Asymmetry on Emerging Multicore

The Impact of
Performance Asymmetry in
Multicore Architectures
Saisanthosh
Ravi
Michael
Konrad
UW-Madison and
Balakrishnan
Rajwar
Upton
Lai
, Intel Corp.
32nd Annual International Symposium on Computer Architecture
Performance asymmetry
... difference in compute power of processors
- Architectural differences
- Micro-architectural parameters
- Other
F F
S S
- Heat: Thermal throttling
Why need asymmetry now?
-
CMP/ Many cores as commodity systems
Run variety of workloads
Good serial performance and high throughput
Optimal energy consumption
Assume an asymmetric multicore system
2
Asymmetry & MT workloads
N procs. Same config.
Performance
Performance
N procs. Diff configs.
Compute power
Scalable?
Same/Many Runs
Stable?
Need to utilize asymmetry. F F perform better
SS
SS
SS
Need predictable and robust performance
3
The problems
Programmers
Algorithm,
Correctness,
Thread
Partitioning
Don’t reason about asymmetry
Characteristics of threads
Partitioning, Synchronization barriers,
Interference, Lifetime
Scheduling of threads
OS Kernel, Library, Application, DB/Web servers,
Managed runtime systems (Java, .NET)
4
Contributions
Asymmetry negatively affects applications
- Studied many workloads on real hardware
- Observed unpredictable workload behavior
This can be fixed by
- Evaluating threads’ work partitioning
- Scheduling of threads with asymmetry
5
Outline
Asymmetry and Performance
Evaluation Methodology
Asymmetric Configurations
Workloads and Results
6
Evaluation methodology
Asymmetry in real hardware
- Intel 4-way 3-GHz Xeon
- Different cores run at different frequencies
- Software controlled
Benefits
- Long real-time runs (no simulations)
- Workloads are setup according to specs
- Representative of other forms of asymmetry
- Communication
- Micro-architecture etc.
7
Configurations
F
F
F
F
all fast
S S
S S
all slow
Symmetric
F
F
F
S
1 slow
F
F
F
S
S S
S S
2 slow
3 slow
Asymmetric
F = Full frequency
S = one-eighth of Full frequency (in talk and paper)
S = one-fourth of Full frequency (in paper)
8
Perf. Metric
Scalability
Same
or
Many
runs
all fast
1 slow
2 slow
3 slow
all slow
Perf. Metric
Studying impact
Stability
(Asymm)
9
Workloads evaluated
SPECjbb
SPECjAppServer
Apache
Zeus
TPC-H
SPECOMP
H.264
PMake
Middle-tier business apps.
Throughput parallel
Webservers
Throughput parallel
Task-based parallelization
Embarrassingly parallel
10
Impact of asymmetry
Workloads
Scalable
Stable
Fix
SPECjbb
SPECjAppServer
P
P
O
P
P
Apache
Zeus
P
O
O
O
P
O
TPC-H
SPECOMP
H.264
P
O
P
O
O
P
P
P
PMake
P
P
11
Workloads
SPECjbb
SPECjAppServer
Apache
Zeus
TPC-H
SPECOMP
H.264
Managed runtime system
(BEA JRockit & Sun HotSpot)
Windows 2003 and Linux
2 GCs- Parallel and Gen.
Concurrent. Only Minor GC
Upto 20 threads
Minimal communication
PMake
12
Stable?
O
25
4 runs
19
17
Warehouses
15
13
9
19
11
Warehouses
17
0
15
0
13
5
11
5
9
10
7
10
5
15
3
15
7
20
5
20
3
25
P
with kernel fix
30
1
30
35
Thousands
35
Scalable?
Stability (JRockit/Gencon GC) on 2 slow
1
Transactions per second
SPECjbb
- Problem: Interference from runtime system (JVM, GC)
- Fix: Kernel scheduler moves jobs from slow to fast
if free
13
Workloads
SPECjbb
SPECjAppServer
Apache
Zeus
TPC-H
SPECOMP
H.264
Webserver on Linux
Thread-based vs. Event-based
model
ApacheBench
Raw perf. with static page
Light and heavy loads
PMake
14
Apache
Scalable?
P
Stable?
O
6
5
4
3
2
1
all fast
1 slow
2 slow
3 slow
0
all slow
Speedup over all slow
Scalability & Stability (light load)
- Problem: light load - threads can be on fast/slow
- No issues under heavy load
- Fixes: Kernel scheduler or shorter lifetime of threads
15
Zeus
Scalable?
O
Stable?
O
8
6
4
2
all fast
1 slow
2 slow
3 slow
0
all slow
Speedup over all slow
Scalability & Stability
- Under heavy and light loads: unpredictable
- Superior perf. on symmetric configs.
- Problem: Aggressive application-level scheduling
16
Workloads
SPECjbb
SPECjAppServer
Apache
Zeus
TPC-H
SPECOMP
H.264
OMP: Scientific app.
Loop-based parallelization
Intel Fortran,OpenMP on Linux
H.264: Media encoding
OpenMP on Windows 2003
PMake: Parallel Make of Linux
Kernel
PMake
17
SPECOMP
Scalable?
O
Stable?
O
all fast
1 slow
2 slow
3 slow
0
1
0
all fast
1
2
1 slow
2
3
2 slow
3
with app. fix
3 slow
4
4
all slow
Speedup over all slow
5
all slow
Speedup over all slow
Scalability
- OpenMP schedules tasks assuming equal perf. procs.
- Problem: Fast processors are held by slow
- Fix: Change scheduling of tasks to on-demand
- Downside: Overheads
18
Scalable?
P
Stable?
P
all fast
1 slow
2 slow
3 slow
PMake
7
6
5
4
3
2
1
0
all slow
Speedup over all slow
all fast
1 slow
2 slow
3 slow
H.264
7
6
5
4
3
2
1
0
all slow
Speedup over all slow
H.264 & PMake
- H.264 slows down significantly with 1 slow proc.
- Speeds up with 1 fast proc.
- PMake linearly scalable on all configurations
19
Impact of asymmetry
SPECjbb
SPECjAppServer
Apache
Zeus
TPC-H
SPECOMP
H.264
PMake
Scalable
Stable
Fix
Interference from
runtime system.
O
P
P
Robust,
multi-tier
Migrate
Migrate
tasks
tasks
from
from
slow
slow
Superior
perf.
in
P
P
Query
application.
parallelization
to
to
fast
fast
core
core
if
if
one
one
is is
Garbage collector
system
notsymmetric
aware
of asymm.
free.
free.
dependent.
Feedback
tunes
the P
OpenMPserves
based
P
O many
Thread
Unpredictable
Approx.
application
on
change
Intra-query
workload.
Reconsider
application
parallelization
with
Inspect
Or,
Handle
runtime
few
software,
requests
O
O
requests
to reduce
Concurrent
GC causesO
asymm.
byand
reducing
with worsens
heavy
degree
and
of
parallelization
scheduling
sync.
barriers.
Robust
application.
interference
recycle
between
threads.
overheads.
more problems.
light
Parallelization.
loads.
stability.
Very
responsive
tolow perf.
Assign
tasks
on-demand
threads
High
overhead,
(GC).
P
O
P
interference,
small
Fast
cores
held
by load.
instead
of
up-front.
Heavy
utilization.
Problems
with
light
O
P
Independent
Fix application
application
scheduler.
heaps
etc. O
slow.
P
P
scheduling
Make
OpenMP
understand
Threads
well-balanced
Threads
can
map
to fast
Multi-programming
Consider
asymm. in query
and
abundant.
orasymm.
slow
proc.
with several
tasks.
optimization
P
Pengine.
20
Conclusions
Asymmetric systems
- Good for energy and performance
- But can introduce unpredictability
Software to understand asymmetry
- Evaluate application’s work partitioning
- Scheduling of tasks. Mostly no other changes.
- May be, feedback based
Suitable asymmetry
- Many slow & few fast processors
21
Questions?