R. Kumar, DM Tullsen, and NP Jouppi

Core Architecture
Optimization for
Heterogeneous CMPs
R. Kumar, D. M. Tullsen, and N.P. Jouppi
İlker YILDIRIM
[email protected]
Outline
• Introduction
• Related Work
• Using Workloads for Multi-core Design
• Monotonicity vs Non-monotonicity
• Methodology
• Analysis and Results
• Conclusion
Introduction
• Multiple-core processors are becoming
more popular.
• More flexibility in design.
• Heterogeneity across cores.
• But how to design CMPs of such
heterogeneity?
Introduction
• Workloads, power and area constraints,
level of threading, etc.
• The best design: Combination of good
general purpose cores vs Combination
of specialized cores?
Related Work
• Related work on heterogenous CMP
design
• In terms of power efficiency.
• Improved processor performance.
• No one touched how to come up with
such a design. They all assume a given
design to go off with.
Using Workloads for
Design
•
Best design for what? On a set of applications.
Certainly applications with a representative set of
workloads.
•
•
•
•
•
Searching for one optimum CPU is already
expensive. It explodes for Multi-processors.
Assume: Sum of performance = Performance of
sum.
Private caches.
Consider only major blocks to be configurable.
Consider only a fixed number (4) of cores.
Monotonicity vs
Non-monotonicity
• Cores of a CMP posses monotonicity if
they can be fully ordered.
• In terms of performance
• In terms of voltage/frequency
• Non-monotonicity; when there is no full
ordering, but partial ordering.
• One is good for memory required jobs
• The other is good for many # of instrs.
Methodology
• Modeling of CPU cores
• Modeling Power and Area
• Modeling Performance
CPU cores
• 4-core multiprocessors, 0.10 micron,
1.2 V technology.
• Private L2 caches.
• In-order cores (Alpha EV5), Out-oforder cores (MIPS R10000)
• Evaluate 480 (96 in-order, 384 out-oforder) cores.
• Possible # of distinct 4 cores over 2.2
billion.
Power and Area
• Area budget = Sum of areas of 4 cores.
• Consider only peak activity power.
• Each core ranges b/w 4.1-16.3 W of
power and 3.3-22 mm2 of area.
• Aggregate: 13.2 to 88mm2 of area and
16.4 to 65.2 W of power.
Performance • Combination
of
Workloads
• Processor bound
• Bandwidth
bound
• All different
(a,b,c,d)
• All same (a,a,a,a)
• A wide range of
workload
SPEC: The standard
performance evaluation
corporation is a non profit
corporation formed to
establish, maintain and
endorse a standardized set of
relevant benchmarks that can
be applied to newest
generation of highperformance computers.
www.spec.org
Performance • 2.2 billionEvaluation
distinct CMP using 480
distinct cores.
• Performance of 4-core = Sum of
performance of each core. Each with its
own private L2 cache.
• Evaluation of performance for each
core:
#distinct cores x #benchmarks x #cycles
480
x
10 x 250 million cycles
Metric: Weighted speed up: arithmetic sum of each
•
running thread’s IPC over its IPC on the simplest core
considered.
Analysis and Results
•
•
•
•
•
Analyzing multi-core processors for a given
workload
Analyzing multi-core processors for a given
budget
Quantifying inefficiency due to monotonicity
Varying Thread-level Parallelism
Efficient Search Techniques
Analyzing for a Given
Workload
• All different: eon, mesa, deltablue, mcf.
• Observe non-monotonicity.
mesa
mcf
deltablue
eon
Analyzing for a given
budget
• Extended analysis in two ways:
• Any combination of workloads.
• Different area and power budgets.
•
•
•
All same case: Heterogeneity captures diversity
among different homogeneous workloads.
Performance depends on power budget.
Heterogeneity achieves specialized cores, whereas
homogeneity brings envelope cores.
•
•
•
•
Significant benefit, as long as power and area
budget are constrained.
The diversity required is related with available
budgets. The stronger the constraints, more the
diversity.
Large difference b/w the best heterogeneous and
homogeneous CMP designs.
Best design is not composition of the same best
performing core. Rather it is the combination of
tuned cores.
Quantifying inefficiency due to
monotonicity
•
The best non-monotonic design of 2 cores is better
than:
•
•
•
The best monotonic design of 2 cores with 7.5%;
The best homogeneous design of 2 cores with
15.4%.
The more constrained the higher cost of
monotonicity.
Varying Thread-Level
Parallelism
• Again heterogeneity has benefits.
• When less competition, performance is
better.
• Again observe the benefits of
heterogeneity.
• Interestingly, for TLP=1, rather than a
huge monolithic core and tiny
complementary cores, a design of
heterogenous and tuned cores it better.
Efficient Search
Techniques
•
•
•
A huge search space: 2.2 billion distinct core
combinations. Thousands of 4-thread workloads.
Not scalable, what if there are more than 10 different
applications, or more than 4 cores?
A smarter solution: Hill climbing which is likely to
stay at a local maxima.
•
Results are still better when compared to that of
homogenous designs.
Conclusions
•
•
•
•
•
How to do good heterogeneous CMP for a given
power and area budget and a set of workloads.
The best is not to combine cores that are good
general-purpose ones.
The best way is to combine tuned heterogeneous
cores.
Such tuning results in non-monotonicity. In the
sense that they can only be partially ordered.
Heterogeneous design performs better also for
homogeneous workloads.