CMP Design Choices

CMP Design Choices
Finding Parameters that
Impact CMP Performance
Sam Koblenski and Peter McClone
Outline



Introduction
Assumptions
Plackett & Burman Analysis




Mean Value Analysis






Simulation methods
Statistical Design
Plackett & Burman Results
MVA Implementation
MVA Results
AMVA Implementation
AMVA Results
Complementary Results
Conclusions
Introduction

2 part study


Design space is huge, how can we reduce it?
Method 1



Plackett & Burman (PB) Analysis finds critical
parameters
Design uses extreme values of parameters
Detailed architecture design can focus on a few
parameters
Introduction (cont.)

Method 2




Mean Value Analysis Model of a CMP
Simply designed to compute throughput
Design choices can be narrowed down
quickly
Intuition is gained and patterns/parameter
relationships identified
Assumptions - PB Design



In-Order approximated as OoO with small window
Die Size = 300 mm2 (16 MB Cache @ 65nm)
L2 Cache Size expanded to fill the die



Discrete sizes: 4, 8, 12 MB
Associativity can be non-power-of-2
Core size measured in Cache Byte Equivalents:
Pipeline
Width
CBE
In-Order
1
50 kB
In-Order
4
100 kB
Out-of-Order
1
75 kB
Out-of-Order
4
250 kB
Simulation Methodology




Simics with Ruby & Opal
16P sims used cache warmup files
2P sims ran for more transactions
Attempted OLTP and JBB benchmarks
Benchmark
Processors Transactions
OLTP
2
200
OLTP
16
100
JBB
2
20000
JBB
16
10000
Plackett & Burman Design

Motivation



Narrow a huge design space
Minimize simulation runs (experiments)
Preliminaries



Performance Measure
Extreme Parameter Values
Number of Parameters (N < 4Xn-1)
PB Design Example
A
+
+
+
+
+
+
+
+
191
B C
D
+ +
+ +
+
+
+
+
+
+
+
+
+ +
+
+
+
+ +
+
+
+
+
+
19 111 -13
E
+
+
+
+
+
+
+
+
79
F G Time
9
+
11
+
2
+
1
+ +
9
+ +
74
+
7
+ +
4
+ +
17
+
76
+
6
+
31
19
33
+
6
112
55 239
PB Design Parameter Values
Parameter
Low Value (-)
High Value (+)
Number of Cores
2
16
Pipeline Organization In-Order
Out-of-Order
Pipeline Width
1
4
L1 Cache Size
16 kB
128 kB
L1 Associativity
Direct Mapped 32-Way
Die Area – Core Area
L2 Cache Size
L2 Associativity
Direct Mapped 32-Way
L2 Banks
2
32
L2 Latency
50 Cycles
12 Cycles
L2 Directory Latency 25 Cycles
6 Cycles
Pin Bandwidth
400
10000
Memory Latency
300 Cycles
100 Cycles
PB Results



Extreme Values stressed the simulator
Have not completed an entire set of
runs, yet
Possibly necessary to build a custom
L2 network for each run
PB Results for JBB
20
18
16
14
12
10
8
6
4
2
0
C
es
r
o
/O
In
ut
W
th
id
L1
ze
Si
L1
A
oc
s
s
L2
A
oc
s
s
L2
B
ks
n
a
L2
La
cy
n
te
D
y
or
t
c
ire
cy
n
te
La
n
Pi
BW
M
em
y
or
cy
n
te
La
Assumptions - MVA




Distribution of time between memory requests is
exponential
Processor cores exhibit the same average behavior
with respect to their service times and miss rates.
Doubling the size of the cache reduces the miss
rate by a factor of 1/√2
An inorder core takes approximately the same area
as 50 KB of cache
MVA Design

Simple Closed Model:
MVA Design

Two phases of this Model design

First: Use the exact MVA equations



Use average time between memory access
as an application parameter
Solve for throughput
Second: Use Approximate MVA (AMVA)


Use an iterative method to converge on this
service time
Solve for throughput
Exact MVA

To solve for the MVA equations, we
determine the mean residence time at all
service centers:




Rp – processor/L1 residence time
RL2 – L2 residence time
RM – memory residence time.
The case with one core is trivial. Use this
case to solve for additional cores

Rn,p = Dp * (1 + Qn-1,p)
Exact MVA results

Using data from simulation runs
throughput was calculated




Miss rates, number of memory requests
Results are erratic
Not consistent with simulation results
Source of the problem is most likely
processor service time!
Approximate MVA Design

An iterative method can be used to converge on a service
time



Uses total R as an input parameter
Iterative method works well with approximate MVA
Goal is to match total average residence time of a memory
request
Approximate MVA Results




Convergence using the AMVA equations does not
always occur
Total measured residence time cannot be reached
with this model and parameter set.
Variation of input values without convergence
implies flaws in the model structure
There is a complex relationship
between the memory system and the rate at
which a core issues requests that must be
modeled
Complementary Results


Initial goal to produce PB Results to
find parameters to focus on for MVA
Model
Results from both approaches could
cross-verify correctness
Conclusions

Simics has a STEEP learning curve





<5 weeks is not enough time for valid/any results
Refinement of a PB Design leads to long
lead times on valid results
CMPs complicate the relationship between
cores and memory subsystem
Design methodologies that focus simulation
runs are necessary
More results and conclusions to follow
Questions

Questions?