Slides

Challenges for
Worst-case Execution Time Analysis of
Multi-core Architectures
Jan Reineke @
saarland
university
computer science
IRIT, Toulouse
September 18, 2014
Hard Real-Time
Systems
The Real-Time
Context:
Hard
Real-Time Systems
Hard
Systems
saarl
unive
compu
Safety-critical
applications:
Safety-critical
applications:
Safety-critical applications:
Avionics,Avionics,
automotive,
train industries,
manufacturing
control
automotive,
train industries,
manufacturing
cont
¢  Avionics, automotive, train industries, manufacturing
Side
in car Crankshaft-synchronous
Crankshaft-synchronous
task
Side airbag
airbag in car
Crankshaft-synchronous
tasks
Side airbag
in
car
tasks
Reaction inin
< 10
in < 45inmicrosec
Reaction
< msec
10 msec Reaction
Reaction
< 45 µsec
Reaction in < 10 msec Reaction in < 45 µsec
¢  Embedded controllers must finish their tasks within
Embedded
controllers must finish their tasks within given
given
time bounds.
Embedded
controllers must finish their tasks within given time
bounds.
¢  Developers would like to know the Worst-Case
bounds.
Developers
would
like to to
know
Worst-Case Execution
Execution
Time
(WCET)
givethe
a guarantee.
Developers
would
the Worst-Case Execution Tim
(WCET)
to like
givetoa know
guarantee.
Jan Reineke, Saarland 2
The Timing Analysis Problem
? Our Vision: PRET Machines
+ Embedded Software
PREcision-Timed processors: Performance & Predicability
+
Timing Requirements
= PRET
Microarchitecture
Jan Reineke, Saarland 3
What does the execution time depend on?
What does the execution time depend on?
2
university
The input, determining which path is taken
through the program.
The input, determining which path is taken through the program.
¢  The state of the hardware platform:
¢ 
1
saarland
computer science
The state of the hardware platform:
l 
Due to caches, pipelining, speculation, etc.
Due to caches, pipelines, speculation, etc.
¢  Interference
the environment:
3 Interferences
from the from
environment:
I External interferences as seen from the analyzed task on shared
l  External interference as seen from the analyzed
busses,
caches,
memory.busses, caches, memory.
task
on shared
I
Simple
CPU
Memory
Jan Reineke, Saarland 4
What does the execution time depend on?
What does the execution time depend on?
saarland
university
saarland
The
input,
determining
which
path
is
taken
university
What does the execution time depend on?
through the program.
1 The input, determining which path is taken through the program.
of thewhich
hardware
platform:
1 ¢ 
TheThe
input,state
determining
path is taken
through the program.
¢ 
computer science
computer science
2
The state of the hardware platform:
The l 
state
of the
hardware pipelining,
platform:
Due
to caches,
speculation,
etc.
Due
to
caches,
pipelines,
speculation,
etc.
I Due to caches, pipelines, speculation, etc.
¢ 
Interference
from
the
environment:
3 Interferences
from
the
environment:
3 Interferences from the environment:
I External interferences as seen from the analyzed task on shared
External
interference
from the
I l 
External
interferences
as seen as
fromseen
the analyzed
task analyzed
on shared
busses,
caches,
memory.
busses,
caches,
memory.
task
on shared
busses, caches, memory.
2
I
Complex CPU
(out-of-order
Simple
execution,
CPU
branch
prediction, etc.)
L1
Main
Memory
Cache
Memory
Jan Reineke, Saarland 5
What does the execution time depend on?
What does the execution time depend on?
saarland
university
saarland
saarland
The
input,
determining
which
path
is
taken
university
university
What does the execution time depend
depend on?
on?
through the program.
1 The input, determining which path is taken through the program.
1 The input, determining which path is taken through the program.
of thewhich
hardware
platform:
statestate
of
the hardware
platform:
12 ¢ 
TheThe
input,
determining
path is taken
through the program.
¢ 
computer science
computerscience
science
computer
2
The state
of the hardware platform:
I
Due
toofcaches,
pipelines,
speculation,speculation,
etc.
The l 
state
the
hardware
platform:
Due
to caches,
pipelining,
etc.
Due
to
caches,
pipelines,
speculation,
etc.
I Due to caches,
3 Interferences
from the
environment:
pipelines,
speculation, etc.
¢ 
Interference
from
the
environment:
3 Interferences
from
the
environment:
I External
interferences
as
seen
from the analyzed task on shared
3 Interferences from the environment:
I External
busses,
caches,
memory.
interferences
asasseen
the
taskanalyzed
onshared
shared
External
interference
as
from the
I l 
External
interferences
seenfrom
fromseen
the analyzed
analyzed
task
on
busses,
caches,
memory.
busses,
caches,
memory.
task
on shared
busses, caches, memory.
2
I
Complex
CPU
Complex
L1
Cache
CPU
(out-of-order
Simple
execution,
...
CPU
branch
prediction, etc.)
Complex
CPU
L1 L2
Main
Main
Memory
Cache
Memory
Cache
Memory
L1
Cache
Jan Reineke, Saarland 6
e
saarland
saarland
university
university
Example
ofInfluence
Influence
Example
Example
ofofInfluence
ofofof
Access
Time
Access
Time
Microarchitectural
State
Microarchitectural
State
Microarchitectural State
x=a+b;
=a+b;
computer
science
computer
science
saarland
saarland
university
university
computer
science
computer
science
LOAD r2,r2,
LOAD
_a_a
LOAD r1,r1,
LOAD
_b_b LOAD
LOADr2, r2,
_a _a
ADDx=a+b;
r3,r2,r1 LOAD
ADD
r3,r2,r1
LOADr1, r1,
x=a+b;
_b _b
ADD r3,r2,r1
r3,r2,r1
ADD
MPC5xx
MPC5xx
PPC755
PPC755
MPC5xx
MPC5xx
PPC755
PPC755
Motorola
PowerPC
Motorola
PowerPC
PowerPC
755 755755
Reinhard
Wilhelm
Reinhard
Wilhelm
Courtesy
of Reinhard
Wilhelm.
ofTutorial
Reinhard
Wilhelm.
Timing
Analysis
and Timing
Predictability Courtesy
Timing
Analysis
and Timing
Predictability
Tutorial
ISCAISCA
20102010
5 / 635 / 63
Reineke
etSaarland
al.,
Berkeley
Reineke
et al.,
Berkeley
Jan
Reineke,
75
Example of Influence of Corunning Tasks
in Multicores
Radojkovic et al. (ACM TACO, 2012) on Intel Atom
and Intel Core 2 Quad:
up to 14x slow-down due to interference
on shared L2 cache and memory controller
Jan Reineke, Saarland 8
Challenges
1. 
Modeling
How to construct sound timing models?
2. 
Analysis
How to precisely & efficiently bound the WCET?
3. 
Design
How to design microarchitectures that enable
precise & efficient WCET analysis?
Jan Reineke, Saarland 9
Our Vision: PRET Machines
The Modeling Challenge
? PREcision-Timed processors: Performance & Predicability
Microarchitecture
+
Timing
Model
= PRET
Timing model = Formal specification of
(Image: John Harrison’s H4, first clock to solve longitude problem)
microarchitecture’s timing
Precision-Timed (PRET) Machines – p. 11/19
Incorrect timing model
à possibly incorrect WCET bound.
Jan Reineke, Saarland 10
Current
Process
of
Deriving
Timing
Model
Our Vision: PRET Machines
? PREcision-Timed processors: Performance & Predicability
Microarchitecture
+
= PRET
Timing
Model
(Image: John Harrison’s H4, first clock to solve longitude problem)
Precision-Timed (PRET) Machines – p. 11/19
Jan Reineke, Saarland 11
Current
Process
of
Deriving
Timing
Model
Our Vision: PRET Machines
? PREcision-Timed processors: Performance & Predicability
Microarchitecture
+
= PRET
Timing
Model
(Image: John Harrison’s H4, first clock to solve longitude problem)
Precision-Timed (PRET) Machines – p. 11/19
Jan Reineke, Saarland 12
Current
Process
of
Deriving
Timing
Model
Our Vision: PRET Machines
? PREcision-Timed processors: Performance & Predicability
Microarchitecture
+
= PRET
Timing
Model
(Image: John Harrison’s H4, first clock to solve longitude problem)
Precision-Timed (PRET) Machines – p. 11/19
Jan Reineke, Saarland 13
Current
Process
of
Deriving
Timing
Model
Our Vision: PRET Machines
? PREcision-Timed processors: Performance & Predicability
Microarchitecture
+
= PRET
Timing
Model
(Image: John Harrison’s H4, first clock to solve longitude problem)
Precision-Timed (PRET) Machines – p. 11/19
Jan Reineke, Saarland 14
Current
Process
of
Deriving
Timing
Model
Our Vision: PRET Machines
? PREcision-Timed processors: Performance & Predicability
Microarchitecture
+
Timing
Model
= PRET
(Image: John Harrison’s H4, first clock to solve longitude problem)
à  Time-consuming,
and
Precision-Timed (PRET) Machines – p. 11/19
à  error-prone.
Jan Reineke, Saarland 15
Current
Process
of
Deriving
Timing
Model
Our Vision: PRET Machines
? PREcision-Timed processors: Performance & Predicability
Microarchitecture
+
Timing
Model
= PRET
(Image: John Harrison’s H4, first clock to solve longitude problem)
à  Time-consuming,
and
Precision-Timed (PRET) Machines – p. 11/19
à  error-prone.
Jan Reineke, Saarland 16
Our
Vision:Process
PRET Machines
1.
Future
of Deriving Timing Model
PREcision-Timed processors: Performance & Predicability
Microarchitecture
+
VHDL
= PRET
Model
Timing
Model
(Image: John Harrison’s H4, first clock to solve longitude problem)
Precision-Timed (PRET) Machines – p. 11/19
Jan Reineke, Saarland 17
Our
Vision:Process
PRET Machines
1.
Future
of Deriving Timing Model
PREcision-Timed processors: Performance & Predicability
Microarchitecture
+
VHDL
= PRET
Model
Timing
Model
Derive timing model automatically from formal
specification of microarchitecture.
(Image: John Harrison’s H4, first clock to solve longitude problem)
Precision-Timed (PRET) Machines – p. 11/19
à 
à 
Less manual effort, thus less time-consuming, and
provably correct.
Jan Reineke, Saarland 18
Our
Vision:Process
PRET Machines
1.
Future
of Deriving Timing Model
PREcision-Timed processors: Performance & Predicability
Microarchitecture
+
VHDL
= PRET
Model
Timing
Model
Derive timing model automatically from formal
specification of microarchitecture.
(Image: John Harrison’s H4, first clock to solve longitude problem)
Precision-Timed (PRET) Machines – p. 11/19
à 
à 
Less manual effort, thus less time-consuming, and
provably correct.
Jan Reineke, Saarland 19
Our
Vision:Process
PRET Machines
1.
Future
of Deriving Timing Model
PREcision-Timed processors: Performance & Predicability
Microarchitecture
+
VHDL
= PRET
Model
Timing
Model
Derive timing model automatically from formal
specification of microarchitecture.
(Image: John Harrison’s H4, first clock to solve longitude problem)
Precision-Timed (PRET) Machines – p. 11/19
à 
à 
Less manual effort, thus less time-consuming, and
provably correct.
Jan Reineke, Saarland 20
Our
Vision:Process
PRET Machines
2.
Future
of Deriving Timing Model
PREcision-Timed processors: Performance & Predicability
Microarchitecture
+
Perform
measurements on
hardware
= PRET
Infer model
Timing
Model
(Image: John Harrison’s H4, first clock to solve longitude problem)
Precision-Timed (PRET) Machines – p. 11/19
Jan Reineke, Saarland 21
Our
Vision:Process
PRET Machines
2.
Future
of Deriving Timing Model
PREcision-Timed processors: Performance & Predicability
Microarchitecture
+
Perform
measurements on
hardware
= PRET
Infer model
Timing
Model
(Image: John Harrison’s H4, first clock to solve longitude problem)
Derive timing model automatically from measurements
on the hardware using ideas from automata learning.
Precision-Timed (PRET) Machines – p. 11/19
à 
à 
à 
No manual effort, and
(under certain assumptions) provably correct.
Also useful to validate assumptions about microarch.
Jan Reineke, Saarland 22
Proof-of-concept:
Automatic Modeling of the Cache Hierarchy
¢ 
¢ 
Cache Model is important part of Timing Model
Can be characterized by a few parameters:
ABC: associativity, block size, capacity
l  Replacement policy
l 
B = Block Size
A = Associativity
Tag
Data
Tag
Data
Tag
Data
Tag
Data
Tag
Data
Tag
Data
Tag
Data
Tag
Data
...
Tag
Data
Tag
Data
Tag
Data
Tag
Data
N = Number of Cache Sets
chi [Abel and Reineke, RTAS 2013] derives all of
these parameters fully automatically.
Jan Reineke, Saarland 23
Example: Intel Core 2 Duo E6750,
L1 Data Cache
|Misses|
90000
80000
70000
60000
50000
L1 Misses
40000
30000
20000
10000
0
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950
|Size|
Jan Reineke, Saarland 24
Example: Intel Core 2 Duo E6750,
L1 Data Cache
|Misses|
90000
80000
70000
60000
50000
L1 Misses
40000
30000
20000
10000
0
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950
Capacity = 32 KB
|Size|
Jan Reineke, Saarland 25
Example: Intel Core 2 Duo E6750,
L1 Data Cache
Way Size = 4 KB
|Misses|
90000
80000
70000
60000
50000
L1 Misses
40000
30000
20000
10000
0
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435363738394041424344454647484950
Capacity = 32 KB
|Size|
Jan Reineke, Saarland 26
Replacement Policy
Approach inspired by methods to learn finite
automata. Heavily specialized to problem domain.
Jan Reineke, Saarland 27
Replacement Policy
Approach inspired by methods to learn finite
automata. Heavily specialized to problem domain.
Discovered to our knowledge undocumented
policy of the Intel Atom D525:
d
ab
cd
ef
x
dc
ab
ef
xe
dc
ab
More information: http://embedded.cs.uni-saarland.de/chi.php
Jan Reineke, Saarland 28
Modeling Challenge: Future Work
Extend automation to other parts of the
microarchitecture:
¢ 
¢ 
¢ 
Translation lookaside buffers, branch
predictors
Shared caches in multicores including their
coherency protocols
Out-of-order pipelines?
Jan Reineke, Saarland 29
Our Vision: PRET Machines
The Analysis Challenge
! ? PREcision-Timed processors: Performance & Predicability
Microarchitecture
+
Timing
Model
Precise & Efficient
= PRET
Timing Analysis
Consider all
Consider all
possible
possible initial
(Image: John Harrison’s H4, firstprogram
clock to solve longitudestates
problem)
of the
inputs
hardware
1. INTRODUCTION
Precision-Timed (PRET) Machines – p. 11/19
WCETH (P ) := max
max
i2Inputs h2States(H)
2. REFERENCES
ETH (P, i, h)
Jan Reineke, Saarland 30
The Analysis Challenge
Consider all
possible
program
inputs
1. INTRODUCTION
WCETH (P ) := max
Consider all
possible initial
states of the
hardware
max
i2Inputs h2States(H)
ETH (P, i, h)
Explicitly evaluating ET for all inputs and all
2. REFERENCES
hardware states is not feasible in practice:
There are simply too many.
è  Need for abstraction and thus approximation!
¢ 
Jan Reineke, Saarland 31
The Analysis Challenge:
State of the Art
Component
Analysis Status
Caches,
Branch Target
Buffers
Precise & efficient abstractions, for
•  LRU [Ferdinand, 1999]
Not-so-precise but efficient abstractions, for
•  FIFO, PLRU, MRU [Grund and Reineke,
2008-2011]
Reasonably precise quantitative analyses, for
•  FIFO, MRU [Guan et al., 2012-2014]
Complex
Pipelines
Precise but very inefficient; little abstraction
Major challenge: timing anomalies
Shared
resources, e.g.
busses, shared
caches, DRAM
No realistic approaches yet
Major challenge: interference between
hardware threads
à execution time depends on corunning tasks
à need timing compositionality or isolation
Jan Reineke, Saarland 32
The Design Challenge
Wanted:
Multi-/many-core architecture that is
timing predictable
=
admits precise and
efficient WCET analysis
and delivers high performance.
Jan Reineke, Saarland 33
Approaches to Increase Predictability
Inputs = Program inputs
+ Initial state of microarchitecture
+ Tasks on other cores
Uncertainty
about Inputs
1. Reduce uncertainty
Possible
Executions
2. Reduce influence of
uncertainty on executions
Analysis
Efficiency
3. Decouple analysis efficiency
from number of executions
Jan Reineke, Saarland 34
1. Reduce Uncertainty
a) Eliminate performance-enhancing features:
saarland
saarland
computer science
computer science
university
t does the execution time depend on? What
does the execution time depend on?
university
The input, determining which path is taken through the program.
What does the execution time depend on?
The state of the hardware platform:
1 The input, determining which path is taken through the program.
E.g. branch predictor, cache, out-of-order
The state of the hardware platform:
nterferences from the environment:
execution…Due to caches, pipelines, speculation, etc.
I
I
Due to caches, pipelines, speculation, etc.
2
1
I
The input, determining which path is taken through the p
The state of the hardware platform:
External interferences as seen from the analyzed task on shared
3 Interferences from the environment:
I Due to caches, pipelines, speculation, etc.
busses, caches, memory.
I External interferences as seen from the analyzed task
shared from the environment:
3 on
Interferences
I External interferences as seen from the analyzed task on
busses, caches, memory.
2
busses, caches, memory.
Complex
CPU
L1
Cache
L2
Cache
...
Complex
CPU
Complex CPU
(out-of-order
execution,
branch
prediction, etc.)
Main
Memory
L1
Cache
L1
Cache
Simple
CPU
Main
Memory
Jan Reineke
Jan Reineke
Timing Analysis and Timing Predictability
11. Februar 2013
6 / 38
Jan Reineke
Timing Analysis and Timing Predictability
11. Februar 2013
Memory
Timing Analysis and Timing Predictability
11. Feb
6 / 38
If done naively: Reverses many micro-architectural developments…
à Decreases performance…
Key question: How to reduce uncertainty without sacrificing performance?
Jan Reineke, Saarland 35
1. Reduce Uncertainty
b) Replace features by alternatives with smaller
at does the execution
time depend on?
state spaces:
saarland
university
computer science
The input, determining which path is taken through the program.
The state of the hardware platform:
Caches à Scratchpads,
Due to caches, pipelines,
speculation,
etc.
Complex
Pipeline
à Simple VLIW or ThreadInterferences from the environment:
Interleaved Pipeline
External interferences as seen from the analyzed task on shared
busses, caches, memory.
DRAM Controller w/ Closed-Page Policy
I
I
Complex
CPU
L1
Cache
L2
Cache
...
Complex
CPU
Jan Reineke
Scratch
pad
VLIW
Main
Memory
L1
Cache
Timing Analysis and Timing Predictability
Scratch
pad
...
VLIW
11. Februar 2013
6 / 38
DRAM
Controller
w/ Closed
Page
Policy
Scratch
pad
Jan Reineke, Saarland 36
1. Reduce Uncertainty
c) Establish fixed microarchitectural state “once in a
while”:
¢  By microarchitecture: partially flush pipeline at basicblock boundaries
[Rochange and Sainrat, Computing Frontiers 2005]
¢  By sync instruction, placed by compiler
[Maksoud and Reineke, RTNS 2014]
Jan Reineke, Saarland 37
Approaches to Increase Predictability
Inputs = Program inputs
+ Initial state of microarchitecture
+ Tasks on other cores
Uncertainty
about Inputs
1. Reduce uncertainty
Possible
Executions
2. Reduce influence of
uncertainty on executions
Analysis
Efficiency
3. Decouple analysis efficiency
from number of executions
Jan Reineke, Saarland 38
2) Reducing Influence of Uncertainty on
Execution Times
If a particular source of uncertainty has no influence
on the execution time of a program under analysis, it
is irrelevant for timing analysis.
Principal example: Temporal Isolation
Jan Reineke, Saarland 39
Tpipeline,
hp, ci) =
Tpipeline
(P,ci)
hpi)
(P ) = cache
max(P,
Tpipeline,
(P, hp,
ne, cache
cache
p,c
Tcache (P, hci)
WCETpipeline, cache (P ) = max Tpipeline, cache (P, hp, ci)
p,c
 max Tpipeline
(P, hpi)
Temporal
Isolation
p
max Tcache (P, hci)
c
 max Tpipeline (P, hpi)
p
= WCETpipeline (P ) + WCETcache (P )
max Tcache (
c
Temporal isolation
between
cores
= WCETcache (P
= WCET
(P ) +
pipeline
l isolation: timing of program on one core is independent of
Temporal isolation:
activity on other cores
0
0
¢ 
Formally:
,
p
,
c
i)
=
T
(P
,
hp
,
c
,
p
,
c
1
2 2
1
1 1
2 2 i) = Tisolated
0 (P
0 1 , hp1 , c2 i)
T (P1 , hp1 , c1 , p2 , c2 i) = T (P1 , hp1 , c1 , p2 , c2 i) = Tisolated (P1 ,
T (P1 , hp1 , c1 , p2 , c2 i) = Tisolated (P1 , hp1 , c2 i)
T (P1 , hp1 , c1 , p2 , c2 i) = Tisolated (P1 ,
¢ 
Can be exploited in WCET analysis:
CET(P1 ) = WCET(P
max T1(P
,(P
c21i)
1 , hpmax
1 , c1 , p2
)
=
T
, hp1 , c1 , p2 , c2 i)
p1 ,c1 ,p2 ,c2
¢ 
p1 ,c1 ,p2 ,c2
= max Tisolated (P
, hp1 ,Tcisolated
1 i) (P1 , hp1 , c1 i)
=1max
p1 ,c1
p1 ,c1
Jan Reineke, Saarland 40
Temporal Isolation
How to achieve it?
Partition resources in space and/or time
l 
Resource appears like a slower and/or smaller
private resource to each client
Examples:
Time-division multiple access (TDMA) arbitration
in shared busses
l  Partitioned shared caches
l  PRET DRAM Controller = time division of data/
control bus + space division of DRAM internals
l 
Why not simply provide private resources then?
Jan Reineke, Saarland 41
Approaches to Increase Predictability
Inputs = Program inputs
+ Initial state of microarchitecture
+ Tasks on other cores
Uncertainty
about Inputs
1. Reduce uncertainty
Possible
Executions
2. Reduce influence of
uncertainty on executions
Analysis
Efficiency
3. Decouple analysis efficiency
from number of executions
Jan Reineke, Saarland 42
3) Decouple analysis efficiency from
number of executions
1. 
Abstraction: represent many concrete executions
by few abstract ones:
l 
l 
2. 
Disregard most executions:
l 
l 
3. 
Genius is in choice of abstraction
Usually requires some sort of monotonicity
E.g. follow only local worst-case
Requires absence of timing anomalies/domino effects
Divide and conquer:
Analyze different components separately
l  Requires timing compositionality
l 
Jan Reineke, Saarland 43
Timing Anomalies
Timing Anomaly = Counterintuitive scenario in
which the “local worst case” does not imply
the “global worst case”.
Example: Scheduling Anomaly
C ready
A
Resource 1
D
C
Resource 2
Resource 1
Resource 2
E
B
A
D
B
E
C
Bounds on multiprocessing timing anomalies
RL Graham - SIAM Journal on Applied Mathematics, 1969 – SIAM
(http://epubs.siam.org/doi/abs/10.1137/0117039) Jan Reineke, Saarland 44
Timing Anomalies
Example: Speculation Anomaly
Prefetching as branch
condition has not been
evaluated yet
Memory access may
induce additional
cache misses later on
Branch Condition
Evaluated
Cache Hit
Cache Miss
A
Prefetch B - Miss
A
C
C
No prefetching as branch
condition has already
been evaluated yet
Jan Reineke, Saarland 45
Timing Anomalies
Example: Cache Timing Anomaly of FIFO
Global worst case
Access:
b
hit
b
c
[a, b]
miss
[c, a]
miss
c
d
[b, c]
miss
[d, b]
miss
[c, d]
4 Misses
[d, c]
3 Misses
[a, ?]
miss
[b, a]
miss
[c, b]
hit
[c, b]
miss
[d, c]
hit
Local worst case
Similar examples exist for PLRU and MRU.
Impossible for LRU.
Jan Reineke, Saarland 46
Timing Anomalies
Consequences for Timing Analysis
In the presence of timing anomalies, a
timing analysis cannot make decisions “locally”:
it needs
toWCET
consider
all cases.
State-of-the-art:
Integrated
Analysis
Drawback Efficiency
à  May yield “State explosion problem”
saarland
university
computer science
Jan Reineke, Saarland 47
Timing Anomalies
Open Analysis and Design Challenges
¢ 
¢ 
How to determine whether a given timing
model exhibits timing anomalies?
How to construct processors without timing
anomalies?
Caches: LRU replacement
l  No speculation
l  Other aspects: “halt” everything upon every
“timing accident” à possibly very inefficient
l 
¢ 
How to construct conservative timing model
without timing anomalies?
l 
Can we e.g. add a “safety margin” to the local
worst case?
Jan Reineke, Saarland 48
Domino Effects
¢ 
Intuitively:
domino effect = “unbounded” timing anomaly
¢ 
Examples:
Pipeline (e.g. PowerPC 755)
l  Caches (FIFO, PLRU, MRU, …)
l 
Jan Reineke, Saarland 49
Domino Effects
Example: Cache Domino Effect of FIFO
Access:
b
hit
[a, b]
c
b
d
c
miss
miss
miss
miss
[c, a]
[b, c]
[d, b]
[c, d]
4 Misses
[d, c]
3 Misses
[a, b]
4 Misses
[b, a]
3 Misses
[a, ?]
miss
[b, a]
miss
[c, b]
[c, d]
[d, c]
miss
miss
[c, b]
d
a
Access:
hit
[a, c]
[a, d]
miss
hit
miss
[d, c]
a
b
[d, a]
[a, d]
miss
miss
hit
[b, d]
[b, a]
miss
hit
Similar examples exist for PLRU and MRU.
Impossible for LRU.
Jan Reineke, Saarland 50
Domino Effects
Open Analysis and Design Challenges
Exactly as with timing anomalies:
¢  How to determine whether a given timing
model exhibits domino effects?
¢  How to construct processors without domino
effects?
¢  How to construct conservative timing model
without domino effects?
Jan Reineke, Saarland 51
Timing Compositionality
What does the execution time depend on?
Motivation
The input, determining which path is taken through the p
1
2
The state of the hardware platform:
I
¢ 
Due to caches, pipelines, speculation, etc.
Interferences from the environment:
Some timing accidents are hard
or
External interferences as seen from the analyzed task o
busses, caches,
even impossible to statically exclude
at memory.
any particular program point:
l  Interference on a shared bus:
...
depends on behavior of tasks
executed on other cores
l  Interference on a cache in
preemptively scheduled systems
l  DRAM refreshes
But it may be possible to make
cumulative statements about the
number of these accidents
3
I
Complex
CPU
L1
Cache
L2
Cache
Complex
CPU
Jan Reineke
¢ 
Main
Memory
L1
Cache
Timing Analysis and Timing Predictability
11. Feb
Jan Reineke, Saarland 52
Timing Compositionality
Intuitive Meaning
Timing of a program can be decomposed into
contributions by different “components”, e.g.
l  Pipeline
l  Cache non-preempted
l  Cache-related preemption delay
l  Bus interference
l  DRAM refreshes
l  …
Timing
compositionality:
¢  Example, decomposition into pipeline and cache
¢ 
Tpipeline, cache (P, hp, ci) = Tpipeline (P, hpi)
Tcache (P, hci)
WCETpipeline, cache (P, hp, ci) = max
Tpipeline, cache (P, hp, ci)
Monotonic
p,c
Jan Reineke, Saarland 53
Timing Compositionality
Application in Timing Analysis
Then,
the components (here: pipeline and cache)
Timing compositionality:
can also be analyzed separately:
l
,
l
#
u
u
Tpipeline, cache (P, hp, ci) = Tpipeline (P, hpi)
Tcache (P, hci)
WCETpipeline, cache (P ) = max Tpipeline, cache (P, hp, ci)
p,c
 max Tpipeline (P, hpi)
p
max Tcache (P, hci)
c
= WCETpipeline (P ) + WCETcache (P )
2.
REFERENCES
Hahn, Reineke, Wilhelm:
Towards Compositionality in Execution Time Analysis - Definition and Challenges.
In CRTS 2013.
Jan Reineke, Saarland 54
Timing Compositionality
Example: “Cache-aware” Response-Time Analysis
saarland
university
In preemptive
scheduling,
Preemptively
Scheduled
Systems preempting
[Altmeyer et tasks
al.] may
computer science
“disturb” the cache contents of preempted tasks:
Response Time of Task 2
aT 1
dT 1
T1
aT 2
dT 2
T2
t
Additional misses due to preemption, referred to as
the Cache-Related Preemption Delay (CRPD).
1
Worst-case execution time of task 2 without preemptions: C2
Jan Reineke, Saarland 55
Timing Compositionality
saarland
Preemptively
Scheduled Systems [Altmeyer
et al.]
Example:
“Cache-aware”
Response-Time
Analysis
university
computer science
Response Time of Task 2
aT 1
dT 1
T1
aT 2
dT 2
T2
t
Timing decomposition:
Worst-case execution time of task 2 without preemptions: C
¢  WCET
of T1execution
without
preemptions:
Worst-case
time of task
1 without preemptions: C C1
Additional cost of task 1 preempting task 2:
= #add. misses · BRT
¢  WCET
without
C2
) R of
 C T2
+ #preemptions
· (C preemptions:
+
)
¢  Additional cost of T1 preempting T2:
CRPD1,2 = BRT * #additional misses
à  Response time of T2:
1
2
2
1
3
1,2
2
2
Sebastian Hahn
1
1,2
Timing Compositionality
19 June 2013
10 / 19
R2 ≤ C2 + #preemptions × (C1 + CRPD1,2)
Jan Reineke, Saarland 56
Timing Compositionality
Open Analysis and Design Challenges
¢ 
¢ 
¢ 
How to check whether a given decomposition
of a timing model is valid?
How to compute bounds on the cost of
individual events, such as cache misses (BRT
in previous example) or bus stalls?
How to build microarchitecture in a way that
permits a sound and precise decomposition of
its timing?
l 
¢ 
Beyond simply stalling the whole system
Evaluate costs and benefits of compositional
analyses.
Jan Reineke, Saarland 57
Emerging Challenge:
Microarchitecture Selection & Configuration
Embedded Software
with
Timing Requirements
? Family of Microarchitectures
= Platform
Jan Reineke, Saarland 58
Emerging Challenge:
Microarchitecture Selection & Configuration
Choices:
? • 
• 
Embedded Software
with
• 
• 
• 
Timing Requirements
Processor frequency
Sizes and latencies
of local memories
Latency and
bandwidth of
interconnect
Presence of floatingpoint unit
…
Family of Microarchitectures
= Platform
Jan Reineke, Saarland 59
Emerging Challenge:
Microarchitecture Selection & Configuration
Choices:
? • 
• 
Processor frequency
Sizes and latencies
of local memories
Select
a
microarchitecture
that
Embedded Software
•  Latency and
a)  satisfies all timing requirements,
and
bandwidth of
with
b)  minimizes cost/size/energy.
interconnect
•  Presence of floatingReineke and Doerfert:
point unit
Architecture-Parametric Timing
•  … Analysis.
In RTAS 2014.
Timing Requirements
Family of Microarchitectures
= Platform
Jan Reineke, Saarland 60
Summary:
Approaches to Increase Predictability
Inputs = Program inputs
+ Initial state of microarchitecture
+ Tasks on other cores
Uncertainty
about Inputs
1. Reduce uncertainty
by simplifying
microarchitecture, e.g.
eliminate cache,
branch prediction, etc.
Possible
Executions
2. Reduce influence
of uncertainty on
executions, e.g. by
temporal isolation
Analysis
Efficiency
3. Decouple analysis efficiency
from number of executions:
•  Eliminate timing anomalies and
domino effects,
•  Achieve timing compositionality
e.g. by LRU replacement, stalling
pipeline upon cache miss
Jan Reineke, Saarland 61
Conclusions
Challenges in
modeling
analysis
design
remain.
Jan Reineke, Saarland 62
Conclusions
Challenges in
modeling
analysis
design
remain.
Progress based on
machine learning
abstraction
partitioning
has been made.
Thank you for your attention!
Jan Reineke, Saarland 63
Some References
A Compiler Optimization to Increase the Efficiency of WCET Analysis
M. A. Maksoud and J. Reineke. In RTNS, 2014.
Architecture-Parametric Timing Analysis
J. Reineke and J. Doerfert. In RTAS, 2014.
Selfish-LRU: Preemption-Aware Caching for Predictability and Performance
J. Reineke, S. Altmeyer, D. Grund, S. Hahn, C. Maiza. In RTAS, 2014.
Towards Compositionality in Execution Time Analysis - Definition and Challenges
S. Hahn, J. Reineke, and R. Wilhelm. In CRTS, 2013.
Impact of Resource Sharing on Performance and Performance Prediction: A Survey
A. Abel, F. Benz, J. Doerfert, B. Dörr, S. Hahn, F. Haupenthal, M. Jacobs, A. H. Moin, J. Reineke,
B. Schommer, and R. Wilhelm. In CONCUR, 2013.
Measurement-based Modeling of the Cache Replacement Policy
A. Abel and J. Reineke. In RTAS, 2013.
A PRET Microarchitecture Implementation with Repeatable Timing and Competitive Performance
I. Liu, J. Reineke, D. Broman, M. Zimmer, and E.A. Lee. In ICCD, 2012.
PRET DRAM Controller: Bank Privatization for Predictability and Temporal Isolation
J. Reineke, I. Liu, H.D. Patel, S. Kim, and E.A. Lee. In CODES+ISSS, 2011.
Jan Reineke, Saarland 64