CornellTalk.v1 - UAH - Engineering

Instruction and Data Address
Trace Compression
Aleksandar Milenković
(collaborative work with Milena Milenković and Martin Burtscher)
Electrical and Computer Engineering Department
The University of Alabama in Huntsville
Email: [email protected]
Web: http://www.ece.uah.edu/~milenka
http://www.ece.uah.edu/~lacasa
Outline



Program Execution Traces
Trace Compression
Trace Compression in Hardware




Stream caches and predictors for
instruction address trace compression
Data address stride caches for
data address trace compression
Results
Conclusions
2
Program Execution Traces

Streams of recorded events





Basic block traces
Address traces
Instruction words
Operands
Trace uses



Computer architects for evaluation
of new architectures
Computer analysts for workload characterization
Software developers for program tuning,
optimization, and debugging
3
Instruction and Data Address Traces:
An Example
for(i=0; i<100; i++) {
c[i] = s*a[i] + b[i];
sum = sum + c[i];
Dinero+ Execution Trace
}
Instruction
Type Address
Data
Address
@ 0x020001f4: mov
r1,r12, lsl #2
2
0x020001f4
@ 0x020001f8: ldr
r2,[r4, r1]
0
0x020001f8
0xbfffbe24
@ 0x020001fc: ldr
r3,[r14, r1]
0
0x020001fc
0xbfffbc94
@ 0x02000200: mla
r0,r2,r8,r3
2
0x02000200
@ 0x02000204: add
r12,r12,#1 (1 >>> 0)
2
0x02000204
@ 0x02000208: cmp
r12,#99 (99 >>> 0)
2
0x02000208
@ 0x0200020c: add
r6,r6,r0
2
0x0200020c
@ 0x02000210: str
r0,[r5, r1]
1
0x02000210
@ 0x02000214: ble
0x20001f4
2
0x02000214
0xbfffbb04
4
Trace Issues

Trace issues




Traces tend to be very large



Capture
Compression
Processing
In terabytes for a minute of program execution
Expensive to store, transfer, and use
Effective reduction techniques:



Lossless
High compression ratio
Fast decompression
5
Outline



Program Execution Traces
Trace Compression
Trace Compression in Hardware




Stream caches and predictors for
instruction address trace compression
Data address stride caches for
data address trace compression
Results
Conclusions
6
Trace Compression

General purpose compression algorithms




Ziv-Lempel (gzip)
Burroughs-Wheeler transformation (bzip2)
Sequitur
Trace specific compression techniques


Tuned to exploit redundancy in traces
Better compression, faster,
can be further combined with
general-purpose compression algorithms
7
Trace-Specific Compression Techniques
Lossless Compression
Instructions
Instructions + data
Link data
addresses to
dynamic basic
block
Offset
Mache
[Samples 1989],
LBTC [Luo and
John 2004]
Replacing
an execution sequence
with its identifier
- Acyclic path
(WPP [Larus 1999],
Time Stamped WPP
[Zhang and Gupta 2001])
Control flow graph +
trace of transitions
QPT [Larus 1993]
- N-tuple [Milenkovic,
Milenkovic and Kulick 2003]
[Pleszkun 1994],
SBC [Milenkovic and
Milenkovic, 2003]
Offset +
repetitions
PDATS
[Johnson, Ha
and Zaidi 2001]
Link data
addresses to loop
Regenerate
addresses
- Instruction (PDI
[Johnson, Ha and Zaidi 2001])
Graph with number of
repetitions in nodes
Abstract
execution
[Hamou-Lhadj and Lethbridge 2002]
[Eggers, et al. 1990],
[Larus 1993]
[Elnozahy 1999], SIGMA
[DeRose, et al. 2002]
Value Predictor
VPC [Burtscher and
Jeeradit 2003],
TCGEN [Burtscher and Sam
2005]
8
Outline



Program Execution Traces
Trace Compression
Trace Compression in Hardware




Stream caches and predictors for
instruction address traces
Data address stride caches for
data address traces
Results
Conclusions
9
Why Trace Compression in Hardware?

Problem #1: Capture program traces
 In software: trap after each instruction or taken branch


E.g., IBM’s Performance Inspector
Slowdown > 100 times
Multiple cores on a single chip +
more detailed information needed (e.g., time stamps of events)
Problem #2: debugging is far from fun
 Stop execution on breakpoints, examine the state
 Time-consuming, difficult,
may miss a critical state leading to erroneous behavior
 Stopping the CPU may perturb the sequence of events
making your bugs disappear



=> Need an unobtrusive real-time tracing mechanism
10
Trace Compression in Hardware

Goals




Small on-chip area and small number of pins
Real-time compression (never stall the processor)
Achieve a good compression ratio
Solution

A set of compression algorithms
targeting on-the-fly compression
of instruction and data address traces
11
Exploiting Stream and Strides

Instruction address
trace compression



Limited number and
strong temporal locality of instruction
streams
=> Replace an instruction stream
with its identifier
Data address trace compression


Spatial and temporal locality
of data addresses
=> Recognize regular strides
CINT
164.gzip
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
300.twolf
#Streams Max.L Dyn.SL
1437
229
13.6
30162
315
11.4
1181
88
7.4
5347
191
13.3
6116
189
10.0
4389
169
13.7
11542
868
11.8
3530
284
11.1
8254
126
11.0
4902
185
14.4
CFP
#Streams Max.L Dyn.SL
168.wupwise
1912
229
27.4
171.swim
1839
707 130.8
172.mgrid
1725 1944 420.8
173.applu
1752 3162 462.4
177.mesa
1938
550 18.15
178.galgel
4153
264
21.8
179.art
976
561
9.0
183.equake
1355
623
27.7
188.ammp
1810
422
38.5
189.lucas
1414
427 113.3
191.fma3d
5007 1158
34.3
200.sixtrack
6515
580 170.5
301.appsi
2989
894
50.7
12
Trace Compressor: System Overview
Processor Core
System Under Test
Processor
Core
Task
Switch
Data
Address
Program
Counter
Data
Address
Buffer
PC
DA
Memory
Stream Cache
(SC)
Data Address Stride
Cache (DASC)
Trace Compressor
SCIT
Trace port
External Trace Unit
for Storing/Processing
(PC or Intelligent Drive)
SCMT
Predictor +
Byte rep. FSM
DT
DMT
Byte rep.
FSM
Trace Output Controller
To External Unit
13
Outline



Program Execution Traces
Trace Compression
Trace Compression in Hardware




Stream caches and predictors for
instruction address traces
Data address stride caches for
data address traces
Results
Conclusions
14
Stream Detector + Stream Cache
0x020001f4
0x020001f8
...
0x02000214
PC
Stream Cache (SC)
PPC
SA
NWAY - 1
…
SL
iWay 1
-
Instruction
Stream
S.SA S.L
Buffer
0
=! 4
’00…0’
reserved
1
F(S.SA, S.SL)
S.SA & S.L
(0x020001f4,0x09)
0
iWay
0x0E
i
SA
iSet
NSET - 1
Hit/Miss
0x00 // it. 0
SCIT
Stream Cache Stream Cache 0x0E // it. 1
Index Trace
Miss Trace
SA
SA
=?
L
S.SA & S.L
From Instruction
Stream Buffer
(0x020001f4,0x09) SCMT (SA, SL)
0x0E // it. 99
15
SC Itrace Compression
Compress instruction stream
1. Get the next instruction stream record
from the instruction stream buffer(S.SA, S.SL);
2. Lookup in the stream cache with iSet = F(S.SA, S.SL);
3. if (hit)
4.
Emit(iSet && iWay) to SCIT;
5. else {
6.
Emit reserved value 0 to SCIT;
7.
Emit stream descriptor (S.SA, S.SL) to SCMT;
8.
Select an entry (iWay) in the iSet set to be replaced;
9.
Update stream cache entry: SC[iSet][iWay].Valid = 1
SC[iSet][iWay].SA = S.SA, SC[iSet][iWay].SL = S.SL;}
10. Update stream cache replacement indicators;
Design Decisions:

Instruction Stream Buffer size

Not to stall processor
(e.g., have consecutive very
short instruction streams)

Stream cache




Size
Associativity
Replacement policy
Mapping function
16
SC Itrace Compression:
An Analytical Model
Size( Dinero.I )
Size( SCIT )  Size( SCMT )
Size( Dinero.I )  N  4 Bytes
CR( SC.I ) 
N
log 2 ( N SET  NWAYS )

Bytes
SL.Dyn
8
N
Size( SCMT ) 
 (1  SC.HitN SET NWAYS )  5 Bytes
SL.Dyn
4  SL.Dyn
CR( SC.I ) 
1
 log 2 ( N SET  NWAYS )  5  (1  SC.HitN SET NWAYS )
8
Size( SCIT ) 
Lim (CR( SC.I )) 
SC.Hit 1
Lim
SC.Hit 1
N SET  NWAYS  256 
N SET  NWAYS  128 
N SET  NWAYS  64 

32  SL.Dyn
log 2 ( N SET  NWAYS )
Lim (CR( SC.I ))  4  SL.Dyn
Legend:
 CR(SC.I) – compression ratio
 N – number of instructions



SL.Dyn – average stream
length (dynamic)
SC.Hit(Nset,Nway) – SC hit rate
Assumptions:


stream length < 256
(1 byte for SL)
4 bytes for stream starting
address
SC.Hit 1
Lim (CR( SC.I ))  4.57  SL.Dyn
SC.Hit 1
Lim (CR( SC.I ))  5.34  SL.Dyn
SC.Hit 1
17
2nd Level Itrace Compression

Size(SCIT) >> Size(SCMT)


Redundancy in SCIT


HitRate = 98%, 8-bit index
=> Size(SCIT) = 10*Size(SCMT)
Temporal and spatial locality of instruction streams
Reduce SCIT trace



Global Predictor
N-tuple compression using Tuple History Table
N-tuple compression using SCIT History Buffer
18
Global Predictor Structure
SCIT Trace
next.sid
History Buffer
Predictor
...
0
F
pindex
MaxP-1
==?
’0’
’1’
SCIT PRED Trace
Hit/Miss
SCIT PRED Miss Trace
19
SCIT Compression
Predict SCIT index
1. Get the incoming index, next.sid, from the SCIT trace
2. Calculate the SCIT predictor index, pindex,
using indices in the History buffer
pindex = F (indices in the History Buffer);
3. Perform lookup in the SCIT Predictor with pindex;
4. if(SCIT.Predictor[pindex] == next.sid)
5.
Emit(‘1') to SCIT PRED trace;
6. else {
7.
Emit(‘0’) to SCIT PRED trace;
8.
Emit next.sid to SCIT Miss PRED trace;
9.
SCIT.Predictor[pindex] = next.sid; }
10. Shift in the next.sid to the History Buffer;
Design Decisions:

Length of history buffer

Global predictor


Size
Mapping function
20
Redundancy in SCIT Pred Trace


High predictor hit rates and long runs of 0xFF bytes
are expected in Predictor Hit Trace
Use a simple FSM to exploit byte repetitions
PRED
Hit
Trace
Prev.BYTE
CNT
=?
SCIT PRED
Header
SCIT PRED
Repetition
Trace
// Detect byte repetitions in SCIT pred
1. Get next SCIT Pred byte, Next.BYTE;
2. if (Next.BYTE == Prev.BYTE) CNT++;
3. else {
4.
if (CNT == 0) {
5.
Emit Prev.BYTE to SCIT.REP.Trace;
6.
Emit ‘0’ to SCIT Header;
7.
} else {
8.
Emit (Prev.BYTE, CNT) pair
to SCIT.REP.Trace;
9.
Emit ‘1’ to SCIT Header;}
10.
Prev.BYTE = Next.BYTE;}
21
Outline



Program Execution Traces
Trace Compression
Trace Compression in Hardware




Stream caches and predictors for
instruction address traces
Data address stride caches for
data address traces
Results
Conclusions
22
Data Address Trace Compression




More challenging task
Data addresses rarely stay constant
during program execution
However, they often have a regular stride
=> Use Data Address Stride Cache (DASC) to exploit
locality of memory referencing instructions and
regularity in data address strides
23
Data Address Stride Cache
Data Address Stride
Cache (DASC)
0x020001f8



DASC
Tagless structure
Indexed by PC of
the corresponding instruction
Entry fields


LDA – Last Data Address
Stride
PC
0
1
G(PC)
index
0xbfffbe24
0xbfffbe20
0xbfffbe1c
…
…
…
…
LDA
Stride
i
N-1
DA
DA-LDA
’0’
’1’
==?
Stride.Hit
Stride.Hit
0xbfffbe24
0xbfffbe20
DT (Data trace)
DMT
Data Miss Trace
0
0 1
24
DASC Compression
// Compress data address stream
1. Get the next pair from data buffers (PC, DA)
2. Lookup in the data address stream cache indexSet = G(PC);
3. cStride = DA - DASC[iSet].LDA;
4. if (cStride == DASC[iSet].Stride) {
5. Emit(‘1’) to DT; //1-bit info
6. } else {
7.
Emit(‘0’) to DT;
8.
Emit DA to DMT;
9.
DASC[iSet].Stride =lsb(cStride); }
10. DASC[iSet].LDA = DA;
Design Decisions:




Number of entries
Index function G
Stride length
Data address buffer depth
25
DASC Dtrace Compression:
An Analytical Model
Size( Dinero.D)
Size( DT )  Size( DMT )
Size( Dinero.D)  N m em ref  4 B
CR( SC.D) 
Size( DT )  Size( DMT )  N m em ref  [(1  DASC.Hit )  4  0.125)] B
CR( SC.D) 
1
1.03125  DASC.Hit
Lim (CR( SC.D)) 
DASC. Hit 1
1
 32
0.03125
Legend:

CR(SC.D) – compression ratio

Nmemref – number of memory
referencing instructions

DASC.Hit – DASC hit rate

Assumptions:

4 bytes for stream starting address
26
Redundancy in DT Trace


High predictor hit rates and long runs of 0xFF bytes
are expected in DT Trace
Use a simple FSM to exploit byte repetitions
DT
Prev.DT
CNT
=?
Data Header
(DH)
Data Repetition
Trace (DRT)
// Detect data repetitions
1. Get next DT byte;
2. if (DT == Prev.DT) CNT++;
3. else {
4.
if (CNT == 0) {
5.
Emit Prev.DT to DRT;
6.
Emit ‘0’ to DH;
7.
} else {
8.
Emit (Prev.DT, CNT) pair to DRT;
9.
Emit ‘1’ to DH;}
10. Prev.DT = DT;}
27
Outline



Program Execution Traces
Trace Compression
Trace Compression in Hardware




Stream caches and predictors for
instruction address traces
Data address stride caches for
data address traces
Results
Conclusions
28
Experimental Evaluation

Goals




Assess the effectiveness
of the proposed
algorithms
Explore the feasibility of
the proposed hardware
implementations
Determine optimal size
and organization of HW
structures
Workload


16 MiBench benchmarks
ARM architecture
cjpeg
djpeg
lame
tiff2bw
tiff2rgba
tiffmedian
tiffdither
mad
sha
bf_e
rijndael_e
ghostscript
rsynth
stringsearch
adpcm_c
gsm_d
IC
104,607,812
23,391,628
1,285,111,635
143,254,646
151,691,275
541,260,067
832,951,018
286,974,899
140,885,982
544,053,846
319,977,971
708,090,638
824,942,227
3,675,745
732,513,651
1,299,270,245
NUS
maxSL SL.Dyn
1636
239 10.89
1324
206 21.81
3410
252 27.81
1058
43 12.79
1146
75 27.54
1431
75 22.22
1831
51 12.57
1659
1055 20.09
495
62 15.15
413
300
5.85
542
254 18.94
6900
187
8.70
1323
180 15.77
439
62
5.61
347
71 54.63
845
401 11.07
Legend:
• IC – Instruction count
• NUS – Number of unique instruction streams
• maxSL – Maximum stream length
• SL.Dyn – Average stream length (dynamic)
29
Findings about SC Size/Organization

Good compression ratio




Outperforms fast GZIP
High stream cache hit rates for
all application (>98 %)
Smaller SCs work well too
CR(SC.I)
Entries
8
16
32
64
128
256
Replacement policy

Pseudo-LRU vs. FIFO
Ways
1
16.3
21.1
23.9
27.5
29.0
28.0
2
17.6
22.1
28.0
36.9
47.6
47.8
4
17.0
27.8
34.4
44.1
54.1
53.6
8
15.8
26.6
34.0
47.1
57.4
54.2
CR=f(Complexity), 4-way SC
1.2
Associativity



4-way is a reasonable choice
8-way and 16-way desirable
Mapping function

S.SA<5+n:6> xor S.L<n-1:0>
n=log2(NSET)
1
CR/MaxCR

0.8
0.6
0.4
0.2
0
0
50
100
150
200
250
300
#SC entries
30
Findings about Global Predictor

Number of entries
should not exceed the
number of entries in SC


Having longer histories and
larger predictors
gives only marginal
improvements for all
applications
except ghostscript, blowfish,
and stringsearch
CR(SC+GP.I) Pred. entries
SC Entries
P32
P64
P128 P256
8x4
47.64
16x4
72.17 81.19
32x4
91.91 113.22 145.79
64x4
100.32 115.09 150.54 207.64
History length = 1

Index GPRED using the
previous SCIT index
31
Putting It All Together (SC+GPRED+BREP):
Itrace Compression
SC,GPRED
8x8,64 16x8,128 32x8,256 64x4,256
CR
277.1
315.0
316.7
263.7
cjpeg
492.3
539.4
443.3
287.1
djpeg
250.6
255.2
238.6
214.0
lame
1493.0
3062.2
1111.5
351.5
tiff2bw
1834.0
3592.0
3713.1
517.6
tiff2rgba
1601.2
1827.4
1229.4
649.4
tiffmedian
154.3
184.8
120.9
54.8
tiffdither
253.4
257.2
230.4
221.0
mad
322.3
322.4
339.6
348.5
sha
92.6
92.6
100.2
100.2
bf_e
285.6
290.1
298.6
142.1
rijndael_e
119.4
123.6
106.4
30.4
ghostscript
211.5
246.0
152.8
97.0
rsynth
74.9
114.0
78.5
21.8
stringsearch
29972.5 28663.9 27457.8 27456.6
adpcm_c
376.0
401.2
292.3
234.9
gsm_d
237.8
254.4
209.0
113.2
TOTAL
DEF.
I.GZ
109.6
71.8
60.5
114.1
121.3
152.8
91.1
73.5
211.4
170.4
143.8
100.6
46.7
82.1
233.1
85.4
87.5
FAST
I.GZ
54.5
39.8
128.5
83.9
20.3
92.3
46.4
37.8
54.4
41.0
12.6
39.7
30.6
32.3
107.3
59.2
47.2
BEST BEST DEF.
I.GZ I.BZ2 GZGZ
265.7
124.5 342.0
232.5
73.7 202.0
174.2
87.6
333.9
615.2
114.4 376.8
122.0 529.6 1292.7
155.5 472.9 1017.5
147.1
99.8 170.9
206.2
94.3
78.5
221.8 656.5 4112.1
182.3 352.0 4065.9
150.6 141.8 2392.9
434.5
111.2 212.5
191.2
48.0 143.2
132.8
100.6 202.5
233.6 1862.6 12764.7
507.1
87.2 165.6
321.6
112.9 172.0
32
Findings about DASC
Stride size




1 byte is optimal
2 byte stride improves
compression for  10%
DASC with 1K entries
is an optimal choice
Tagged (multi-way) DASC
further improves overall
compression ratio

CR=f(Complexity)
7
6
5
CR

4
3
2
1
0
0
1000
2000
3000
4000
5000
# DASC entries
Increased complexity
33
DASC Compression Ratio
DASC DASC
32
64
cjpeg
3.35
4.60
djpeg
2.81
3.57
lame
1.20
1.52
tiff2bw
76.31 78.04
tiff2rgba
5.98 79.81
tiffmedian
8.64
8.70
tiffdither
2.61
6.08
mad
1.30
1.59
sha
6.58
7.94
bf_e
1.58
1.95
rijndael_e
1.10
1.10
ghostscript
1.07
1.19
rsynth
1.22
1.36
stringsearch 1.80
2.04
adpcm_c
3.13
3.13
gsm_d
2.67
4.48
TOTAL
1.66
2.04
DASC DASC DASC DASC DEF. FAST BEST
128
256
512 1024 D.GZ D.GZ D.GZ D.BZ2 D.GZGZ
5.14
5.77
6.54
7.11 5.98 4.50 6.11 18.20
9.57
4.28
4.96
5.22
5.29 4.22 3.78 4.22
8.62
4.92
2.81
3.82
4.49
4.88 6.56 4.01 6.63
8.80
8.60
84.28 105.04 128.84 134.23 2.14 2.55 2.10 14.28
3.07
91.24 107.49 127.05 139.57 2.10 2.79 2.09
4.06
4.03
8.74
8.81
8.87
8.89 4.40 4.37 4.53 11.16
6.03
7.21
8.69
9.65 10.06 4.51 4.41 4.51
7.87
6.77
1.96
2.07
2.35
2.64 4.08 3.60 4.22 13.47
6.97
9.38 10.79 11.36 11.36 44.91 8.36 45.61 172.71 591.69
2.38
2.61
2.75
2.91 7.58 4.86 7.83 16.35
9.08
1.10
1.13
1.29
2.06 4.24 3.22 4.27
7.31
4.49
1.56
2.19
2.93
5.27 27.21 18.58 27.46 47.42
40.83
1.76
3.81
8.30 32.43 24.44 21.46 25.27 57.40
43.88
2.70
4.13
4.44
5.16 11.12 8.57 11.23 15.03
11.47
3.13
3.13
3.13
3.13 6.57 3.64 7.15 12.27
11.42
11.30 13.60 14.81 16.78 21.60 18.05 23.29 63.53
33.15
2.80
3.77
4.67
6.12 6.78 5.51 6.90 13.29
9.70
34
Hardware Complexity Estimation

CPU model



SC and DASC timings



In-order, Xscale like
Vary SC and DASC parameters
SC: Hit latency = 1 clock,
Miss latency = 2 clocks
DASC: Hit latency = 2 clocks
Miss latency = 2 clocks
To avoid any stalls



Component
Entries
Complexity
Bytes
Instruction stream
buffer
2
2x5
10
Stream detector
2
2x4
8
64x4
256x5
1280
256
256 + 1(h)
257
Data address buffer
8
8x8
64
Data address stride
cache
1024
1024x5
5120
-
4
4
Stream cache
Global Predictor
Instruction stream input buffer:
Byte repetition
state machines
MIN = 2 entries
Data address input buffer:
MIN = 8 entries
Results are relatively independent of
SC and DASC organization
35
Trace Port Bandwidth Analysis
CJPEG
1.6
1.4
1.2
1.2
1
1
bits/instr.
bits/instr.
CJPEG
SC
SC+PRED
SC+PRED+BREP
1.4
0.8
0.6
TDASC
TDASC+BREP
0.8
0.6
0.4
0.4
0.2
0.2
0
0
1
21
41
61
81
101
1
21
Instructions Executed (millions)
41
61
81
101
Instruction Executed (millions)
MAD
MAD
0.8
SC
SC+PRED
SC+PRED+BREP
0.7
1.6
1.4
0.5
bits/instr.
bits/instr.
0.6
1.8
0.4
0.3
1.2
TDASC
TDASC+BREP
1
0.8
0.6
0.4
0.2
0.2
0.1
0
0
1
1
41
81
121
161
201
241
281
41
81
121
161
201
241
281
Instruction Executed (millions)
Instructions Executed (millions)
36
Outline



Program Execution Traces
Trace Compression
Trace Compression in Hardware




Stream caches and predictors for
instruction address traces
Data address stride caches for
data address traces
Results
Conclusions
37
Conclusions

A set of algorithms and hardware structures
for instruction and data address trace compression



Benefits



Stream Caches + Global Predictor + Byte repetition FSM
for instruction traces
Data Address Stride Cache + Byte repetition FSM for data traces
Enabling real-time trace compression with high compression ratio
Low complexity (small structures, small number of external pins)
Analytical & simulation analysis focusing on compression ratio
and optimal sizing/organization of the structures as well as
real-time trace port bandwidth requirements
38
Laboratory for Advanced
Computer Architectures and Systems
at Alabama: Research Overview
Aleksandar Milenković
The LaCASA Laboratory
Electrical and Computer Engineering Department
The University of Alabama in Huntsville
Email: [email protected]
Web: http://www.ece.uah.edu/~milenka
http://www.ece.uah.edu/~lacasa
Secure Processors
Software & physical attacks
Computer Security is Critical

Today
Tomorrow
Original Code
Encrypt
Generate Program Keys
(Key1,Key2,Key3)
Encrypt
I-Block
Calculate
Signature
Signed Code
Secure Mode
EKey.Cpu(Key1)
EKey.Cpu (Key2)
EKey.CPU(Key3)
Decrypt
Signature
Improvements

Trusted Code

Program Keys
(Key1,Key2,Key3)
Secure
Execution
EKey3(I-Block)
Instruction Fetch
Decrypt
Calculate
Signature
=?
Signature Fetch
Multiple format string
overflow.
Sign & Verify for Guaranteed Integrity
and Confidentiality of Code
Program
Loading

MMClient.exe in Indiatimes
vulnerabilities in (1) neon

Multiple
heap-based
Messenger
6.0 allows remote
0.24.4 and earlier, and other
attackers tobuffer
cause a overflows
denial of
in the imlib products that use neon including
service (application
crash)
and allow remote
BMP image
handler
Cadaver, (3) Subversion,
 (2)
Stack-based
buffer and
possibly execute
(4) OpenOffice, allow remote
attackers arbitrary
to execute
overflow
in the URL parsing
malicious
WebDAV
servers
to
code via arbitrary
a long group name
code via a crafted
function in Gaim before 1.3.0

Buffer overflow execute
in
argument to
thefile.
RenameGroup
arbitrary
code.
BMP
allows remote
attackers to
WIDCOMM Bluetooth Connectivity
function in the
execute
Software, as used in products
such arbitrary code
MMClient.MunduMessenger.1
viaand
an instant message (IM) with a
as BTStackServer 1.3.2.7
ActiveX object.
large URL.
1.4.2.10, Windows XP and

Integer overflow
Windows 98 in
with MSI Bluetooth
Multiple buffer overflows in

Buffer overflow
inDongles,
the JPEGand(ioHP IPAQ5450
pixbuf_create_from_xpm
(JPG) parsing
engine
running
WinCE
3.0, allowsRealOne
remote Player, RealOne Player 2.0,
xpm.c)
in in
thethe
XPM image
decoder
RealOne Enterprise Desktop, and
Microsoft Graphic
Device
Interface
for gtk+
2.4.4
(gtk2) and
earlier,
attackers
to execute
RealPlayer Enterprise allow remote
Plus (GDI+)and
component,
gdk-pixbuf before 0.22, allows
arbitrary code via certain
GDIPlus.dll,remote
allows remote
attackers to execute arbitrary
attackers
to execute
service
requests.
attackers toarbitrary
execute code via certain
code via malformed (1) .RP, (2) .RT,
(3) .RAM, (4) .RPM or (5) .SMIL files.
arbitraryn_col
code
JPEG that enable a
andvia
cppa values
image.
heap-based buffer
Yesterday
Secure
Installation
Buffer overflow in
I-Block

PMAC (Parallel MACs) for reduced
cryptographic latency
A variation of the one-time-pad for
code encryption
Instruction Verification Buffer for
conditional execution before
verification
Signature
Match
http://www.ece.uah.edu/~lacasa/research.htm#secure_processors
40
Microbenchmarks for
Architectural Analysis

Small programs for uncovering
architectural parameters (usually
not publicly disclosed) of modern
processors


Architecture-aware
compiler optimization
Processor design evaluation
and verification
Testing
Competitive analysis


Relatively simple, so their
behavior can be understood


Microbenchmarks
BTB Size
BTB
Benefits
Outcome
Predictor

BTB Org.
BTB
Indexing
...
Results



Local History
Performance
Counters
Branch
related
events
Global History 
Challenge

...
Microbenchmarks
for BTB analysis
Experimental flow for
outcome predictor
Tested on P6 and NetBurst
(Northwood core)
Dothan (PentiumM) predictor
http://www.ece.uah.edu/~lacasa/bp_mbs/bp_microbench.htm
41
TinyHMS
Concept
Prototype
Software
PS
(PDA)
User Interface
Network Coordinator
(Telos)
ActiS
WWAN/WLAN
Communication
Protocol
ActiS
(Tmote sky)
ActiS Application Layer
ActiS
Protocol
Flash
Storage
Messaging Control
Interface
(USB/CF)
Storage
Signal
Processing
Sensor
Interface
Interface
(USB/CF)
TimeSync
Main Control
(Messaging, Fusion, Buffering)
Flash Storage
Wireless
Transceiver
TimeSync
Messaging
Buffering
IAS/ISPM
ActiS
Interface
Filtering/
Pre-processing
Data
Acquisition
Wireless
Transceiver
http://www.ece.uah.edu/~lacasa/research.htm#tinyHMS
42
2000
TinyHMS
1000
0
105
1.5
x 10
105.2
105.4
105.6
105.8
106
106.2
106.4
106.6
106.8
accX
accY
accZ
107
105.4
105.6
105.8
106
106.2
106.4
106.6
106.8
107
105.4
105.6
105.8
106
106.2
106.4
106.6
106.8
107
4
Motion
Sensor
(TS2)
1
0.5
105
105.2
4000
ECG
Sensor
(TS1)
3000
2000
1000
105
105.2
Heart Beat
Beacon
Message
Heart Beat
Event Message
with Timestamp
Heart Beat
Step
Step
Beacon
Message
…
NC
TS1
TS2
TS3
…
NC
Frame i-1
TS1
TS2
TS3
Frame i
43