Shadow Profiling

Shadow Profiling:
Hiding Instrumentation
Costs with Parallelism
Tipp Moseley
Alex Shye
Vijay Janapa Reddi
Dirk Grunwald
(University of Colorado)
Ramesh Peri
(Intel Corporation)
Motivation

An ideal profiler will…
1. Collect arbitrarily detailed and abundant information
2. Incur negligible overhead

A real profiler, e.g., using Pin, satisfies condition 1

But the cost is high
 3X for BBL counting
 25X for loop profiling
 50X or higher for memory profiling

A real profiler, e.g. PMU sampling or code patching,
satisfies condition 2

But the detail is very coarse
Motivation
Pintools,
Valgrind,
ATOM, …
High
Detail
“Bursty Tracing”
(Sampled
Instrumentation),
Novel Hardware,
Shadow Profiling
VTune, DCPI,
OProfile, PAPI,
pfmon,
PinProbes, …
Low
Overhead
Goal

To create a profiler capable of collecting
detailed, abundant information while incurring
negligible overhead

Enable developers to focus on other things
The Big Idea


Stems from fault tolerance work on deterministic replication
Periodically fork(), profile “shadow” processes
fork()
fork()
fork()
fork()
fork()
Slice 0 Slice 1 Slice 2 Slice 4 Slice 5
Original Execution
Time
CPU 0
0
Orig. Slice 0 Slice 0
1
Orig. Slice 1 Slice 0
Slice 1
2
Orig. Slice 2 Slice 0
Slice 1
Slice 2
3
Orig. Slice 3 Slice 3
Slice 1
Slice 2
4
Orig. Slice 4 Slice 3
Slice 4
Slice 2
5
Slice 3
Slice 4
6
CPU 1
CPU 2
Slice 4
* Assuming instrumentation overhead of 3X
CPU 3
Challenges






Threads
Shared Memory
Asynchronous Interrupts
System Calls
JIT overhead
Overhead vs. Number of CPUs
 Maximum speedup is Number of CPUs
 If profiler overhead is 50X, need at least 51 CPUs to
run in real-time (probably many more)
Too many complications to ensure deterministic
replication
Goal (Revised)

To create a profiler capable of sampling
detailed traces (bursts) with negligible
overhead

Trade abundance for low overhead

Like SimPoints or SMARTS (but not as smart :)
The Big Idea (revised)


Do not strive for full, deterministic replica
 Instead, profile many short, mostly deterministic bursts
Profile a fixed number of instructions
 “Fake it” for system calls
 Must not allow shadow to side-effect system
fork()
fork()
Slice 0
Slice 1
Original Execution
Time
CPU 0
CPU 1
CPU 2
CPU 3
0
Orig. Slice 0 Slice 0
Spyware
1
Orig. Slice 1 Slice 0
Spyware
2
Orig. Slice 2 Slice 0
Slice 1
Spyware
3
Orig. Slice 3
Slice 1
Spyware
4
Orig. Slice 4
Slice 1
Spyware
Design Overview
Monitor
Original
Application
fork()
Profiling
Instrumentation
Shadow
Application
fork()
Profiling
Instrumentation
Shadow
Application
Design Overview

Monitor uses Pin Probes (code patching)





Application runs natively
Monitor receives periodic timer signal and decides
when to fork()
After fork(), child uses PIN_ExecuteAt() functionality
to switch Pin from Probe to JIT mode.
Shadow process profiles as usual, except handling
of special cases
Monitor logs special read() system calls and pipes
result to shadow processes
System Calls

For SPEC CPU2000, system calls occur around 35
times per second

Forking after each puts lots of pressure on CoW
pages, Pin JIT engine

95% of dynamic system calls can be safely
handled

Some system calls can be allowed to execute (49%)

getrusage, _llseek, times, time, brk, munmap, fstat64,
close, stat64, umask, getcwd, uname, access,
exit_group, …
System Calls

Some can be replaced with success assumed (39%)


Some are handled specially, but execution
may continue (1.8%)


mmap2, open(creat), mmap, mprotect, mremap, fcntl
read() is special (5.4%)



write, ftruncate, writev, unlink, rename, …
For reads from pipes/sockets, the data must be logged from the
original app
For reads from files, the file must be closed and reopened after
the fork() because the OS file pointer is not duplicated
ioctl() is special (4.8%)


Frequent in perlbmk
Behavior is device-dependent, safest action is to simply
terminate the segment and re-fork()
Other Issues

Shared Memory
 Disallow writes to shared memory

Asynchronous Interrupts (Userspace signals)
 Since we are only mostly deterministic, no longer an
issue
 When main program receives a signal, pass it along to
live children

JIT Overhead
 After each fork(), it is like Pinning a new program
 Warmup is too slow
 Use Persistent Code Caching [CGO’07]
Multithreaded Programs
Issue: fork() does not duplicate all threads

Only the thread that called fork()
Solution:
1.
Barrier all threads in the program and store their CPU
state
2.
Fork the process and clone new threads for those that
were destroyed

Identical address space; only register state was
really ‘lost’
3.
In each new thread, restore previous CPU state

4.
Modified clone() handling in Pin VM
Continue execution, virtualize thread IDs for relevant
system calls
Tuning Overhead

Load



Number of active shadow processes
Tested 0.125, 0.25, 0.5, 1.0, 2.0
Sample Size




Number of instructions to profile
Longer samples for less overhead, more data
Shorter samples for more evenly dispersed data
Tested 1M, 10M, 100M
Experiments

Value Profiling



Path Profiling



Typical overhead ~100X
Accuracy measured by Difference in Invariance
Typical overhead 50% - 10X
Accuracy measured by percent of hot paths detected (2%
threshold)
All experiments use SPEC2000 INT Benchmarks
with “ref” data set

Arithmetic mean of 3 runs presented
Results - Value Profiling Overhead


Overhead versus
native execution
Several
configurations less
than 1%
Path profiling
exhibits similar
trends
9%
8%
7%
1M
10M
100M
6%
Overhead

Value Profiling Overhead
5%
4%
3%
2%
1%
0%
0.063 0.125 0.25
0.5
Load
1
2
Results - Value Profiling Accuracy

All configurations
within 7% of
perfect profile

Lower is better
Difference in Invariance
8%
1M
7%
10M
6%
100M
5%
4%
3%
`
2%
1%
0%
0.063 0.125 0.25
Load
0.5
1
2
Results - Path Profiling Accuracy
Most configurations
over 90% accurate


Higher is better
Some benchmarks
(e.g., 176.gcc,
186.crafty,
187.parser) have
millions of paths,
but few are “hot”
94%
1M
10M
100M
92%
Accuracy

96%
90%
88%
86%
84%
82%
80%
78%
0.06 0.13 0.25 0.5
Load
1
2
Results - Page Fault Increase

Proportional
increase in page
faults

Shadow/Native
Increase in Page Faults
200
180
160
1M
140
10M
120
100M
100
80
60
40
20
0
0.06 0.13 0.25
0.5
Load
1
2
Results - Page Fault Rate
10000
8000
1M
10M
100M
6000
4000
2000
Load
2
1
0
0.
5
Difference in page
faults per second
experienced by
native application
0.
06
3
0.
12
5
0.
25

Paging Rate Increase
12000
Future Work




Improve stability for multithreaded programs
Investigate effects of different persistent code
cache policies
Compare sampling policies
 Random (current)
 Phase/event-based
 Static analysis
 Study convergence
Apply technique


Profile-guided optimizations
Simulation techniques
Conclusion

Shadow Profiling allows collection of bursts of
detailed traces


Incurs negligible overhead


Accuracy is over 90%
Often less than 1%
With increasing numbers of cores, allows
developers’ focus to shift from profiling to
applying optimizations