Compiler-directed Synthesis of Multifunction Loop Accelerators

Dynamic Acceleration of Multithreaded Program
Critical Paths in Near-Threshold Systems
Hyoun Kyu Cho and Scott Mahlke
University of Michigan, Ann Arbor
December 2, 2012
1
University of Michigan
Electrical Engineering and Computer Science
Critical Path
• Longest path between source and sink in DAG
2
University of Michigan
Electrical Engineering and Computer Science
Critical Path
3
2
5
2
5
3
10
14
2
2
5
9
3
3
2
[Saidi`08]
3
University of Michigan
Electrical Engineering and Computer Science
Critical Path for Multithreaded Programs
Call
Call
Call
StartLock
ArBarrier
Call
Call
ArBarrier
ArBarrier
Unlock
EndLock
T1
LvBarrier
LvBarrier
T2
T1
(a) Mutex Lock
LvBarrier
T3
T2
(b) Barrier
[Hollingsworth`98]
4
University of Michigan
Electrical Engineering and Computer Science
Scalability of Multithreaded Programs
Some benchmarks does not scale very well!
5
University of Michigan
Electrical Engineering and Computer Science
CPU Time Wasted on Synchronizations
Synchronization is major bottleneck!
6
University of Michigan
Electrical Engineering and Computer Science
Arrival Time Variation
7
University of Michigan
Electrical Engineering and Computer Science
Accelerating Critical Path
• ACS [Suleman et al. ASPLOS `09]
– Critical sections
• Voltage Boosting [Dreslinski `11]
– Transactional bottlenecks
• Booster [Miller et al. HPCA `12]
– Alleviate performance variation
– Reactive acceleration for barriers
8
University of Michigan
Electrical Engineering and Computer Science
Challenges and Opportunities of NTC
• Poor single thread performance
• Very sensitive to process variation
– Running at the slowest one leads to severe loss
– Likely to have performance heterogeneity
• Potential for bigger frequency boosting
9
University of Michigan
Electrical Engineering and Computer Science
Objectives
• Systematic way of identifying critical paths
• Dealing with performance variation
• Flexible control of core boosting
10
University of Michigan
Electrical Engineering and Computer Science
System Architecture
offline
instrumentation
Instrumented
Executable
Compilation
Target
Program
online
Intermediate
Representation
Priority
Weighted
Probabilistic
Priority
Scheduler
Monitor
Observe
Schedule
Adjust
Priority
Parallelism
Analysis
Monitoring
Logic
11
University of Michigan
Electrical Engineering and Computer Science
Lottery Scheduling
• Each thread holds a number of tickets
• Scheduler select fast mode thread by picking a ticket
• Efficient implementation of proportional-share resource
management
• Responsive, flexible control over relative execution rate
total = 20
random [0 .. 19] = 15
10
∑ = 10
∑ > 15?
no
2
5
∑ = 12
∑ > 15?
no
12
1
∑ = 17
∑ > 15?
yes
2
[Waldspurger`94]
University of Michigan
Electrical Engineering and Computer Science
Progress Monitoring
• For data parallel threads
• Slower threads are more likely to be in critical path
• Divide task into multiple smaller chunks and instrument
monitoring code
• Monitoring code reduce number of tickets
13
University of Michigan
Electrical Engineering and Computer Science
Example of Progress Monitoring
…
pthread_barrier_wait(barrier);
long PROGRESS_GRANULE = (k2 – k1) / NUM_STEPS;
for ( i = k1 ; i < k2 ; i++ ) {
float x_cost = dist(points->p[i],points->p[x],points->dim) * points->p[i].weight;
float current_cost = points->p[i].cost;
if ( x_cost < current_cost ) {
switch_membership[i] = 1;
cost_of_opening_x += x_cost – current_cost;
} else {
int assign = points->p[i].assign;
lower[center_table[assign]] += current_cost – x_cost;
}
if ( (i – k1) % PROGRESS_GRANULE == 0 )
halve_priority_tickets();
}
pthread_barrier_wait(barrier);
…
Loop Body
14
University of Michigan
Electrical Engineering and Computer Science
Priority Delegation
• Thread holding a mutex is likely to be in critical path
– Increase tickets when acquire mutex
• More likely to be in critical path if other threads are
waiting
– Temporarily transfer waiting thread’s ticket to the thread
holding mutex
15
University of Michigan
Electrical Engineering and Computer Science
Performance Evaluation
• Post processing traces
– Generated on 32-core machine
• Four 8-core Intel Xeon X7560
• 24MB L3 cache per chip
• 32GB Total memory
– Augmented progress time indication
• 1.5x, 2x, 5x, 10x acceleration for 1 fast mode core
• Varying scheduling quantum from 1us to 1ms
16
University of Michigan
Electrical Engineering and Computer Science
Speedup for Streamcluster
H/W
User OS
mode
17
University of Michigan
Electrical Engineering and Computer Science
Current Status
instrumentation
Instrumented
Executable
Compilation
Target
Program
Monitor
Observe
Schedule
Adjust
Priority
Parallelism
Analysis
Monitoring
Logic
Cores
Intermediate
Representation
Priority
Weighted
Probabilistic
Priority
Scheduler
18
Normal
Normal
Turbo
Turbo
University of Michigan
Electrical Engineering and Computer Science
Conclusion & Future Work
• Introduce S/W framework to improve multithreaded
programs’ performance using core boosting
• Combines static analysis, dynamic monitoring, and
probabilistic priority scheduling to predict critical paths
• Shows 5% ~ 27% performance improvement for
streamcluster
• Better model the tradeoff between performance and energy
• Predicting critical paths on other type of parallelism
19
University of Michigan
Electrical Engineering and Computer Science
Thank you!
20
University of Michigan
Electrical Engineering and Computer Science