An Efficient Low-Power Instruction Scheduling Algorithm for

An Efficient Low-Power Instruction Scheduling
Algorithm for Embedded Systems
Contents
•
•
•
•
•
•
•
•
•
•
Introduction
Bus Power Model
Related Works
Motivation
Figure-of-Merit
Algorithm Overview
Random Scheduling
Schedule Selection
Experimental Result
Conclusion
Introduction (1/3)
• Nomadic life-style is wide spreading these days thanks to the rapid
progress in microelectronics technologies.
• Not only did electronic equipment get smaller, it got smarter.
• The low power electronics will play a key role in this nomadic age.
• The figure-of-merit for nomadic age
= (intelligence)/(Size * Cost * Power).
Electronic Equipment toward Smaller Size
Nomadic Tool
Introduction (2/3)
• ASIPs have high-programmability and application specific
hardware structure.
• Because of ASIPs have high configurability and productivity,
it have the merit of time-to-market.
• Retargetable compiler is essential tool for application analysis
and code generation in the design of ASIPs.
• By equipping a retargetable compiler with an efficient
scheduling algorithm, low-power code can be generated.
Compiler-in-Loop Architecture Exploration
Introduction (3/3)
• The power consumption of ASIP instruction memory was found to
be 30% or higher of the entire process power consumption.
• Minimizing power consumption at instruction bus is critical in lowpower ASIP design.
Power Distribution for ICORE
Bus Power Model (1/2)
• Bit transition on bus lines is one of the major contributing
factors to power consumption.
• Traditional power model just uses self-capacitance model.
• With the development of nanometer technologies, couplingcapacitance is significant.
• As a result, how to solve the crosstalk problem on buses has
become an important issue.
Self capacitance model
Self & coupling capacitance model
Bus Power Model (2/2)
• Crosstalk type
• Bus power model
Instruction Recoding
• Instruction recoding analyzes the performance pattern of the
application program and reassign the binary code.
• Histogram graphs are used for analysis of application
performance pattern.
• Chattopadhyay et al. obtained initial solution using MWP and
applied simulated annealing with the initial solution.
Histogram Graph
Cold Schedule
• C. Su et al. first proposed cold scheduling
– How to reflect Control Dependency into SCG ?
– Why MST and Simulated Annealing as postprocess ?
– TSP is better choice
Cold Schedule
• K. Choi et al. has formulated cold-scheduling as TSP
problem – Reasonable approach
• C. Lee et al. expanded cold-scheduling to VLIW
Motivation (1/2)
Recoding
Cold-Scheduling
Input
Instruction sequence
Instruction binary
format
Output
Recoded instruction
binary
Instruction order
Optimization
Scope
Global
Local
Considered
Inst. Field
Partial field
All fields
Comparison between Recoding and Cold Scheduling
Motivation (1/2)
(a) Different Scheduling Results
(b) Constructed Histogram Graphs
(c) Optimal Recoding Results
Figure-of-Merit
• Maximizing the variance of transition edge weights increases the
efficiency of recoding.
• The larger the sum of self-loop edge weights, the greater will be
the power saving effect of a code sequence.
Figure-of-merit
Algorithm Overview
• Presented FM is global function
• Global instruction scheduling is difficult to implement
• We solve the optimization problem using random schedule
gathering and schedule selection
Schedule Selection
Random Scheduling
Make_Schedules_for_BBs (BB_SET[ ])
begin
for each BB in BB_SET[ ] do
begin
list_schedule_solution = LIST_SCHEDULE (BB);
latency_UB = LATENCY (list_schedule_solution);
Insert list_schedule_solution to Schedules_for_BBs[BB];
for i = 0 until ITERATION_COUNT (BB) do
begin
new_schedule = RANDOM_SCHEDULE (BB);
acceptable = False;
if (LATENCY (new_schedule) <= latency_UB) then
begin
acceptable = True;
for each schedule solution s in Schedules_for_BBs[BB] do
if (LATENCY (s) == LATENCY (new_schedule)) then
begin
similarity_measure = COMPARE (s, new_schedule);
if (similarity_measure > Threshold*LATENCY (new_schedule)) then
begin
accpetable = False;
break;
end
end
if (acceptable) then
Insert new_schedule to Schedules_for_BBs[BB];
end end end
return Schedules_for_BBs[ ];
end
• Considerations
- Runtime performance
- BB size and iteration count
- Differences (similarity)
between random schedules
Schedule Selection (1/3)
• Problem formulation
• Global histogram graph can be decomposed to local histograms
• So, we can consider the divide-and-conquer algorithm
Merge of Histogram Graphs
Schedule Selection (2/3)
• NP-Hardness
- To maximize the global variance, must consider not only the
sum of each local variances but also covariance of all local
histogram pairs
Schedule Selection (3/3)
• We used the dynamic programming method to achieve local
cost maximization via a bottom-up approach
• For further optimization, we used simulated-annealing
Greedy Selection Algorithm
Experimental Result
• We used PCC as our measure of performance.
Comparison of PCC Values
Conclusion
• We presented a new instruction scheduling algorithm
for low-power code synthesis
• It’s very exhaustive method to generate low-power
code in application specific domain
• But, advance of computing power makes our method
reasonable