Code Generation Code Generation Use registers during execution Whenever possible, perform computation in registers Memory load/store are much more expensive Need to determine the best register allocation For a given number of registers, minimize the number of spills Spill: When run out of registers, store some registers to memory Need to determine the best order of instruction execution To satisfy the suboptimal register allocation decision To reduce the number of instructions Instruction selection Map the intermediate code to the set of machine instructions that minimizes the cost of execution Peephole optimization Code Generation Various methods for register allocation and instruction scheduling Tree Achieve optimal register allocation and instruction scheduling DAG (directed acyclic graph) Achieve local subexpression elimination (optimal) Optimal register allocation and instruction scheduling is NP Heuristic algorithms Global Global register allocation Do not have corresponding scheduling algorithm, just follow the original instruction order Tree Based Approach for a Basic Block Basic block: t1 := a + b t2 := c * d t3 := e + f t4 := t2 + t3 y := t1 * t4 Assumptions: The system has two registers, r0, r1 only y is alive at the exit of the block “op reg reg/mem reg” -- first reg is the result “– a b c” a := b – c 15 instructions 10 load, 5 store load r0, a add r0, r0, b store t1, r0 load r0, c mul r0, r0, d store t2, r0 load r0, e add r0, r0, f store t3, r0 load r0, t2 add r0, r0, t3 store t4, r0 Can we use the registers more effectively? load r0, t1 add r0, r0, t4 store y, r0 Tree Based Approach for a Basic Block Assumptions: The system has two registers, r0, r1 only y is alive at the exit of the block Basic block: t1 := a + b t2 := c * d t3 := e + f t4 := t2 + t3 y := t1 * t4 t1 (R0) and t2 (R1) are still needed But no more registers to compute t3 Has to spill (choose R0) * y + t4 + t1 a Need to load t1 back into R1 + t3 * t2 b c d e f load r0, a add r0, r0, b load r1, c mul r1, r1, d store t1, r0 load r0, e add r0, r0, f add r0, r1, r0 load t1, r1 mul r0, r1, r0 store y, r0 11 instructions 7 load, 2 store (1 spill) Tree Based Approach for a Basic Block Assumptions: The system has two registers, r0, r1 only y is alive at the exit of the block Basic block: t1 := a + b t2 := c * d t3 := e + f t4 := t2 + t3 y := t1 * t4 load r1, e add r1, r1, f Can we always achieve optimal execution? * y + t3 * t2 b c d e add r1, r0, r1 load r0, a add r0, r0, b mul r0, r0, r1 store y, r0 + t4 + t1 a load r0, c mul r0, r0, d f 9 instructions 6 load, 1 store (0 spill) Optimal! Tree based Register Allocation and Scheduling Construct the execution tree for a basic block Label the tree to obtain the register requirements Depth first labeling L(leaf) = 1 if it is an identifier L(leaf) = 0 if it is a constant L(nonleaf node) = Assumptions: From here onwards, 3 address code need to be: op reg reg reg If L(left child) = L(right child) then o L(current node) := L(left child) + 1 Otherwise o L(current node) := max (L(left child), L(right child)) Assign registers and generate code Register allocation Instruction scheduling follows the register allocation algo Tree based Register Allocation and Scheduling Register allocation and instruction scheduling The process starts from root, recursively going to leave nodes Each non-leave node N with mark lr(N), do – t1 Assume o The registers allocated to N are Rb+1 to Rb+k (k registers) o The node operation is op and op is binary If L(left) = L(right) a b t1 := a – b – t1 a b o Go to left, pass lr = left, use registers Rb+1 to Rb+k–1, store result in Rb+1 o Go to right, pass lr = right, use registers Rb+2 to Rb+k, store result in Rb+k o If lr(N) = left gen “op Rb+1 Rb+1 Rb+k”; else gen “op Rb+k Rb+1 Rb+k” If L(left) < L(right) (assume: left needs m registers, right needs k registers, k > m) o Go to right, pass lr = right, use registers Rb+1 to Rb+k, store result in Rb+k o Go to left, pass lr = left, use registers Rb+1 to Rb+m, store result in Rb+1 o If lr(N) = left gen “op Rb+1 Rb+1 Rb+k”; else gen “op Rb+k Rb+1 Rb+k” Tree based Register Allocation and Scheduling Register allocation and instruction scheduling Each non-leave node N, with mark lr(N), do Assume o The allocated registers are Rb+1 to Rb+k (k registers) o The node operation is op and op is unary If lr(N)=left then Go to the child o Pass lr(N), and pass registers Rb+1 to Rb+k to the child o Generate code: “op Rb+1 Rb+1” If lr(N)=right then Go to the child o Pass lr(N), and pass registers Rb+1 to Rb+k to the child o Generate code: “op Rb+k Rb+k” Leave node x, x is an identifier Assume: allocated register is Rb+1 Generate code: “load Rb+1 x” load r1, c load r2, d Compute register requirement mul r1, r1, r2 Assign registers Tree based Register Allocation and Scheduling load r2, e Generate code mul r1, r1, r3 load r3, f now r1, r2 are available (r1,r2,r3) add r3, r2, r3 * add r3, r1, r3 load r1, a add r1, r1, r2 load r2, b (r1, r2) + add r1, r1, r2 3 add r3, r1, r3 (r1,r2,r3) 2 + mul r1, r1, r2 (r1,r2) mul r1, r1, r3 a 1 b 1 (r1) load r1, a (r2) load r2, b * c1 (r1) load r1, c 3 add r3, r2, r3 (r2,r3) 2 d1 (r2) load r2, d + 2 e 1 f 1 (r2) load r2, e (r3) load r3, f Global Register Allocation Basic approach Global liveliness analysis Build the interference graph Graph coloring N colors N is the number of available registers If N-coloring is not possible Insert spill code to the program Global Register Allocation Block level liveliness analysis {b,c,f } { c,d,e,f } {c,e} a:= b+c d:= –a e:= d+f {a,c,f} {c,d,f} b:= d+e e:= e–1 print(b) f:= 2*e {c,f} {b} {b,c,f} b:= f+c Assumption: {b} is the Live set of the next block {c,d,e,f} {b,c,e,f} {b,c,f} Global Register Allocation Build the interference graph Show which variables interfere with each other Principle: Two variables that are alive simultaneously interfere They cannot be allocated to the same register ----x=? define ----alive ----{x,…} ----?=x-- use ----- Register interference graph: One vertex for each variable in the graph At each point “p” in the CFG L is the Live set at p Two variables x and y are in L together, x should not get the same register as y add an edge (x,y) ------------------------------------- x y {x,y,…} Global Register Allocation Build the interference graph -- example {b,c,f } { c,d,e,f } a:= b+c d:= –a e:= d+f {a,c,f} {c,d,f} {c,d,e,f} {c,e} f:= 2*e {b,c,e,f} {b,c,f} b:= d+e e:= e–1 print(b) a b f {c,f} b:= f+c {b,c,f} {b} c e d Global Register Allocation Graph coloring to decide register allocation Color the interference graph so that no two adjacent nodes are of the same color Graph is k-colorable: Implies we can use k registers without needing to spill Whether a graph is k-colorable is NP complete If there are k registers available We do not care whether the graph is k-colorable, we have to only use k registers anyway When it is not possible, spill Global Register Allocation Coloring the graph with k colors Reasoning: If there exists a node x with less than k neighbors no matter how the neighbors are colored, there is a different color that x can use Heuristic approach (this step is also called simplification) Pick a node x with less than k edges Put x in a stack (to keep track of the coloring order) Remove x and its edges from the interference graph If the resulting graph is k-colorable then so does the original graph Repeat until only one node left When there is no node with < k edges Algorithm fails Global Register Allocation Coloring the example graph with 4 colors Color selection Starting from the last nodes added to the stack The nodes removed later are having more edges and their colors should be decided first For each node, pick a color that is different from its neighbors Always possible to get a color This is obvious from how the node was removed Global Register Allocation Coloring the example graph with 4 colors Simplification step a, b, d have < 4 edges. Choose a Now all nodes have < 4 edges, remove them in arbitrary order a b f e b c e b, d have < 4 edges. Choose d c d d a stack top Global Register Allocation Coloring the example graph with 4 colors Selection step a f e b f b c c e d a d stack Global Register Allocation Coloring the example graph with 3 colors After removing a, No node has < 3 edges Algorithm fails!!! a b f c e d Global Register Allocation Coloring algorithm failure (for k colors) Does not imply it is not possible to color with K colors Always try to color anyway Example: color the graph with 3 colors Color the node with the highest degree first. The remaining nodes has the same degree. Choose any to color. After removing a, No node has < 3 edges. Algorithm fails! a had degree 2, no problem to color! a b f Still can find a color for this node! c e d Still can find a color for this node! Spill When no way is found to color with k colors Choose one node to spill Continue to spill if necessary, till a node can be removed For each spilled node For each definition, store the value For each use, load the value Where to load the value, need a register anyway Naive approach Always keep extra registers for shuffling data in and out What a waste!!! Rewrite code Use a new temporary variable for each load, it will have very short life and likely to have very few outgoing edges Redo liveliness analysis and register allocation Spill Consider the example we gave earlier Cannot find a way to color with k = 3, spill After removing a, No node has < 3 edges Once c is spilled, coloring can be done Nothing else to spill a b f c e d Choose to spill c Remove c Spill After spill, rewrite the code, redo the allocation store c {b,f,t1 } {d,e,f } {b,f} {a,f} {d,f} b f b:= f+c c e {d,e,f} {e} f:= 2*e {b,e,f} {b,f} b:= d+e e:= e–1 print(b) f:= 2*e a t1 := load c a:= b+t1 d:= –a e:= d+f a:= b+c d:= –a e:= d+f d a b:= d+e e:= e–1 print(b) b f {f} t2 := load c b:= f+t2 {b,f} {b} {f,t2} t1 e t2 {b} d Spill Redo the coloring for the new interference graph Consider k=3, the graph can be colored!!! When generating code Load c of the top block to t1’s register Load c of the bottom block to t2’s register a b f t1 e t2 d Pre-Color Some variables are pre-assigned to registers E.g., in C, it is possible to define register variables register int i Handle pre-assigned registers If the system has k registers available, and x variables are preassigned, then only use k–x registers for other variables But this is wasteful, these registers can be reused Perform coloring on IG Still using k colors Pre-color the variables that are pre-assigned to registers In the simplification phase, these nodes cannot be removed The simplification phase terminates when only pre-colored nodes left In the selection phase, do not change the colors of pre-colored nodes Pre-Color Assume that a is pre-assigned to registers Consider K=4 Pre-color a Terminate when only a left b d e c f a b f c e d Coalescing When no way is found to color with k colors Try coalescing before trying spill!!! When there are copy statements, x := y, coalesce x and y Assign x and y to the same register Advantage Reduce the unneeded copying Save a register Requirement Assume no dead code x and y are not interfering, i.e., not connected in IG Coalescing Example {in: j, k} g := M[j+12] h := k – 1 f := g + h e := M[j+8] m := j + f b := M[j] c := e + 8 d := c k := m + 4 j := b {out: d,j,k} j, k g g, j, k g, h, j f j h f, j e, f, j e, j, m e b k b, e, m b, c, m b, d, m m d b, d, k copy link c Coalescing Coloring with 3 colors h g f c need to spill g f j h e b k m d c Coalescing Coloring with 4 colors h g f c j Cannot color with 3 colors Need to use 4 colors (4 registers) e b k m d Coalescing Coalescing {in: j, k} g := M[j+12] h := k – 1 f := g + h e := M[j+8] m := j + f b := M[j] c := e + 8 d := c k := m + 4 j := b {out: d,j,k} g f j h e b k m d c Coalescing Coalescing c,d (non-interfering) g j h f e b k m c,d Coalescing Coalescing b,j (non-interfering) g b,j h f e k m c,d Coalescing Coloring with 3 colors Simplification g b,j h f e k m c,d h g f k c,d b,j e m Coalescing Coloring with 3 colors r1 r2 r3 Selection g b,j h f e k m c,d h g f k c,d b,j e m Coalescing Coloring with 3 colors {in: j, k} g := M[j+12] h := k – 1 f := g + h e := M[j+8] m := j + f b := M[j] c := e + 8 d := c k := m + 4 j := b {out: d,j,k} r3 := load j r1 := load k r2 := M[r3+12] r1 := r1 – 1 h r1 := r2 + r1 r2 := M[r3+8] r1 := r3 + r1 r3 := M[r3] r2 := r2 + 8 r2 := r2 r1 := r1 + 4 r3 := r3 store r2, d storeCoalescing r3, j saved storetwo r1,copy k statements and 1 register r1 r2 r3 g b,j f e k m c,d h g f k c,d b,j e m Coalescing Another example {in: a, c} v := a + c t := v * c v := t * a b := v t := M[b] u := b + c w := t * u {out: w} a, c b a, c, v t a, c, t c, v c u b, c b, c, t t, u a v w Try to color with 3 colors u v b a c t Coalescing Coalescing b,v (non-interfering) b t c u a v Coalescing Coalescing b,v Coloring with 3 colors Coalesce when b,v t Coalescing increases the degree of the coalesced node and makes the graph irreducible! Only coalesce when the degrees of the nodes are not increased. But sometimes, coalesce may increase the degree of some nodes, but ends up saving registers! c u a After removing u, no node with < 3 degree. Need to spill! Global Register Allocation Code generation For each statement Replace variables by registers If a variable is from external, then it should be loaded to the register first For the spilled variables Load to reserved registers if the rewrite code approach is not used Store the live variables No need to store temporary variables Variables that are alive after the CFG should be stored to memory DAG Based Instruction Scheduling Dag * e1 Used for subexpression elimination Used to eliminate duplicate variables a := b – c b := a + d d := b – c a := a * d b := b – c e := a * b * a2 – b1 + a1 – b0 d0 c0 d1, b2 DAG Based Instruction Scheduling How to generate code for dags Need to determine the schedule for executing the instructions Need to determine the register allocation Global register allocation Minimum register instruction sequence Based on the results, generate code DAG Based Instruction Scheduling Algorithm for ordering nodes in a dag Start from the root Assign a node x an order number, if All x’s parents already has a number After x obtained a number Try to assign numbers for x’s children If x’s child y cannot be assigned an order number: no problem o y has at least one parent without an assigned number, when its parent has the number, y will be examined for number assignment Since the dag is acyclic, all nodes will obtain a number and the order is correct DAG Based Instruction Scheduling Ordering nodes in a dag – Example 1 * e 1 Rewrite the code based on the dag load b load c a1 := b – c load d b1 := a1 + d d1 := b1 – c a2 := a1 * d1 e1 := a2 * d1 2 * a2 d1, b2 – 6 b1 4 + a1 – b0 8 3 d0 5 c0 7 DAG Based Instruction Scheduling Register allocation load b, c a1 := b – c load d b1 := a1 + d d1 := b1 – c a2 := a1 * d1 e1 := a2 * d1 … use e b, c r3 a1, c d r3 d1 r2 c a1, c, d a1, b1, c a1, d1 a2, d1 e1 Need 3 registers b1 r3 e1 r1 a1 r1 b a2 r1 r3 DAG Based Instruction Scheduling Generate code based on The instruction sequence 87654321 The register allocation r3 r3 d1 d r2 (8) load r3 b (7) load r2 c (6) sub r1 r3 r2 (5) load r3 d (4) add r3 r1 r3 (3) sub r3 r3 r2 (2) mul r1 r1 r2 (1) mul r1 r1 r3 store e, r1 c r1 a1 r1 b a2 r1 a2 6 – a1 2 3 d1, b2 d0 5 r3 b0 8 * – b1 4 + b1 r3 e1 e1 * 1 c0 7 Tree Based Register Allocation Original code a := b – c b := a + d d := b – c a := a * d b := b – c e := a * b e0 * 3 a2 * 3 2 a1 – 2 c0 b0 – b1 2 + 2 – a1 d1 b1 + 2 c0 2 – a1 d0 b0 b0 Does not really work c0 d0 c0 Simply Global Register Allocation Original code a := b – c b := a + d d := b – c a := a * d b := b – c e := a * b r3 r2 d c a, b, c a b a, b r1 r3 b, c, d a, c, d a, b, c e a, b, c, d r1 e Need to use 4 registers Minimum Register Instruction Sequence Derive an instruction sequence so that its register requirement is minimum Instructions with no data dependency can be rearranged But MRIS problem is NP, need to use heuristic algorithms (a) (b) (c) (d) (e) (f) (g) (h) (i) t1 := load x t1 t2 := t1 + 4 t2 t3 := t1 * 8 t3 t4 := t1 - 4 t4 t5 := t1 / 2 t5 t6 := t2 * t3 t6 t7 := t4 - t5 t7 t8 := t6 * t7 t8 store t8, y Original schedule Need 4 registers (a) (d) (e) (g) (c) (b) (f) (h) (i) live range t1 := load x t1 t4 := t1 - 4 t4 t5 := t1 / 2 t5 t7 := t4 - t5 t7 t3 := t1 * 8 t3 t2 := t1 + 4 t2 t6 := t2 * t3 t6 t8 := t6 * t7 t8 Store t8, y Properly reschedule Only need 3 registers Minimum Register Instruction Sequence Consider dag based MRIS Construct dag for the example code (a) (b) (c) (d) (e) (f) (g) (h) (i) t1 := load x t2 := t1 + 4 t3 := t1 * 8 t4 := t1 - 4 t5 := t1 / 2 t6 := t2 * t3 t7 := t4 - t5 t8 := t6 * t7 store t8, y * t8, y * t7 2 - t4 * t6 4 * t3 8 How to determine Register allocation Execution sequencing / t5 + t2 t1 x 4 Minimum Register Instruction Sequence t3 Principles s1 Nodes on a single path can share the same register E.g., t1, t2, t3 can share one register o r0 := r0 op x, r0 := r0 op y y t2 x z t1 New scheduling constraints Sharing register may introduce new dependencies (execution orders) in the dag E.g., If t2 reuses t1’s register, then all other operations that uses t1 should be done first o E.g., s1 uses t1, s1 should be evaluated before t2 uses t1’s register o Otherwise, t1’s register value got changed by t2 and t1 is lost This only happens for nodes with multiple parents (not for tree) Constant does not need any register Minimum Register Instruction Sequence Path formation Number the nodes Number root nodes as 0 The child node number = max (parent node number) + 1 Find all paths till all nodes are covered Start from a largest node that is not covered o Keep going till reaching the end or a covered node o A node that is already in a path is marked as covered For an already covered node, still include, but use ( ) to mark it - Because the register can only get released after ( ) is evaluated - Need it for interference analysis If the node has multiple parents, choose the smallest parent node o Need to also create scheduling constraints Minimum Register Instruction Sequence MRIS Algorithm Number the nodes Find all paths * 1 Largest uncovered node: t1 t1 has multiple parents of the same number choose t2 P1 = [t1, t2, t6, t8, ()] Largest uncovered node: t3, t4, t5, choose t3 t1 hasP2multiple parents and t1, t2 share registers = [t3, (t6)] t3, t4, t5 has tonode: be evaluated beforet4t2 Largest uncovered t4, t5, choose add constraints P3scheduling = [t4, t7, (t8)] Redo numbering Largest uncovered node: t5 P4 = [t5, (t7)] Largest uncovered node: none t8, y 2 34 * t7 / t5 2 * t6 4 34 * t3 8 + 3 t2 t1 x 2 34 - t4 45 4 Minimum Register Instruction Sequence t3 Interferences s1 Live range of a variable x o The durations x is alive o Can be derived from liveliness analysis y t2 Live range x z t1 When a path shares a register r, the live range for r spans the life of the entire path o r0 := r0 op x, r0 := r0 op y r0 is alive from the evaluation of t1 to t3 If two variables has overlapping live range, they cannot share register Before instructions are scheduled, the life range of variables are not determined But dependencies makes constraints on live ranges Find out live range constraints Minimum Register Instruction Sequence Interferences Theorem 1 Two paths: P = [u1, u2, …, um], Q = [v1, v2, …, vn] If u1 can reach vn Need to evaluate u1 before vn If v1 can reach um u5 v3 Need to evaluate v1 before um Live ranges of P and Q have to overlap Cannot use the same registers Use this relation to construct IG Not for the nodes but for the paths Essentially, for the registers (path = register) u4 v2 u3 v1 u2 u1 Edges due to scheduling constraints They enforce execution order, their impact on live range is the same as the other edges Minimum Register Instruction Sequence P1 = [t1, t2, t6, t8, ()] P2 = [t3, (t6)] P3 = [t4, t7, (t8)] P4 = [t5, (t7)] Dag based MRIS Interferences based on Theorem 1 P1 interferes with all other paths * t8, y * t7 o t1 can reach all nodes o All nodes can reach t8 / t5 P2 interferes with P3 2 o t3 can reach t8, t4 can reach t6 - t4 P2 does not interfere with P4 o t5 can reach t6, but t3 does not reach t7 P3 interferes with P4 * t6 * t3 o t4 can reach t7, t5 can reach t8 P1 4 P2 8 + P3 P4 t2 t1 x 4 Minimum Register Instruction Sequence How to schedule the paths The approach so far only found the potential number of registers Still do not know how to schedule Path fusing A register can only be released when an entire path is done Done means the last node in () is evaluated () is in another path The result from the path can be stored in the register of another path Check path pairs (x,y) to “fuse” If x can execute till completion before y starts, we say (x,y) can fuse Then x can release the register to y after it completes Which pairs to consider? Eliminate impossible pairs o All the interfering pairs (due to any of the reasons) cannot fuse Minimum Register Instruction Sequence Dag based MRIS * t8, y * t7 Find (x, y) to fuse / t5 Only P2 and P4 do not interfere Try (P2, P4) o t6 > t2 and t2 > t4 o Need to start P4 before evaluate t2 and t6 o Not possible Try (P4, P2) P1 P2 P3 P4 P1 = [t1, t2, t6, t8, ()] P2 = [t3, (t6)] P3 = [t4, t7, (t8)] P4 = [t5, (t7)] 2 - t4 * t6 o P4 can complete without starting P2 - Complete P2 just need to execute P3 partially o Succeeded o Let P2 and P4 share register o Add the scheduling constraint t7 < t3 t1 x 4 * t3 8 + t2 4 Minimum Register Instruction Sequence Dag based MRIS 1 * t8, y 7 * t7 Assign register 5 / t5 Assign according to the path P1, P3 each gets one register P4 and P2 share one register Do not color the node in () Find execution order for the dag Dag node ordering Code generation r1 := load x r3 := r1 / 2 r2 := r1 – 4 r2 := r2 * r3 r3 := r1 * 8 r1 := r1 + 4 r1 := r1 * r3 r1 := r1 * r2 store y r1 2 6 - t4 2 * t6 4 4 * t3 8 + 3 t2 t1 x 8 r1 4 r2 r3 Instruction Selection Some hardware provides a rich set of instructions May be not RISC processors! There are multiple ways to translate a set of instructions Example instruction set load r1, r2 store r1, r2 add r1, r2 addc r1, c mul r1, r2 mulc r1, c movem r1, r2 movex r1, r2, r3 load M[r2] to r1 store r2 to M[r1] r1 := r1 + r2 r1 := r1 + c, where c is a constant r1 := r1 * r2 r1 := r1 * c, where c is a constant M[r1] := M[r2] M[r1] := M[r2+r3] Instruction Selection Example program A[i+1] := B[j] Intermediate code t1 := j * 4 mulc rj, 4 mulc rj, 4 t2 := B + t1 add rb, rj add rb, rj t3 := M[t2] load r1, rb addc ri, 1 addc ri, 1 mulc ri, 4 t4 := i + 1 mulc ri, 4 add ra, ri t5 := t4 * 4 add ra, ri movem ra, rb t6 := A + t5 store ra, r1 M[t6] := t3 load r1, r2 Assume: register ra holds address of A store r1, r2 Assume: register rb holds address of B add r1, r2 addc r1, c Assume: register ri holds value of i mul r1, r2 mulc r1, c Assume: register ra holds value of j mulc rj, 4 addc ri, 1 mulc ri, 4 add ra, ri movex ra, rb, rj load M[r2] to r1 store r2 to M[r1] r1 := r1 + r2 r1 := r1 + c r1 := r1 * r2 r1 := r1 * c movem r1, r2 M[r1] := M[r2] movex r1, r2, r3 M[r1] := M[r2+r3] Instruction Selection Each instruction may have different cost Time cost: how fast can the instruction execute Space cost: how much space the instruction take mulc rj, 4 add rb, rj addc ri, 1 mulc ri, 4 add ra, ri movem ra, rb cost = 27 For example load r1, r2 store r1, r2 add r1, r2 addc r1, c mul r1, r2 mulc r1, c movem r1, r2 movex r1, r2, r3 cost = 3 cost = 3 cost = 1 cost = 1 cost = 10 cost = 10 cost = 4 cost = 5 mulc rj, 4 add rb, rj load r1, rb addc ri, 1 mulc ri, 4 add ra, ri store ra, r1 cost = 29 mulc rj, 4 addc ri, 1 mulc ri, 4 add ra, ri movex ra, rb, rj cost = 27 Goal: find the translations with the minimal cost Tree Representation Problem Some translations may have to take non-consecutive instructions t1 := j * 4 t2 := b + t1 t3 := M[t2] t4 := i + 1 t5 := t4 * 4 t6 := a + t5 M[t6] := t3 mulc rj, 4 addc ri, 1 mulc ri, 4 add ra, ri movex ra, rb, rj Solution Use tree-like representation for instruction match o In tree, these instructions are consecutive Convert instructions to tiles Easier to detect the matching instructions Instruction Selection Goal Determine parts of the tree that can match the instruction tiles store + store load R? movem R? * a i + 4 + 1 store * b j R? R? load + load R? 4 mul + + c? addc R? R? R? add R? movex + R? + R? c? mulc … Instruction Selection Desirable to achieve optimal tiling Get the instruction set with least cost Not easy The maximal munch algorithm (a greedy algorithm) Start from the tree root and find all matching tiles Select the one with the maximum number of nodes o Can consider other criteria that include the cost of the instruction Go to the children and apply the algorithm recursively Until the tree is fully covered Instruction Selection Dynamic programming for tiling Start from the tree root For each node If the best cost for the node has already been computed, then return it Otherwise: Cost of a node = cost of T + total costs of all children o T is a matching tile, for different T, the children may be different o Select the minimum cost among all possible matching T’s The tiling decision is top down, the cost is computed bottom up Complexity: O(N*NT) N: the number of nodes NT: the maximum number of titles for a node The cost of each node only need to be computed once o Once computed, just return After one round, the costs of the internal nodes of some tiles may not have been computed, if selected in another round, will be computed Instruction Selection Cost criteria should consider modern architecture Best instruction set or best instruction schedule may not be useful for some modern architectures E.g., best scheduling may reduce register usage, but may not allow best pipelining or best parallel execution o Pipelining is commonly used in modern processors o Multiple core will become common architecture For both instruction selection and instruction scheduling Should consider the cost to facilitate pipelining and/or parallel execution Peephole Optimization Performed at the end of code generation Performed directly at the generated machine code Only look at a few instructions Generally no more than 5 Using a sliding window Eliminating redundant instructions Some code generator generated code has redundancies, after other optimization steps, there still may be easy to catch redundancies Algebraic transformation Strength reductions Eliminate Redundancies Unnecessary load-store load r0, a store a, r0 Load r0, a Eliminate jump after jump if (a<b) goto L1 ... L1: goto L2 goto L1 ... L1: if a < b goto L2 L3: goto L1 ... L1: return L3: if (a<b) goto L2 goto L3 ... if (a<b) goto L2 ... L1: goto L2 return ... L1: Eliminate Redundancies Eliminate jump after jump Source code: debug = 0 ... if(debug) {print debugging information} Generated intermediate code: debug = 0 ... if debug = 1 goto L1 goto L2 L1: print debugging information L2: After optimization: debug = 0 ... if debug 1 goto L2 print debugging information L2: Strength Reduction Replace multiplication and division by shift A := A * 4 A := << A Need to take care of overflow problem (may result in negative number) A := A / 4 A := >> A Need to shift by replacing msb with sign bit But right-shift has a famous problem in two’s complement representation o –5 111111...1111111011 o >> –5 111111...1111111101 o –6 111111...1111111010 o >> –6 111111...1111111101 (–5 / 2 = –2, but the result here is –3) (No problem, correct answer) o Fix 1: shift bit by bit, add lsb back to the number after shift o Fix 2: Convert to positive number for shift, then convert to negative Code Generation Issues -- Summary Read Chapter 8 Sections 8.5, 8.7, 8.8, 8.10 Run time storage allocation Register allocation and instruction scheduling For basic blocks Tree based Dag based o R. Govindarajan Y, H. Yang Z, J. N. Amaral, C. Zhang Z, G. R. Gao, “Minimum register instruction sequence problem: Revisiting optimal code generation for DAGs,” IEEE International Parallel & Distributed Processing Symposium, 2001 Global register allocation using graph coloring Peephole optimization
© Copyright 2024 Paperzz