Profile-driven Inlining for Erlang Thomas Lindgren [email protected] Inlining Replace function call f(X1,…,Xn) with body of f/n Optimization enabler – Simplify code – Specialize code – Remove ”optimization fence” Standard tool in modern compiler toolbox Inlining Main problem: which calls to inline? – Code growth reduces performance – Estimate code size growth – Select the best estimated sites subject to cost Some static estimations: – f/n is small? (= inline cost is small) – Inlining the call to f/n enables optimization Are we optimizing the important code? – Or just the convenient code? Inlining Dynamic estimation – Profile the program – Select the best hot call sites for inlining Optimize the important code Our approach Inlining driven by profiling Permit cross-module inlining – Computations often span several modules – Code growth measured for whole program Cross-module optimization enabled by (i) module aggregation and (ii) guarded conversion of remote to local calls (will not describe this further here) [Lindgren 98] The rest of this talk Overview of method Performance measurements Inline forest Inlinings to be done represented by forest Nodes are inlined call sites Leaves are call sites to be checked (Example shows nested inlining) f g f g h Some sites are not inlined h Priority-based inlining All call sites (leaves in inline forest) are placed in priority queue – Priority = estimated number of calls When a call site f is inlined, the call sites in f are added to the queue – Priority scaled appropriately Inlining algorithm Preprocess code – call_site and size maps – Initialize priority queue – Initialize inline forest While prio queue not empty – Take call site (k, f) – Try to inline it Preprocessing for each function visited k times – for each call site visited k’ times set ratio(call_site) = (k’/k) Adjust ratio so that < 1.0 Self-recursive call sites := 0.0 – (improves code quality) maps (function -> [{call_site, ratio}]) dec_bearer_capability(__X12,__X13) -> {visits,200000}, case {__X12,__X13} of {BbcRec,[Octet5|Rest]} -> {200000}, NewBbcRec = case Octet5 band 31 of 1 -> {200000}(erlang:setelement (3,BbcRec,1)); 3 -> "..."; Original code marked with number of visits 1 6 - > "..."; 24 -> "..." end, case if Octet5 band 128 == 128 -> {200000}, false; true -> "..." end of true -> "..."; false -> {200000}(dec_bearer_capability_6 (NewBbcRec,Rest)) end end. dec_bearer_capability(__X12,__X13) -> {visits,200000}, case {__X12,__X13} of {BbcRec,[Octet5|Rest]} -> {200000}, NewBbcRec = case Octet5 band 31 of 1 -> {200000}(erlang:setelement (3,BbcRec,1)); 3 -> "..."; Special attention to function calls1 6 - > "..."; 24 -> "..." end, case if Octet5 band 128 == 128 -> {200000}, false; true -> "..." end of true -> "..."; false -> {200000}(dec_bearer_capability_6 (NewBbcRec,Rest)) end end. dec_bearer_capability(__X12,__X13) -> {visits,200000}, case {__X12,__X13} of {BbcRec,[Octet5|Rest]} -> {200000}, NewBbcRec = case Octet5 band 31 of 1 -> {200000}(erlang:setelement (3,BbcRec,1)); 3 -> "..."; 16 -> "..."; 24 -> " . .200,000 ." dec_bearer_capability/2 runs times e n d , dec_bearer_capability_6 visited 200,000 times case if ratio is (200/200) = 1.0 Octet5 band 128 == 128 -> adjust ratio to 0.99 {200000}, false; true -> "..." end of true -> "..."; false -> {200000}(dec_bearer_capability_6 (NewBbcRec,Rest)) end end. Inlining a call site Bookkeeping phase (code gen later) Call to f(X1,…,Xn), visited k times k < minimum frequency? stop tot_size + size(f) > max_size? skip Otherwise, – tot_size += size(f) – for each call site g of f add (k * ratio, g) to priority queue extend node f by call sites g1,…,gn Iterate until no call sites remain Example Inlining applied to decode1 – Protocol decoding – Single module decode1 decode_ie_coding_1/3 [800k] decode_action/1 [800k] dec_bearer_capability/2 [200k] dec_bearer_capability_6/2 [198k] decode_ie_heads_setup/5 [198k] … Prio queue Inline forest adjust to 0.99 dec_bearer_capability/2 -> [(dec_bearer_capability_6, 1.00)] decode_ie_heads_setup/5 -> [(decode_action/1, 0.8), (decode_ie_coding/1, 0.8), (dec_bearer_capability, 0.2), (decode_ie_heads_setup/5, 0.2), (decode_ie_heads_setup/5, 0.6)] … Call_site mapping (selected parts) self-recursive so set to 0.0 Try to inline decode1 decode_ie_coding_1/3 [800k] decode_action/1 [800k] dec_bearer_capability/2 [200k] dec_bearer_capability_6/2 [198k] decode_ie_heads_setup/5 [198k] … Prio queue Inline forest dec_bearer_capability/2 -> [(dec_bearer_capability_6, 0.99)] decode_ie_heads_setup/5 -> [(decode_action/1, 0.8), (decode_ie_coding/1, 0.8), (dec_bearer_capability, 0.2), (decode_ie_heads_setup/5, 0.0), (decode_ie_heads_setup/5, 0.0)] … Call_site mapping decode1 decode_action/1 [800k] dec_bearer_capability/2 [200k] dec_bearer_capability_6/2 [198k] decode_ie_heads_setup/5 [198k] … Prio queue Inline forest decode1 dec_bearer_capability/2 [200k] dec_bearer_capability_6/2 [198k] decode_ie_heads_setup/5 [198k] … Prio queue Inline forest decode1 Prio queue Inline forest Final result: -inline dec_bearer_cap_6/2 into dec_bearer_cap/2 yielding (*) -Inline dec_ie_coding/1, decode_action/1 and (*) into decode_ie_heads_setup/5 -During inlining, one inline was rejected for too much code growth (not shown) Now time for code generation Code generation Walk each inline tree from leaf to root – Replace inlined calls f(E1,…,En) with (fun(X1,…,Xn) -> E end)(E1,…,En) – General case: nested inlines Simplify the resulting function – – – – Apply fun to arguments (above) Case-of-case Case-of-if … Measurements Used five applications – decode1 (small protocol decoder) – ldapv2 (ASN.1 encode/decode) – gen_tcp (send/rcv over socket) – beam (compiler) – mnesia (simulate HLR) Benchmarks App Mods Funcs Calls Local Visited Gen_tcp 13 658 1546 989 202 ldapv2 5 321 1038 616 140 beam 51 2347 9669 7594 2653 mnesia 63 4207 13390 8435 984 Benchmarks App Mods Funcs Calls Local Visited Gen_tcp 13 658 1546 989 202 ldapv2 5 321 1038 616 140 beam 51 2347 9669 7594 2653 mnesia 63 4207 13390 8435 984 Benchmarks App Mods Funcs Calls Local Visited Gen_tcp 13 658 1546 989 202 ldapv2 5 321 1038 616 140 beam 51 2347 9669 7594 2653 mnesia 63 4207 13390 8435 984 Performance Very preliminary – Code generation problems for beam and mnesia => unable to measure – (Probably due to name capture bug) Did not use outlining, higher-order specialization, apply open-coding [EUC’01] Tried only emulated code – Native code compilation failed Speedup vs baseline decode1 1.05 gen_tcp 1.04 ldapv2 1.10 Native compilation of inlined decode1 provided a net slowdown Future work Integrate with other optimizations Plenty of opportunities for further source-level simplifications Suggests new approach to module aggregation – (do it after inlining instead of before) Tuning, measurements – Bugfixing … Conclusion Profile-guided inlining speeds up real code Whole-program, cross-module inlining probably necessary Backup slides %% inlined, before simplify dec_bearer_capability(BbcRec,[Octet5|Rest]) -> ... Case-of-if case if Octet5 band 128 == 128 -> false; true -> true end of true -> dec_bearer_capability_5a(NewBbcRec,Rest); false -> _0_BbcRec = NewBbcRec,[_0_Octet6] = Rest, _0_STC = case (_0_Octet6 bsr 5) band 3 of 0 -> 0; 1 -> 1 end, _0_UPCC = case _0_Octet6 band 3 of 0 -> 0; 1 -> 1 end, _0_NewBbcRec = erlang:setelement (6,erlang:setelement(5,_0_BbcRec,_0_UPCC),_0_STC) end. %% after simplify: dec_bearer_capability(BbcRec,[Octet5|Rest]) -> ... if Octet5 band 128 == 128 -> _0_BbcRec = NewBbcRec, [_0_Octet6] = Rest, _0_STC = case (_0_Octet6 bsr 5) band 3 of 0 -> 0; 1 -> 1 end, _0_UPCC = case _0_Octet6 band 3 of 0 -> 0; 1 -> 1 end, _0_NewBbcRec = erlang:setelement (6,erlang:setelement(5,_0_BbcRec,_0_UPCC),_0_STC); true -> dec_bearer_capability_5a(NewBbcRec,Rest) end. Module merging We want to optimize over several modules at a time What to do about hot code loading? – Merge modules to aggregates – Convert suitable remote calls into local calls – Guard such calls to preserve code loading semantics – Annotate code regions with ”origin module” to enable precise process purging Or … extend Erlang appropriately 0 - > c a s e _ 4 _ F l a g b a n d _ 4 _ F l a g b a n d 1 6 = = 1 6 A _ [ { e r l a n d e c o d c 4 I B g e t _ d i : _ i F , n i i o l F 1 s e n a , , _ _ = g L B b h 1 i i e = , n n a i f F L 0 2 } a r d s , ] = = e y ( B i _ s e t e r n u r l ) p l a n g : a n g : s , e r l ( B i n , b p a T i l n y n i g p a t : e r _ s O y b i f _ i z C t n e a o a ( l _ r B l l y i , i ( n E 3 o f - > s t ( B i B i n , 4 ) > = p r F l a n 1 ) , ) , 4 - > g , I E L i s t , B r e p F l a g ) w h e n
© Copyright 2026 Paperzz