ppt

Profile-driven
Inlining for Erlang
Thomas Lindgren
[email protected]
Inlining
Replace function call f(X1,…,Xn) with
body of f/n
 Optimization enabler

– Simplify code
– Specialize code
– Remove ”optimization fence”

Standard tool in modern compiler
toolbox
Inlining

Main problem: which calls to inline?
– Code growth reduces performance
– Estimate code size growth
– Select the best estimated sites subject to cost

Some static estimations:
– f/n is small? (= inline cost is small)
– Inlining the call to f/n enables optimization

Are we optimizing the important code?
– Or just the convenient code?
Inlining

Dynamic estimation
– Profile the program
– Select the best hot call sites for inlining

Optimize the important code
Our approach


Inlining driven by profiling
Permit cross-module inlining
– Computations often span several modules
– Code growth measured for whole program

Cross-module optimization enabled by (i)
module aggregation and (ii) guarded
conversion of remote to local calls


(will not describe this further here)
[Lindgren 98]
The rest of this talk
Overview of method
 Performance measurements

Inline forest




Inlinings to be done
represented by
forest
Nodes are inlined
call sites
Leaves are call
sites to be checked
(Example shows
nested inlining)
f
g
f
g
h
Some sites are not
inlined
h
Priority-based inlining

All call sites (leaves in inline forest) are
placed in priority queue
– Priority = estimated number of calls

When a call site f is inlined, the call
sites in f are added to the queue
– Priority scaled appropriately
Inlining algorithm

Preprocess code
– call_site and size maps
– Initialize priority queue
– Initialize inline forest

While prio queue not empty
– Take call site (k, f)
– Try to inline it
Preprocessing

for each function visited k times
– for each call site visited k’ times
 set
ratio(call_site) = (k’/k)
Adjust ratio so that < 1.0
 Self-recursive call sites := 0.0

– (improves code quality)

maps (function -> [{call_site, ratio}])
dec_bearer_capability(__X12,__X13) ->
{visits,200000},
case {__X12,__X13} of
{BbcRec,[Octet5|Rest]} ->
{200000},
NewBbcRec = case Octet5 band 31 of
1 ->
{200000}(erlang:setelement
(3,BbcRec,1));
3 ->
"...";
Original code marked with number of visits 1 6 - >
"...";
24 ->
"..."
end,
case if
Octet5 band 128 == 128 ->
{200000},
false;
true ->
"..."
end of
true ->
"...";
false ->
{200000}(dec_bearer_capability_6
(NewBbcRec,Rest))
end
end.
dec_bearer_capability(__X12,__X13) ->
{visits,200000},
case {__X12,__X13} of
{BbcRec,[Octet5|Rest]} ->
{200000},
NewBbcRec = case Octet5 band 31 of
1 ->
{200000}(erlang:setelement
(3,BbcRec,1));
3 ->
"...";
Special attention to function calls1 6 - >
"...";
24 ->
"..."
end,
case if
Octet5 band 128 == 128 ->
{200000},
false;
true ->
"..."
end of
true ->
"...";
false ->
{200000}(dec_bearer_capability_6
(NewBbcRec,Rest))
end
end.
dec_bearer_capability(__X12,__X13) ->
{visits,200000},
case {__X12,__X13} of
{BbcRec,[Octet5|Rest]} ->
{200000},
NewBbcRec = case Octet5 band 31 of
1 ->
{200000}(erlang:setelement
(3,BbcRec,1));
3 ->
"...";
16 ->
"...";
24 ->
" . .200,000
."
dec_bearer_capability/2 runs
times
e
n
d
,
dec_bearer_capability_6 visited 200,000 times
case if
 ratio is (200/200) = 1.0
Octet5 band 128 == 128 ->
 adjust ratio to 0.99
{200000},
false;
true ->
"..."
end of
true ->
"...";
false ->
{200000}(dec_bearer_capability_6
(NewBbcRec,Rest))
end
end.
Inlining a call site





Bookkeeping phase (code gen later)
Call to f(X1,…,Xn), visited k times
k < minimum frequency? stop
tot_size + size(f) > max_size? skip
Otherwise,
– tot_size += size(f)
– for each call site g of f



add (k * ratio, g) to priority queue
extend node f by call sites g1,…,gn
Iterate until no call sites remain
Example

Inlining applied to decode1
– Protocol decoding
– Single module
decode1
decode_ie_coding_1/3 [800k]
decode_action/1 [800k]
dec_bearer_capability/2 [200k]
dec_bearer_capability_6/2 [198k]
decode_ie_heads_setup/5 [198k]
…
Prio queue
Inline forest
adjust to 0.99
dec_bearer_capability/2 -> [(dec_bearer_capability_6, 1.00)]
decode_ie_heads_setup/5 ->
[(decode_action/1, 0.8), (decode_ie_coding/1, 0.8), (dec_bearer_capability, 0.2),
(decode_ie_heads_setup/5, 0.2), (decode_ie_heads_setup/5, 0.6)]
…
Call_site mapping (selected parts)
self-recursive so set
to 0.0
Try to inline
decode1
decode_ie_coding_1/3 [800k]
decode_action/1 [800k]
dec_bearer_capability/2 [200k]
dec_bearer_capability_6/2 [198k]
decode_ie_heads_setup/5 [198k]
…
Prio queue
Inline forest
dec_bearer_capability/2 -> [(dec_bearer_capability_6, 0.99)]
decode_ie_heads_setup/5 ->
[(decode_action/1, 0.8), (decode_ie_coding/1, 0.8), (dec_bearer_capability, 0.2),
(decode_ie_heads_setup/5, 0.0), (decode_ie_heads_setup/5, 0.0)]
…
Call_site mapping
decode1
decode_action/1 [800k]
dec_bearer_capability/2 [200k]
dec_bearer_capability_6/2 [198k]
decode_ie_heads_setup/5 [198k]
…
Prio queue
Inline forest
decode1
dec_bearer_capability/2 [200k]
dec_bearer_capability_6/2 [198k]
decode_ie_heads_setup/5 [198k]
…
Prio queue
Inline forest
decode1
Prio queue
Inline forest
Final result:
-inline dec_bearer_cap_6/2 into dec_bearer_cap/2 yielding (*)
-Inline dec_ie_coding/1, decode_action/1 and (*) into decode_ie_heads_setup/5
-During inlining, one inline was rejected for too much code growth (not shown)
Now time for code generation
Code generation

Walk each inline tree from leaf to root
– Replace inlined calls f(E1,…,En) with

(fun(X1,…,Xn) -> E end)(E1,…,En)
– General case: nested inlines

Simplify the resulting function
–
–
–
–
Apply fun to arguments (above)
Case-of-case
Case-of-if
…
Measurements

Used five applications
– decode1 (small protocol decoder)
– ldapv2 (ASN.1 encode/decode)
– gen_tcp (send/rcv over socket)
– beam (compiler)
– mnesia (simulate HLR)
Benchmarks
App
Mods
Funcs Calls
Local
Visited
Gen_tcp
13
658
1546
989
202
ldapv2
5
321
1038
616
140
beam
51
2347
9669
7594
2653
mnesia
63
4207
13390 8435
984
Benchmarks
App
Mods
Funcs Calls
Local
Visited
Gen_tcp
13
658
1546
989
202
ldapv2
5
321
1038
616
140
beam
51
2347
9669
7594
2653
mnesia
63
4207
13390 8435
984
Benchmarks
App
Mods
Funcs Calls
Local
Visited
Gen_tcp
13
658
1546
989
202
ldapv2
5
321
1038
616
140
beam
51
2347
9669
7594
2653
mnesia
63
4207
13390 8435
984
Performance

Very preliminary
– Code generation problems for beam and mnesia
=> unable to measure
– (Probably due to name capture bug)


Did not use outlining, higher-order
specialization, apply open-coding [EUC’01]
Tried only emulated code
– Native code compilation failed
Speedup vs baseline
decode1
1.05
gen_tcp
1.04
ldapv2
1.10
Native compilation of inlined decode1 provided a net slowdown
Future work
Integrate with other optimizations
 Plenty of opportunities for further
source-level simplifications
 Suggests new approach to module
aggregation

– (do it after inlining instead of before)

Tuning, measurements
– Bugfixing …
Conclusion
Profile-guided inlining speeds up real
code
 Whole-program, cross-module inlining
probably necessary

Backup slides
%% inlined, before simplify
dec_bearer_capability(BbcRec,[Octet5|Rest]) ->
...
Case-of-if
case if
Octet5 band 128 == 128 ->
false;
true ->
true
end of
true ->
dec_bearer_capability_5a(NewBbcRec,Rest);
false ->
_0_BbcRec = NewBbcRec,[_0_Octet6] = Rest,
_0_STC = case (_0_Octet6 bsr 5) band 3 of
0 ->
0;
1 ->
1
end,
_0_UPCC = case _0_Octet6 band 3 of
0 ->
0;
1 ->
1
end,
_0_NewBbcRec = erlang:setelement
(6,erlang:setelement(5,_0_BbcRec,_0_UPCC),_0_STC) end.
%% after simplify:
dec_bearer_capability(BbcRec,[Octet5|Rest]) ->
...
if
Octet5 band 128 == 128 ->
_0_BbcRec = NewBbcRec,
[_0_Octet6] = Rest,
_0_STC = case (_0_Octet6 bsr 5) band 3 of
0 ->
0;
1 ->
1
end,
_0_UPCC = case _0_Octet6 band 3 of
0 ->
0;
1 ->
1
end,
_0_NewBbcRec = erlang:setelement
(6,erlang:setelement(5,_0_BbcRec,_0_UPCC),_0_STC);
true ->
dec_bearer_capability_5a(NewBbcRec,Rest)
end.
Module merging


We want to optimize over several modules
at a time
What to do about hot code loading?
– Merge modules to aggregates
– Convert suitable remote calls into local calls
– Guard such calls to preserve code loading
semantics
– Annotate code regions with ”origin module” to
enable precise process purging

Or … extend Erlang appropriately
0
- >
c a s e
_ 4 _ F l a g
b a n d
_ 4 _ F l a g
b a n d
1 6
= =
1 6
A
_
[
{
e r l a n
d e c o d
c
4
I
B
g
e
t
_
d
i
:
_
i
F
,
n
i
i
o
l
F
1
s
e
n
a
,
,
_
_
=
g
L
B
b
h
1
i
i
e
=
,
n
n
a
i f
F
L 0
2 }
a r
d s
,
]
=
=
e
y ( B i
_ s e t
e
r
n
u
r
l
)
p
l a n g :
a n g : s
,
e r l
( B i n ,
b
p
a
T
i
l
n
y
n
i
g
p
a
t
:
e
r
_
s
O
y
b
i
f
_
i
z
C
t
n
e
a
o
a
(
l
_
r
B
l
l
y
i
,
i
(
n
E
3
o f
- >
s t ( B i
B i n , 4
)
> =
p r F l a
n 1 ) ,
) ,
4
- >
g , I E L i s t , B r e p F l a g )
w h e n