Semantic Reduction

Program Analysis Lecture 9 (30/5/02)
Shape Analysis via 3-Valued Logic TVLA + Applications cont.
Mooly Sagiv
Lecture notes by Eran Globen
Additional notes are in blue
Part 1: Shape Analysis via 3-Valued Logic Cont.
As we saw last class the 3-valued logic abstract interpretation is sound but not very
precise. In order to refine the abstraction and gain a more precise abstract interpretation
we use the following principle.
The Instrumentation Principle
Increases precision by storing the truth-value of some designated formulae. The value of
those extra predicates will allow us to differentiate between cells with different predicate
values. By the embedding theorem (see previous class) we know that the value of these
predicates is kept sound in the abstract world as well (this is demonstrated in Figures 1,
2)
Introduce predicate-update formulae to update the extra predicates. In order to increase
efficiency the extra predicates are updated with each step as opposed to being calculated
from scratch with each change.
The following is an example of such a predicate.
IS(v) – Is Shared – This predicate evaluates to 1 if the object v is pointed to by more than
one object.
is[sel](v) = v1,v2: sel(v1,v)  sel(v2,v)  eq(v1, v2)
x

 31
is = 0
is = 0
 71
 91
is = 0
is = 0
 u 
x
u1
is = 0
Figure 1
1
is = 0
In the following figure one of the cells is shared. This change is reflected in the
abstraction resulting in a more precise representation of the data structure.
x

31
71
91
is = 0
is = 1
is = 0
is = 0

x
u1
u
is = 0
is = 1

is = 0
Figure 2
Updating sharing x.sel:=y
Following is the predicate update formulate for is
is [sel]’(v) :=
(v1:x(v1)?
(y(v)?
v2:sel(v2, v) x(v2)
:(sel(v1, v)?
v2, v3: is[sel](v2, v3, v) 
x(v2) x(v3)
: is[sel](v))
:is[sel](v))
with the following:
is[sel](v2, v3, v) = sel(v2, v)sel(v3, v)  eq(v2, v3)
is[sel] is true if v is pointed to by v2 and v3 and v2, v3 are different cells.
2
Other Instrumentation
For doubly linked lists:
c[cdr,car](v)=v1: cdr(v, v1)car(v1, v)
c[car,cdr](v)=v1: car(v, v1)cdr(v1, v)
Notice that in doubly-linked lists there are several ways to reach a node. Still when
c[cdr, car](u) and c[car, cdr](u) are both 1, we see that u is an element of a doubly linked
list.
reachability:
r[sel](v1, v2) = sel*(v1, v2)
r[x, sel](v) = v1: x(v1)sel*(v1, v)
r[x](v) = v1: x(v1)(car|cdr)*(v1, v)
The importance of reachability will be explained later.
For sorted lists:
inOrder[sel,dle](v) = v1: sel(v, v1) dle(v, v1)
inROrder[sel,dle](v) = v1: sel(v, v1) dle(v1, v)
3
Semantic Reduction







Improves the precision of the analysis by recovering properties of the program
semantics.
Given a Galois connection (L1, , , L2)
An operation op:L2L2 is a semantic reduction
if lL2 op(l)l
and (op(l)) = (l)
Can be applied before and after basic operations
Preserves soundness
The decision to apply or not apply a given semantic reduction is mostly
performance based.
In a generic system like TVLA semantic reduction is very important as it allows
defining the meaning of program operations in a modular way by ignoring the
interdependencies between different predicates and then using a semantic
reduction to improve the precision by observing certain dependencies.
Semantic reductions need not lead to the best (induced) abstract interpretation.
That is it may be that (concrete-op((l))  (op((l)) for a concrete operation
concrete-op and a semantic reduction . The intuitive reason is that  may not take
into account all the information which is available on the concrete semantics and
how it affects the abstraction.
4
Materialization
An example of a semantic reduction is the materialization operation as shown in Figure 4.
In structure 3 the value of x(u) is ½. In structure 4 we see that the summary node has
been split into two and the value for x(u2) is 1, yielding a more precise abstraction.
Materialization “creates” the cells with a definite value for the core variable predicates so
that those values are more in focus.
Notice this operation is not formally a semantic reduction since structure 2 represents
some concrete stores that 4 does not represent. For example, structure 2 also concrete
stores in which x points to the 3rd element. This problem is avoided in TVLA by
decoupling the semantic reduction from the abstract meaning of operations
1
cdr
 u 
cdr
x
y
u1
y
cdr
3
x
y
cdr
u1
2
x := x  cdr
u1
cdr
x
cdr
x
x := x  cdr

 u 
cdr
4
cdr
 u
y
Figure 3
5
u1
u2
u3
The Focusing Principle





The purpose of the focusing principle is to increase precision.
More specifically we wish to “Bring the predicate-update formula into focus”
(Force 1/2 to 0 or 1) and only then apply the predicate-update formulae.
Generalizes materialization
Focusing is sound in the sense that concretization of structure S and focus(S) is
the same.
This is a true a semantic reduction and it can be changed by using different
formulae.
Following are a few samples of the focus operation on different formulae.
Step 1: Focus on 1(v) =  v1: x(v1)  cdr(v1,v)
x
y
cdr
u1

u
1(u) = ½
cdr
x
y



x
y



x
y

u1
cdr
 u 
1(u) = 0
cdr
cdr
u1
 u 
1(u) = 1
cdr
cdr
u1
u.1
cdr
cdr
u.0
1(u.1) = 1
Figure 4
All three right hand structures have definite values for the focused formula.
The second structure actually represents a list with only 2 cells since cdr(u1, u)=1 and
surely u1 can only point to one cell. The third structure is a list with u1 pointing to only
part of the list.
Notice that the 3 structures on the right represent exactly the same stores as the ones
represented in the left.
6
After focusing on a formula we can update the appropriate predicate. In this case x is a
pointer traveling on a list.
Step 2: Evaluate Predicate-Update Formulae
x’(v) =  v1: x(v1)  cdr(v1,v)
x
y
x
y
u1
u1
cdr

y
u
u1
u.1
u
u1
cdr
y
cdr
cdr
x
u
cdr
cdr
u1
cdr

x
cdr
y
y
u
cdr

cdr
cdr
cdr  
u1
cdr
u.1
cdr
cdr
cdr
u.0
u.0
Figure 5
As explained before summary node u in the second figure can only represent one actual
cell. In the same way we can see that summary node u.1 in the third structure can only
represent one node. Also we know that if the original structure was a simple linked list
then some of the cdr links between u.1 and u.0 are impossible. We will see these
problems handled later when coercion is introduced.
The Focus Operation
Focus: Formula(P(3-Struct) P(3-Struct)) Focus is a partial function. For some
formulae it might return an infinite number of structures and so is undefined for those
formulae. For every formula 
Focus()(X) yields structure in which  evaluates to definite values in all assignments.
Focus() is a semantic reduction but Focus()(X) may be undefined for some X
For more details about the focus operation see Tal Lev-Ami’s Master thesis at
http://www.math.tau.ac.il/~tla/thesis.zip
7
Focus on  v1: cdr(v1,v)
x
y
 u 
cdr
u1
cdr
Figure 6
Focusing on the structure in Figure 8 will result in an infinite amount of structures as
shown in Figure 10. . The main problem is that the summary node u cannot be explicitly
included since the formula evaluates to ½ on this structure. Thus, we have to keep around
all the structures that represent the same concrete elements and have definite (0 or 1)
values. TVLA detects such situation at analysis time and yields an exception. A TVLA
user needs to guarantee that her/his analysis will not reach such situations.
For example, in shape analysis we only focus on the L- and R-values of pointer
expressions and thus guarantee that infinite structures cannot occur.
x
y
x
y
u1
u1
.
.
.
Figure 7
8
As was shown in figure 6 there might be additional information about the data structures
which we can use. We can use that information to further improve our semantics so it can
rule out inconsistent structures or sharpen values of instrumentation predicates.
The Coercion Principle





Another Semantic Reduction
Can be applied after Focus or after Update or both
Increases precision by exploiting some structural properties possessed by all
stores (Global invariants) (e.g. a variable can only point to one cell).
Structural properties captured by constraints (e.g. a certain structure is a acyclic
list).
Apply a constraint solver
A sample result after applying a Constraint Solver:
cdr
x
cdr
y
u1
u.1
x
cdr
cdr
cdr
cdr
y
u.0
u1
u.1
cdr
u.0
Figure 8
After applying the constraint solver we can easily see that variable x is now pointing at
the next element in a list.
Sources of Constraints
Properties of the operational semantics [x(v1) x(v2)eq(v1, v2)]
 Domain specific knowledge
 Instrumentation predicates [is[sel](u.1) = 0]
 User supplied
Format of Constraints
p(v1, v2, …, vk) ij  vivj
p(v1, v2, …, vk) ij  vivj
The right hand side of a constraint can only contain an atomic formula.
Interpretation: If LHS is 1 so is RHS. This is a strong definition as the RHS cannot be 1/2
if the LHS is 1.
Constraints are preserved under tight embedding
Example Constraints
x(v1) x(v2)eq(v1, v2) – A variable can only point to one cell.
sel(v, v1) sel(v,v2)eq(v1, v2) – A cell can only point to one cell.
sel(v1, v) sel(v2,v)eq(v1, v2)is[sel](v) – A cell is pointed to by 2 different cells iff it
is shared.
9
Example Constraints
x(v1) x(v2)eq(v1, v2) (1)
x(v1) eq(v1, v2) x(v2) (1a)
sel(v, v1) sel(v,v2)eq(v1, v2) (2)
sel(v, v1)  eq(v1, v2)  sel(v,v2) (2a)
sel(v1, v) sel(v2,v)eq(v1, v2)is[sel](v) (3)
sel(v1, v) sel(v2,v)is[sel](v) eq(v1, v2) (3a)
sel(v1, v) eq(v1, v2) is[sel](v) sel(v2,v) (3b)
is[sel](v) sel(v1, v) sel(v2,v)eq(v1, v2) (4)
From each of the original constraints (1, 2, 3 and 4) additional constraints were created,
these constraints are the logical consequences of the original one and can be created
automatically.
Examples for applying a Constraint Solver
x
y
cdr
cdr
u1
u.1
x
cdr
cdr
cdr
y
u.0
cdr
cdr
u1
u.1
cdr
cdr
cdr
u.0
x(v1) x(v2)eq(v1, v2) (1)
x
y
cdr
cdr
u1
u.1
x
cdr
cdr
cdr
u.0
y
cdr
cdr
u1
u.1
cdr
u.0
sel(v1, v) eq(v1, v2) is[sel](v) sel(v2,v) (3b)
Figure 9
Applying constraint (1) we can see that summary node u.1 is in fact just one cell.
Applying constraint (3b) we can eliminate some of the cdr edges with indefinite predicate
values.
10
Part 2: TVLA
From “TVLA: A framework for Kleene logic based static analysis”: TVLA is a
“YACC”-like framework for automatically constructing static-analysis algorithms from
an operational semantics where the operational semantics is specified using logical
formulae.
The TVLA System


Input:
o
o
o
o
o
SOS using Predicate Logic.
Control Flow Graph.
A set of 3-valued structures at the start node.
Implementation hints.
First two are supplied in .tvp (three value program) file. Last two in a .tvs
(three value structure) file.
Output
o The set of 3-valued structures at every control flow node (labeled directed
graphs)
o Textual error messages
o Output is created in a postscript file.
Partial info on TVLA command line
tvla <tvp> <tvs> [-d]
–d sets debug mode. Using the –d flag shows the structure in different phases of
execution.
For more information see the TVLA user manual.
Some sample analyses
Creating a list of unknown (at compilation time) size.
/* list.h */
typedef struct node {
struct node *n;
int data
} *Elements;
/* create.c */
Elements create() {
Elements *f, *x;
int i, size;
for(i=0; i<size; i++) {
f = malloc(sizeof(Elements));
f->n = x;
x = f;
}
return x;
}
Running TVLA with create.tvp and an empty structure supplied by empty.tvs.
tvla create include\empty –d (in debug mode)
tvla create include\empty
11
Figure 1: Control flow graph
Figure 2.b: Location n_4
Figure 2.a: location n_5
Figure 2.b: Location Exit
Figure 1 shows the control flow graph as supplied to TVLA.
Figure 2 shows some of the structures reached during the execution. Dotted lines have a
predicate value of ½. Double edged nodes are summary nodes.
12
Reversing a linked list.
typedef struct node {
struct node *n;
int data;
} *Elements;
Elements * reverse(Elements *x) {
Elements *y, *t;
y = NULL;
while (x != NULL) {
t = y;
y = x;
x = x  n;
y  n = t;
}
return y;
}
Running TVLA on reverse.tvp with an initial structure of a list (list.tvs)
tvla reverse include\list
Figure 3
13
Figure 4a: Location n_1
Figure 4b: Location n_2
Figure 4c: Location exit
Figure 4a shows the initial structure supplied to TVLA in list.tvs.
Figure 4b shows a possible fixed point for the loop. Following the flow of the program
we can see that this is not a very precise abstraction. x and y actually point at disjoint
lists. We can also see that the program does not leak memory as in Figure 4c. Both these
problems will be remedied later when we introduce the reachability predicate.
14
Some information on tvp files
For more information see the TVLA user manual.
Structure of a tvp file
tvp files are structured much like a YACC file using %% to separate sections and %x for
declarations, actions, etc. A tvp file has three parts:
Predicate definitions – declare core predicates, instrumentation predicates and
consistency rules (constraints) here.
Action definitions – declare actions here. Actions are used later in the CFG part.
Control flow graph (CFG) – The program to be analyzed is composed of CFG nodes with
edges connecting between them and actions to be performed on these edges.
General
The tvp file goes through a C pre-processor so usual pre-processor command can be used
(e.g. #include).
Comments are in C++ style. // comments until end of line. /* comments until */.
foreach – As expected foreach performs action for each predicate in the supplied
predicate set.
A-
E-
Declarations
%s – Set declaration.
%p – Core predicate. These predicates can be declared one for each programming
language and reused with each program analyzed.
%i – Instrumentation predicate. These are more program specific predicates.
%r – Consistency rule. Most consistency rules are generated automatically but additional
rules can be defined manually.
Predicate properties
unique – This predicate is true for at most one node.
function – This predicate is a partial function.
box – display this predicate in a box.
First two properties imply consistency rules. Third property is used only for display.
Actions
Actions are used much like functions in other programming languages. The main part of
an action is the update formula.
%action – Action declaration.
%t – Action title, used when printing the action's structures.
%f – Focus formulae. Applied before the precondition.
%p – Precondition. The precondition formula is evaluated to check whether this action
should be performed.
%message – Report a message if the formula given is potentially satisfied.
%new – A mechanism for creating new nodes.
15
isNew() – A special unary predicate created when %new is used and set true only for the
nodes created in this action.
Control Flow Graph
The CFG is built from edges and nodes. A cfg.node is declared implicitly by its usage in
an edge. The syntax for a CFG edge is:
<cfg.node> <action> <cfg.node>
This creates and edge between the two nodes with action performed to move from first
node to second node.
For nodes with no semantic meaning the action uninterpreted can be used
When translating a destructive update into a CFG node it is convenient to use two
actions. First explicitly set the pointer to null then assign it the new value.
Following is a sample tvp file used for analyzing reverse.c.
/* reverse.tvp */
%s PVar {x, y, t}
#include "include\pred.tvp"
%%
#include "include\cond.tvp"
#include "include\stat.tvp"
%%
// y = NULL
n_1 Set_Null_L(y) n_2
// while (x != NULL) {
n_2 Is_Null_Var(x) exit
n_2 Is_Not_Null_Var(x) n_3
// t = y
n_3 Copy_Var_L(t, y) n_4
// y = x
n_4 Copy_Var_L(y, x) n_5
// x = x->n
n_5 Get_Next_L(x, x) n_6
// y->n = NULL
n_6 Set_Next_Null_L(y) n_7
// y->n = t
n_7 Set_Next_L(y, t) n_2
// }
Predicate definitions
Action definitions
Control Flow Graph (tvp syntax is
preceded by the commented C code)
16
Following is the code used to analyze create.c
/* create.tvp */
%s PVar {x, f}
#include "include\pred.tvp"
%%
#include "include\cond.tvp"
#include "include\stat.tvp"
%%
// for(i=0; i<size; i++) {
n_1 uninterpreted() n_2
// f = malloc(sizeof(Elements));
n_2 Malloc_L(f) n_3
// f->n = x;
n_3 Set_Next_Null_L(f) n_4
n_4 Set_Next_L(f,x) n_5
// x = f;
n_5 Copy_Var_L(x, f) n_1
n_2 uninterpreted() exit
// }
Predicates are defined in pred.tvp
/* pred.tvp */
/*************** Core Predicates *************/
foreach (z in PVar) {
%p z(v_1) unique box
}
%p n(v_1, v_2) function
/**********************************************/
/*********** Instrumentation Predicates *************/
%i is[n](v) = E(v_1, v_2) ( v_1 != v_2 & n(v_1, v) & n(v_2, v))
Conditional actions are defined in cond.tvp
/* cond.tvp */
%action uninterpreted() {
%t "uninterpreted"
}
%action Is_Not_Null_Var(x1) {
%t x1 + " != NULL"
%f { x1(v) }
%p E(v) x1(v)
}
%action Is_Null_Var(x1) {
%t x1 + " == NULL"
%f { x1(v) }
17
%p !(E(v) x1(v))
}
%action Is_Eq_Var(x1, x2) {
%t x1 + " == " + x2
%f { x1(v), x2(v) }
%p A(v) x1(v) <-> x2(v)
}
%action Is_Not_Eq_Var(x1, x2) {
%t x1 + " != " + x2
%f { x1(v), x2(v) }
%p !A(v) x1(v) <-> x2(v)
}
All other actions are defined in stat.tvp
/* stat.tvp */
%action Set_Null_L(x1) {
%t x1 + " = NULL"
{
x1(v) = 0
}
}
%action Copy_Var_L(x1, x2) {
%t x1 + " = " + x2
%f { x2(v) }
{
x1(v) = x2(v)
}
}
%action Malloc_L(x1) {
%t x1 + " = (L) malloc(sizeof(struct node)) "
%new
{
x1(v) = isNew(v)
}
}
%action Get_Next_L(x1, x2) {
%t x1 + " = " + x2 + "->" + n
%f { E(v_1) x2(v_1) & n(v_1, v) }
%message (!E(v) x2(v)) -> "an illegal dereference to\n" +
n + " component of " + x2 + "\n"
{
x1(v) = E(v_1) x2(v_1) & n(v_1, v)
}
18
}
%action Set_Next_Null_L(x1) {
%t x1 + "->" + n + " = null"
%f { x1(v) }
%message (!E(v) x1(v)) -> "an illegal dereference to\n" +
n + " component of " + x1 + "\n"
{
n(v_1, v_2) = n(v_1, v_2) & !x1(v_1)
is[n](v) = is[n](v) & (!(E(v_1) x1(v_1) & n(v_1, v)) |
E(v_1, v_2) v_1 != v_2 &
(n(v_1, v) & !x1(v_1)) & (n(v_2, v) & !x1(v_2)))
}
}
%action Set_Next_L(x1, x2) {
%t x1 + "->" + n + " = " + x2
%f { x1(v), x2(v) }
%message (E(v, v1) x1(v) & n(v, v1)) ->
"Internal Error! assume that " + x1 + "->" + n + "==NULL"
%message (!E(v) x1(v)) -> "an illegal dereference to\n" +
n + " component of " + x1 + "\n"
{
n(v_1, v_2) = n(v_1, v_2) | x1(v_1) & x2(v_2)
is[n](v) = is[n](v) | E(v_1) x2(v) & n(v_1, v)
}
}
19
Reachability



Concrete semantics which records reachability from stack (global) variables
Useful for:
o Compile-time Garbage Collection. If we can prove a structure is garbage
at compile time we can insert the call to free it.
o Saving calls to Dynamic Garbage Collection. If we can prove there are no
garbage items at a specific point we can save calls to the garbage collector.
 No memory leaks. We might be able to prove a program creates no
garbage and so no memory leaks.
o Speeding-up static analysis. Despite of making the analysis more complex
it takes less time to complete.
Instrumentation predicate
r[n, x](v) = E(v1) x(v1) & n*(v1, v) (v is reachable from x with n steps).
Analyzing create.c again with reachability, we are now able to prove that create.c does
not leak any memory.
tvla rcreate include\empty
rcreate.tvp is the reachability enabled tvp file for create.c.
Analyzing reverse.c again with reachability.
tvla rreverse include\rlist
Figure 5: location n_2
Figure 5 shows one of the possible structure at location n_2 (loop conditional) in the new
analysis for reverse. We can now see that x and y point at disjoint lists.
20
Analyzing fumble
This is a faulty version of reverse which causes a memory leak.
typedef struct node {
struct node *n;
int data;
} *Elements;
Elements* reverse(Elements *x)
{
Elements *y,*t;
y = NULL;
while (x!= NULL) {
t = xn;
y = x;
xn = y;
x = t;
}
return y;
tvla rfumble include\rlist
In the next page you can see the control flow graph (Figure 6) for rfumble above. You
can also one of the warning messages issued for location n_4 (y = NULL). Notice the
message saying “potential memory leak”, because of the soundness of the analysis we
cannot be certain a memory leak will occur in this location.
21
Figure 6: CFG for rfumble
Figure 7: Location n_4
22
Some of the code used for analyzing with reachability. The reachability predicate is
defined for each of the program core variables. r[] uses an additional predicate c[] which
checks for cycles.
/* rpred.tvp */
/*************** Core Predicates *************/
foreach (z in PVar) {
%p z(v_1) unique box
}
%p n(v_1, v_2) function
/**********************************************/
/*********** Instrumentation Predicates *************/
%i is[n](v) = E(v_1, v_2) ( v_1 != v_2 & n(v_1, v) & n(v_2, v))
foreach (z in PVar) {
%i r[n,z](v) = E(v_1) (z(v_1) & n*(v_1, v))
}
%i c[n](v) = n+(v, v)
/* rstat.tvp */
%action Set_Null_L(x1) {
%t x1 + " = NULL“
%message !(|/{A(v)(x1(v) -> r[n,z](v)) : z in PVar -{x1}}) -> "potential memory leak
of element pointed-to-by\n" + x1 + "\n"
{
x1(v) = 0
r[n, x](v) = 0
}
}
%action Copy_Var_L(x1, x2) {
%t x1 + " = " + x2
%f { x2(v) }
{
x1(v) = x2(v)
r[n,x1](v) = r[n,x2](v)
}
}
%action Malloc_L(x1) {
%t x1 + " = (L) malloc(sizeof(struct node)) "
%new
{
x1(v) = isNew(v)
r[n, x](v) = isNew(v)
}
}
23
Benefits of Reachability




Predict memory leaks
Separate disjoint data structures
More precise semantic reduction
x = x->n
TVLA Design Mistakes
 The operational semantics is written in too low level language


o No types
o No local updates to specific predicate values
o No constants and functions
o No means for modularity
o No local variables and sequencing
o Combines UI with functionality
TVP can be a high level language
TVLA might have been better named 3VLA
TVLA Experience



Quite fast on small programs
o But runs on medium programs too (Around 5k lines)
Not a panacea
o panacea \pan-uh-SEE-uh\, noun:
A remedy for all diseases, problems, or evils; a universal medicine; a cureall.
More instrumentation may lead to faster (and more precise) analysis
24