Canonical Abstraction

Program Analysis: Lecture #11
Programming Languages Semantics / Mooly Sagiv, Noam Rinetzky
Lecture #11, 3 June 2004: Shape Analysis via 3-Valued Logic + TVLA
Notes by: Michal Spivak
Introduction
In this lecture we will go over:

Canonical abstraction
o Instrumentation
o Embedding

Abstract interpretation using canonical abstraction
o Focusing
o Materialization
o Semantic reduction
o Constrain solvers

TVLA
o Usage
o examples
Page 1 of 26
Program Analysis: Lecture #11
Page 2 of 26
Canonical Abstraction
Canonical abstraction is a way to convert logical structures of unbounded size into
bounded size. The abstraction enables every first-order formula to be conservatively
interpreted.
Kleene Three-Valued Logic

1: True

0: False

1/2: Unknown

A join semi-lattice: 0  1 = 1/2
The tables below defined the Boolean connectivity for Kleene’s Three-Valued logic.
Boolean Connectivity[Kleene]

0
1/2
1
0
0
0
0
 0
0
0
1/2 1/2
1
1
1/2
0
1/2
1/2
1
0
1/2
1
1/2
1/2
1/2
1
1
1
1
1
3-Valued Logical Structures
A set of individuals (nodes) U with the predicate pS: (US)k  {0, 1, 1/2}
Example: p(u1, u2, u3) = (0,1,1/2)
Program Analysis: Lecture #11
Page 3 of 26
Canonical abstraction

Partition the individuals into equivalence classes based on the values of their
unary relations
- Every individual is mapped into its equivalence class

Collapse the other relations via 
pS (u’1, ..., u’k) =  {pB (u1, ..., uk) | f(u1)=u’1, ..., f(u’k)=u’k) }

The number of abstract individuals we can have is 2A A being the number of
variables in the program. For example if we have x,t pointers, there can be at most
4 abstract individuals: (x(u), t(u)) = (0,0) (0,1) (1,0) (1,1) every u will be mapped
to one of the classes.
u1
x
n
n
u2
u3
t
Figure 1
As we can see in figure 1 (x(u2), t(u2)) = (x(u3), t(u3)) meaning u2 and u3 can be
classified to the same equivalence class. Therefore we collapse them into one node.
x
u1
n
t
u2,3
n
Figure 2
Figure 2 demonstrates the graph after the collapse. The dotted lines represent a relation
that means the node n-selector MAYBE points to the next node. We mark this as ½ = 10
The way we define the value of the edge is by applying the  operator on all the edges
Program Analysis: Lecture #11
Page 4 of 26
that connected these nodes in the original structure. (In our example figure 1). Note that if
there isn’t an edge between two nodes it is equivalent to having an edge with value 0.
x
u1
n
u2
t
x
n
u3
n
n
u1
u2,3
t
Figure 3
Figure 3 shows why we got a dotted line between u1 and u2,3. The yellow line represents
an edge with a value ‘0’. We did 0 1 and got ½. The rest of the edges were calculated in
the same way.
Canonical Abstraction and Equality
Summary nodes are nodes that represent more than one element.
(In)equality doesn’t have to be preserved under abstraction. Meaning, if eq(u1, u2) = 1 in
the concrete world, it is possible that eq(u1, u2) = ½ in the canonical abstraction.
Summary nodes are defined as nodes that eq(u,u) = ½ because if eq(u,u) = 1 it means that
u = u which means u can not represent more than one node.
Summary nodes will be marked as
Canonical abstraction for the program:
t = NULL;
while (…) do {
t = malloc();
t next=x;
x=t
}
Program Analysis: Lecture #11
Page 5 of 26
The node u2,3 in figure 4 is a summary node. We see that to obtain the information that x
and t point to the same element, it is enough to just hold the first node of the list
separately and there is no need to distinguish between the other elements of the list.
x
u1
n
u2
n
u3
t
Canonical
abstraction
n
x
u2,3
u1
t
n
Figure 4
Canonical Abstraction of Sets
A power set of canonical representation graphs. This enables relational analysis – the
ability to represent relations between two states.
We ask the question: What is it good for? Why not just smash everything into one graph?
The reason is because we would loose information. This way we can keep more
information regarding the relations between things that occurred at the same time. For
Program Analysis: Lecture #11
Page 6 of 26
example, we know that the list has at least two elements.
One of the interesting properties of the canonical abstraction is that it is storeless. There
is no memory partition, the same runtime location in concrete states may be represented
by different abstract locations depending on its properties (Unary/Binary relations).
Embedding
Embedding is basically a mapping between two structures.

A logical structure B can be embedded into a structure S via an onto function
f(BfS) if the basic relations are preserved, i.e.,
pB(u1, …, uk)  pS(f(u1), …, f(uk))
The analysis on the structure S will be more conservative than on structure B.
The analysis is sound.

S1fS2  Every concrete state represented by S1 is also represented by S2 (but
is more precise).

The set of nodes in S1 and S2 may be different.
- No meaning for nodes (abstract locations).
An example of an embedding:
x
x
n1
n1
n2
n2
n3
n4
n5
n3
I
II
f(n1) = n1
f(n2) = n2
f(n3) = n3
f(n4)=n3
f(n5)=n3
f is an onto function because for every element in II there is an element in I.
Every concrete state in I is represented in II
S is called a tight embedding of B with respect to f if it does not lose unnecessary
information meaning:
pS(u1#, …, uk#) = (pB(u1, …, uk) | f(u1)= u1#, …, f(uk)= uk#))
Canonical abstraction is a tight embedding.
Embedding Theorem
 Assume B f S,
pB(u1, .., uk)  pS (f(u1), ..., f(uk))
 Then every formula  is preserved:
Program Analysis: Lecture #11
Page 7 of 26
 If  = 1 in S, then  = 1 in B
 If  = 0 in S, then  = 0 in B
 If  = 1/2 in S, then  could be 0 or 1 in B
n
u1
x
t
u2,3
n
Below are formulas that are applied on the graph above and their values.
v: x(v)
1=Yes (x->u1)
v: x(v)t(v)
1=Yes (x->u1t->u1)
v: x(v)y(v)
0=No (y doesn’t point to anything)
v1,v2: x(v1)next(v1, v2)
½=Maybe (u1 –maybe-> u2,3)
v1,v2: x(v1)next(v1, v2) next*(v2, v1)
0=No (There is no path back to u1)
v1,v2: x(v1)  next*(v1,v2)  next+(v2, v2)
½=Maybe (u1-maybe-> u2,3)
Instrumentation
Some of the limitations of embedding are that information on summary nodes is lost, this
sometime leads to useless verification.
Ways to increase the precision consist of:

User (Programming language) supplied global invariants

Record extra information in the concrete interpretation.
Instrumentation refines the abstraction by defining global invariants.
It gives us the option to keep track of a certain property which interests us during the
analysis that would have been lost if we hadn’t defined the instrumentation.
Some examples:
Example 1: Heap Sharing
is(v) = 
next(v1,v)  next(v2,v)  v1  v2
This predicate keeps track of which nodes are shared and which are not. For example, if
we are running an analysis on a linked list, we can check if it’s tail is cyclic by checking
if one of its nodes is shared.
Program Analysis: Lecture #11
Page 8 of 26
Figure 5
Figure 5 shows how the abstraction preserves the is-shared property. Since is(u2)=1
before and after the abstraction. If is(u2) was equal to 0 we could have “saved” an extra
node in the representation and represented u2 by the summary node.
This is a very important property for the program.
Why do we need heap sharing and not stack sharing?
The answer is simple, we don’t need to keep information on stack sharing since it is kept
automatically for us by the canonical abstraction, each variable on the stack is kept in the
abstraction and also the information on where it points to is kept. However, during the
abstraction we loose information regarding heap sharing, therefore it requires special
handling.
Example 2: Reachability
Reachability(u): from what node can we reach u. This can be used for example, for
improving the garbage collector by finding, at compile time, nodes that are not reachable
by any variables – which means they can be collected, or finding that ALL nodes are
reachable – which means that we can save a call to the garbage collector.
For example, take a look at the following figure:
x
y
We see that the two summary nodes represent two different properties. The first summary
node can be reached by x, and not reached by y. The second summary node can be
reached by both of them. Without the additional reachability instrumentation predicate
we would have got:
Program Analysis: Lecture #11
x
Page 9 of 26
y
Which shows that all the nodes in the summary node may be reached from both x and y,
i.e., we loose information.
Instrumentation refines the abstraction. Choosing the instrumentation predicates depends
on the application. Different properties are interesting in different applications. Adding
instrumentation improves the precision, however it increases the worse worst case
analysis time.
Some more examples of instrumentation:

reachable-from-variable-x(v)

cfb(v) =

tree(v)

dag(v)

inOrder(v) = 
b(v1, v)
 data(v) <= data(v1)
“Focus” – Based Transformer (x = x n)
The Focus Based transformer is another way to increase precision. The idea is that we
transform into a different structure (set of abstract structures) which kind of extract
elements from the summary node which we want to “focus” on. The focusing action will
then enable us to determine whether these “focused” nodes value is 0 or 1 (and not ½)
This way we get more nodes with better precision. We apply the transformer only after
we perform the focusing. The focusing process can also be thought of like an inverse
imbedding. It translates from a structure with less elements/information onto a set of
structures which contain more information.
Program Analysis: Lecture #11
Page 10 of 26
The slide above demonstrates the action. Intuitively, we start with a structure in which we
can see that xn is a summary node, meaning that after applying x = xn, x will point
to a node described by the summary node, and thus, the value of the x predicate will be
½. Therefore, we “focus” on the node that xn points to. Focusing means we “extract”
that node from the summary node. After focusing we apply the formula and return to a
structure with more information. Now we know that x points to the ‘next’ element of y.
Performing the abstraction on the initial structure would have lost this information. It
would have resulted in the figure below:
x
x
abstraction
y
y
In this figure the information that x points to y->next is lost.
To get a better understanding of focusing we will define Materialization.
Program Analysis: Lecture #11
Page 11 of 26
Materialization
(The predicate t[n] between nodes v1, v2 means that v2 is reachable in n steps from v1.)
We can see from the illustration above that we loose the information that x is pointing to
the next node of y. x can point to any node described by the summary node.
We want to “materialize” the node. Similar to an “abstract memory allocation”. We
divide the summary node into 2 nodes and kind of focus on the new node that was
“allocated”. The figure above shows that by materializing the node that y->next points to
we keep the information we lost in figure 1, which is that x points to the next of y.
What we can see here is that the materialization is done at the same time as the
transformer is applied. We do not have 2 stages. Focusing differs from materialization, in
that it first creates all the focused structures and only then applies the transformer.
Focusing generalizes materialization.
The focusing principle

To increase precision
- “Bring the predicate-update formula into focus” (Force ½ to 0 or 1)
- Then apply the predicate-update formula

Generalizes materialization
Example 1: v1: x(v1) n(v1,v)
First stage – focus on v1: x(v1) n(v1,v)
Explanation: We want all the values of this predicate to return either 0 or 1 and not ½
Program Analysis: Lecture #11
Page 12 of 26
t[n]
x
y
n
 u 
n
u1
t[n]
t[n]
Figure 6
Here we see that:
v1: x(v1) n(v1,u1) = 10=0 (u1 doesn’t point to itself) so there is nothing to do here.
v1: x(v1) n(v1,u) = 1½ = ½ (u is a summary node, therefore n(u1,u) = ½ ) so we want to
create structures where x(v1) n(v1,u) will result in either 0 or 1.
Figure 7
The three structures above give absolute values (0 or 1) to this predicate for every node.
The first structure v1: x(v1) n(v1,u) = 10 = 0
The second structure shows the state where the n-selector of u1 points to all nodes in the
summary node.
v1: x(v1) n(v1,u) = 11 = 1
The third structure shows the state where the n-selector of u1 points to part of the list and
the n-selector of u.1 points to the rest of the list (u.0).
v1: x(v1) n(v1,u.1) = 11 = 1 v1: x(v1) n(v1,u.2) = 10 = 0
Notice that the three structures in figure II represent the same stores as the figure in I
Program Analysis: Lecture #11
Page 13 of 26
Stage 2 – evaluate predicate-update formula
Some points to notice:
In figure 7, the first structure represents an impossible situation. The summary node is
marked as reachable, but we can see that it is not.
After evaluating the predicate update formulae on the second structure in figure 7, x will
point to a summary node, but we know that x can point to only one node, therefore this
summary nodes actually represents just one node. The same observation hold for the
summary node u.1 in the third structure
Semantic reduction
Sometimes we can’t write the best transformer, but we can write a “better” transformer.
Semantic reduction is used when certain properties/dependencies of the program can be
observed. For example, a property such as “everything that is pointed to by a variable is
not a summary node”. By observing these dependencies and applying them on the
abstract representation we can improve the precision of the analysis by recovering
properties of the program semantics.
definition:

Given a Galois connection (L1,  L2), an operation op:L2L2 is a semantic
reduction if the following holds:
-lL2 op(l)l
-  (op(l)) =  (l)
Meaning basically, that op gives a more precise analysis, but still has the same
concrete value.
Both have the same
concrete value, therefore
it’s sound
L1

L2
l

op(l)
The semantic reduction can be applied before and after basic operations. The decision
whether to apply it or not, is based on performance issues.
Program Analysis: Lecture #11
Page 14 of 26
The Focus Operation
The focus operation goes from one set of 3-valued-structure to another set of 3-valuedstructure. Each structure in the output has a definite value (0 or 1) for a certain formula.

Focus: Formula(P(3-Struct) P(3-Struct))
Focus is a partial function because for some states it is undefined. It is undefined
when the number of structures that can be returned is infinite.

For every formula 
- Focus()(X) yields structures in which  evaluates to a definite value in all
assignments (0 or 1).
- Focus() is a semantic reduction:
--lL2 Focus() (l)l
--  (Focus() (l)) =  (l)
- Focus()(X) may be undefined for some X (in case it creates infinite number of
structures)
Some examples:
Example 1: Focus on v1: x(v1) n(v1,v)
The same group of stores are represented in both sides (before and after focusing). After
we “focus” we get a set of structures where the value of the formula v1: x(v1) n(v1,v) is
always definite.
Evaluation of the Predicate-Update formula using this focus will result in the figure
below:
Program Analysis: Lecture #11
Page 15 of 26
The update formula is applied on each of the structures that we received from the focus
operation.
Example 2: Focus on v1: n(v1,v)
Trying to focus in this example is problematic. We are looking for a structure which will
give ‘1’ for the formula(u). The structures below represent such structures.
x
y
u1
x
y
u1
x
y
u1
As we can see there can be an infinite number of structures which give this formula a
value of 1. Because we will always have the last node which is a summary node return ½
Program Analysis: Lecture #11
Page 16 of 26
The Coercion Principle

Another semantic reduction

Can be applied after Focus or after Update or both

Increase precision by exploiting some structural properties possessed by all stores
(Global invariants) (for example, that a variable can only point to one node)

Structural properties captured by constraints

Apply a constraint solver
Example:
Example 1 : Constraint: a variable can only point to one node.
Constraint solver knows to turn the middle summary node into a regular node since x is
pointing at it. Now we can determine a new property, that x is in fact pointing to the node
pointed by y->next
Some more example Constraints:
Constraints derived from concrete semantics:
x(v1) x(v2)  eq(v1, v2) (what ever is pointed to by a variable is not a summary node)
n(v, v1) n(v, v2)  eq(v1, v2) (every field can point to only one location)
Constraints derived from Instrumentation:
n(v1, v) n(v2, v) eq(v1,v2)  is(v) (is shared)
n*(v1, v2)  t[n](v1, v2) (reachability)
Sources of Constraints:

Properties of the operational semantics

Domain specific knowledge
- Instrumentation predicates

User supplied
Apply constraint solver
Constraint 1:
n(v, v1 )  n(v, v2)  v1 = v2  v1  v2  n(v, v1 )  n(v, v2)
Program Analysis: Lecture #11
Page 17 of 26
we take v2 = u.0, v1 = u.1 and get that v1  v2, n(u1, u.1) = 1  !n(u1, u.0) therefore we
can remove the edge marked in pink.
Contraint 2:
n(v1, v )  n(v2, v)  v1  v2  is(v)  is(v)  n(v1, v)  v1 v2  n(v2, v)
We take v1 = u1 v2 = u.1 v = u.1 is(u.1) = 0 and n(u1, u.1) = 1 and v1  v2 therefore
!n(u.1, u,1) therefore we can remove that edge.
Constraint 3:
n(v1, v )  n(v2, v)  v1  v2  is(v)  is(v)  n(v1, v)  v1  v2  n(v2, v)
we take v1 = u1 v2 = u.0 v = u.1 and we get the !n(u.0, u,1) therefore we can remove the
edge between u.0 and u.1
Program Analysis: Lecture #11
Page 18 of 26
Constraint 4:
The constraint identifies that the middle node is in fact one node and does not represent
several nodes.
After applying all the constraint solvers we get
We got a more precise result..
Summary:
Below is en explanation of how the different methods we discussed solve some difficult
issues of shape analysis. The solutions are marked in bold red.

Destructive updating through pointers
o Pnext = q
o Produces complicated aliasing relationships
o Track aliasing on 3-valued structures
o 3-valued logic guaranteed soundness

Dynamic storage allocation
o No bound on the size of runtime data structures
o Canonical abstraction  finite-sized 3-valued structures

Data-structure invariants typically only hold at the beginning and end of
operations
o Need to verify that data-structure invariants are re-established
o Query the 3-valued structures that arise at the exit
Program Analysis: Lecture #11
Page 19 of 26
Predicate logics allows naturally expressing SOS for languages with pointers and
dynamically allocated structures. The 3-valued logic provides a sound solution
Next we will discuss TVLA + Applications
Program Analysis: Lecture #11
Page 20 of 26
TVLA: 3-Valued Logic Analyzer
http://www.cs.tau.ac.il/~tvla
TVLA is an evolving research vehicle for abstract interpretation, featuring:




A powerful language for expressing concrete semantics.
Automatic generation of abstract interpreters from concrete semantics.
Tunable abstractions.
Naturally suited for checking safety properties of heap allocated data
and also other kinds of static analysis.
TVLA does not miss errors and it generates very few false alarms. For an average
program, it may not generate false alarms at all.
TVLA inputs

TVP – Three Valued Program
This file defines the semantics of the program. The file consists of:
o Predicate declaration
o Action definitions SOS

Statements

Conditions
o Control flow graph

TVS – Three Valued Structure
This is sort of the initializing state. It gives the initial values, what we referred to
until now as ‘’. This is useful for “expensive” analysis which can’t be run on an
entire program. For example: reversing a list, we will initiate it to continue
reversing from the “middle” and not from the beginning.
Program Analysis: Lecture #11
Page 21 of 26
Understanding the TVP file structure
The TVP file is divided into 3 parts. Each devided by %%
Part 1: Declarations
Part 2: Actions
Part 3: Control Flow Graph.
We will use the notation: A  B to define a set of A’s separated by Bs
Example: <var>, == <var1>, <var2>, … , <vark>
Part 1: Declarataions
%s <id> {<var>, }: set declaration. Defines the variables of the program.
Example: %s PVar {x, elem} // PVar is the set of elements containing x, elem
%p <pred-name> (<var>,) <flags>: Core predicate declaration.
Example: %p n(v_1, v_2) function
%i <pred-name>(<var>,) = <formula>: Instrumentation rules
Example: %i is[n](v) = E(v_1, v_2) (v_1 != v_2) & n(v_1, v) & n(v_2,v) (E= )
%r <formula> ==> <formula>: Consistency rule
Example: !dle(v_1, v_2) ==> dle(v_2, v_1) (sets an order between each two elements,
dle = data less equal)
Some of the flags mentioned above in declaring a predicate consist of:
unique: true for at most one node – generates the constraint: p(v1)&p(v2) => v1 == v2
box: display p in a box – graphical influence only.
function: Partial function – generates the constraint: n(v1, v2)&n(v2, v3) => v2 == v3
Part 2: Actions
%action id (<var>,) {…}: Action declaration
%t <message>: Action title - used when printing the action's structures.
%f {<formula>,): Focus formula - applied before the precondition.
%p <formula>: Precondition - The precondition formula is evaluated to check whether
this action should be performed. If the formula contains free variables then the action is
performed for each assignment into these variables potentially satisfying the formula.
%message <formula> => <message>: Report messages if the formula is true.
%new: a mechanism for creating new nodes. An unary predicated called isNew(v) is
created an set true only for the nodes created in this action.
%retain <formula>: a mechanism for getting rid of unwanted nodes. Only nodes that
satisfy the formula are retained.
Example:
%action Copy_Var_L(x1, x2) {
Program Analysis: Lecture #11
Page 22 of 26
%t x1 + “ = “ + x2 // type this message
%f {x2(v) } // focus, we want x2(v) NOT to be ½
{
x1(v) = x2(v)
}
}
Part 3: Control Flow Graph
The program to be analyzed is composed of CFG nodes with edges connecting between
them and actions to be performed on these edges. A CFG node is declared implicitly by
the existence of incoming or outgoing CFG edges. The action used in the CFG edge must
be predefined in the actions section.
A CFG edge is defined as: <cfg_node> <action> <cfg_node>
If there is no action to be performed we use “uninterpreted()”
Example:
n_1 Copy_Var_L(elem, x) n_2
A few more notes:
1) Preprocessing
The TVP file can be preprocessed using the standard C preprocessor before being parsed
by the system. The preprocessor enables file inclusion (using the #include directive),
macro expansion (using the #define directive), and conditional evaluation (using the #if,
#endif, etc. directives).
2) Comments
The TVP file supports both /* */ and // style comments
Program Analysis: Lecture #11
Page 23 of 26
Examples Shown in class:
/* A Tvp program for the null_deref function from elem_lib.c */
/***********************************************************/
/*********************** Sets ******************************/
%s PVar {x, elem}
Part 1: declarations
#include "pred.tvp"
%%
#include "cond.tvp"
#include "stat.tvp"
Part 2: actions
%%
/***************** code *******************************/
Part 3: Control Flow Graph
/* elem=x; */
n_1 Copy_Var_L(elem, x) n_2
The graph presents the code:
/* elem=x; */
elem = x
n_2 Copy_Var_L(elem, x) n_3
while (x != NULL) {
/* while(x != NULL) { */
if (elem->val == value) {
n_3 Is_Null_Var(x) exit
return 1;
n_3 Is_Not_Null_Var(x) n_4
elem = elem->next;
/* if (elem->val == value) */
}
/* return 1; */
return 0;
n_4 uninterpreted() exit
n_4 uninterpreted() n_5
/* elem=elem->next; */
n_5 Get_Next_L(elem, elem) n_3
/* } */
/* return 0; */
Below are some of the tvla results for running this program
Program Analysis: Lecture #11
Page 24 of 26
This figure is the control flow graph. The titles on the edges are the string we gave in the
%t option.
TVLA gives as output the possible abstract structures for each node. In addition, if there
are warnings TVLA prints them for the specific node.
For example, one of the outputs for node 3 is displayed below
Program Analysis: Lecture #11
Page 25 of 26
this figure also includes the reachability analysis.
We see that nodes 3 and 2 are both reachable by elem and x, where as node 1 is only
reachable by x.
One of the outputs for Node 4 was the message:
this identifies that elem may be
pointing to null.
Program Analysis: Lecture #11
TVLA Summary
New TVLA features
 Automatic generation of predicate-update for instrumentation
 Java Frontend
 Efficient representation of structures
 Better performance

On large programs
TVLA experience
 Quite fast on small programs

But runs on medium programs too
 Not a panacea
 More instrumentation may lead to faster (and more precise) analysis
 Manually updating instrumentation predicates is difficult
TVLA Design Mistakes
 The operational semantics is written in too low level language

No types

No local updates to specific predicate values

No constants and functions

No means for modularity

No local variables and sequencing

Combines UI with functionality
 TVP can be a high level language
 “instrumentation” = “derived”
 TVLA3VLA
Page 26 of 26