Flow Insensitive C++ Pointers and Polymorphism Analysis and its

Flow Insensitive
Analysis
C++
Pointers and Polymorphism
and its Application
Paolo Tonella, Giuliano Antoniol,
Roberto Fiutem
I.R.S.T.
via. Alla Cascata
I-3SO50 Povo (Trento), Italy
461.314444
{tonella, antoniol, fiutem}@irst.itc.it
Ettore Merlo
Dep. Electrical and Computer Engineering
Ecole Polytechnique
C.P. 6079, Succ. Centre Ville
Quebec, Canada
514.340575s
[email protected]
ABSTRACT
Large software systems are difficult to understand
and
maint,ain. Code analysis tools can provide programmers
with different views of the software which may help their
understanding
activity.
To be applicable to real programs writ,ten in modern programming
languages, these
tools need to efficiently handle pointers. In the case of
C++ analysis, object oriented peculiarities
(like, e.g.,
polymorphism)
have to be accounted for as well.
We propose
a flow insensitive,
context
insensitive
points-to analysis capable of dealing with the features of
the object oriented code. It is estremely promising because of the positive trade-off between complexity and
accuracy. The integration
of the points-to results with
other analyses, such as reaching definitions and slicing,
is also discussed in the context of our program undersbanding environment.
Keywords
Point.s-t.o analysis,
C-l-+, polymorphism,
had originally been developed
ing such concepts.
Modern programming
languages
contain features like
pointers and object’s. Object,-Oriented
(00) characteristics such as classes, inheritance,
polymorphism
and dynamic binding augment language espressivity
and offer
new possibilities
to programmers.
On the other hand,
they represent new challenges to flow analyses, which
Permission to make digit&hard
copies of all or part of this material for
without fee provided that tie copies
are not made or diibuted
for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is
givers
that copyright is by permission of the ACM, Inc. To copy otherwise,
to rcpubliih, to post on servers or to redistribute to lists, requir=
specific
permission and/or fee
ICSE 97 Boston hL4 USA
Copyright 1997 AChi O-89791-914-9/97/05 ..$3.50
pt~~onalor classroom use is -ted
-.__-
not includ-
Different flow and context sensitive
algorithms
have
been developed to obtain accurate results on pointers
analysis, data dependences
and slicing, at the expense
of time and memory complexity [7,13, 15,161. However,
when the objective is to analyze large industrial software
systems, efficiency becomes crucial. The importance
of
controlling
the time complexity has been investigated
in [3], where a solution based on a fine grained context
sensitivity
specification
is proposed.
In [20], we have
presented a variable precision reaching definitions analysis with similar objectives of scalability
and control.
To make pointer analysis tractable
for large software
systems, a flow and contest insensitive Points-To Analysis (PTA) has been recently presented in [17, 181 for
the C language. It has an almost linear time worst case
complesity and is also very fast in practice.
The results of the 00 PTA are integrated in the IRST
program understanding
environment
which contains different flow analyses for program and architectural
understanding.
Some esamples of program slicing based
on flow insensitive points-to analysis are presented.
The paper is organized as follows: next section is an
overview of the flow insensitive points-to analysis of C
code. It is followed by the description of the proposed
flow insensitive points-to algorithm for C++. Then our
results are compared with those obtained through a flow
and contest sensitive analysis on a test suite. The following section discusses the esperimental
resulm.
Finally the integration
of the PTA with reaching definitions and slicing is presented. The last section presents
conclusions and future work.
433
..
for languages
The purpose of the present work is to estend the
mentioned flow insensitive points-to algorithms to handle C++ peculiarities
(class attributes,
polymorphism,
templates, dynamic binding).
Flow insensitive points-to
results have been compared with those obtained with a
flow sensitive technique [15], to assess the practical accuracy of the proposed extensions.
slicing,
program
understanding,
flow analysis, reverse engineering.
INTRODUCTION
Many flow analyses have been applied to software engineering.
For esample reaching definitions analysis has
been used in data flow testing [lo]. Slicing has been
used in maintenance,
debugging, reuse, testing, reverse
engineering,
metrics and architectural
understanding
[4, 5, 6, S, 9, 11, 12, 221. Constant
propagation
has
been used in user interface reengineering
[2, 141.
-_
to Slicing
-
FLOW INSENSITIVE
POINTS-TO
OF C CODE: AN OVERVIEW
ANALYSIS
The PTA purpose is to determine the points40 set of
every location, i.e., the set of locations which may be
pointed to by a given location (by location we mean
stack and heap allocated objects, data members and
variables). PTA can be performed adopting the non
standard set of types defined in [17, 181, and denoted as
r-types.
r-types are not related to concrete types: they serve to
model how storage is used in a program. As types may
be recursive, type variables are introduced to describe
mutually referencing objects. In the following type variables are denoted as ri where i is an integer identifying
the type. Each location may contain actual data (i.e.,
it does not reference other locations), such in a case its
type is ri =ref(l), or may contain a location address,
this is indicated as ri = ref (q).
Once the PTA has been performed, each location is associated with a type variable, say ri = ref (q) or ri =
ref (I). ~j, referenced by Q, models the points-to relation in that all locations of type rj are possibly pointed
to by the locations of type ri. The points-to set of a
location x is thus:
x: 7-i= ref (rj)
points-to(x)
= { y I y:
dicated by the types is a safe (conservative) description of all possible dynamic storage configurations [17],
Well-typedness can be espressed in terms of typing
constraints [17], the satisfaction of which is obt,ained
through the application of join() or c- join() ,a~ described above.
FLOW INSENSITIVE
OF C++ CODE
The points-to set of a location with ri = ref (I) is obviously empty. The points-to relation between locations
is conveniently represented by the storage shape graph
(SSG) [17], where each node is a r-type (and is labelled
with all locations of that type), and a directed edge
exists between q and rj if ri = ref (rj). Initially all IOcations are associated with different type variables, ri,
each initialized as ri = ref(l). Then the instructions
of the program are analyzed in an arbitrary order (the
analysis is both flow and context insensitive), and each
time a pointer affecting statement is encountered, types
are merged and referenced types are updated by means
of the join0 or c- join0 functions. When different locations are pointed to by the same pointer, their types
have to be merged. The join( > function is used to unify
the types to one single type, and is recursively applied
to the referenced types. When the pointed to locations
have not yet been encountered, the pointer is a reference
to bottom, and type merging is delayed. The c-join0
function inserts the type into a pending list, and a real
unification is performed only if a successive PTA step
transforms the reference to bottom into a reference to a
non bottom type 1171.
The points-to algorithm infers a typing environment under which the program is well-typed, i.e., the SSG in-
ANALYSIS
A flow insensitive PTA method, which covers the peculiarities of modern programming languages, is presented. It is an extension of Steensgaard’s work on
C [17, 181. Although the approach is not limited to
a particular 00 language, to illustrate results on real
software developed with a widely used programming
language, this paper is focused on C++. All 00 features are considered: pointers to objects, dynamic object allocation, single and multiple inheritance, recursive
data structures, recursive methods, virtual functions,
dynamic binding and pointers to methods.
Since both flow insensitivity and context, insensitivity
are preserved in our extensions, the language control
structures and functions (or methods) calling sequences
do not affect the analysis and therefore are not taken
into consideration.
A
Tj >
POINTS-TO
....-
(type-id)
A
*A
A CEXPRI
A.id
A->id
id
(A>
nen CLASS
Figure 1: Grammar of the access paths.
operators are: *, -> and CElypRl .
Dereference
The approach presented here integrates the notion of
access paths [15] and that of non-standard set of rtypes [17] with a new type inference scheme, which hcas
been defined first to take into account G++ object members under the access path framework and seconcl to be
compatible with the low cost PTA presented in [17],
To deal with polymorphism, we integrate the r-types
with the standard type information (denoted as concrcfc
types), which can be determined by syntactic analysis of
C++ code.
An access path is a unary expression [19] described by
the grammar of Figure 1. It contains an arbitrary sequence of dereferences and accesses to data members,
Dereferences and member accesses may be combined
through the -> operator. Array components accesses
It can be sho\vn [15] that every access path which includes n > 0 dereferences is aliased to at least one fixed
location object. Hence it is possible to statically associate every variable location with the set of fixed locations that may be dynamically
accessed through it.
are considered as a particular kind of dereference, since
they can be obtained through normal dereferences as
rvell. An access path represents a fixed location if it does
not contain dereference operators (*, -> and CEXPRI)
[15]. Otherwise it represents a variable location. Variable locabions, depending on the possible values of the
involved pointers, may be associated with different fixed
locations during the esecution.
A dynamic allocation is
represented as a fised location object heap-n created for
each line, n, lvhere a new operator is encountered
(i.e.,
heap ident,ification
by memory allocation sites). Array
components
are fixed location objects denoted as idC1
(rvhile a variable location leading to an array component has the form idCEXPRI),
representing
a generic
component, of the array. A fixed location can therefore
be espressed by the following regular expression:
@([I) * 4 * W)*
(1)
rvhere st,ar (*) stands for zero or more occurrences,
rvhile dot (.) and square brackets (C 1) are not metacharacters.
FIXED
LOCATIONS
INSTRUCTIONS
PTA consists of examining in turn all the instructions
of the system under analysis and for each assignment
instruction
which affects some pointers a r-type update
action is taken. Tivo basic kinds of instructions
involving access paths may be encountered
during the PTA:
assignments
lvithout or with the use of the & operator,
which takes the address of a location. The corresponding update actions, are defined in Figure 4.
VARIABLE
LOCATIONS
class A {
x x;
y y:
1;
A a;
A ::= (type-id)
AdA. type = -41. type}
{A.type = deref (Al. type>}
* AI
A1 CEXPRI
{A. type = deref(Al. type)}
Al.id
jt.t;;e
= Ar.type.id.type}
Al-Bid
-.t e=
deref(Ar.type).id.type}
id
{A.type = id.type}
{A.type
= Ar.type}
C-41)
new CLASS {A.type = ref(heapn.type)}
*a
a.x
a-y
n: +p = new A;
q
b[lO] = I;
ab
heapn
P
*P
bC101
b[l
Figure
In fig. 2 t,he declaration
of an object of class A generat’es one fixed location for the object, plus as many tied
locations as there are data members in the class. The
dynamic allocation of an object at generic linen involves
the fixed location identified by the site of the dynamic
allocat,ion and the fised location of the pointer p. While
the pointer itself is a fixed location, its content, *p, is
a variable location.
Also an array assignment
involves
tlvo fixed locations, namely the array name, b, and the
generic array element, b Cl, accessed through the variable location b Cl01 .
Instruction
(1) of Figure 4 is an assignment
betlveen
two access paths.
Types referenced by left and right
hand sides are merged according to the c-join
(conditional join) procedure described in [17]. In the second
kind of assignment the left hand side referenced type is
merged into the right hand side type according to the
join procedure.
Object Data Members Extension
Each type variable has a (possibly empty) set of member
attributes,
used to model data members of class objects
435
__.___
I
Figure 3: Syntax directed definitions
to compute the
type attribute of an access path. The type of a member
attribute of a type is denoted as: type. id.type.
2: Esamples of fixed and variable locations
-
The extensions to C++ involve the computation
of appropriate r-types associated with access paths. The Ttype of an access path is represented
as an attribute
(A. type), which can be determined
using the syntax
directed definitions of Figure 3. Explicit and implicit
pointer dereferencing
and array element accesses perform a deref operation on the r-type attribut,e.
The
deref operation simply returns the referenced type of
a given type, i.e., if ri = ref(rj),
deref(Ti) returns Tj.
Creating (with the new operator)
and assigning a dynamic object to an access path, produce the construction of the relation (ref) between the access path T-type
and the heap r-type corresponding to the allocation site.
All the other semantic rules propagate
the r-types of
member attributes,
and will be discussed in detail in
the next section.
(1)ai = a2 q c-join(deref
(2) al = &a2 -----r’join(deref(rl),
(71)) deref (~2))
~2)
where:
q =ai. type
72 =a2. type
c.x). The constraint is applied to the type referenced
by variable p, i.e., ~1, whose x member is made to point
to 72. When the last statement is analyzed (step 4),
variable c will inherit the typing constraint of the member attribute x, which is associated with its new type
71.
Effects of interprocedurality on pointers
Figure 4: The two basic actions to be performed when
a pointer affecting statement is met.
Since the proposed PTA is context insensitive, only the
points-to relations involving the passed parameters have
to be taken into consideration.
or structures. If a class has a data member id which is
accessed from an object of type pi, a member attribute
id is added to Ti:
Parameters are handled by applying the rules of Fig
ure 4, where each actual parameter of a methocl (or
a function) invocation is assigned to the corresponding
formal parameter of the called method.
ri.id: ok = ref(q)
As a result rk is the T-type of the data member id of
all locations of type q. Typing constraints related to
the id member are introduced through the application
of join or c-join to Q, the type of Ti. id.
class A {
x x;
y y;
1;
A a, c;
A method may induce a typing constraint on data members of its class. As the constraint holds for every object,
it is applied to a generic object -any of that class. The
last step of the points-to analysis propagat,es the -any
constraints to all the fixed locations of that class.
class A (
public :
B* p;
f(B*
q>
(
p
=
q;
1
I;
sl :p = ta;
s2 :q = &b;
A a, b;
B c, d;
a.f@c);
b.f(&d);
s3 :p->x = q;
s4:p
=
kc;
Step 3
Step 4
a: 7-l= ref(l>
b: 7-2= ref(JJ
c:
i-3 = ref(l)
p:
74 = ref(r1)
q:
75 = ref(r2)
?-I
.x: r6 = ref(n)
a: rl = ref(l)
b: r2 = ref(JJ
c: r1
p:
r4 = ref(rl>
4: ~5 = ref(r2)
71.x: T6 = ref(r2>
-any.p: rl = ref(n>
q:
7-3 = ref(72)
c: r2 = ref(l)
d: r2 = ref(JJ
a: r4 = refU)
b: 7-5= ref(l)
r4.p:
rl = ref(n)
r5.p: ~1 = ref(r2>
Figure 5: Example including the declaration of class A,
objects a, c and four statements. The resulting typing
environment after steps 3 and 4 is shown.
Figure 6: Example comprising the declaration of class
A, the declaration of objects a, b, and two metllod invocations. The resulting typing environment is sJ]own,
Figure 5 contains the declaration of class A and objects a, c, four instructions to be analyzed in an order
which may be different from the execution one, because
of flow insensitivity, and the resulting typing environment. When the statement ss is processed during the
PTA (step 3), variable c has not yet possibly been encountered in an analyzed statement (because of the flow
insensitivity of the analysis). Therefore it is not possible to apply typing constraints to its member (i.e., to
An alternative choice in C++ is to consider a data member access as an access through the this pointer [In],
and to consequently apply member const*raint,s t.o the
type referenced by this. With this alternative choice
all the objects of a given class are forced t(o have t.he
same type, because their address is assigned to this,
yielding a lower accuracy of the points-to results. This
overly restrictive constraint is avoided by using the -any
fictitious object. Actually, in Figure 6, where the two
method invocations modify the points-to set of the data
member p, after the application of the -any constraints,
a and b are not forced to be of the same type.
A ::=
I
2
3
4
5
6
7
8
AI(A .name = Al .name)
iA .name =
points-to(Ar.narne))
Ar CEXPRI
{A.name =
points-to(Ar.name)}
Ar.id
{A .name = Al. name. id}
Al->id
{A.name =
points-to(Ar
.name) . id}
id
{A.name = {id})
{A .name = Ar .name)
(Al)
(type-id)
* AI
Points-to set of p (after processing statements
points-to(p)
The points-to
cat.ion x to
(points-to(x)).
Given a polymorphic
ap->f(al,
sets, associating
each
the possibly
referenced
method
. . . . an);*(*ap).f(al,
invocation
(>
==>
==>
==>
points-to(p).plot()
heap-5.plot0, a.plot()
Rectangle::plot(),
Circle::plot()
The example of Figure S includes a polymorphic
call at
line 8. Concrete types and points to sets are shown,
as they are required to perform the steps necessary
for polymorphism
resolution.
To solve the virtual call
@plot
O), w h’ICh is equivalent to (*p) .plot 0, the
name of the access path *p is computed according to
the second definition of Figure 7, resulting in {heapd,
aI * The call is therefore either heap-5 .plot 0 or
a.plot 0, and the concrete types of the involved fixed
locations are Rectangle
and Circle
respectively,
resulting in the invocable methods Rectangle:
:pLot() ,
Circle:
:plot().
with
fixed lolocations
Algorithm Properties
The proposed algorithm has several interesting
properties. It is safe: the points-to sets and the polymorphic
call sets are conservative
approximations.
This means
that if a points-to relation holds, it is surely included in
the points-to set reported by the approximate
analysis,
and if a method is invocable from a polymorphic
call, it
is surely reported in the approximate
polymorphic
call
set.
of the form:
. . . . an);
where *ap, ai,
. . . , an are access paths, the syntax
directed definitions of Figure 7 are used to compute the
name attribute,
i.e., set of fixed locations which are possibly referenced through a particular access path. Each
fixed locat,ion has a concrete type representing
the class
of such a location and the combination
of all the possible
concrete types of the possibly referenced fixed locations
determines the set of invocable methods (the polymorphic call set).
As the algorithm is both flow and context insensitive,
it does not require any knowledge about the call graph,
while other flow and context sensitive approaches require the construction
of the call graph, or its abstraction, during the analysis.
However, in flow sensitive
437
_...___
-1,
a)
Figure 8: Example including a polymorphic call (line 8).
Its resolution, based on concrete types of fixed locations
and points to sets, is shown at the bottom.
New fixed locations are available for objects of the derived class, if data members are added. Redefinition
of
methods is also possible, and needs special treatment
only when virtual functions are involved, and therefore
the calls become polymorphic.
Points-to results can be
used to analyze polymorphism,
i.e., to associate polymorphic function calls with the set of invocable methods. The following structures
are required:
l
= (heap-5,
4,. . ., S):
Virtual calls resolution:
p->plot
each fixed location
concrete type):
a: Circle
heap-5: Rectangle
p: Shape*
Inheritance and Polymorphism
A derived class is defined through inheritance,
a mechanism which allows the derived class to access data members and use methods from it,s parent class, as if they
were of it’s own.
A symbol table, associating
its concrete type.
if (c)
P = new Rectangle();
else
=&a;
&ot
0 ;
Fixed locations (and corresponding
Figure 7: Syntax directed definitions
to compute the
name attribute of an access path, which is the set of
iked locations reachable through the access path.
l
Shape *p;
Circle a;
analyses and in presence of polymorphism the call graph
depends on the results of the ongoing analysis. Therefore while flow sensitive algorithms iteratively use the
available points-to information for call graph construction and vice versa, the proposed approach is able to
completely separate the points-to analysis from polymorphism resolution and no call graph construction is
required.
Recursive data structures and recursive function calls
are inherently handled. The SSG may contain cycles,
hence it is not necessary to introduce k-limiting techniques to represent self-referential data structures. Recursive data structures are properly modeled by type
variables, the member attributes of which reference the
type variable itself. As the analysis is context insensitive, the call graph is not taken into consideration, nor
is the presence of cycles in it. Therefore methods may
be recursive or indirectly recursive.
Being flow and context insensitive, the algorithm has a
low cost in terms of both space and time complexity.
The worst case complexity is theoretically exponential
with the program size N, as it is possible to build a program with O(2N) distinguishable locations [18]. Even
if theoretically correct, it is practically meaningless to
express the complexity in terms of N. A more convenient metric is the count S of all objects and variables
in the program. If all classes have R or fewer data members, the time complexity results O(RSa(S, S)), where
Q is the inverse Ackerman’s function, i.e., the algorithm
has a quadratic worst-case running time complexity [18].
The term R is the maximum number of data members
of a class, and thus depends on the depth of inheritance
and on the use of multiple inheritance. Common programming practices indicate that both of these factors
are limited, and do not linearly increase with program
size. Consequently R turns out to be a constant and the
almost linear complexity O(SCY(S,S)) presented in [17]
is preserved in practice.
EXPERIMENTAL
SETUP AND RESULTS
The framework for the development of the Points-To
Analysis Tool (PTAT) is the IRST program understanding environment. It comprises the FLOWANalysis
Tool (FLANT), the Architectural Recovery Tool (ART),
which uses flow analysis results, and the PTAT.
FLANT provides many flow sensitive interprocedural
analyses among which reaching definitions, reachable
uses, slicing and impact. It handles recursive function
calls and allows variable precision for reaching definitions [20]. Its output is textual, graphical (when results
can be represented as graphs) or interactively provided
to the user through a customized version of the text
editor emacs. ART [9] is a reverse engineering tool
supporting the discovery of architectural information.
It reports information about architectural component,s
and connectors found in the source code according to
different views (system, module, task and code).
The IRST program understanding environment is multi
language and modular. Information eschange among
modules is based on an intermediate representation for
both inputs and outputs. PTAT is crucial in the environment, because FLANT analyses depend on the
points-to results, and FLANT results are used by ART.
The contribution of PTAT in the analysis framework
is in its capability to solve polymorphic calls, and to
provide a points-to set for each fixed location.
Points-to sets are used by FLANT to decide which variables are possibly defined or used in a statement,. This
is essential for some of the FLANT analyses, such as
data dependences analysis, where taking pointers into
considerations is fundamental.
PTAT is written in C++ and uses the LEDA’ library.
Its input is an intermediate representation of the source
code produced by the front-end, and its output, can be
textual or graphical. The latter results in the visualization of the SSG.
The experiments performed to test the PTAT were
aimed at measuring the number of extra point#s-t,a relations, possibly, introduced by the contest insensitive
approximation, with respect to the results from flow
sensitive C++ pointer and polymorphism analysis. For
this purpose the public domain test suite collect,ed by
Pande [15] was used. The test suite consists of 19 programs containing virtual calls, for which polymorphism
resolution is required. For each polymorphic call our
PTA provides a conservative polymorphic call set.
Program
1. greed
2. garage
3. vcircle
4. office
5. family
6. FSM
7. derive
8. shapes
9. derivl
10. objects
11. simul
12. primes
13. ocean
14. NP
15. city
16. tree
17. employ
18. life
19. chess
Lot
968
143
142
213
109
98
313
267
192
465
339
46
444
31
519
217
894
17s
392
Methods
47
19
16
12
22
15
34
34
31
59
54
11
64
7
67
26
58
21
43
Virtunl
Methods
(I
z5
4
4
3
2
16
12
13
31
12
4
10
2
2
8
25
8
12
Virtual
Cdh
:i
G
4
3
1
66
22
28
39
7
3
li
6
1
3
4
2
1
Table 1: Some characteristics of the test suite Ir,vPan&,
‘LEDA is an efficient general purpose
at the Mas Planck Institut fiir Informatik,
clnss library
Saarbriicken,
rlevcloped
Gcrmnny
Program
1. greed
2. garage
3. vcircls
4. office
5. family
6. FSM
7. deriv2
8 shapes
9. derivl
10 objects
11 simul
12 primes
13. ocean
14 NP
15. city
16. tree
17 employ
1S life
19. chess
l-way
100
100
100
100
100
100
28
56
50
50
33
33
20
0
0
0
0
0
0
3-way
1 l-way
Flow 1 m-Isltlv
0
0
0
0
0
0
0
0
0
0
0
0
39
33
11
30
0
0
50
67
0
0
67
20
20
0
100
0
100
100
0
0
0
0
0
0
0
> 3-way
0
0
0
0
0
0
33
0
20
0
0
0
40
0
0
0
100
100
100
l-way
F
100
100
100
100
100
100
84
83
65
50
33
33
20
0
0
0
0
0
0
1 a-way
Flow
0
0
0
0
0
8
17
25
50
67
67
40
100
100
0
0
0
0
points-to
0
0
0
0
0
0
0
0
0
0
0
0
20
0
0
0
100
100
100
Extra
(%)
0
0
0
0
0
0
60
33
40
0
0
0
14
0
0
0
0
0
0
T
Flow Insens.
CPU time (ms)
3
9
9
17
8
9
100
25
26
15
20
3
17
19
7
7
10
34
14
Table 2: Percentage of polymorphic
calls solved by 1, 2, 3 or more than 3 possibly invoked methods. Extra pointsto relations measured as the ratio between flow sensitive and flow insensitive
cardinalities
of the sets of methods
invocable from the polymorphic calls.
Table 1 shows some features of the programs in the test
s&e by Pande. Their size ranges from 31 to 96s LOC,
and the number of methods from 7 to 67. All of them
declare virtual methods and contain polymorphic
calls.
Table 2 reports the results of the virtual calls resolution
on the test suite. To simplify comparison with the result’s presented in [15] the same grouping criteria for
polymorphic
call sets are adopted: invocable methods
are ranked in classes of 1,2,3 or more than 3 cardinality.
The last but one column contains the extra points-to
relations measure of the flow insentive PTA compared
to the flow sensitive analysis.
It is computed as the
percentage size increase of the flow insensitive polymorphic call sets, with respect to the flow sensitive ones.
The last column contains CPU times in milliseconds for
t,he flow insensitive PTA.
DISCUSSION
OF RESULTS
One of the drawbacks of the flow insensitivity
of this
PTA, witch respect to flow sensitive PTA’s, is that it
int.roduces some extra points-to relations. This is due to
the fact that, statement
order is not taken into account,
while flow sensitive analysis strategies also consider the
order induced by the control flow graph (CFG).
On the ot,her hand, context insensitivity
is also a source
of imprecision.
Calling contexts of methods are ignored,
and consequently
the return points-to sets propagate
bo every invocation
of the method, including unreachable invocations,
instead of propagating
only to the call
associated with the current instance.
As an effect of
these inaccuracies
the estimated
points-to sets can be
a superset of the real one. Consequently
the methods
invocable from a polymorphic
call can be more than
required.
However, comparing flow sensitive and flow
insensitive polymorphism
resolution on the mentioned
test suite, it results that in practice precision is very
high. In addition, it is important
to note that extra
points-to relations are anyhow safe, because flow sensitive polymorphic
call sets are always included in the
flow insensitive sets.
Context insensitivity
is the main reason for the high
loss of precision (60 %) in the 7th program, as reported
in Table 2. This program computes a function derivative with respect to a variable.
The function is given
as a combination
of operators (+, -) *, /) and operands
(constants and variables). As the functions, used as test
case, are hard coded in the main, and they do not exploit the full range of operators, some classes are never
instantiated,
giving rise to unreachable
methods. These
methods are discarded by the context sensitive analysis, while their effect is taken into consideration
by the
flow insensitive algorithm.
Flow and context insensitivity has some effect on three other programs (numbered
S, 9 and 13 in Table 2), and does not affect at all the
remaining 15 test cases.
An explanation for the high precision obtained could be
that it is common programming
practice to not reuse
pointer variables for different purposes, according to different control flows or calling contexts: a code where a
pointer variable has a unique meaning (possibly suggested by the name chosen for it) is definitely more
readable and maintainable.
A consequence
is that a
flow and context insensitive points-to analysis exhibits
an accuracy similar to that of the flow sensitive counterpart, since the results scarcely depend on flow and
context.
,
On the other hand, because of flow insensitivity, the proposed algorithm has low time complexity. The trade-off
between accuracy and performance is in our opinion satisfactory, at least on the test suite under investigation.
This becomes important when industrial size software
is taken into consideration for analysis, and scalability
has to be taken into account.
Let us consider a variable x whose points-to set is:
points-to(x)
= Cs, t)
The DEF set of statement ((my-type
((my-type
SLICING C++ CODE
Slicing is a technique originally developed by [al] and
subsequently investigated by [ll], [l]. The extensions of
the system dependence graph to the 00 software, described in [13], make it possible to slice 00 code. The
importance of efficiently handling pointers and polymorphism is also stressed, as a prerequisite in a code
analysis framework for the C++ language. PTA is also
helpful in improving the accuracy of data dependences
analysis used in slicing.
+)x)->b
3 1 is computed ns:
*lx)->b = 1
==> DEF = points-to(x).b
==> DEF = {s.b,
tab)
-I
Figure 9: Example of DEF set computation.
A
::=
A1(A.use = A1 .use }
{A .USB = points-to(Al .use>
U Al.use }
Al CEXPRI {A.use = points-to(Al.une)
(type-id)
* AI
Al.id
Al ->id
U Al.use }
.use = Al.use.id}
I:: .use = points-to(Al
.use) . id
U Alme }
.use = {id}}
id
(AI)
In this section two examples of slicing produced using
C++ flow insensitive PTA are presented. The first is
taken from [13] and serves the purpose to show that, in
the presented example, flow insensitive points-to results
lead to a slice which is identical to the reported one. The
second example serves the purpose of further illustrating
slicing with the use of flow insensitive PTA on a public
domain C++ code.
. use = Al. use}
Figure 10: Syntax directed definitions to compute the
use attribute of an access path, which is the set of fised
locations referenced by the access path or used in tile
dereference chain of the access path.
accesses a data member through a pointer.
PTA provides the set of invocable methods for each
polymorphic call statement and also the points-to set
of every fixed location. All this information are used by
IRST data dependences analyzer. The set of invocable
methods is used to determine the possible call-graph
thus allowing inter-procedural analysis. The pointsto set of every fixed location is essential to determine
the DEF and USE sets of every instruction, which are
used during reaching definitions computation. The DEF
(USE) set of a statement is the set of fixed locations
which are defined (used) by such a statement. As a
statement uses access paths to reference the locations
that are being defined (used), it is necessary to transform an access path to the set of fixed locations reachable through it. For this purpose the syntax directed
definitions of Figure 7 and Figure 10 are respectively
used.
The USE set computation requires different syntax directed definitions, since the USE set can be considered
as the collection of the referenced fixed location8 t,ogether with the fixed locations used in the dereferencc
chain. For this reason, in Figure 10 the union between
referenced and referencing fixed locations is computed,
at every production that includes a dereference operator.
Figure 11 describes an example of USE set computation
for a statement comprising the access t,o a data member
through a pointer.
Once the DEF and USE sets have been det,ermined for
each instruction, reaching definitions and data dependences (use-definition chains) can be computed. Data
dependences are subsequently used by the slicer. A detailed description of the analyses performed by FLANT
can be found in [20]. Data dependences determination,
When corresponding to variable locations, the fixed locations inserted in the DEF set are marked as nonkilling definitions, i.e., definitions which do not override
previous definitions, but are simply added during the
analysis. The reason for their non-killing nature is that
the points-to relation used to obtain them is a possible points-to, and not a definite points-to. A discussion about the impact of definite points-to and possible
points-to on the kill set can be found in [7].
Let us consider a variable x whose points-to set is:
points-to(x)
= Cs, t),
The USE set of statement y = ((my-type
y = ((my-type
Figure 9 shows an example of DEF set computation
for an assignment statement where the left hand side
*Ix)->b
==> USE = points-to(x1.b
==> USE = Ix, s.b, t.b)
*Ix)->b
is computrd ns:
U xx)
Figure 11: Example of USE set computation,
440
FLANT code slicer uses data dependences,
control dependences and invocation graph to provide the user with
a slice of the program on a slicing criterion < p, z >,
where p is a statement
and x is a variable of interest in
the program. The result is the set of statements
which
directly or indirectly contribute to the value of z at the
statement p, and is computed as the transitive
closure
of data dependences and control dependences from p on
2. In object oriented programming,
variable references
are often substituted
by method calls (simply returning
a value), so that an extension of the slicing criterion is
required: x may be both a variable or a method [13]. If
it is a variable, it must be defined or used at statement
p, and if it is a method it must be invoked at p.
class A 1
int x;
void f(int i);
3
void A::f(inti) <
10 x = i;
3
main0 C
20 Aa;
21 a.fll);
3
---___--------------------------------------void A::f(int i, int& x) c
10 x = i;
3
main0 {
20 A a;
21 a.f(l, a.x>;
3
c<x, 10>3
If data and control dependences
are forward (instead
of backward) traversed, the result is a forward slice on
the given slicing criterion < p, x >. Forward slices include all instructions
directly or indirectly
affected by
the value of x at p (the impact set). Once data and control dependences
are available for a program, forward
slices are computable
as well.
C<a.x,lO>)
Figure 12: &ample of reaching definitions
computation. The necessary signature extension is shown under
the horizontal
iine, along with the resulting reaching
definitions.
based on PTA results, is safe, as a consequence of the
safety of t,he point’s-to and reaching definitions analyses: if the points-to set is larger than needed, so is the
result,ing data dependences set, but every valid data dependence is surely reported.
A method may induce some data dependences
on the
data members of its class. To map these dependences
t,o t,he data members of the actual object which invokes
the method, t.he method signature is estended and every
class attribute
can be considered in our approach as a
parameter passed by reference. The object invoking the
method passes by reference its own data members, and
correspondingly
receives back on them the effects of the
call.
In t,he example of Figure 12, reaching definitions
are
indicated
as a set of pairs <fied
location, definition
identifier>.
Note that, for sake of simplicity, the definibions identifier has been confused with the line number
in which it. appears. The signature of method f has been
ext,ended with the passed by reference parameter x to
propagate
the dependences
to the calling object data
member (a.~).
This approach is similar
global variables, instead
ence, are used.
to that proposed in [13], where
of parameters passed by refer-
Figure 13 presents an example of C++ code taken
from [13]. The points-to analysis performed by PTAT
gives the following results on variable e-ptr (they hold
all over the program, since the analysis is flow insensitive):
points-to(eqtr)
and consequently
the polymorphic
call of line 38 is associated with the following invocable methods:
e-ptr->go
(1
“.,
-,
i
==> points-to(e-ptr).goO
==> heap-36.go0, heap-37.goo
==> AlarmElevator::go(),
Elevator::go()
On the following criteria (taken from [13]), FLANT
duced the following slices:
pro-
sliceC39, which-floor> = (3,4,5,12,16,17,18,
19,20,22,25,26,32,33,35,36,37,38)
slice<20, current-floor> = (3,4,5,12,16,19,
20,22,25,26,32,33,35,36,37,38)
FLANT slices are identical to those presented in [13].
Again, on this example, flow insensitive points-to produced results identical to flow sensitive ones and therefore the slices turn out to be the same.
Figure 13
shows an example of an interactive session with FLANT.
Slicing criteria can be introduced
by clicking on a
443.
---l_
= {heap36, heap-373
-_ ~
iRutFm
Filer
Tnnla
Edit
Search
Helm
Figure 14: Storage shape graph of a portion of the program HMM performing the recognition task.
’ Slice
Slice<l70, maxstates>
Slice<1026, trellis-width>
Slice<1032, heap-hmm-cc-506>
Size
85
132
567
Ratio
10.5 ‘;b
16.4 %
70.3 %
Table 3: Slices of HMM. Line numbers in the slicing
criterions refer to file hmm.cc. The absolute and relative
(to the program size) extracted lines of code are given.
Figure 13: Example of C-/-S code. The slice on the
criterion ~20, currentfloor>
has been highlighted by
FLANT.
variable name at the desired program point ((20,
currentfloor>
in this case). FLANT computes the
slice and visualizes it in reverse (or in color when possible) on the screen. Line numbers in Figure 13 are the
same as in [13], so that they can be easily compared.
FLANT slicer has been applied to a second example,
HMM, previously analyzed with PTAT. HMM is a C++
public domain package, about 800 LOC (excluding comments and empty lines), developed by R.Myers and
J.Whitson at the University of California, Irvine, implementing a recognizer based on Hidden Markov Models.
Figure 14 shows a portion of the SSG of the program as
computed by PTAT. Heap locations are identified by a
suffix which comprises a file identifier and the line number (e.g., heap--cc-502
corresponds to the allocation
at line 502 of file hmm.cc). The points-to information
was used to determine the DEF and USE sets of each
instruction in the program, and subsequently to compute reaching definitions and data dependences. Once
data dependences were determined, it was possible t#o
interactively query FLANT for slices. As an example in
table 3 three slices are summarized. The third one on
heaphmm-cc-506 is actually on a generic component of
the array scalingfactors,
which is a relevant HMM
package data structure.
CONCLUSION
AND
FUTURE
WORK
We have presented a flow insensitive context insensitive
points-to algorithm for C++ code which is an extension of the corresponding one for C-language described
in [17, 181.
A comparison of the flow insensitive point&o results
with flow sensitive ones on a C++ test suite reported
in the literature shows that accuracy is high.
Under some assumptions on data members and depth
of inheritance limited growth with program size, the
presented extensions for C++ do not affect the original
almost linear running time complexity, so the proposed
approach should be easily scalable to large industrial
size software.
We have also described the integration
of the pointsto analysis Lvith data dependences
and slicing.
Some
esamples have been presented and discussed.
[lo] P. G. Frankl, E. J. Weyuker, “An Applicable Family of Data Flow Testing Criteria,‘, IEEE Transactions on Software Engineering,
Vol. 14, No. 10, pp.
1483-1498, October 1988.
Our future rvork will be devoted to further evaluating
the trade-off betsveen accuracy and performance
of the
flolv insensitive
points-to
analysis with respect to the
flow sensit,ive one on a larger industrial benchmark and
in the context of slicing.
.[ll]
ACKNOWLEDGMENT
We are grateful to Laurie Hendren
ing discussions and to Hemant Pande
C++ test suite.
K.B. Gallagher,
J.R. Lyle, “Using Program Slicing in SoftTvare Maintenance”,
IEEE Transactions
on Software Engineering, vol. 17, n. 8, pp. 751-761,
August 1991.
[12] R. Gupta, M. J. Harrold, M. L. Soffa, “An approach to Regression Testing Using Slicing”, in Proc.
of the Int. C’onf. on Software Maintenance, pp. 299308, 1992.
for the interestfor providing the
PI H.
[13] L. Larsen, M.J. Harrold, “Slicing Object-Oriented
Softlvare”, Proc. of the Int. Con& on Software Engineering, pp. 495-505, 1996.
PI G.
[14] E. Merle, J. F. Girard, L. Hendren, R. De Mori,
“Multi-Valued
Constant
Propagation
Analysis for
User Interface Reengineeing,”
International
Journal
of Software Engineering and Knowledge Engineering, vol. 5, n. 1, pp. 2-23, 1995.
REFERENCES
Agralval,
“On Slicing Programs
With Jump
Statements”,
in Proc. of the ACM SIGPLAN’94
Conf. on Programming Languages Design and Implementafion,
SIGPLAN Notices, Vol. 29, No. 6, pp.
302-3112,1994.
Antonio& R. Fiutem, E. Merle, P. Tonella, “Application and User Interface Migration From Basic
to Visual CSS”, PTOC.ofthe Int. Conf. on Soflware
Uaintenance,
pp. 76-85, 1995.
[15] H.D. Pande, “Compile Time Analysis of C and
C++ Systems”, Ph.D. Thesis, Rutgers University,
New Brunswick, Nem Jersey, May 1996.
W.G. Griswold,
“The Design of
[31 D.C. Atkinson,
Whole-Program
Analysis Tools”, Proc. of the Int.
Conf. on Software Engineering, pp. 16-27, 1996.
[16] H.D. Pande, W.A. Landi, B.G. Ryder, “Interprocedural Def-Use Associations
or C Systems with Single Level Pointers”, IEEE Transactions
on Software
Engineering, vol. 20, n. 5, May 1994.
PI J.
Beck, D. Eichmann,
“Program and Interface Slicing for Reverse Engineering,,,
in Proc. of the Int.
Conf. on Software Engineering, pp. 509-518, 1993.
[17] B. Steensgaard,
“Points-to Analysis in Almost Linear Time”,
PTOC.of the 23rcl ACM SIGPLANSIGAGT Symposium on Principles of Programming
Languages, St. Petersburg
Beach, FL, Jan. 1996.
PI J.
M. Bieman, L. M. Ott, “Measuring
Functional
Cohesion”,
IEEE Transactions
on Software Engineering, Vol. 20, No. 8, pp. 644657, August 1994.
[18] B. Steensgaard,
“Points-to
Analysis by Type Inference of Programs with Structures
and Unions”,
Proc. of the Int. Conf. on Compiler Construction,
Linkijping, Sweden, Apr. 1996.
[61 A. Cimitile, A. De Lucia, M. Munro, “Identifying
Reusable Functions Using Specification Driven Program Slicing: A Case Study”, in Proc. of the Int.
Conf. on Software Maintenance,
pp. 124133, 1995.
PI M.
[19] B. Stroustrup,
(2nd edition)“,
PI R.
[20] P. Tonella,
R. Fiutem,
E. Merlo, G. Antoniol,
“Variable Precision Reaching Definitions
Analysis
for Software Maintenance”,
IRST Technical Report
9602-01, 1996 (to appear in Proc. of the Euromicro Working G’onf. on Software Maintenance
and
Reengineering, 1997).
Emami,
R. Ghiya, L.J. Hendren,
“ContextSensitive Interprocedural
Points-to Analysis in the
Presence of Function Pointers”,
PI-oc. of the ACM
SIGPLAN’94
Conf. on Programming
Language Design and Implementation,
pp. 242-256, Orlando,
Florida, June 20-24, 1994.
Fiutem, E. Merlo, G. Antonio& P. Tcnella, “Understanding
the Architecture
of Softjvare Systems”,
4th Workshop on Program Comprehension,
pp. 187196, Berlin, March 29-31, 1996.
Fiutem, P. Tonella, G. Antoniol,
E. Merle, “A
Cliche Based Environment
to Support Architectural
Reverse Engineering”,
PTOC. of the Int. Conf. on
Software Maintenance, pp. 319-328, 1996.
[22] M. Weiser, “Programmers
Use Slices when Debugging”, Communications
of the ACM, Vol. 25, July
1982, pp. 446-452
,
443
* -.-, _
-._
-
Language
I ’
j
/
[21] M. Weiser, “Program slicing”, IEEE Transactions
on Soflware Engineering, vol. 10, n. 4, pp. 352-357,
July 1984.
PI R.
- ____
“The C++ Programming
Addison- Wesley, 1992.
,
\
-___