The Evolution Matrix: Recovering Software Evolution using Software

Program Comprehension
& Software Evolution [Lightweight] Principles and
[real] Practice
Michele Lanza
Faculty of Informatics
University of Lugano
Switzerland
Prologue
•
•
•
•
•
•
Once upon
a time…
Reverse engineer 1’200’000 lines of C++ code in ca. 2300 classes
* 2 = 2’400’000 seconds
/ 3600 = 667 hours
667 hours / 8 = 83 working days
83 days / 5 = 16 working weeks and 3 days
~ 4 months
• Questions:
– What is the size and the overall structure of the system?
– What is the internal structure of the system and its elements?
– How did the software system become like that?
1
The Life Cycle of Software Systems
?
Requirements
Analysis
Design
Issues
• Tool support
• Scalability
• Flexibility
Time
Implementation
2
Object-Oriented Reverse Engineering
?
• Goal: take a (large legacy) software system and
“understand” it, i.e., construct a mental model of the
system
?
• Problem: the software system in question is
–
–
–
–
Unknown, very large, and complex
Domain- and language-specific
Seldom documented or commented
“In bad shape”
3
Object-Oriented Reverse Engineering (II)
?
• Constructing a mental model requires information about
the system:
– Top-down approaches
– Bottom-up approaches
– Mixed Approaches
• There is no “silver bullet” methodology
• Every reverse engineering situation is unique
• Need for flexibility, customizability, scalability, and
simplicity
4
Reverse Engineering Approaches
?
• Reading (source code,
documentation, UML
diagrams, comments)
• Running the SW and
analyze its execution trace
• Interview users and
developers (if available)
• Clustering
•
•
•
•
•
•
•
•
Concept Analysis
Software Visualization
Software Metrics
Slicing and Dicing
Querying (Database)
Data Mining
Logic Reasoning
…
5
The “Information Crystallization” Problem
?
• Many approaches generate too much or not
enough information
• The reverse engineer must make sense of this
information by himself
• We need the right information at the right time
6
..take a step back..block the ground..think
about it..
!
• The information needed to reverse engineer a legacy
software system resides at various levels
• We need to obtain and combine
– Coarse-grained information about the whole system
– Fine-grained information about specific parts
– Evolutionary information about the past of the system
7
Contents
• Polymetric Views
• Software Visualization vs. Reverse Engineering
–
–
–
–
Coarse-grained
Fine-grained
Evolutionary
Dynamic Information
• Discussion
• Demos
8
A Solution - The Polymetric View
• A lightweight combination of two approaches:
– Software visualization (reduction of complexity,
intuitive)
– Software metrics (scalability, assessment)
• Interactivity (iterative process, silver bullet
impossible)
• Does not replace other techniques, it complements
them:
– “Opportunistic code reading”
9
The Polymetric View - Principles
• Visualize software:
Entities
– entities as rectangles
– relationships as edges
Relationships
• Enrich these
visualizations:
– Map up to 5 software
metrics on a 2D figure
– Map other kinds of
semantic information
on nominal colors
width metric
2 position metrics
color metric
height
metric
10
The Polymetric View - Example
…
System Complexity View
Nodes = Classes
Edges = Inheritance Relationships
Width = Number of Attributes
Height = Number of Methods
Color = Number of Lines of Code
11
The Polymetric View - Example (II)
…
System Complexity View
Nodes = Classes
Edges = Inheritance
Relationships
Width =
Height =
Color =
# attributes
# methods
# lines of code
Reverse engineering goals
View-supported tasks
• Get an impression (build a first raw mental
model) of the system, know the size, structure,
and complexity of the system in terms of classes
and inheritance hierarchies
• Locate important (domain model) hierarchies,
see if there are any deep, nested hierarchies
• Locate large classes (standalone, within
inheritance hierarchy), locate stateful classes and
classes with behaviour
• Count the classes, look at the displayed nodes,
count the hierarchies
• Search for node hierarchies, look at the size and
shape of hierarchies, examine the structure of
hierarchies
• Search big nodes, note their position, look for tall
nodes, look for wide nodes, look for dark nodes,
compare their size and shape, “read” their name
=> opportunistic code reading
12
The Polymetric View - Description
…
• Every polymetric view is
described according to a
common pattern
• Every view targets
specific reverse
engineering goals
• The polymetric views are
implemented in
CodeCrawler
System Complexity View
Structural Specification
Target
......
Scope
..........
Metrics
.......
......
......
.......
.....
........
Layout
............
Description
........................................................
.........................
Goals
………………………………………..
……………………………
Symptoms
……………………..
……………………………
Scenario
Case Study
………………………………………..
………………………..
13
Coarse-grained Software Visualization
• Reverse engineering question:
– What is the size and the overall structure of the system?
• Coarse-grained reverse engineering goals:
–
–
–
–
–
Gain an overview in terms of size, complexity, and structure
Asses the overall quality of the system
Locate and understand important (domain model) hierarchies
Identify large classes, exceptional methods, dead code, etc.
…
14
Coarse-grained Polymetric Views - Example
LOC
Method Efficiency Correlation View
Nodes:
Edges:
Size:
Position X:
Position Y:
Methods
Number of method parameters
Number of lines of code
Number of statements
Goals:
• Detect overly long methods
• Detect “dead” code
• Detect badly formatted methods
• Get an impression of the system in terms of
coding style
• Know the size of the system in # methods
NOS
15
CodeCrawler Demo
16
Clustering the Polymetric Views
First Contact
Candidate Detection
System Hotspots
System Complexity
Root Class Detection
Implementation Weight Distribution
Data Storage Class Detection
Method Efficiency Correlation
Direct Attribute Access View
Method Length Distribution
Inheritance Assessment
Class Internal
Inheritance Classification
Inheritance Carrier
Intermediate Abstract
The Class Blueprint
17
Coarse-grained SV - Conclusions
• Benefits
– Views are customizable (context…) and easily
modifiable
– Simple approach, yet powerful
– Scalability
• Limits
– Visual language must be learned
18
Fine-grained Software Visualization
• Reverse engineering question:
– What is the internal structure of the system and its elements?
• Fine-grained reverse engineering goals:
– Understand the internal implementation of classes and class
hierarchies
– Detect coding patterns and inconsistencies
– Understand class/subclass roles
– Identify key methods in a class
– …
19
The Class Blueprint - Principles
Initialization
External Interface
Internal Implementation
Accessor
Attribute
Invocation Sequence
• The class is divided into 5 layers
• Nodes
• Methods, Attributes, Classes
• Edges
• The method nodes are positioned
according to
• Layer
• Invocation sequence
• Invocation, Access, Inheritance
20
The Class Blueprint - Principles (II)
# invocations
Method
# lines
Abstract Method
Constant Method
Overriding Method
Read Accessor
Delegating Method
Write Accessor
Extending Method
Attribute
# external accesses
Attribute
# internal accesses
Method Invocation
Direct Attribute Access
21
The Class Blueprint - Example
• Delegate:
– Delegates functionality to other classes
– May act as a “Façade” (DP)
• Large Implementation:
– Deep invocation structure
– Several methods
– High decomposition
• Wide Interface
• Direct Access
• Sharing Entries
22
The Class Blueprint - A Pattern Language?
• The patterns reveal
information about
– Coding style
– Coding policies
– Particularities
• We grouped them according to
–
–
–
–
–
Size
Layer distribution
Semantics
Call-flow
State usage
• Moreover…
– Inheritance Context
– Frequent pattern
combinations
– Rare pattern combinations
• They are all part of a
pattern language
23
The Class Blueprint - Example (II)
• Call-flow
– Double Single Entry
– (=> split class?)
• Inheritance
– Adder
– Interface overriders
• Semantics
– Direct Access
• State Usage
– Sharing Entries
24
The Class Blueprint - What do we see?
25
CodeCrawler Demo
26
Fine-grained SV - Conclusions
• Benefits
– Complexity reduction
– Visual code inspection technique
– Complements the coarse-grained views
• Limits
– Visual language must be learned
– Good object-oriented knowledge required
– No information about actual functionality =>
opportunistic code reading necessary
27
Evolutionary Software Visualization
• Reverse engineering question:
– How did the software system become like that?
• Evolutionary reverse engineering goals:
– Understand the evolution of OO systems in terms of size and
growth rate
– Understand at which time an element, e.g., a class, has been
added or removed from the system
– Understand the evolution of single classes
– Detect patterns in the evolution of classes
– …
28
The Evolution Matrix - Principles
First Version
Version 2 .. Version (n - 1)
Last Version
Removed Classes
Added Classes
Growth Phase
Stagnation Phase
Time (Versions)
29
The Evolution Matrix - Principles (II)
•
The Evolution Matrix reveals patterns
– The evolution of the whole system
(versions, growth and stagnation
phases, growth rate, initial and final size)
– The life-time of classes (addition,
removal)
• Moreover, we enrich the
evolution matrix view with
metric information
# methods
Class
# attributes
• This allows us to see
patterns in the evolution of
classes
30
The Evolution Matrix - Pattern Language
Pulsar
• Repeated Modifications make it
grow and shrink.
• System Hotspot: Nearly every
new system version requires
changes.
• No “cheap class”
Time (Versions)
Supernova
• Suddenly increases in size,
possible reasons:
• Massive shift of functionality
towards a class.
• Data storage class
• Developers knew what to fill in.
31
The Evolution Matrix - Pattern Language (II)
White Dwarf
• Lost the functionality it had and
now trundles along without real
meaning.
• Possibly dead code.
Red Giant
• A permanent god class which is
always very large
Idle
• Keeps size over several
versions.
• Possibly dead code,
possibly good code.
Time (Versions)
32
The Evolution Matrix - Pattern Language (III)
Dayfly
Persistent
• Exists during only one or two
versions.
• Perhaps an idea which was
tried out and then dropped.
• Has the same lifespan as the
whole system.
• Part of the original design.
• Perhaps holy dead code which no
one dares to remove.
33
The Evolution Matrix - Example
34
Evolutionary Software Visualization Demo
35
Evolutionary SV - Conclusions
• Benefits
– Complexity reduction
• Limits
– Scalability (can be solved)
– Rename problem (can be solved)
– Relative changes hard to see (can be solved)
36
Run-Time Analysis Problems and Challenges
•
•
•
RTA and Reverse Engineering - useful (in combination with static information)?
Procedural RTA vs. Object-Oriented RTA
OO RTA - Conceptual problems
–
–
–
–
•
Polymorphism and late-binding
Inheritance and incremental class definition
Functionality (features) spread over the system
Which trace to generate? How?
Technical challenges and constraints
–
–
–
–
Instrumentation problem (logging, VM patching, wrapping, ..)
Amount, density, and noise of generated information (Thousands of events in a few seconds..)
Granularity of information (object instantiations, message sends, attribute accesses, ..)
How much can we automate?
37
RTA - Questions
•
•
Can we merge the dynamic information with static information?
Can we use a ‘’successful’’ static technique like polymetric views in RTA?
–
–
–
–
–
–
–
What are the most instantiated classes?
Are there any singletons?
Which classes are object factories?
What is the percentage of actually used methods in classes?
Memory consumption?
Speed bottlenecks?
…
38
Case Study and Experiment Setup
• Case Study: Moose, our reengineering environment
– Implementation language: Smalltalk
– Age: 6 years
– Size: >250 classes and >3500 methods and a test suite of more than 280 unit
tests (a veritable legacy system ;-)
• Setup
– Code instrumentation using MethodWrappers
– Trace Scenario(s) given by the Unit test suite
– Wrapping down to method body level
• During trace-time we record events and increase counters
• Afterwards we map the counter values as metrics
39
Run-time Measurements
• NCM, the number called methods
• NMI, the number of method invocations
• NCI, the number of created instances, that is the number
of times a class has been instantiated
• NCO, the number of created objects, that is the number of
‘foreign’ objects that a class’s objects instantiated
• Condensed information leads to greater scalability
• Tradeoff with granularity and sequence of a trace
• Interval of the values can be great (logarithmic scaling
useful)
40
Instance Usage Overview
Nodes
Edges
Metric Scale
Layout
Node Width
Node Height
Node Color
Classes
Inheritance
Logarithmic
Tree
# of Created Instances
# of Called Methods
# of Method Invocations
Symptoms
Small, light: unused
Narrow, tall: few, but used, instances
Flat, pale: heavily instantiated, seldom
used
Flat, dark: heavily instantiated,
functionality partially but heavily used
Classes
A: CDIFScanner
B: AttributeDescription (3500 instances, 350’000 calls!)
C: FAMIX metamodel root
G: Uninstantiated FAMIX classes (!)
I: Smalltalk AST Visitor hierarchy
41
Creation Interaction View
Nodes
Edges
Metric Scale
Layout
Node Width
Node Height
Node Color
Edge Width
Classes
Instantiation
Logarithmic
Embedded Spring
# of Created Objects
# of Created Instances
# of Created Instances
# of Instantiations
Symptoms
Unconnected: uninstantiated
Connected, small: classes with few
instances
Flat, light: instance creators, seldom
instantiated, possibly factories
Narrow, dark: heavily instantiated, but do
not create many other instances
Wide, dark: heavily instantiated and used
Class Examples
A: AttributeDescription - C: VWImporter (high-level import), D: VWParseTreeEnumerator (low-level import)
E: FAMIXClass, ..Method, ..Attribute, etc. F: FAMIXAccess, FAMIXInvocation - G: MSEMeasurement (short-lived objects)
42
Dynamic Information SV - Conclusions
• Pros
– Some new views on
software systems
– Intuitive and compact way
of presenting very large
amounts of information
– Insights into
implementation issues
– Side result: assessment of
test suite
• Cons
– Loss of granularity and order
– Suitability for optimization
domain unclear
– Probably does not really scale
up for very large systems (but
this depends on the viewer and
his/her will to interact..)
– The current approach is
intrinsically interactive
(automatisation would be
possible using advanced
metrics-based techniques like
detection strategies)
43
What about reality?
• Most IDEs have no or limited visualization support
• Not an industry “standard”, most developers still have vi &
emacs mentality
• Still poor usability
– May be used as “stand-alone” browsing tool, but not as part of a
development metholodogy
– Needs much more effort (and people) to be “sexy”
• Ongoing work must cope with the “present hypes”, such
as distributed development, eXtreme programming, etc.
44
Epilogue
The End
• Did we succeed after all?
• Not completely, but…
– System Hotspots View on
1.200’000 LOC of C++
– System Complexity View
on ca. 200 classes of C++
45
Industrial Validation - The Acid Test
• Several large, industrial case studies (NDA)
• Different implementation languages
• Severe time constraints
System
Language
Z
C++
Y
Lines of Code
Classes
1’200’000
~2300
C++/Java
120’000
~400
X
Smalltalk
600’000
~2500
W
COBOL
40’000
-
Sortie
C/C++
28’000
~70
Duploc
Smalltalk
32’000
~230
Jun
Smalltalk
135’000
~700
ArgoUML
Java
220’000
~1400
46
Questions and Comments
Let’s do it…
47