WCRE 1999 / 2009

WCRE 1999 / 2009
Experiments with clustering as
a software remodularization
method
Nicolas Anquetil
Timothy C. Lethbridge
1 of 63
Forewarning
Nicolas:
 After this research I became suspicious of the
usefulness of clustering for remodularization.
I still am.
2 of 63
You have been warned
(although note that Tim has a less gloomy view)
3 of 63
Agenda






Background of the research
Overview of the paper
From then until now
And now what?
An analogy
Another analogy
4 of 63
Background of the research
Context:




KBRE group, U. of Ottawa, Canada
CSER project (Consortium for Software
Engineering Research)
Pairs: university/company
(U. Of Ottawa/Telecom. company)
Focus on real problems and/or
real situations
5 of 63
Background of the research
The project: One company's PBX







2+ MLOC
2+ K files
10+ possible configurations
10+ years old (in 1999)
2 proprietary languages
1 directory
0 packages
6 of 63
Background of the research
Company situation:




High turnover (18 months)
High entry barrier (6+ months to be
productive)
Aging software (and languages)
Configuration management difficulties
7 of 63
Agenda






Background of the research
Overview of the paper
From then until now
And now what?
An analogy
Another analogy
8 of 63
Overview of the paper
”providing solutions to
help software engineers
understand, restructure
or migrate old software
towards more modern
architecture and/or
languages”
9 of 63
Overview of the paper
Possible solution:
”Clustering is used to
gather software
components into
modules significant to
the software engineers.”
10 of 63
Overview of the paper

Seminal paper by Theo Wiggerts, “Using
Clustering Algorithms in Legacy Systems
Remodularization”, WCRE'97



Summary of the literature on clustering
Lists all the possible choices
Lists some advantages and drawbacks of
these choices
11 of 63
Overview of the paper
”Clustering is a
sophisticated research
domain with many
methods [...] Reverse
engineering is a young
domain [...] Clustering
has been used with no
deep understanding of all
the issues involved.”
12 of 63
Overview of the paper
”Conclusions of Wiggerts'
paper are those of the
literature which may not
entirely hold for reverse
engineering.”
13 of 63
Overview of the paper

For example:



Living things naturally fit in an evolution tree
(more or less)
Not so with software modularization
This must impact the tools we use and how we
use them
14 of 63
Overview of the paper

Three issues

What clustering algorithms to use?

How to compute cohesion?
How to describe entities?
How to evaluate the results?


15 of 63
Overview of the paper

Algorithms

We tested mainly hierarchical agglomerative
algorithms

Some tests with hill-climbing algorithms (”Bunch” tool:
Mancoridis)
16 of 63
Overview of the paper

Entities


We clustered files (into packages)
Description


Elements contained in the files:
Types, variables, routines, macros, comments,
identifiers
17 of 63
Overview of the paper
Reminder:
”Clustering algorithms do
not discover some
hidden structure in a
system, but impose a
structure on the set of
entities they are given.”
18 of 63
Overview of the paper
Some results

Redundancies among description schemes:


File, routine, variable, macro, type
Comments, identifiers
19 of 63
Overview of the paper
Some results

Combining features (routine + variable + ...)
improves the results
20 of 63
Overview of the paper
Some results

Direct/sibling links

Sibling more used and better
21 of 63
Overview of the paper
Some results


Avoid “sparse” descriptive features
Avoid similarity metrics that consider absence
of a feature as significant
22 of 63
Agenda






Background of the research
Overview of the paper
From then until now
And now what?
An analogy
Another analogy
23 of 63
From then until now


Raw numbers
What extensions?
24 of 63
From then until now
References
(volume)
18
16
14
12
10
8
-
6
4
2
0
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
25 of 63
[data from Google scholar]
From then until now
References (authors)






P.Tonella(8), F.Ricca(7), C.Girardi(5),
E.Pianta(5)
O.Maqbool(7), HA.Babri(6)
C.Tjortjis(5)
N.Anquetil(5)
S.Ducasse(5)
K.Sartipi(4)
26 of 63
[data from Google scholar]
From then until now
References (venue)





Thesis
CSMR
IWPC
WCRE
J.Soft.Maint.
Evol.
=11
=6
=6
=5




J.Syst.Soft.
ICSM
ICSE
Trans.Syst.
Eng.
=4
=3
=2
=2
=4
27 of 63
[data from Google scholar]
From then until now
Some extensions

Clustering, how?



New/improved algorithms
New/improved distance metrics
Clustering what?

New entities (and/or description)

Clustering, why?

Other extensions
28 of 63
From then until now
New algorithm

Genetic algorithm


[Mahdavi]
“Combined algorithm”

[Saeed, Maqbool, Babri, Hassan, Sarwar]
29 of 63
From then until now
New distance metric

Minimization of information loss

[Andritsos, Tzerpos]
30 of 63
From then until now
New entities

Static web pages



[Di Lucca,
Fasolino,
Tramontana]
[Tonella,Ricca,Pian
ta, Girardi]
Association rules


Data vs. Control


Dynamic data


[Davey,Burd],
[Sartipi,Kontogiannis]
[Stroulia,Systä]
Co-change records
[Maqbool,Babri]
31 of 63
From then until now
Other extensions

Evaluations / comparisons


[Tonella], [Wu, Holt], [Parsa, Bushehrian]
Framework
32 of 63
From then until now
Other extensions

Needs of maintainers?


Input for visualization tools


[Tjortjis, Layzell]
[Ducasse]
Naming clusters

[Tzerpos], [Maqbool, Babri]
33 of 63
Agenda






Background of the research
Overview of the paper
From then until now
And now what?
An analogy
Another analogy
34 of 63
And now what?



Back to paper's results
Wild ideas in clustering
Related topics
35 of 63
And now what?
Paper's results

Choice of (traditional) algorithm matters little


It will give a result
Not significantly better or worse than other
36 of 63
And now what?
Paper's results


Choice of similarity metric matters little
As long as they don't consider absence of a
feature as a sign of similarity
37 of 63
And now what?
Paper's results


Choice of description scheme for entity
matters a bit more
May be source of short term progress?

Using dynamic information?
38 of 63
And now what?
Wild ideas

Consider new entities?



Individual instructions?
Non code: requirements, model elements,
tests, … ?
Process-wise modularization?

Clustering requirements, models elements, ...
39 of 63
And now what?
Related topics

Problem without solution?



Software modularization is highly subjective
Packages are not mutually exclusive
Decisions must be made that are always
wrong (and always correct)
40 of 63
And now what?
Related topics

Modularization is a logical (virtual)
decomposition based on semantics



High cohesion, low coupling may only be an
(imperfect) by-product of pre-chosen
modularization
Cohesion/coupling not a driving force but a
secondary goal?
Other forces, e.g. packages of “comparable”
sizes
41 of 63
And now what?
Related topics

Typical example: Utility package


Low cohesion, high coupling
java.util

BitSet, Calendar, Currency, Dictionary,
EventListenerProxy, Formatter, Observable,
Random, ResourceBundle, Scanner, UUID,
TimeZone, ...
42 of 63
And now what?
Related topics

How to evaluate results?


Cohesion/coupling


Open question in the paper
Normaly useless because it is the function
optimized by the algorithms
Gold standard


Manually: expensive, not precise
Automatically: biased
43 of 63
And now what?
Related topics

How to evaluate results?

Other metrics, e.g. Stability, Non-extremity
[Wu]
44 of 63
Agenda






Background of the research
Overview of the paper
From then until now
And now what?
An analogy
Another analogy
45 of 63
And now what?
Paper's results

”The fact that all six algorithms are ranked low
on authoritativeness suggests that they may
not be mature enough for use in production on
large systems undergoing evolutionary
change.
However ...”
[Wu, Holt, 2005]
46 of 63
An analogy

A short story of Belo Horizonte:


In 1893 a new capital is planned in the state of
Minas Gerais (Brazil)
The arquitects/urbanists get inspiration from
Washington D.C.
47 of 63
An analogy

The initial architecture:

Planned Belo
Horizonte
48 of 63
An analogy

The city grew (2.5 Mhab., area=5.1 Mh.)
49 of 63
An analogy

The city grew (2.5 Mhab.)
50 of 63
An analogy

Could we remodularize that?
51 of 63
An analogy

Could we remodularize that?
52 of 63
An analogy

Analogy with software clustering:




Initial architecture is completly lost in the
overall city
Regularities would allow to find only small
“clusters”
There are large “empty” parts difficult to
(automatically) cluster
A division into districts would necessarily be
subjective
53 of 63
Agenda






Background of the research
Overview of the paper
From then until now
And now what?
An analogy
Another analogy
54 of 63
Another analogy

You are a 21-year old leaving university




You buy a large house because you have a
good job
You are not well organized
You have a general concept that “food goes in
the kitchen and clothes go in the bedroom”
But much of your stuff is strewn around
55 of 63
Another analogy


Initially you do not have many things, so the
disorganization doesn't matter
After a while, you accumulate very many
worldly goods


You constantly can't find things
Your new partner starts complaining
56 of 63
Another analogy


You realize it is time to organize things better
You are a computer scientist so you want to
apply a clustering algorithm
57 of 63
Another analogy

But what criteria to use?

Things made in the same country go
together?


Oops, the 'China' cluster is too big
Temporal cohesion?

Things used in the morning in one place,
things used in the evening in another place?
– Where does 'toothbrush' go?
58 of 63
Another analogy

Functional cohesion
 Everything for each recipe I make is
kept together
 But utilities (things used commonly)
are separately organized as a cluster
 Too awkward
59 of 63
Another analogy

In the end, your approach is pragmatic:
1.You decide from general experience on a set of
general categories and storage locations
2. You spend a weekend moving things into these
locations (yes there are thousands of things)
60 of 63
Another analogy
3. As you proceed, you notice
 Some
things do not fit in any categories
 Some categories are not so well chosen
 Some categories overlap
4. You refactor the categories a bit and move
things around
61 of 63
How can this be applied to
software?

Use a clustering tool to mainly to give you a
sense of the possibilities


Combine with other RE tools to learn about
the functionality of each module as well as
other properties
But also apply general wisdom about good
software design
62 of 63
How can this be applied to
software?

Play with the parameters of the clustering tool
and other RE tools, refactoring until you have
achieved a remodularization that you
understand


Ideally, tools would allow instant adjustment
with good visualization
Retain documents describing the resulting
design
63 of 63