The Search Landscape of
Graph Partitioning Problems
using Coupling and Cohesion as
the Clustering Criteria
Brian S. Mitchell & Spiros Mancoridis
{bmitchel,smancori}@mcs.drexel.edu
http://www.mcs.drexel.edu/~{bmitchel,smancori}
Department of Computer Science
Software Engineering Research Group
http://serg.mcs.drexel.edu
Drexel University, Philadelphia, PA, USA
10/05/2002
1
Software Clustering with Bunch
Source Code
void main()
{
printf(“hello”);
}
Source Code
Analysis Tools
Acacia
Chava
Bunch Clustering
Tool
Bunch GUI
Clustering
Algorithms
Clustering Tools
MDG File
M1
M3
M2
M4
M5
Programming
API
M6
M7
Visualization Tool
M8
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
Partitioned MDG File
M1
M3
M2
M4
M5
M6
M7
M8
2
Software Clustering as a Search
Problem
Source Code
void main()
{
printf(“hello”);
}
SEARCH SPACE
Set of All
MDG Partitions
M1
M3
M2
Source Code
Analysis Tools
Acacia
Chava
M4
M8
M3
M2
M4
M5
M8
M4
while(searching())
{
p = selectNext();
if(p.isBetter(bP))
bP = p;
}
M7
“GOOD” MDG Partition
M2
M6
bP = null;
return bP;
M6
M3
M1
M7
M5
M1
MDG
M6
Software Clustering
Search Algorithms
M5
M1
M7
M8
Total = 4140 Partitions
M2
M4
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
M3
M5
M6
M7
M8
3
The Search Space is Enormous
The number of MDG partitions grows very quickly,
as the number of modules in the system increases…
S n, k
1=1
2=2
3=5
4 = 15
5 = 52
1
=
Sn-1,k -1 + kSn-1,k
6 = 203
7 = 877
8 = 4140
9 = 21147
10 = 115975
if k = 1 k = n
otherwise
11 = 678570
12 = 4213597
13 = 27644437
14 = 190899322
15 = 1382958545
16 = 10480142147
17 = 82864869804
18 = 682076806159
19 = 5832742205057
20 = 51724158235372
A 15 Module System is about the
limit for performing Exhaustive Analysis
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
4
Our Assumption…
“Well designed software systems are
organized into cohesive clusters that are
loosely interconnected.”
We designed a measurement called MQ that
embodies our assumption
The MQ measurement balances cohesion and
coupling
We apply MQ to partitions of the MDG
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
5
Not all Partitions of the MDG are
Good Solutions
M1
M2
MDG
M3
M4
M5
Bad Partition!
Good Partition!
M1
M4
M2
M5
M3
M6
M1
M4
M2
M6
M3
M5
M6
MQ(Good Partition) > MQ(Bad Partition)
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
6
The Software Clustering Problem:
Algorithm Objectives
“Find a good partition of the MDG.”
A partition is the decomposition of a set of
elements (i.e., all the nodes of the graph) into
mutually disjoint clusters.
A good partition is a partition where:
highly interdependent nodes are grouped in the
same clusters
independent nodes are assigned to separate
clusters
The better the partition the higher the MQ
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
7
Generate a Random Decomposition of MDG
Neighbor
Partition
Iteration Step
A neighbor
partition is
created by
altering the
current
partition
slightly.
Current
Partition
Measure MQ
New Best
Neighboring Partition
Bunch Hill Climbing Clustering
Algorithm
Generate
Next
Neighbor
Measure
MQ
Compare to Best
Neighboring Partition
Better
Better?
Best Neighboring Partition for Iteration
Convergence
Best Neighboring Partition
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
8
Bunch Hill Climbing Clustering
Algorithm
Generate a Random Decomposition of MDG
Neighbor
Partition
Iteration Step
A neighbor
partition is
created by
altering the
current
partition
slightly.
Current
Partition
We have
Measure MQ
New Best
Neighboring Partition
Other Things of
Generate
Interest
Next
implemented
Neighbora
hill-climbing algorithms
Measure
family ofMQ
Compare to Best
Neighboring
Partition
implemented
an Exhaustive
We also
and Genetic Algorithm
Better
Better?
Best Neighboring Partition for Iteration
Convergence
Best Neighboring Partition
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
9
Hierarchical Clustering (1):
Nested View
1.
4.
2. Default
3.
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
10
Hierarchical Clustering (2):
Consolidated View
1.
4.
2. Default
3.
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
11
Hierarchical Clustering (3):
Tree View
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
12
Hierarchical Clustering (3):
Tree View
Observations
• The number of levels for a given
system’s clustering hierarchy is
bounded by:
O(log2N)
because Bunch places at least 2
nodes in each cluster.
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
13
Evaluating The Software
Clustering Results
Over the past few years we have spent
a lot of time evaluating Bunch’s
software clustering results
Empirically
Semi-formally
Measuring Similarity
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
14
What We Know
Given a particular MDG, the results
produced by Bunch converge to a family
of related solutions
The search space is large, and the
probability of finding a good solution by
random sampling is infinitesimal
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
15
Software Clustering using Graph
Partitioning Techniques
Running Bunch multiple times produces a
family of related clustering results
Bunch starts with a random partition of the MDG,
and makes random moves to explore the search
space
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
16
Software Clustering using Graph
Partitioning Techniques
How related are these clustering results?
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
17
Software Clustering using Graph
Partitioning Techniques
Given that there are 2,7644,437 distinct partitions
of this MDG, there is a lot of agreement…
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
18
Software Clustering using Graph
Partitioning Techniques
Why Some Modules Don’t Agree…
Library Modules
Isomorphism
Omnipresent
Module Influences
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
19
Special Modules
Isomorphic – Modules that are
connected to multiple clusters with
equal strength
Library – All edges fan-in
Driver – All edges fan-out
Omnipresent – Modules that are
strongly connected to many other
modules in the system
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
20
Clustering a System
Many Times (1)…
RCS (Bunch)
30
2.5
2
2
25
2
1
1.5
1
0.5
0
0
0
10
20
30
15
10
20
1
0.5
5
30
0
0
250
Number of Clusters
Dot (Random)
1.5
10
0
0
Number of Clusters
500
750
0
1000
Dot (Bunch)
1.6
1.6
40
1.6
1.4
1.4
35
1.4
1.2
1.2
30
1.2
0.6
30
Number of Clusters
40
0.4
0.2
5
0.2
0
0
0
10
20
30
Number of Clusters
40
750
1000
1
0.6
0
1000
0.8
10
0.2
20
20
15
0.6
0.4
10
25
MQ
0.8
0.4
0
Number Clusters
1.8
0.8
750
Dot
Dot
45
1
500
Sample
1.8
1
250
Sample
1.8
MQ Value
MQ Value
20
MQ
1.5
Number Clusters
2.5
0.5
Dot
RCS
RCS
2.5
MQ Value
MQ Value
RCS
RCS (Random)
Random
Bunch
0
0
250
500
Sample
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
750
1000
0
250
500
Sample
21
Clustering a System
Many Times (2)…
Swing (Bunch)
6
5
4
3
400
5
4
3
2
2
1
1
200
300
100
200
300
4
3.5
3.5
3
2.5
2
1.5
100
Number of Clusters
125
1000
0
250
500
750
1000
750
1000
Sample
Bunch
125
4.5
100
3.5
3
75
50
2.5
2
1.5
1
25
0.5
0
0
750
4
2
1
500
Bunch
1.5
0.5
75
0
250
Sample
3
1
3
1
0
2.5
0.5
4
2
100
400
Number Clusters
4.5
4
MQ Value
MQ Value
Bunch
4.5
50
150
Bunch (Bunch)
Bunch (Random)
25
200
Number of Clusters
Number of Clusters
0
250
0
0
400
5
300
MQ
100
6
350
50
0
0
7
MQ
6
Swing
Swing
450
Number Clusters
7
MQ Value
MQ Value
Swing
Swing (Random)
7
0
Random
Bunch
0
0
25
50
75
100
Number of Clusters
125
0
0
250
500
Sample
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
750
1000
0
250
500
Sample
22
Clustering a System
Many Times (2)…
Swing (Bunch)
3
5
4
3
1
0
100
200
300
250
200
150
100
200
300
400
0
0
250
500
750
1000
0
Bunch
Bunch (Bunch)
125
4.5
100
3.5
4
2
1.5
3
2.5
2
1.5
1
1
50
0
25
50
75
100
Number of Clusters
125
750
1000
Bunch
2.5
2
1
25
0.5
0
0
1000
1.5
0.5
0.5
750
3
75
MQ
Number Clusters
MQ Value
3
500
Sample
4
3.5
2.5
250
Sample
4.5
3.5
3
1
Number of Clusters
Bunch (Random)
4
4
2
100
0
0
400
Number of Clusters
4.5
5
300
50
0
0
6
350
• As the number of clusters increased
in the random samples, MQ decreased
• Bunch converged to a consistent
“family” of solutions, no matter where
the random starting point was generated
• Some solutions were multi-modal
• Random solutions were consistently
worse than Bunch’s solutions.
2
2
MQ Value
Number Clusters
MQ Value
MQ Value
Swing
4
1
Bunch
400
6
5
7
450
Observations
6
Swing
Swing
7
MQ
Swing (Random)
7
Random
Bunch
0
0
25
50
75
100
Number of Clusters
125
0
0
250
500
Sample
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
750
1000
0
250
500
Sample
23
Example - Detailed Results:
Bunch System
MQ versus Number of Clusters
4.5
23%
4
3.5
MQ
3
77%
2.5
The search space
has some inherent
structure, as random
clusters constrained
to the area where
Bunch converged did
not produce better
MQ values.
2
1.5
1
0.5
0
0
5
10
15
20
Number of Clusters
MQ For Random Clusters (4-8)
MQ For Random Clusters (11-16)
4.5
4
4
3.5
3.5
3
3
2.5
2.5
MQ
MQ
4.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
0
250
500
750
1000
0
Sample
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
250
500
750
1000
Sample
24
Understanding the Search Space
There are characteristics of Bunch’s clustering
algorithms that are interesting:
It seems unusual that the clustering algorithms
produce consistent MQ values given the large
search space
Other approaches [spectral methods] to solving
the clustering problem using Bunch’s MQ have not
produced better clustering results
The median clustering level is a good tradeoff
between cluster size and number of clusters
Harman et al. examined using a target granularity
[GECCO’02] to bias the desired cluster sizes
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
25
Investigating the Search Space
Examined multiple systems of different
size:
15 open source systems developed in C,
C++, or Java
13 randomly generated graphs with
different properties that we wanted to
investigate
We clustered each MDG 500 times and examined
the clustering data to gain some insight into the
search space.
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
26
Example: Median Clustering
Level
swing
65
60
55
50
L1
L2
L3
L4
L5
L6
L7
Median
75
Cumulative MQ
Cumulative MQ
70
Kerbos v.5
70
65
60
55
L1
L2
L3
L4
L5
L6
L7
Median
50
45
45
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
27
Example: Median Clustering
Level
telnetd
php
9
4.5
MQ
4
3.5
8
7
MQ
3
2.5
6
2
5
1.5
4
1
0.5
0
L1
L2
L3
Median
3
L1
L3
Median
L2
L4
2
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
28
Example: Median Clustering
Level
bash
mod_ssl
16
10
lynx
10
14
8
X Axis:
MQ Value
8
12
6
4
70
65
60
6
10
8
ping_libc
10
4
elm
8
6
mailx
5
4
55
50
6
45
4
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
3
2
29
Example: Median Clustering
Level – Random Bipartite Graphs
bip-100-1
33
8
bip-100-2
10
8
6
28
bip-100-5
6
23
4
18
2
10
8
bip-100-25
4
2
5
bip-100-75
4
6
4
2
3
X Axis:
MQ Value
2
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
30
Example: Median Clustering
Level – Random Graphs
rnd-100-1
38
38
33
33
28
28
23
23
18
18
8
rnd-100-25
rnd-100-2
rnd-100-5
18
13
8
5
6
4
4
3
2
2
rnd-100-75
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
X Axis:
MQ Value
31
Example: Median Clustering
Level – Random “Circle” Graphs
25
circle-50
50
20
40
15
30
10
20
75
circle-100
circle-150
65
55
X Axis:
MQ Value
45
35
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
32
X Axis: #Clusters
Y Axis: MQ Value
MQ versus #Clusters
krb5
47
46.8
46.6
46.4
46.2
46
170
swing
190
bash
5.15
5.1
5.05
5
4.95
4.9
3
45.6
45.4
45.2
45
44.8
180
1
0
mod_ssl
8.4
8.3
8.2
25
35
45
40
45
50
0
5
10
ping_libc
47
46.8
46.6
46.4
46.2
46
170
180
php
4.65
4.6
4.55
4.5
4.45
2
150 160 170 180
8.5
telnetd
10
20
30
4.3
2.4
2.35
4.2
lynx 4.1
mailx 2.3
2.25
4
Drexel University
Software Engineering Research Group (SERG)2.2
http://serg.mcs.drexel.edu
25
35
45
5
10
20
elm
4.3
4.25
4.2
4.15
4.1
4.05
190
15
15
33
40
X Axis: #Clusters
Y Axis: MQ Value
MQ versus #Clusters
bip-100-1
19.46
19.44
19.42
19.4
19.38
20
25
bip-100-5
4.95
4.9
4.85
4.8
4.75
4.7
10
30
rnd-100-1
25.67
11.5
25.67
11
25.67
10.5
25.67
10
30
cir-50
31
32
12
bip-100-25
4.05
4
3.95
3.9
3.85
14
rnd-100-5
35 40 45 50
bip-100-75
1.8
1.79
1.78
1.77
38
40
42
rnd-100-25
3.9
3.8
3.7
3.6
3.5
20
1.9
30
40
rnd-100-75
1.8
1.7
1.6
30
40
25
12.6
12.4
cir- 24.5
12.2
100 24
12
11.8
23.5
Drexel University Software Engineering
Research Group (SERG)
http://serg.mcs.drexel.edu
20
25
30
40
45
50
50
30
35
40
37.5
cir- 37
150 36.5
36
65
34
70
75
Internal- versus
External Edges
2320
krb5
swing
1240
1230
1220
1210
1200
1190
1180
250
2300
2280
2260
500
980
960
940
920
900
100
550
600
bash
150
X Axis: External Edges
Y Axis: Internal Edges
350
140
40
135
20
130
0
125
10
2320
30
50
ping_libc
2280
2260
100
150
200
500
php
145
60
2300
1600
lynx
80
mod_ssl
980
960
940
920
900
200
300
telnetd
0
50
elm
145
140
135
130
125
550 600
300
0
50
200
1550
mailx 100
1500
1450
Drexel University
Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
0
200
400
0
0
100
100
200
35
100
Internal- versus
External Edges
15
bip-100-1
10
5
0
0
15
20
40
rnd-100-1
10
5
0
0
cir-50
142
140
138
136
134
132
130
bip-100-5
50
bip-100-25
1000
995
990
985
85
195
190
185
180
175
X Axis: External Edges
Y Axis: Internal Edges
90
95 100
rnd-100-5
100
110
rnd-100-25
1140
1100
1080
50
100
50
48
0
100
26
25
24
cir- 46
23
22
100 44
21
20
Drexel University Software Engineering42
Research Group (SERG)
http://serg.mcs.drexel.edu
20
25
30
50
55
60
2450
2400
2350
2300
2250
120
1120
0
bip-100-75
0
400
rnd-100-75
3600
3500
3400
3300
3200
200
cir150
200
0
500
74
72
70
68
66
36
75
80
85
Real Systems
Similarity of Clustering Results
IntraEdge Agreement
Isomporphic Nodes
100
90
Percentage
80
70
60
50
40
30
20
10
0
krb5
ping_libc
swing
lynx
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
mod_ssl
bunch
bash
inn
elm
php
dhcpd
joe
mailx
crond
telnetd
System
37
Random Systems
Similarity of Clustering Results
100
90
Percentage
80
70
60
50
40
IntraEdge Agreement
Isomporphic Nodes
30
20
10
0
circle-150
circle-100
circle-50
rnd-100-75
rnd-100-25
rnd-100-5
rnd-100-2
rnd-100-1
bip-100-75
bip-100-25
bip-100-5
bip-100-2
bip-100-1
System
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
38
Real Systems
Similarity of Clustering Results
100
90
Percentage
80
70
60
50
40
30
IntraEdge Agreement
20
10
0
krb5
ping_libc
swing
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
lynx
mod_ssl
bunch
bash
inn
elm
php
dhcpd
joe
mailx
crond
telnetd
System
39
Random Systems
Similarity of Clustering Results
100
90
Percentage
80
70
IntraEdge Agreement
60
50
40
30
20
10
0
circle-150
circle-100
circle-50
rnd-100-75
rnd-100-25
rnd-100-5
rnd-100-2
rnd-100-1
bip-100-75
bip-100-25
bip-100-5
bip-100-2
bip-100-1
System
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
40
What we Learned From Studying
the Search Landscape
Not all modules are “equal” - Some modules:
Are connected to many other modules
Are connected to few other modules
Have a large fan-in
Have a large fan-out
Are uniformly connected to other system
components
Are not uniformly connected to other system
components
Some modules may have a more “natural” home than
other subsystems with respect to their assigned cluster
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
41
What we Learned From Studying
the Search Landscape
Bunch tends to converge to a consistent
solution with respect to MQ
There is a very low probability of finding one of
these partitions by random selection
The partitions found by Bunch are a very small
subset of the overall search landscape
The degree of isomorphism in the clustering
results was larger than expected
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
42
What we Learned From Studying
the Search Landscape
When examining the median level of the clustering
hierarchy we observed that all systems tend to
converge to at most 2 levels
The systems that we studied range from under 100 modules
to several thousand modules
The number of levels in the clustering hierarchy is bounded
by O(log2N)
We expect that studying systems with several hundred
thousand modules would produce results where the median
level converges to more than 2 levels.
We observed this in very sparse graphs (e.g., rnd-100-1, and
bip-100-1)
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
43
Conclusions (1)
Understanding the search landscape is
important
A single run of Bunch is helpful, but it does
not highlight modules/classes that tend to
drift between clusters
Analysis of many Bunch runs helps build a
mental model of the search landscape
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
44
Conclusions (2)
A best practice for program understanding
Cluster a system many times in order to
understand the search landscape
Identify and separate omnipresent, library and
supplier modules
Identify that tend to drift between many
subsystems
Assign to other clusters manually, or influence the
clustering algorithm by adjusting the edge weights
Bunch supports manual and semi-automatic clustering
features to help with this type of analysis
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
45
Questions
Special Thanks To:
AT&T Research
Sun Microsystems
DARPA
NSF
US Army
SEMINAL Group
Drexel University Software Engineering Research Group (SERG)
http://serg.mcs.drexel.edu
46
© Copyright 2026 Paperzz