Identifying Objects Using Cluster and Concept Analysis

Identifying Objects
Using Cluster and Concept
Analysis
Arie van Deursen
Tobias Kuipers
CWI, The Netherlands
Motivation
• Legacy code incomprehensible
– Lack of structure
• Case: >100,000 LOC Banking System
– Cobol + VSAM data files
• Customer wanted OO redesign
• Data central to the system
General Plan
• Find interesting data
– Data selection
– Candidate attributes
• Find interesting functionality
– Program selection (procedure)
– Candidate methods
• Combine the two
– Candidate classes
Input Selection
• Domain related v. Implementation specific
• Persistent data stores
– Only records written to/read from file
– Refine by CRUD (Create/Read/Update/Delete)
– Records too big for one class
• Analysis of Program Call Graph
– high fan-out: control-programs
– high fan-in: low-level technical
Combining Data & Functionality
• Cluster analysis -- technique for finding
groups in data
– Relies on metrics to compare distance between
data items
• Concept analysis -- for finding groups too
– Relies on maximal subsets of data items sharing
a set of features
Cluster Analysis
• Calculate distance (similarity) number
between all data items (record fields)
• Use clustering to find hierarchy
Field Name
P1
NAME
TITLE
INITIAL
PREFIX
NUMBER
NUMBER-EXT
ZIPCODE
STREET
CITY
P2
1
1
1
1
0
0
0
0
0
P3
0
0
0
0
0
0
0
0
1
P4
0
0
0
0
0
0
0
1
0
0
0
0
0
1
1
1
1
1
Dendrogram
Field Name
P1
P2
P3
P4
NAME
1
0
0
0
TITLE
1
0
0
0
INITIAL
1
0
0
0
PREFIX
1
0
0
0
NUMBER
0
0
0
1
NUMBER-EXT
0
0
0
1
ZIPCODE
0
0
0
1
STREET
0
0
1
1
CITY
0
1
0
1
0
Name
Title
Initial
Prefix
1
Dendrogram
Field Name
P1
P2
P3
P4
NAME
1
0
0
0
TITLE
1
0
0
0
INITIAL
1
0
0
0
PREFIX
1
0
0
0
NUMBER
0
0
0
1
NUMBER-EXT
0
0
0
1
ZIPCODE
0
0
0
1
STREET
0
0
1
1
CITY
0
1
0
1
0
Name
Title
Initial
Prefix
Number
Nb-Ext
Zipcode
1
Dendrogram
Field Name
P1
P2
P3
P4
NAME
1
0
0
0
TITLE
1
0
0
0
INITIAL
1
0
0
0
PREFIX
1
0
0
0
NUMBER
0
0
0
1
NUMBER-EXT
0
0
0
1
ZIPCODE
0
0
0
1
STREET
0
0
1
1
CITY
0
1
0
1
Distance is 1
0
Name
Title
Initial
Prefix
Number
Nb-Ext
Zipcode
1
Dendrogram
Field Name
P1
P2
P3
P4
NAME
1
0
0
0
TITLE
1
0
0
0
INITIAL
1
0
0
0
PREFIX
1
0
0
0
NUMBER
0
0
0
1
NUMBER-EXT
0
0
0
1
ZIPCODE
0
0
0
1
STREET
0
0
1
1
CITY
0
1
0
1
Distance is 1
0
1
Name
Title
Initial
Prefix
Number
Nb-Ext
Zipcode
City
Dendrogram
Field Name
P1
P2
P3
P4
NAME
1
0
0
0
TITLE
1
0
0
0
INITIAL
1
0
0
0
PREFIX
1
0
0
0
NUMBER
0
0
0
1
NUMBER-EXT
0
0
0
1
ZIPCODE
0
0
0
1
STREET
0
0
1
1
CITY
0
1
0
1
0
1
Name
Title
Initial
Prefix
Number
Nb-Ext
Zipcode
City
Street
Dendrogram
Field Name
P1
P2
P3
P4
NAME
1
0
0
0
TITLE
1
0
0
0
INITIAL
1
0
0
0
PREFIX
1
0
0
0
NUMBER
0
0
0
1
NUMBER-EXT
0
0
0
1
ZIPCODE
0
0
0
1
STREET
0
0
1
1
CITY
0
1
0
1
0
1
Name
Title
Initial
Prefix
Number
Nb-Ext
Zipcode
City
Street
Dendrogram
0
1
Name
Title
Initial
Prefix
Number
Nb-Ext
Zipcode
City
Street
Dendrogram from Real Data
0
OfficeName
BankCity
IntAccount
OfficeType
PaymentKind
RelationNr
ChangeDate
2
1
Amount
Account
MortSeqNr
MortNr
TitleCd
Prefix
Initial
Name
ZipCd
CountyCd
StreetNr
City
Street
Concept Analysis
• Relies on maximal subsets of data items
sharing a set of features
• Concept analysis finds a lattice
Field Name
NAME
TITLE
INITIAL
PREFIX
NUMBER
NUMBER-EXT
ZIPCODE
STREET
CITY
P1
x
x
x
x
P2
P3
x
x
P4
x
x
x
x
x
Concept Lattice
Field Name
NAME
TITLE
INITIAL
PREFIX
NUMBER
NUMBER-EXT
ZIPCODE
STREET
CITY
P1
x
x
x
x
P2
P3
x
x
P4

top
All Variables
x
x
x
x
x
Set of features
Set of items
(field names)
P1 P2 P3 P4
bottom

Concept Lattice
Field Name
NAME
TITLE
INITIAL
PREFIX
NUMBER
NUMBER-EXT
ZIPCODE
STREET
CITY
P1
x
x
x
x
P2
P3
x
x
P4
x
x
x
x
x

top
All Variables
P1
P4
Name Title
Initial Prefix
Number Nb-Ext
Zipcode Street City
P1 P2 P3 P4
bottom

Concept Lattice
Field Name
NAME
TITLE
INITIAL
PREFIX
NUMBER
NUMBER-EXT
ZIPCODE
STREET
CITY
P1
x
x
x
x
P2
P3
x
x
P4
x
x
x
x
x

top
All Variables
P1
P4
Name Title
Initial Prefix
Number Nb-Ext
Zipcode Street City
P3 P4
P2 P4
Street
City
P1 P2 P3 P4
bottom

Concept Lattice

top
All Variables
P1
P4
Name Title
Initial Prefix
Number Nb-Ext
Zipcode Street City
P3 P4
P2 P4
Street
City
P1 P2 P3 P4
bottom

Real Concept Lattice
3
4
12
G
ABCDEF
5
H
MNOP
IJKL
6
7
Q
R
S
8
9
10
11 12
13 14
T
U
V
W
X
Concluding Remarks
• Variable Selection - Input filtering
• Records are natural starting point in dataintensive applications
– Legacy/Cobol domain
• Records are too big: Decompose them
• Cluster analysis v. Concept analysis
Cluster v Concept Analysis
• Multiple partitionings
– Clustering does not show all possibilities
• Items in multiple groups
• Features and clusters
– Origin of cluster decision is lost
• Concept more efficient computationally
• Clustering needs more filtering
Questions
Current Approaches
• Subsystem classification techniques
– Survey, Lakhotia 97. Don’t work for Cobol, Cimitile 99
• Record as data part of a class
– Newcomb & Kotik (‘95) take level 01 records, Fergen
et al (94) compare structure of records for reuse
• Manual Methodology
– Sneed (‘92) provides manual methodology for
migration of code, Sneed & Nyári (‘95) derive ‘OO’
documentation from legacy.