Identifying Objects Using Cluster and Concept Analysis Arie van Deursen Tobias Kuipers CWI, The Netherlands Motivation • Legacy code incomprehensible – Lack of structure • Case: >100,000 LOC Banking System – Cobol + VSAM data files • Customer wanted OO redesign • Data central to the system General Plan • Find interesting data – Data selection – Candidate attributes • Find interesting functionality – Program selection (procedure) – Candidate methods • Combine the two – Candidate classes Input Selection • Domain related v. Implementation specific • Persistent data stores – Only records written to/read from file – Refine by CRUD (Create/Read/Update/Delete) – Records too big for one class • Analysis of Program Call Graph – high fan-out: control-programs – high fan-in: low-level technical Combining Data & Functionality • Cluster analysis -- technique for finding groups in data – Relies on metrics to compare distance between data items • Concept analysis -- for finding groups too – Relies on maximal subsets of data items sharing a set of features Cluster Analysis • Calculate distance (similarity) number between all data items (record fields) • Use clustering to find hierarchy Field Name P1 NAME TITLE INITIAL PREFIX NUMBER NUMBER-EXT ZIPCODE STREET CITY P2 1 1 1 1 0 0 0 0 0 P3 0 0 0 0 0 0 0 0 1 P4 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 Dendrogram Field Name P1 P2 P3 P4 NAME 1 0 0 0 TITLE 1 0 0 0 INITIAL 1 0 0 0 PREFIX 1 0 0 0 NUMBER 0 0 0 1 NUMBER-EXT 0 0 0 1 ZIPCODE 0 0 0 1 STREET 0 0 1 1 CITY 0 1 0 1 0 Name Title Initial Prefix 1 Dendrogram Field Name P1 P2 P3 P4 NAME 1 0 0 0 TITLE 1 0 0 0 INITIAL 1 0 0 0 PREFIX 1 0 0 0 NUMBER 0 0 0 1 NUMBER-EXT 0 0 0 1 ZIPCODE 0 0 0 1 STREET 0 0 1 1 CITY 0 1 0 1 0 Name Title Initial Prefix Number Nb-Ext Zipcode 1 Dendrogram Field Name P1 P2 P3 P4 NAME 1 0 0 0 TITLE 1 0 0 0 INITIAL 1 0 0 0 PREFIX 1 0 0 0 NUMBER 0 0 0 1 NUMBER-EXT 0 0 0 1 ZIPCODE 0 0 0 1 STREET 0 0 1 1 CITY 0 1 0 1 Distance is 1 0 Name Title Initial Prefix Number Nb-Ext Zipcode 1 Dendrogram Field Name P1 P2 P3 P4 NAME 1 0 0 0 TITLE 1 0 0 0 INITIAL 1 0 0 0 PREFIX 1 0 0 0 NUMBER 0 0 0 1 NUMBER-EXT 0 0 0 1 ZIPCODE 0 0 0 1 STREET 0 0 1 1 CITY 0 1 0 1 Distance is 1 0 1 Name Title Initial Prefix Number Nb-Ext Zipcode City Dendrogram Field Name P1 P2 P3 P4 NAME 1 0 0 0 TITLE 1 0 0 0 INITIAL 1 0 0 0 PREFIX 1 0 0 0 NUMBER 0 0 0 1 NUMBER-EXT 0 0 0 1 ZIPCODE 0 0 0 1 STREET 0 0 1 1 CITY 0 1 0 1 0 1 Name Title Initial Prefix Number Nb-Ext Zipcode City Street Dendrogram Field Name P1 P2 P3 P4 NAME 1 0 0 0 TITLE 1 0 0 0 INITIAL 1 0 0 0 PREFIX 1 0 0 0 NUMBER 0 0 0 1 NUMBER-EXT 0 0 0 1 ZIPCODE 0 0 0 1 STREET 0 0 1 1 CITY 0 1 0 1 0 1 Name Title Initial Prefix Number Nb-Ext Zipcode City Street Dendrogram 0 1 Name Title Initial Prefix Number Nb-Ext Zipcode City Street Dendrogram from Real Data 0 OfficeName BankCity IntAccount OfficeType PaymentKind RelationNr ChangeDate 2 1 Amount Account MortSeqNr MortNr TitleCd Prefix Initial Name ZipCd CountyCd StreetNr City Street Concept Analysis • Relies on maximal subsets of data items sharing a set of features • Concept analysis finds a lattice Field Name NAME TITLE INITIAL PREFIX NUMBER NUMBER-EXT ZIPCODE STREET CITY P1 x x x x P2 P3 x x P4 x x x x x Concept Lattice Field Name NAME TITLE INITIAL PREFIX NUMBER NUMBER-EXT ZIPCODE STREET CITY P1 x x x x P2 P3 x x P4 top All Variables x x x x x Set of features Set of items (field names) P1 P2 P3 P4 bottom Concept Lattice Field Name NAME TITLE INITIAL PREFIX NUMBER NUMBER-EXT ZIPCODE STREET CITY P1 x x x x P2 P3 x x P4 x x x x x top All Variables P1 P4 Name Title Initial Prefix Number Nb-Ext Zipcode Street City P1 P2 P3 P4 bottom Concept Lattice Field Name NAME TITLE INITIAL PREFIX NUMBER NUMBER-EXT ZIPCODE STREET CITY P1 x x x x P2 P3 x x P4 x x x x x top All Variables P1 P4 Name Title Initial Prefix Number Nb-Ext Zipcode Street City P3 P4 P2 P4 Street City P1 P2 P3 P4 bottom Concept Lattice top All Variables P1 P4 Name Title Initial Prefix Number Nb-Ext Zipcode Street City P3 P4 P2 P4 Street City P1 P2 P3 P4 bottom Real Concept Lattice 3 4 12 G ABCDEF 5 H MNOP IJKL 6 7 Q R S 8 9 10 11 12 13 14 T U V W X Concluding Remarks • Variable Selection - Input filtering • Records are natural starting point in dataintensive applications – Legacy/Cobol domain • Records are too big: Decompose them • Cluster analysis v. Concept analysis Cluster v Concept Analysis • Multiple partitionings – Clustering does not show all possibilities • Items in multiple groups • Features and clusters – Origin of cluster decision is lost • Concept more efficient computationally • Clustering needs more filtering Questions Current Approaches • Subsystem classification techniques – Survey, Lakhotia 97. Don’t work for Cobol, Cimitile 99 • Record as data part of a class – Newcomb & Kotik (‘95) take level 01 records, Fergen et al (94) compare structure of records for reuse • Manual Methodology – Sneed (‘92) provides manual methodology for migration of code, Sneed & Nyári (‘95) derive ‘OO’ documentation from legacy.
© Copyright 2026 Paperzz