WCRE 1999 / 2009 Experiments with clustering as a software remodularization method Nicolas Anquetil Timothy C. Lethbridge 1 of 63 Forewarning Nicolas: After this research I became suspicious of the usefulness of clustering for remodularization. I still am. 2 of 63 You have been warned (although note that Tim has a less gloomy view) 3 of 63 Agenda Background of the research Overview of the paper From then until now And now what? An analogy Another analogy 4 of 63 Background of the research Context: KBRE group, U. of Ottawa, Canada CSER project (Consortium for Software Engineering Research) Pairs: university/company (U. Of Ottawa/Telecom. company) Focus on real problems and/or real situations 5 of 63 Background of the research The project: One company's PBX 2+ MLOC 2+ K files 10+ possible configurations 10+ years old (in 1999) 2 proprietary languages 1 directory 0 packages 6 of 63 Background of the research Company situation: High turnover (18 months) High entry barrier (6+ months to be productive) Aging software (and languages) Configuration management difficulties 7 of 63 Agenda Background of the research Overview of the paper From then until now And now what? An analogy Another analogy 8 of 63 Overview of the paper ”providing solutions to help software engineers understand, restructure or migrate old software towards more modern architecture and/or languages” 9 of 63 Overview of the paper Possible solution: ”Clustering is used to gather software components into modules significant to the software engineers.” 10 of 63 Overview of the paper Seminal paper by Theo Wiggerts, “Using Clustering Algorithms in Legacy Systems Remodularization”, WCRE'97 Summary of the literature on clustering Lists all the possible choices Lists some advantages and drawbacks of these choices 11 of 63 Overview of the paper ”Clustering is a sophisticated research domain with many methods [...] Reverse engineering is a young domain [...] Clustering has been used with no deep understanding of all the issues involved.” 12 of 63 Overview of the paper ”Conclusions of Wiggerts' paper are those of the literature which may not entirely hold for reverse engineering.” 13 of 63 Overview of the paper For example: Living things naturally fit in an evolution tree (more or less) Not so with software modularization This must impact the tools we use and how we use them 14 of 63 Overview of the paper Three issues What clustering algorithms to use? How to compute cohesion? How to describe entities? How to evaluate the results? 15 of 63 Overview of the paper Algorithms We tested mainly hierarchical agglomerative algorithms Some tests with hill-climbing algorithms (”Bunch” tool: Mancoridis) 16 of 63 Overview of the paper Entities We clustered files (into packages) Description Elements contained in the files: Types, variables, routines, macros, comments, identifiers 17 of 63 Overview of the paper Reminder: ”Clustering algorithms do not discover some hidden structure in a system, but impose a structure on the set of entities they are given.” 18 of 63 Overview of the paper Some results Redundancies among description schemes: File, routine, variable, macro, type Comments, identifiers 19 of 63 Overview of the paper Some results Combining features (routine + variable + ...) improves the results 20 of 63 Overview of the paper Some results Direct/sibling links Sibling more used and better 21 of 63 Overview of the paper Some results Avoid “sparse” descriptive features Avoid similarity metrics that consider absence of a feature as significant 22 of 63 Agenda Background of the research Overview of the paper From then until now And now what? An analogy Another analogy 23 of 63 From then until now Raw numbers What extensions? 24 of 63 From then until now References (volume) 18 16 14 12 10 8 - 6 4 2 0 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 25 of 63 [data from Google scholar] From then until now References (authors) P.Tonella(8), F.Ricca(7), C.Girardi(5), E.Pianta(5) O.Maqbool(7), HA.Babri(6) C.Tjortjis(5) N.Anquetil(5) S.Ducasse(5) K.Sartipi(4) 26 of 63 [data from Google scholar] From then until now References (venue) Thesis CSMR IWPC WCRE J.Soft.Maint. Evol. =11 =6 =6 =5 J.Syst.Soft. ICSM ICSE Trans.Syst. Eng. =4 =3 =2 =2 =4 27 of 63 [data from Google scholar] From then until now Some extensions Clustering, how? New/improved algorithms New/improved distance metrics Clustering what? New entities (and/or description) Clustering, why? Other extensions 28 of 63 From then until now New algorithm Genetic algorithm [Mahdavi] “Combined algorithm” [Saeed, Maqbool, Babri, Hassan, Sarwar] 29 of 63 From then until now New distance metric Minimization of information loss [Andritsos, Tzerpos] 30 of 63 From then until now New entities Static web pages [Di Lucca, Fasolino, Tramontana] [Tonella,Ricca,Pian ta, Girardi] Association rules Data vs. Control Dynamic data [Davey,Burd], [Sartipi,Kontogiannis] [Stroulia,Systä] Co-change records [Maqbool,Babri] 31 of 63 From then until now Other extensions Evaluations / comparisons [Tonella], [Wu, Holt], [Parsa, Bushehrian] Framework 32 of 63 From then until now Other extensions Needs of maintainers? Input for visualization tools [Tjortjis, Layzell] [Ducasse] Naming clusters [Tzerpos], [Maqbool, Babri] 33 of 63 Agenda Background of the research Overview of the paper From then until now And now what? An analogy Another analogy 34 of 63 And now what? Back to paper's results Wild ideas in clustering Related topics 35 of 63 And now what? Paper's results Choice of (traditional) algorithm matters little It will give a result Not significantly better or worse than other 36 of 63 And now what? Paper's results Choice of similarity metric matters little As long as they don't consider absence of a feature as a sign of similarity 37 of 63 And now what? Paper's results Choice of description scheme for entity matters a bit more May be source of short term progress? Using dynamic information? 38 of 63 And now what? Wild ideas Consider new entities? Individual instructions? Non code: requirements, model elements, tests, … ? Process-wise modularization? Clustering requirements, models elements, ... 39 of 63 And now what? Related topics Problem without solution? Software modularization is highly subjective Packages are not mutually exclusive Decisions must be made that are always wrong (and always correct) 40 of 63 And now what? Related topics Modularization is a logical (virtual) decomposition based on semantics High cohesion, low coupling may only be an (imperfect) by-product of pre-chosen modularization Cohesion/coupling not a driving force but a secondary goal? Other forces, e.g. packages of “comparable” sizes 41 of 63 And now what? Related topics Typical example: Utility package Low cohesion, high coupling java.util BitSet, Calendar, Currency, Dictionary, EventListenerProxy, Formatter, Observable, Random, ResourceBundle, Scanner, UUID, TimeZone, ... 42 of 63 And now what? Related topics How to evaluate results? Cohesion/coupling Open question in the paper Normaly useless because it is the function optimized by the algorithms Gold standard Manually: expensive, not precise Automatically: biased 43 of 63 And now what? Related topics How to evaluate results? Other metrics, e.g. Stability, Non-extremity [Wu] 44 of 63 Agenda Background of the research Overview of the paper From then until now And now what? An analogy Another analogy 45 of 63 And now what? Paper's results ”The fact that all six algorithms are ranked low on authoritativeness suggests that they may not be mature enough for use in production on large systems undergoing evolutionary change. However ...” [Wu, Holt, 2005] 46 of 63 An analogy A short story of Belo Horizonte: In 1893 a new capital is planned in the state of Minas Gerais (Brazil) The arquitects/urbanists get inspiration from Washington D.C. 47 of 63 An analogy The initial architecture: Planned Belo Horizonte 48 of 63 An analogy The city grew (2.5 Mhab., area=5.1 Mh.) 49 of 63 An analogy The city grew (2.5 Mhab.) 50 of 63 An analogy Could we remodularize that? 51 of 63 An analogy Could we remodularize that? 52 of 63 An analogy Analogy with software clustering: Initial architecture is completly lost in the overall city Regularities would allow to find only small “clusters” There are large “empty” parts difficult to (automatically) cluster A division into districts would necessarily be subjective 53 of 63 Agenda Background of the research Overview of the paper From then until now And now what? An analogy Another analogy 54 of 63 Another analogy You are a 21-year old leaving university You buy a large house because you have a good job You are not well organized You have a general concept that “food goes in the kitchen and clothes go in the bedroom” But much of your stuff is strewn around 55 of 63 Another analogy Initially you do not have many things, so the disorganization doesn't matter After a while, you accumulate very many worldly goods You constantly can't find things Your new partner starts complaining 56 of 63 Another analogy You realize it is time to organize things better You are a computer scientist so you want to apply a clustering algorithm 57 of 63 Another analogy But what criteria to use? Things made in the same country go together? Oops, the 'China' cluster is too big Temporal cohesion? Things used in the morning in one place, things used in the evening in another place? – Where does 'toothbrush' go? 58 of 63 Another analogy Functional cohesion Everything for each recipe I make is kept together But utilities (things used commonly) are separately organized as a cluster Too awkward 59 of 63 Another analogy In the end, your approach is pragmatic: 1.You decide from general experience on a set of general categories and storage locations 2. You spend a weekend moving things into these locations (yes there are thousands of things) 60 of 63 Another analogy 3. As you proceed, you notice Some things do not fit in any categories Some categories are not so well chosen Some categories overlap 4. You refactor the categories a bit and move things around 61 of 63 How can this be applied to software? Use a clustering tool to mainly to give you a sense of the possibilities Combine with other RE tools to learn about the functionality of each module as well as other properties But also apply general wisdom about good software design 62 of 63 How can this be applied to software? Play with the parameters of the clustering tool and other RE tools, refactoring until you have achieved a remodularization that you understand Ideally, tools would allow instant adjustment with good visualization Retain documents describing the resulting design 63 of 63
© Copyright 2026 Paperzz