A Methodological Framework for Socio-Cognitive Analyses of Collaborative Design of Open Source Software* Warren Sack1, Françoise Détienne2, Jean-Marie Burkhardt2, Flore Barcellini2, Nicolas Ducheneaut3, Dilan Mahendran4 1 University of California, Santa Cruz, USA 2 INRIA, Eiffel research group, Domaine de voluceau, Rocquencourt, BP 105, 78153 Le Chesnay, France 3 Palo Alto Research Center (PARC) 3333 Coyote Hill Road Palo Alto, CA 94304 - USA 4 University of California, Berkeley, CA 94720-2316, USA [email protected] ; [email protected] ; [email protected] ; [email protected], [email protected]; [email protected] The Open Source Software (OSS) movement has received enormous attention in the last several years. It is often characterized as a fundamentally new way to develop software that poses a serious challenge to the commercial software business that dominates most software markets today (Raymond, 2001). It is claimed, for example, that defects are found and fixed very quickly because there are “many eyeballs looking for the problems.” Code is written with more care and creativity, because developers are working only on things for which they have a real passion. All these potential advantages are said to emerge from the following characteristics of work and collaboration inside OSS projects: OSS systems are built by potentially large numbers of volunteers. Work is not assigned; people undertake the work they choose to undertake. There is no explicit system-level design, or even detailed design. There is no project plan, schedule, or list of deliverables. * Presented at the Workshop on Distributed Collective Practices, ACM CSCW Computer-Supported Cooperative Work, Chicago, November 2004. http://tech-web-n2.utt.fr/cscw04/ OSS represents an extreme but successful case of geographically distributed development: codesigners work in arbitrary locations, rarely or never meet face-to-face, and coordinate their design activity almost exclusively by three information spaces: the implementation space (code CVS), the documentation space and the discussion space (Ducheneaut, 2003; Gasser et al. 2003; Mockus et al. 2002). 1. Objective The objective of our research is to understand the specific hybrid weaving accomplished by the actors of the design process. The design process implies various types of actors: people with prescribed roles, and also elements involved in the three information spaces. This paper presents the methodological framework we have constructed to analyse these links which emerge between these elements from a socio-cognitive perspective. There exist a wide variety of ongoing Open Source Software (OSS) projects. We choose to work on the design processes of an OSS project devoted to the development of a programming language called Python (see http://www.python.org). The Python project is particularly interesting because the designers of Python engage in a specific design process called Python Enhancement Proposals (PEPs) which are similar to two design processes used in conventional software projects: RFCs (request for comments) and technical review meetings. The negotiation, refinement and editing of PEPs are akin to a design process, called RFCs, that has been practiced for decades to define standards for the Internet (used, especially by the Internet Engineering Task Force, IETF). PEPs are also comparable to technical review meetings (D’Astous et al, in press) as practiced in many corporate and governmental settings. The Python project is also interesting because the PEPS design process can be seen as distributed through three information spaces: the implementation space, the documentation space and the discussion space. It thus offers us interesting data to analyse the links constructed between these three spaces and people involved in the design process. Our object of study is the hybrid weaving accomplished by the actors involved in the negotiation, elaboration, development and implementation of the PEPs. 2. Information spaces and design process in Python PEPs are the main mechanisms for proposing new features, for collecting community input on an issue, and for documenting the design decisions that have gone into Python. A PEP is a design document providing information to the Python community, or describing a new feature for Python. It should provide a concise technical specification of the feature, a rationale for the feature and a reference implemention. Each PEP has a champion (the author of the PEP). The PEP champion should collect community feedback by posting it to the comp.lang.python newsgroup (a.k.a. [email protected] mailing list).The PEP champion then emails the PEP editors, who assign PEP numbers and change their status, with a proposed title and a draft of the PEP. If the PEP editor approves, he will assign the PEP a number, and give it status "Draft". The author of the PEP is then responsible for posting the PEP to the community forums, [email protected] and/or [email protected] where the PEP is discussed. Finally, it is Guido (the project leader called Beneficient Dictator For Life BDFL) and his chosen consultants, who may accept or reject a PEP or send it back to the author(s) for revision. Once a PEP has been accepted, the reference implementation must be completed. When the reference implementation is complete and accepted by Guido, the status will be changed to "Final".The implementation can take place. A PEP can also be assigned status "Deferred", "Rejected" or can also be replaced by a different PEP. PEP work flow is as follows: Draft -> Accepted -> Final -> Replaced ^ +----> Rejected v Deferred We analyse the PEPs design process as proceeding through three information spaces: the discussion space, the documentation space and the implementation space. The discussion space is composed of several newsgroups and mailing lists. Most of newsgroups are also available as a mailing lists for participants who don't have Usenet access or prefer to receive messages as e-mail. The comp.lang.python newsgroup is about developing with Python, not about development of the Python interpreter itself. PEPs ideas are discussed here before getting an official PEP status (or not). The Python-dev newsgroup is for work on developing Python: fixing bugs and adding new features to Python itself. Practically everyone with CVS write privileges is on python-dev, and first drafts of PEPs are posted here for review and rewriting before their public appearance on python-announce. The comp.lang.python.announce newsgroup is a forum for Python-related announcements. New modules and programs are announced, and PEPs are posted to get comments from the community. Special Interest Groups (SIGs) are smaller communities focused on a particular topic or application such as databases, Every SIG has a mailing list. There are other mailing lists such as patches.mailing.list and python-help mailing list. In the documentation space, the PEPs drafts are maintained as text files under CVS control. Archives of discussion are kept on python org, sourceforge.org, gmane.org. Messages can be viewed according to several organizations: time, topics (e.g. PEPs number), threads (reply-to). In the implementation space, the PEPs implementation can take place. The CVS (Concurrent Versions System) tool is used to manage changes within the source code tree. The current version of a piece of source code is stored, as well as a record of all changes (and who made those changes) that have occurred since the preceding version and so on. While accessing the CVS repository is free, CVS write privileges are given only to a subset of Python community (developers). Figure 1 shows an overview of Python PEP process with links to the three information spaces. Figure 1: Overview of PEP -Python Enhancement Proposal process. Once a pre-PEP is accepted, it becomes a PEP which is discussed in the “discussion space”. Archives of discussion, decisions regarding a PEP and the different versions of a PEP are kept in the documentation spaces. So, status of PEPs and information on PEPs are distributed in these two spaces. Even when a PEP is accepted, it has to be reviewed by BDFL. This review can put the PEP in discussion again. Finally, a PEP can produce a new piece of code (implementation space). 3. A socio-cognitive methodological framework Following the conventions of actor-network analysis in field of science and technology studies we hereafter refer to the “elements” of an OSS project as either “actors” or “actants” and their interrelationships as a “network.” Thus, people, code archives, messages, threads and PEP documents are all actors or actants and their links and relationships of cohesion constitute an actor-network. Concurrently, we refer to the process of PEP development, and OSS development in general, as a process of hybridization – a collective process of knitting together in a cohesive manner the diverse elements of an OSS project. While it is ultimately necessary to understand the emerging or accomplished coherence attained in a PEP, we have found it methodologically more tractable to analyze the textual, material – i.e., literal – signs of coherence in PEP process. Following the linguistic (specifically systemic-functional) convention, we call these literal, textual signs of coherence cohesion. Consequently, our cognitive analysis of the PEP process is explicable as an investigation into the emerging cohesion between the many textual elements of an OSS project. Specifically we identified, and have developed as XML tagging schema, to define the following important textual elements, their parts and relationships between the elements: (a) the published PEP document (usually a webpage of a very specific format that defines the final consensus); (b) email messages exchanged during the negotiation and development of a PEP; (c) threads – i.e., sequences of email message replies – elaborated via email messages; (d) code archive and editing (i.e., CVS) records. We have examined OSS design as both a set of cognitive and social processes. Methodologically we have combined qualitative and quantitative approaches including ethnography, discourse analysis, social network analysis, and actor-network analysis. Our work has also entailed the design and implementation of various computational tools for the analysis of email and code archives and the testing and verification of these tools. We have employed ethnographic methods to drive a needs assessment for the design of algorithms and interfaces for analyzing the archives of OSS projects. Specifically, we have built systems for analyzing email-based discussions and CVS code archives (e.g., the work of Ducheneaut shown in Figure 3). Consequently, some of our ethnographic observations have been embodied in software useful for further examination of OSS development efforts. And, some of our qualitative and quantitative work has allowed us to debug, redesign and evaluate our analysis software. For example, the quotation analysis shown in Appendix was done by hand and has forced us to redesign our threading and quotation analysis software. Depending upon which actors we focus on, our results can be understood as a social analysis or as a cognitive analysis. Four complementary views of the socio-technical interaction network have been constructed: A view on how power is distributed across three information spaces - the discussion, implementation and documentation spaces - shows the social and governance structures in the design project; A view on the evolution of links between people and two information spaces – the discussion and implementation spaces - shows the progressive integration of people into the socio-technical network; A view on the dynamics in the discussion space and the links with the social structure shows how the design activity reflects the social and organizational structure in the project and people influence in design; A view on the links between the code space (architecture) and the social structure shows how the technical structure influences the social structure of the project. 3. 1 Social and governance structures Much of the focus of our work has been on understanding the diversity, interrelationships, and dimensions of the social and organizational roles played by participants in the Python project and, specifically, in the PEP process (cf., Gacek et al., 2004). Some of these roles are explicit, other are implicit. For example, the founder of the project, Guido Van Rossum, is referred to playfully – but explicitly – as the Python Project’s BDFL, “Benificient Dictator for Life.” Others in the project have explicit roles insofar as, for instance, they are assigned to lead the development or be administrators of specific parts of the project. Other roles are implicit: question-answerers in online discussions, novices seeking help, etc. We have done a long-term ethnography of the Python project (Mahendran, 2002) and roughly sketched the interrelationships between roles in the Python project using the hierarchy shown in Figure 2. Figure 2: Sociotechnical stratification of roles in the Python project. Figure 2 reveals a very conventional organizational structure: one leader (Von Rossum) has control over the project; directly below him, in organizational power, are a few people who work directly with him and are known as the Python Lab core team, below them are members of a particular mailing list (Python-dev) who also have the power to directly change the code of the project, below them are advanced members who can comment on the project but cannot change the code, and newbies (or novices) exist on the bottom rung of the organizational hierarchy. From this description one can understand that power in the Python project is distributed across the elements of the project which might generally be distinguished as three different “spaces”: (1) discussion spaces; (2) implementation or coding spaces; and, (3) comment or documentation spaces. Project participants with more power can contribute to all of the spaces. Other participants with limited power have, literally, certain aspects of the project that are “off limits” to them. For example, not everyone can make changes to the code of the project. Perhaps one of the most striking observations of this ethnography of Python project members concerns the how they explain and talk about their roles in the project using a vocabulary of pre-industrial, craft- or artisan-based roles. The notion of artisan is not uncommon in other free software communities. As observed in the Python community master/apprentice work relationships were quite common. The guild like structure of Python, with senior developers handing off programming projects to junior developers, is striking and marks free software development off from commercial software ventures. Many of the social relations can be reduced to the trope of master and apprentice. In short, one of our results is rather paradoxical: the hypothesized “new” and “different” structure of OSS development relies of very old ideas of production based on strict, hierarchical models of production and social/organization roles. These old ideas are apparent in the Python project when participants talk about their own roles in the project and when detailed quantitative studies of work activities are carried out. 3.2 Integration of people into the actor-network Following this ethnography of the Python project, we carried out quantitative studies analyzing the observable cohesion between the various elements or actors of the project. These studies of, for example, the code and email archives of the project reflect and further substantiate the observed social and governance structures discovered in the ethnographic work. Our analysis allows us to define and follow participants’ roles and changing status within the PEP process. We employ some standard social network metrics (e.g., measurements of centrality and connectedness) extended to allow the inclusion of non-human actants in the network (e.g., email messages and pieces of code appear as nodes in the following actornetwork). The following is taken from an automatic analysis of the corpus of messages exchanged in the Python OSS project. The automatic analysis (see Figure 3) was done using tools we had developed (Ducheneaut, 2003). The analysis shows how the participant, “Greg,” starts in January 2002 with a proposal to extend the Python language. As an outsider he needs to work his way into the center of the social and technical network of the project before his proposal has any chance of success. He managed to work his way from outsider to insider in about 10 months by contributing both to the ongoing discussion and also by writing code for the project (cf., the work of Madley et al., 2004). Figure 3: Map of the progressive integration of a software designer into the social (i.e., online discussion) and technical (i.e., code) networks of Python, an Open Source Software project. The round, black nodes indicate people, the square, blue nodes indicate code. Thus, the integrated network of people and code is a sociotechnical network (i.e., an actor network), not simply a social network. 3. 3 Organizational structure and citation In the discussion space, we analyse the emerging cohesion of the email messages themselves by a cognitive analysis, specifically more like analogous work in psychology and linguistics discourse analysis. Our research question was the dynamics in the discussion space and the links with the social structure. A central aspect of coherence is how a message connects to previous messages in a discourse context. In face-to-face conversation, coherence-how a turn connects to previous turns in a dialogue- can be seen as actively constructed by participants across turns taking. In on line conversations, a message can be separated both in time and place from the message it responds to. Thus, according to a (time-based) sequential model of on-line conversation (messages are posted in the order received by the system), there are disrupted turn adjacency, i.e. relevant responses do not occur temporally adjacent to initiating turns (Herring, 1999): this is a violation of sequential coherence (pragmatic principles of adjacency and relevance). Prior work on online discussions (e.g. Venolia & Neustaedter, 2003; Popolov et al. 2000) assumes that the conversational structure is determined by “threading” (i.e., reply relations): A message may either denote a new conversation or be a reply to a single prior message. This representation is most useful to analyse the interactional roles in turn taking of proposants and repliers and to get a picture of the centrality (versus periphery) of participants (who tend to get the most response of one post) in the social network. However it is not completely relevant to analyse the referential coherence of the conversation. We examine an alternative view based on quoting or citation (Yee, 2002) and on content analysis. On the basis of content analysis, Eklundh and Rodriguez, (2004) distinguish between several types of conversational linking strategies in on-line conversations around documents : Explicit references: message number (in fact, never used, name of author), author (e.g. even through Fred may be right), subject either by quoting or paraphrasing Implicit references: deictic or anaphoric reference to previous messages (e.g. as you mention”), conversational sequencing (question or response move), topic relatedness External references: to other documents, to group experience Quoting is seen as a linguistic strategy used by participants to connect a comment to previous discourse contributions. Preliminary studies on the practice of quoting in on line conversation (Herring, 1999; Eklundh & Rodriguez, 2004) show that it creates the illusion of adjacency: it incorporates portions of two turns within a single message. It maintains context and last messages can retrace the history of conversation. We started this analysis, manually, on one corpus. The second step will be to analyse more corpus with a software support. We selected a corpus of 126 email messages posted to the main Python development mailing list from March 28th to April 8th, 2002 by 22 developers including 6 administrators. (This corpus corresponds to the entire discussion of PEP 279.) We distinguish two types of cohesion that occur between the messages: (1) Reply: Email messages can be explicit responses to previously-posted messages (this is usually visible via a subject: header shared by both messages); and, (2) Quotation: Email messages frequently quote from previously-posted messages (quotations usually appear as indented or prefixed lines -- e.g., lines starting like this: >>> -- in the citing message). From a close analysis of the discussion organized by quotation, we find that not all posters participated equally in the PEP discussion. When we quantitatively distinguish the highfrequency posters from the low-frequency posters (i.e., those who contributed many versus those who contributed few messages) we can see that high-frequency posters are mostly people who have assigned, administrative positions in the Python project. Moreover, those posters who integrated either no (i.e., zero) quotes or multiple quotes from prior messages into their responses tended to be administrators; those who used single quotes in their replies tend to be developers, not administrators. In short, this is a simple example of where analysis of the activity in the OSS project reflects the social and organizational role structure of the project (and vice versa). Furthermore, the patterns of quotation, sequential versus branch structure, tend to be linked with respect to the social position of the poster in the Python project (see Apprendix). For example, we note that (1) the branching structure is generally initiated by a message posted by either Guido or by the PEP’s Champion (2) the sequential structure tends to show alternances of administrators posting with developers posting. However, in thematic drift (as in P8) this is not observed as Guido or the PEP’s Champion do not participate any more (except when Guido stops the discussion). This analysis shows again the links between the social structure and elements in the discussion space and how it shapes influence in the design process. A more fine-grained content analysis is in progress. It categorizes messages according to a coding scheme, inspired by our own previous work on collaborative design (Détienne et al, 2003; Détienne et al. in press). We distinguish between: (1) Theme: Problem addressed; (2) Activity: Prop: proposition of (alternative) solution; Agreement/disagreement (with or without arguments); Group regulation; Problem setting; Synthesis; Clarification; Explicit decision. Our objective is to analyse patterns of activities with respect to the quoting structure and the participants roles. 3. 4 Social structure and technical structure Within the field of software engineering, it has been noted that, for any large software system one can map out an "ownership architecture" (cf., Bowman and Holt, 1998). Specifically, one can chart out who "owns" -- i.e., who makes changes and extensions to -- which modules of the system. General software engineering implications for such “architectures” include, for instance, "Conway's Law”: the social structure of the project (e.g., who manages whom, who communicates with whom, etc.) has a direct influence on the structure of the software itself (e.g., its division into modules). Conway’s law (Herbsleb and Grinter, 1999) was the first explicit recognition that the communication patterns left an indelible mark upon the product built. Most OSS projects produce conventional software products (e.g., programming languages, operating systems, network clients and servers, etc.). We are exploring the possible influences of an "inverse Conway's Law" (Sack et al. 2003) that could explain how the “miracle” of organization of OSS development is not at all miraculous: the technical structure of the software might directly influence the social structure of the project. It may be the case that OSS development methods work only because the "parceling out" of the work is well-known to most computer scientists even before the start of the project. And, furthermore, this “parceling out” seems to entail the reinvention of some very old – rather than new – work roles and governance/administrative structures (e.g., a strict, topdown hierarchy and a reenactment of artisan guild roles of master, apprentice, etc. as described in Figure 3). We need to complete further work to test these hypotheses concerning the possible, surprising, “conservative” nature of the organization and structure of OSS development processes. 4. Discussion While our work has uncovered some interesting possible similarities and differences between OSS design and conventional software design, we feel one of our largest accomplishments has simply been to develop a framework (a variant of actor-network analysis) for the analysis of OSS development that integrates social and cognitive dimensions. Note that our framework might be compared with analogous work currently under development in the UC system (cf., Scacchi, 2004). Our methodology has also resulted in the integration of qualitative and quantitative work and has engendered the development of automatic tools for the analysis of OSS project archives. Our joint work has given us a methodological framework and a set of practical, software tools for us to continue to expand and deepen our research in this area. References Bowman, I. T., Richard C., & Holt, R. C. (1998) Software architecture recovery using Conway's law. Proceedings of the 1998 conference of the Centre for Advanced Studies on Collaborative research, November 1998. D’Astous, P., Détienne, F., Robillard, P. N., & Visser, W. (in press) Changing our view on design evaluation meetings methodology : a study of software technical evaluation meetings. Design Studies. Détienne, F., Burkhardt, J-M., & Visser, W. (2003) Cognitive effort in collective software design: methodological perspectives in cognitive ergonomics. Proceedings of the 2nd Workshop in the Workshop Series on Empirical Software Engineering "The Future of Empirical Studies in Software Engineering", pages 17-25, Monte Porzio Catone (Rome, Italy), 29 September, 2003. Détienne, F., Martin, G., & Lavigne, E. (in press) Viewpoints in co-design : a field study in concurrent engineering. Design Studies. Ducheneaut, N. (2003) The reproduction of Open Source Software programming communities. Ph.D. Dissertation, School of Information Management and Systems, UC Berkeley, May 2003. Eklundh, K. s., & Rodriguez, H. (2004) Coherence and interactivity in text-based group discussions around web documents. Proceedings of the 37th Hawai international conference on Systems Sciences. Gacek, C., & Arief, B. (2004) The Many Meanings of Open Source, IEEE Software, 21(1), 34-40, January/February 2004. Gasser, L., Scacchi, W., Ripoche, G., & Penne, B. (2003) Understanding Continuous Design in F/OSS Projects. 16th International Conference on Software Engineering & its Applications (ICSSEA-03), December, 2003, Paris, France. Herbsleb, J. D., & Mockus, A. (2003) An empirical study of speed and communication in globally-distributed software development. IEEE Transactions on Software Engineering, 29(6). Herring, S. C. (1999) Interactional coherence in CMC. Proceedings of the 32nd Hawai international conference on Systems Sciences. Latour, B. (1987) Science in Action, Cambridge, MA: Harvard University Press. Madey, G., Freeh, V., & Tynan, R. (to appear, 2004) Modeling the F/OSS Community: A Quantative Investigation. In Koch, S., (ed.) : Free/Open Source Software Development. Idea Publishing. Mahendran, D. (2002) Serpents and Primitives: An ethnographic excursion into an Open Source community. Master’s Thesis, School of Information Management and Systems, UC Berkeley, May 2002. Mockus, A., Fielding, R.T., & Herbsleb, J. D. (2002) Two cases studies of Open Source Software development: Apache and Mozilla. ACM Transactions on Software Engineering and Methodology, 11(3), 309-346. Popolov, D., Callaghan, M., & Luker, P. (2000) Conversation space:visualizing multithreaded conversation. AVI 2000, Palermo, Italy. Raymond, E. S. (2001) The Cathedral and the Bazaa The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary. Sebastopol, CA: O'Reilly. Also available at http://www.tuxedo.org/~esr/writings/cathedral-bazaar. Sack, W., Ducheneaut, N., Mahendran, D., Détienne, F., & Burkhardt, J-M (2003) Social Architecture and Technological Determinism in Open Source Software Development, International 4S Conference: Social Studies of Science and Society, Atlanta, GA, October 2003. Scacchi, W. (2004) Socio-Technical Interaction Networks in Free/Open Source Software Development Processes. In S.T. Acuña and N. Juristo (eds.): Peopleware and the Software Process. World Scientific Press. Venolia, G., & Neustaedter, C. (2003) Understanding sequence and reply relationships within email conversations : a mixed-model visualization. CHI 2003, April 5-10, Florida, USA. Yee, K-P. (2002) Zest: discussion mapping for mailing lists. CSCW 2002 (demo). APPENDIX. Citation graph of PEP 279 discussion Overview of the graph This graph represents a part of the conversation of PEP 279. Each circle represents an email message which is labeled with an arbitrary number; arrows that join messages symbolize the relation “ is quoted by”. For example, the message labeled “0” is quoted by “1”, “22” and “68”. Colors of circle represent the main problem (theme) treated by the message. Detailed view of the graph (three parts) In the graph below, we propose a more detailed view of the same conversation introducing time and roles of the participants. In abscisse, one can see the day and the time at which messages were sent. Messages are represented by a different symbol according to the role of their author in the project (BDLF, Administrators, Developers). Colors of the outlines represent the theme (design problem) addressed in the messages; colors inside symbols represent the main activity conducted via the message (agreement, disagreement, proposition, etc.). Arrows joining symbols still express the relation “is quoted by”.
© Copyright 2026 Paperzz